Merge pull request #4000 from clayjohn/Optimization

Initial port of optimization tutorials to master
This commit is contained in:
Nathan Lovato
2020-09-12 08:12:26 -06:00
committed by GitHub
14 changed files with 1074 additions and 194 deletions

View File

@@ -7,7 +7,6 @@
introduction_to_3d
using_transforms
optimizing_3d_performance
3d_rendering_limitations
standard_material_3d
lights_and_shadows

View File

@@ -1,192 +0,0 @@
.. meta::
:keywords: optimization
.. _doc_optimizing_3d_performance:
Optimizing 3D performance
=========================
Introduction
~~~~~~~~~~~~
Godot follows a balanced performance philosophy. In the performance world,
there are always trade-offs, which consist of trading speed for
usability and flexibility. Some practical examples of this are:
- Rendering objects efficiently in high amounts is easy, but when a
large scene must be rendered, it can become inefficient. To solve
this, visibility computation must be added to the rendering, which
makes rendering less efficient, but, at the same time, fewer objects are
rendered, so efficiency overall improves.
- Configuring the properties of every material for every object that
needs to be rendered is also slow. To solve this, objects are sorted
by material to reduce the costs, but at the same time sorting has a
cost.
- In 3D physics a similar situation happens. The best algorithms to
handle large amounts of physics objects (such as SAP) are slow
at insertion/removal of objects and ray-casting. Algorithms that
allow faster insertion and removal, as well as ray-casting, will not
be able to handle as many active objects.
And there are many more examples of this! Game engines strive to be
general purpose in nature, so balanced algorithms are always favored
over algorithms that might be fast in some situations and slow in
others.. or algorithms that are fast but make usability more difficult.
Godot is not an exception and, while it is designed to have backends
swappable for different algorithms, the default ones (or more like, the
only ones that are there for now) prioritize balance and flexibility
over performance.
With this clear, the aim of this tutorial is to explain how to get the
maximum performance out of Godot.
Rendering
~~~~~~~~~
3D rendering is one of the most difficult areas to get performance from,
so this section will have a list of tips.
Reuse shaders and materials
---------------------------
The Godot renderer is a little different to what is out there. It's designed
to minimize GPU state changes as much as possible.
:ref:`class_StandardMaterial3D`
does a good job at reusing materials that need similar shaders but, if
custom shaders are used, make sure to reuse them as much as possible.
Godot's priorities will be like this:
- **Reusing Materials**: The fewer different materials in the
scene, the faster the rendering will be. If a scene has a huge amount
of objects (in the hundreds or thousands) try reusing the materials
or in the worst case use atlases.
- **Reusing Shaders**: If materials can't be reused, at least try to
re-use shaders (or StandardMaterial3Ds with different parameters but the same
configuration).
If a scene has, for example, 20.000 objects with 20.000 different
materials each, rendering will be slow. If the same scene has
20.000 objects, but only uses 100 materials, rendering will be blazingly
fast.
Pixel cost vs vertex cost
-------------------------
It is a common thought that the lower the number of polygons in a model, the
faster it will be rendered. This is *really* relative and depends on
many factors.
On a modern PC and console, vertex cost is low. GPUs
originally only rendered triangles, so all the vertices:
1. Had to be transformed by the CPU (including clipping).
2. Had to be sent to the GPU memory from the main RAM.
Nowadays, all this is handled inside the GPU, so the performance is
extremely high. 3D artists usually have the wrong feeling about
polycount performance because 3D DCCs (such as Blender, Max, etc.) need
to keep geometry in CPU memory in order for it to be edited, reducing
actual performance. Truth is, a model rendered by a 3D engine is much
more optimal than how 3D DCCs display them.
On mobile devices, the story is different. PC and Console GPUs are
brute-force monsters that can pull as much electricity as they need from
the power grid. Mobile GPUs are limited to a tiny battery, so they need
to be a lot more power efficient.
To be more efficient, mobile GPUs attempt to avoid *overdraw*. This
means, the same pixel on the screen being rendered (as in, with lighting
calculation, etc.) more than once. Imagine a town with several buildings,
GPUs don't know what is visible and what is hidden until they
draw it. A house might be drawn and then another house in front of it
(rendering happened twice for the same pixel!). PC GPUs normally don't
care much about this and just throw more pixel processors to the
hardware to increase performance (but this also increases power
consumption).
On mobile, pulling more power is not an option, so a technique called
"Tile Based Rendering" is used (almost every mobile hardware uses a
variant of it), which divides the screen into a grid. Each cell keeps the
list of triangles drawn to it and sorts them by depth to minimize
*overdraw*. This technique improves performance and reduces power
consumption, but takes a toll on vertex performance. As a result, fewer
vertices and triangles can be processed for drawing.
Generally, this is not so bad, but there is a corner case on mobile that
must be avoided, which is to have small objects with a lot of geometry
within a small portion of the screen. This forces mobile GPUs to put a
lot of strain on a single screen cell, considerably decreasing
performance (as all the other cells must wait for it to complete in
order to display the frame).
To make it short, do not worry about vertex count so much on mobile, but
avoid concentration of vertices in small parts of the screen. If, for
example, a character, NPC, vehicle, etc. is far away (so it looks tiny),
use a smaller level of detail (LOD) model instead.
An extra situation where vertex cost must be considered is objects that
have extra processing per vertex, such as:
- Skinning (skeletal animation)
- Morphs (shape keys)
- Vertex Lit Objects (common on mobile)
Texture compression
-------------------
Godot offers to compress textures of 3D models when imported (VRAM
compression). Video RAM compression is not as efficient in size as PNG
or JPG when stored, but increases performance enormously when drawing.
This is because the main goal of texture compression is bandwidth
reduction between memory and the GPU.
In 3D, the shapes of objects depend more on the geometry than the
texture, so compression is generally not noticeable. In 2D, compression
depends more on shapes inside the textures, so the artifacts resulting
from 2D compression are more noticeable.
As a warning, most Android devices do not support texture compression of
textures with transparency (only opaque), so keep this in mind.
Transparent objects
-------------------
As mentioned before, Godot sorts objects by material and shader to
improve performance. This, however, can not be done on transparent
objects. Transparent objects are rendered from back to front to make
blending with what is behind work. As a result, please try to keep
transparent objects to a minimum! If an object has a small section with
transparency, try to make that section a separate material.
Level of detail (LOD)
---------------------
As also mentioned before, using objects with fewer vertices can improve
performance in some cases. Godot has a simple system to change level
of detail,
:ref:`GeometryInstance <class_GeometryInstance>`
based objects have a visibility range that can be defined. Having
several GeometryInstance objects in different ranges works as LOD.
Use instancing (MultiMesh)
--------------------------
If several identical objects have to be drawn in the same place or
nearby, try using :ref:`MultiMesh <class_MultiMesh>`
instead. MultiMesh allows the drawing of dozens of thousands of objects at
very little performance cost, making it ideal for flocks, grass,
particles, etc.
Bake lighting
-------------
Small lights are usually not a performance issue. Shadows a little more.
In general, if several lights need to affect a scene, it's ideal to bake
it (:ref:`doc_baked_lightmaps`). Baking can also improve the scene quality by
adding indirect light bounces.
If working on mobile, baking to texture is recommended, since this
method is even faster.

View File

@@ -0,0 +1,277 @@
.. _doc_cpu_optimization:
CPU optimization
================
Measuring performance
=====================
We have to know where the "bottlenecks" are to know how to speed up our program.
Bottlenecks are the slowest parts of the program that limit the rate that
everything can progress. Focussing on bottlenecks allows us to concentrate our
efforts on optimizing the areas which will give us the greatest speed
improvement, instead of spending a lot of time optimizing functions that will
lead to small performance improvements.
For the CPU, the easiest way to identify bottlenecks is to use a profiler.
CPU profilers
=============
Profilers run alongside your program and take timing measurements to work out
what proportion of time is spent in each function.
The Godot IDE conveniently has a built-in profiler. It does not run every time
you start your project: it must be manually started and stopped. This is
because, like most profilers, recording these timing measurements can
slow down your project significantly.
After profiling, you can look back at the results for a frame.
.. figure:: img/godot_profiler.png
.. figure:: img/godot_profiler.png
:alt: Screenshot of the Godot profiler
Results of a profile of one of the demo projects.
.. note:: We can see the cost of built-in processes such as physics and audio,
as well as seeing the cost of our own scripting functions at the
bottom.
Time spent waiting for various built-in servers may not be counted in
the profilers. This is a known bug.
When a project is running slowly, you will often see an obvious function or
process taking a lot more time than others. This is your primary bottleneck, and
you can usually increase speed by optimizing this area.
For more info about using Godot's built-in profiler, see
:ref:`doc_debugger_panel`.
External profilers
~~~~~~~~~~~~~~~~~~
Although the Godot IDE profiler is very convenient and useful, sometimes you
need more power, and the ability to profile the Godot engine source code itself.
You can use a number of third party profilers to do this including
`Valgrind <https://www.valgrind.org/>`__,
`VerySleepy <http://www.codersnotes.com/sleepy/>`__,
`HotSpot <https://github.com/KDAB/hotspot>`__,
`Visual Studio <https://visualstudio.microsoft.com/>`__ and
`Intel VTune <https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html>`__.
.. note:: You will need to compile Godot from source to use a third-party profiler.
This is required to obtain debugging symbols. You can also use a debug
build, however, note that the results of profiling a debug build will
be different to a release build, because debug builds are less
optimized. Bottlenecks are often in a different place in debug builds,
so you should profile release builds whenever possible.
.. figure:: img/valgrind.png
:alt: Screenshot of Callgrind
Example results from Callgrind, which is part of Valgrind.
From the left, Callgrind is listing the percentage of time within a function and
its children (Inclusive), the percentage of time spent within the function
itself, excluding child functions (Self), the number of times the function is
called, the function name, and the file or module.
In this example, we can see nearly all time is spent under the
`Main::iteration()` function. This is the master function in the Godot source
code that is called repeatedly. It causes frames to be drawn, physics ticks to
be simulated, and nodes and scripts to be updated. A large proportion of the
time is spent in the functions to render a canvas (66%), because this example
uses a 2D benchmark. Below this, we see that almost 50% of the time is spent
outside Godot code in ``libglapi`` and ``i965_dri`` (the graphics driver).
This tells us the a large proportion of CPU time is being spent in the
graphics driver.
This is actually an excellent example because, in an ideal world, only a very
small proportion of time would be spent in the graphics driver. This is an
indication that there is a problem with too much communication and work being
done in the graphics API. This specific profiling led to the development of 2D
batching, which greatly speeds up 2D rendering by reducing bottlenecks in this
area.
Manually timing functions
=========================
Another handy technique, especially once you have identified the bottleneck
using a profiler, is to manually time the function or area under test.
The specifics vary depending on the language, but in GDScript, you would do
the following:
::
var time_start = OS.get_ticks_usec()
# Your function you want to time
update_enemies()
var time_end = OS.get_ticks_usec()
print("update_enemies() took %d microseconds" % time_end - time_start)
When manually timing functions, it is usually a good idea to run the function
many times (1,000 or more times), instead of just once (unless it is a very slow
function). The reason for doing this is that timers often have limited accuracy.
Moreover, CPUs will schedule processes in a haphazard manner. Therefore, an
average over a series of runs is more accurate than a single measurement.
As you attempt to optimize functions, be sure to either repeatedly profile or
time them as you go. This will give you crucial feedback as to whether the
optimization is working (or not).
Caches
======
CPU caches are something else to be particularly aware of, especially when
comparing timing results of two different versions of a function. The results
can be highly dependent on whether the data is in the CPU cache or not. CPUs
don't load data directly from the system RAM, even though it's huge in
comparison to the CPU cache (several gigabytes instead of a few megabytes). This
is because system RAM is very slow to access. Instead, CPUs load data from a
smaller, faster bank of memory called cache. Loading data from cache is very
fast, but every time you try and load a memory address that is not stored in
cache, the cache must make a trip to main memory and slowly load in some data.
This delay can result in the CPU sitting around idle for a long time, and is
referred to as a "cache miss".
This means that the first time you run a function, it may run slowly because the
data is not in the CPU cache. The second and later times, it may run much faster
because the data is in the cache. Due to this, always use averages when timing,
and be aware of the effects of cache.
Understanding caching is also crucial to CPU optimization. If you have an
algorithm (routine) that loads small bits of data from randomly spread out areas
of main memory, this can result in a lot of cache misses, a lot of the time, the
CPU will be waiting around for data instead of doing any work. Instead, if you
can make your data accesses localised, or even better, access memory in a linear
fashion (like a continuous list), then the cache will work optimally and the CPU
will be able to work as fast as possible.
Godot usually takes care of such low-level details for you. For example, the
Server APIs make sure data is optimized for caching already for things like
rendering and physics. Still, you should be especially aware of caching when
using :ref:`GDNative <toc-tutorials-gdnative>`.
Languages
=========
Godot supports a number of different languages, and it is worth bearing in mind
that there are trade-offs involved. Some languages are designed for ease of use
at the cost of speed, and others are faster but more difficult to work with.
Built-in engine functions run at the same speed regardless of the scripting
language you choose. If your project is making a lot of calculations in its own
code, consider moving those calculations to a faster language.
GDScript
~~~~~~~~
:ref:`GDScript <toc-learn-scripting-gdscript>` is designed to be easy to use and iterate,
and is ideal for making many types of games. However, in this language, ease of
use is considered more important than performance. If you need to make heavy
calculations, consider moving some of your project to one of the other
languages.
C#
~~
:ref:`C# <toc-learn-scripting-C#>` is popular and has first-class support in Godot.It
offers a good compromise between speed and ease of use. Beware of possible
garbage collection pauses and leaks that can occur during gameplay, though. A
common approach to workaround issues with garbage collection is to use *object
pooling*, which is outside the scope of this guide.
Other languages
~~~~~~~~~~~~~~~
Third parties provide support for several other languages, including `Rust
<https://github.com/godot-rust/godot-rust>`_ and `Javascript
<https://github.com/GodotExplorer/ECMAScript>`_.
C++
~~~
Godot is written in C++. Using C++ will usually result in the fastest code.
However, on a practical level, it is the most difficult to deploy to end users'
machines on different platforms. Options for using C++ include
:ref:`GDNative <toc-tutorials-gdnative>` and
:ref:`custom modules <doc_custom_modules_in_c++>`.
Threads
=======
Consider using threads when making a lot of calculations that can run in
parallel to each other. Modern CPUs have multiple cores, each one capable of
doing a limited amount of work. By spreading work over multiple threads, you can
move further towards peak CPU efficiency.
The disadvantage of threads is that you have to be incredibly careful. As each
CPU core operates independently, they can end up trying to access the same
memory at the same time. One thread can be reading to a variable while another
is writing: this is called a *race condition*. Before you use threads, make sure
you understand the dangers and how to try and prevent these race conditions.
Threads can also make debugging considerably more difficult. The GDScript
debugger doesn't support setting up breakpoints in threads yet.
For more information on threads, see :ref:`doc_using_multiple_threads`.
SceneTree
=========
Although Nodes are an incredibly powerful and versatile concept, be aware that
every node has a cost. Built-in functions such as `_process()` and
`_physics_process()` propagate through the tree. This housekeeping can reduce
performance when you have very large numbers of nodes (usually in the thousands).
Each node is handled individually in the Godot renderer. Therefore, a smaller
number of nodes with more in each can lead to better performance.
One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
get much better performance by removing nodes from the SceneTree, rather than by
pausing or hiding them. You don't have to delete a detached node. You can for
example, keep a reference to a node, detach it from the scene tree using
:ref:`Node.remove_child(node) <class_Node_method_remove_child>`, then reattach
it later using :ref:`Node.add_child(node) <class_Node_method_add_child>`.
This can be very useful for adding and removing areas from a game, for example.
You can avoid the SceneTree altogether by using Server APIs. For more
information, see :ref:`doc_using_servers`.
Physics
=======
In some situations, physics can end up becoming a bottleneck. This is
particularly the case with complex worlds and large numbers of physics objects.
Here are some techniques to speed up physics:
- Try using simplified versions of your rendered geometry for collision shapes.
Often, this won't be noticeable for end users, but can greatly increase
performance.
- Try removing objects from physics when they are out of view / outside the
current area, or reusing physics objects (maybe you allow 8 monsters per area,
for example, and reuse these).
Another crucial aspect to physics is the physics tick rate. In some games, you
can greatly reduce the tick rate, and instead of for example, updating physics
60 times per second, you may update them only 30 or even 20 times per second.
This can greatly reduce the CPU load.
The downside of changing physics tick rate is you can get jerky movement or
jitter when the physics update rate does not match the frames per second
rendered. Also, decreasing the physics tick rate will increase input lag.
It's recommended to stick to the default physics tick rate (60 Hz) in most games
that feature real-time player movement.
The solution to jitter is to use *fixed timestep interpolation*, which involves
smoothing the rendered positions and rotations over multiple frames to match the
physics. You can either implement this yourself or use a
`third-party addon <https://github.com/lawnjelly/smoothing-addon>`__.
Performance-wise, interpolation is a very cheap operation compared to running a
physics tick. It's orders of magnitude faster, so this can be a significant
performance win while also reducing jitter.

View File

@@ -0,0 +1,297 @@
.. _doc_general_optimization:
General optimization tips
=========================
Introduction
~~~~~~~~~~~~
In an ideal world, computers would run at infinite speed. The only limit to
what we could achieve would be our imagination. However, in the real world, it's
all too easy to produce software that will bring even the fastest computer to
its knees.
Thus, designing games and other software is a compromise between what we would
like to be possible, and what we can realistically achieve while maintaining
good performance.
To achieve the best results, we have two approaches:
- Work faster.
- Work smarter.
And preferably, we will use a blend of the two.
Smoke and mirrors
^^^^^^^^^^^^^^^^^
Part of working smarter is recognizing that, in games, we can often get the
player to believe they're in a world that is far more complex, interactive, and
graphically exciting than it really is. A good programmer is a magician, and
should strive to learn the tricks of the trade while trying to invent new ones.
The nature of slowness
^^^^^^^^^^^^^^^^^^^^^^
To the outside observer, performance problems are often lumped together.
But in reality, there are several different kinds of performance problems:
- A slow process that occurs every frame, leading to a continuously low frame
rate.
- An intermittent process that causes "spikes" of slowness, leading to
stalls.
- A slow process that occurs outside of normal gameplay, for instance,
when loading a level.
Each of these are annoying to the user, but in different ways.
Measuring performance
=====================
Probably the most important tool for optimization is the ability to measure
performance - to identify where bottlenecks are, and to measure the success of
our attempts to speed them up.
There are several methods of measuring performance, including:
- Putting a start/stop timer around code of interest.
- Using the Godot profiler.
- Using external third-party CPU profilers.
- Using GPU profilers/debuggers such as
`NVIDIA Nsight Graphics <https://developer.nvidia.com/nsight-graphics>`__
or `apitrace <https://apitrace.github.io/>`__.
- Checking the frame rate (with V-Sync disabled).
Be very aware that the relative performance of different areas can vary on
different hardware. It's often a good idea to measure timings on more than one
device. This is especially the case if you're targeting mobile devices.
Limitations
~~~~~~~~~~~
CPU profilers are often the go-to method for measuring performance. However,
they don't always tell the whole story.
- Bottlenecks are often on the GPU, "as a result" of instructions given by the
CPU.
- Spikes can occur in the operating system processes (outside of Godot) "as a
result" of instructions used in Godot (for example, dynamic memory allocation).
- You may not always be able to profile specific devices like a mobile phone
due to the initial setup required.
- You may have to solve performance problems that occur on hardware you don't
have access to.
As a result of these limitations, you often need to use detective work to find
out where bottlenecks are.
Detective work
~~~~~~~~~~~~~~
Detective work is a crucial skill for developers (both in terms of performance,
and also in terms of bug fixing). This can include hypothesis testing, and
binary search.
Hypothesis testing
^^^^^^^^^^^^^^^^^^
Say, for example, that you believe sprites are slowing down your game.
You can test this hypothesis by:
- Measuring the performance when you add more sprites, or take some away.
This may lead to a further hypothesis: does the size of the sprite determine
the performance drop?
- You can test this by keeping everything the same, but changing the sprite
size, and measuring performance.
Binary search
^^^^^^^^^^^^^
If you know that frames are taking much longer than they should, but you're
not sure where the bottleneck lies. You could begin by commenting out
approximately half the routines that occur on a normal frame. Has the
performance improved more or less than expected?
Once you know which of the two halves contains the bottleneck, you can
repeat this process until you've pinned down the problematic area.
Profilers
=========
Profilers allow you to time your program while running it. Profilers then
provide results telling you what percentage of time was spent in different
functions and areas, and how often functions were called.
This can be very useful both to identify bottlenecks and to measure the results
of your improvements. Sometimes, attempts to improve performance can backfire
and lead to slower performance.
**Always use profiling and timing to guide your efforts.**
For more info about using Godot's built-in profiler, see :ref:`doc_debugger_panel`.
Principles
==========
`Donald Knuth <https://en.wikipedia.org/wiki/Donald_Knuth>`__ said:
*Programmers waste enormous amounts of time thinking about, or worrying
about, the speed of noncritical parts of their programs, and these attempts
at efficiency actually have a strong negative impact when debugging and
maintenance are considered. We should forget about small efficiencies, say
about 97% of the time: premature optimization is the root of all evil. Yet
we should not pass up our opportunities in that critical 3%.*
The messages are very important:
- Developer time is limited. Instead of blindly trying to speed up
all aspects of a program, we should concentrate our efforts on the aspects
that really matter.
- Efforts at optimization often end up with code that is harder to read and
debug than non-optimized code. It is in our interests to limit this to areas
that will really benefit.
Just because we *can* optimize a particular bit of code, it doesn't necessarily
mean that we *should*. Knowing when and when not to optimize is a great skill to
develop.
One misleading aspect of the quote is that people tend to focus on the subquote
*"premature optimization is the root of all evil"*. While *premature*
optimization is (by definition) undesirable, performant software is the result
of performant design.
Performant design
~~~~~~~~~~~~~~~~~
The danger with encouraging people to ignore optimization until necessary, is
that it conveniently ignores that the most important time to consider
performance is at the design stage, before a key has even hit a keyboard. If the
design or algorithms of a program are inefficient, then no amount of polishing
the details later will make it run fast. It may run *faster*, but it will never
run as fast as a program designed for performance.
This tends to be far more important in game or graphics programming than in
general programming. A performant design, even without low-level optimization,
will often run many times faster than a mediocre design with low-level
optimization.
Incremental design
~~~~~~~~~~~~~~~~~~
Of course, in practice, unless you have prior knowledge, you are unlikely to
come up with the best design the first time. Instead, you'll often make a series
of versions of a particular area of code, each taking a different approach to
the problem, until you come to a satisfactory solution. It's important not to
spend too much time on the details at this stage until you have finalized the
overall design. Otherwise, much of your work will be thrown out.
It's difficult to give general guidelines for performant design because this is
so dependent on the problem. One point worth mentioning though, on the CPU side,
is that modern CPUs are nearly always limited by memory bandwidth. This has led
to a resurgence in data-oriented design, which involves designing data
structures and algorithms for *cache locality* of data and linear access, rather
than jumping around in memory.
The optimization process
~~~~~~~~~~~~~~~~~~~~~~~~
Assuming we have a reasonable design, and taking our lessons from Knuth, our
first step in optimization should be to identify the biggest bottlenecks - the
slowest functions, the low-hanging fruit.
Once we've successfully improved the speed of the slowest area, it may no
longer be the bottleneck. So we should test/profile again and find the next
bottleneck on which to focus.
The process is thus:
1. Profile / Identify bottleneck.
2. Optimize bottleneck.
3. Return to step 1.
Optimizing bottlenecks
~~~~~~~~~~~~~~~~~~~~~~
Some profilers will even tell you which part of a function (which data accesses,
calculations) are slowing things down.
As with design, you should concentrate your efforts first on making sure the
algorithms and data structures are the best they can be. Data access should be
local (to make best use of CPU cache), and it can often be better to use compact
storage of data (again, always profile to test results). Often, you precalculate
heavy computations ahead of time. This can be done by performing the computation
when loading a level, by loading a file containing precalculated data or simply
by storing the results of complex calculations into a script constant and
reading its value.
Once algorithms and data are good, you can often make small changes in routines
which improve performance. For instance, you can move some calculations outside
of loops or transform nested ``for`` loops into non-nested loops.
(This should be feasible if you know a 2D array's width or height in advance.)
Always retest your timing/bottlenecks after making each change. Some changes
will increase speed, others may have a negative effect. Sometimes, a small
positive effect will be outweighed by the negatives of more complex code, and
you may choose to leave out that optimization.
Appendix
========
Bottleneck math
~~~~~~~~~~~~~~~
The proverb *"a chain is only as strong as its weakest link"* applies directly to
performance optimization. If your project is spending 90% of the time in
function ``A``, then optimizing ``A`` can have a massive effect on performance.
.. code-block:: none
A: 9 ms
Everything else: 1 ms
Total frame time: 10 ms
.. code-block:: none
A: 1 ms
Everything else: 1ms
Total frame time: 2 ms
In this example, improving this bottleneck ``A`` by a factor of 9× decreases
overall frame time by 5× while increasing frames per second by 5×.
However, if something else is running slowly and also bottlenecking your
project, then the same improvement can lead to less dramatic gains:
.. code-block:: none
A: 9 ms
Everything else: 50 ms
Total frame time: 59 ms
.. code-block:: none
A: 1 ms
Everything else: 50 ms
Total frame time: 51 ms
In this example, even though we have hugely optimized function ``A``,
the actual gain in terms of frame rate is quite small.
In games, things become even more complicated because the CPU and GPU run
independently of one another. Your total frame time is determined by the slower
of the two.
.. code-block:: none
CPU: 9 ms
GPU: 50 ms
Total frame time: 50 ms
.. code-block:: none
CPU: 1 ms
GPU: 50 ms
Total frame time: 50 ms
In this example, we optimized the CPU hugely again, but the frame time didn't
improve because we are GPU-bottlenecked.

View File

@@ -0,0 +1,280 @@
.. _doc_gpu_optimization:
GPU optimization
================
Introduction
~~~~~~~~~~~~
The demand for new graphics features and progress almost guarantees that you
will encounter graphics bottlenecks. Some of these can be on the CPU side, for
instance in calculations inside the Godot engine to prepare objects for
rendering. Bottlenecks can also occur on the CPU in the graphics driver, which
sorts instructions to pass to the GPU, and in the transfer of these
instructions. And finally, bottlenecks also occur on the GPU itself.
Where bottlenecks occur in rendering is highly hardware-specific.
Mobile GPUs in particular may struggle with scenes that run easily on desktop.
Understanding and investigating GPU bottlenecks is slightly different to the
situation on the CPU. This is because, often, you can only change performance
indirectly by changing the instructions you give to the GPU. Also, it may be
more difficult to take measurements. In many cases, the only way of measuring
performance is by examining changes in the time spent rendering each frame.
Draw calls, state changes, and APIs
===================================
.. note:: The following section is not relevant to end-users, but is useful to
provide background information that is relevant in later sections.
Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or
Vulkan). The communication and driver activity involved can be quite costly,
especially in OpenGL and OpenGL ES. If we can provide these instructions in a
way that is preferred by the driver and GPU, we can greatly increase
performance.
Nearly every API command in OpenGL requires a certain amount of validation to
make sure the GPU is in the correct state. Even seemingly simple commands can
lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to
reduce these instructions to a bare minimum and group together similar objects
as much as possible so they can be rendered together, or with the minimum number
of these expensive state changes.
2D batching
~~~~~~~~~~~
In 2D, the costs of treating each item individually can be prohibitively high -
there can easily be thousands of them on the screen. This is why 2D *batching*
is used. Multiple similar items are grouped together and rendered in a batch,
via a single draw call, rather than making a separate draw call for each item.
In addition, this means state changes, material and texture changes can be kept
to a minimum.
3D batching
~~~~~~~~~~~
In 3D, we still aim to minimize draw calls and state changes. However, it can be
more difficult to batch together several objects into a single draw call. 3D
meshes tend to comprise hundreds or thousands of triangles, and combining large
meshes in real-time is prohibitively expensive. The costs of joining them quickly
exceeds any benefits as the number of triangles grows per mesh. A much better
alternative is to **join meshes ahead of time** (static meshes in relation to each
other). This can either be done by artists, or programmatically within Godot.
There is also a cost to batching together objects in 3D. Several objects
rendered as one cannot be individually culled. An entire city that is off-screen
will still be rendered if it is joined to a single blade of grass that is on
screen. Thus, you should always take objects' location and culling into account
when attempting to batch 3D objects together. Despite this, the benefits of
joining static objects often outweigh other considerations, especially for large
numbers of distant or low-poly objects.
For more information on 3D specific optimizations, see
:ref:`doc_optimizing_3d_performance`.
Reuse Shaders and Materials
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Godot renderer is a little different to what is out there. It's designed to
minimize GPU state changes as much as possible. :ref:`StandardMaterial3D
<class_StandardMaterial3D>` does a good job at reusing materials that need similar
shaders. if custom shaders are used, make sure to reuse them as much as
possible. Godot's priorities are:
- **Reusing Materials:** The fewer different materials in the
scene, the faster the rendering will be. If a scene has a huge amount
of objects (in the hundreds or thousands), try reusing the materials.
In the worst case, use atlases to decrease the amount of texture changes.
- **Reusing Shaders:** If materials can't be reused, at least try to
re-use shaders (or StandardMaterial3Ds with different parameters but the same
configuration).
If a scene has, for example, ``20,000`` objects with ``20,000`` different
materials each, rendering will be slow. If the same scene has ``20,000``
objects, but only uses ``100`` materials, rendering will be much faster.
Pixel cost versus vertex cost
=============================
You may have heard that the lower the number of polygons in a model, the faster
it will be rendered. This is *really* relative and depends on many factors.
On a modern PC and console, vertex cost is low. GPUs originally only rendered
triangles. This meant that every frame:
1. All vertices had to be transformed by the CPU (including clipping).
2. All vertices had to be sent to the GPU memory from the main RAM.
Nowadays, all this is handled inside the GPU, greatly increasing performance.
3D artists usually have the wrong feeling about polycount performance because 3D
DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory for it to
be edited, reducing actual performance. Game engines rely on the GPU more, so
they can render many triangles much more efficiently.
On mobile devices, the story is different. PC and console GPUs are
brute-force monsters that can pull as much electricity as they need from
the power grid. Mobile GPUs are limited to a tiny battery, so they need
to be a lot more power efficient.
To be more efficient, mobile GPUs attempt to avoid *overdraw*. Overdraw occurs
when the same pixel on the screen is being rendered more than once. Imagine a
town with several buildings. GPUs don't know what is visible and what is hidden
until they draw it. For example, a house might be drawn and then another house
in front of it (which means rendering happened twice for the same pixel). PC
GPUs normally don't care much about this and just throw more pixel processors to
the hardware to increase performance (which also increases power consumption).
Using more power is not an option on mobile so mobile devices use a technique
called *tile-based rendering* which divides the screen into a grid. Each cell
keeps the list of triangles drawn to it and sorts them by depth to minimize
*overdraw*. This technique improves performance and reduces power consumption,
but takes a toll on vertex performance. As a result, fewer vertices and
triangles can be processed for drawing.
Additionally, tile-based rendering struggles when there are small objects with a
lot of geometry within a small portion of the screen. This forces mobile GPUs to
put a lot of strain on a single screen tile, which considerably decreases
performance as all the other cells must wait for it to complete before
displaying the frame.
To summarize, don't worry about vertex count on mobile, but
**avoid concentration of vertices in small parts of the screen**.
If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use
a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to
avoid having triangles smaller than the size of a pixel on screen.
Pay attention to the additional vertex processing required when using:
- Skinning (skeletal animation)
- Morphs (shape keys)
- Vertex-lit objects (common on mobile)
Pixel/fragment shaders and fill rate
====================================
In contrast to vertex processing, the costs of fragment (per-pixel) shading have
increased dramatically over the years. Screen resolutions have increased (the
area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA
screen, that is 27x the area), but also the complexity of fragment shaders has
exploded. Physically-based rendering requires complex calculations for each
fragment.
You can test whether a project is fill rate-limited quite easily. Turn off
V-Sync to prevent capping the frames per second, then compare the frames per
second when running with a large window, to running with a very small window.
You may also benefit from similarly reducing your shadow map size if using
shadows. Usually, you will find the FPS increases quite a bit using a small
window, which indicates you are to some extent fill rate-limited. On the other
hand, if there is little to no increase in FPS, then your bottleneck lies
elsewhere.
You can increase performance in a fill rate-limited project by reducing the
amount of work the GPU has to do. You can do this by simplifying the shader
(perhaps turn off expensive options if you are using a :ref:`StandardMaterial3D
<class_StandardMaterial3D>`), or reducing the number and size of textures used.
**When targeting mobile devices, consider using the simplest possible shaders
you can reasonably afford to use.**
Reading textures
~~~~~~~~~~~~~~~~
The other factor in fragment shaders is the cost of reading textures. Reading
textures is an expensive operation, especially when reading from several
textures in a single fragment shader. Also, consider that filtering may slow it
down further (trilinear filtering between mipmaps, and averaging). Reading
textures is also expensive in terms of power usage, which is a big issue on
mobiles.
**If you use third-party shaders or write your own shaders, try to use
algorithms that require as few texture reads as possible.**
Texture compression
~~~~~~~~~~~~~~~~~~~
By default, Godot compresses textures of 3D models when imported using video RAM
(VRAM) compression. Video RAM compression isn't as efficient in size as PNG or
JPG when stored, but increases performance enormously when drawing large enough
textures.
This is because the main goal of texture compression is bandwidth reduction
between memory and the GPU.
In 3D, the shapes of objects depend more on the geometry than the texture, so
compression is generally not noticeable. In 2D, compression depends more on
shapes inside the textures, so the artifacts resulting from 2D compression are
more noticeable.
As a warning, most Android devices do not support texture compression of
textures with transparency (only opaque), so keep this in mind.
.. note::
Even in 3D, "pixel art" textures should have VRAM compression disabled as it
will negatively affect their appearance, without improving performance
significantly due to their low resolution.
Post-processing and shadows
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Post-processing effects and shadows can also be expensive in terms of fragment
shading activity. Always test the impact of these on different hardware.
**Reducing the size of shadowmaps can increase performance**, both in terms of
writing and reading the shadowmaps. On top of that, the best way to improve
performance of shadows is to turn shadows off for as many lights and objects as
possible. Smaller or distant OmniLights/SpotLights can often have their shadows
disabled with only a small visual impact.
Transparency and blending
=========================
Transparent objects present particular problems for rendering efficiency. Opaque
objects (especially in 3D) can be essentially rendered in any order and the
Z-buffer will ensure that only the front most objects get shaded. Transparent or
blended objects are different. In most cases, they cannot rely on the Z-buffer
and must be rendered in "painter's order" (i.e. from back to front) to look
correct.
Transparent objects are also particularly bad for fill rate, because every item
has to be drawn even if other transparent objects will be drawn on top
later on.
Opaque objects don't have to do this. They can usually take advantage of the
Z-buffer by writing to the Z-buffer only first, then only performing the
fragment shader on the "winning" fragment, the object that is at the front at a
particular pixel.
Transparency is particularly expensive where multiple transparent objects
overlap. It is usually better to use transparent areas as small as possible to
minimize these fill rate requirements, especially on mobile, where fill rate is
very expensive. Indeed, in many situations, rendering more complex opaque
geometry can end up being faster than using transparency to "cheat".
Multi-platform advice
=====================
If you are aiming to release on multiple platforms, test *early* and test
*often* on all your platforms, especially mobile. Developing a game on desktop
but attempting to port it to mobile at the last minute is a recipe for disaster.
In general, you should design your game for the lowest common denominator, then
add optional enhancements for more powerful platforms. For example, you may want
to use the GLES2 backend for both desktop and mobile platforms where you target
both.
Mobile/tiled renderers
======================
As described above, GPUs on mobile devices work in dramatically different ways
from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers
split up the screen into regular-sized tiles that fit into super fast cache
memory, which reduces the number of read/write operations to the main memory.
There are some downsides though. Tiled rendering can make certain techniques
much more complicated and expensive to perform. Tiles that rely on the results
of rendering in different tiles or on the results of earlier operations being
preserved can be very slow. Be very careful to test the performance of shaders,
viewport textures and post processing.

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 146 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 160 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 96 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 177 KiB

View File

@@ -1,9 +1,76 @@
Optimization
=============
Introduction
------------
Godot follows a balanced performance philosophy. In the performance world,
there are always trade-offs, which consist of trading speed for usability
and flexibility. Some practical examples of this are:
- Rendering large amounts of objects efficiently is easy, but when a
large scene must be rendered, it can become inefficient. To solve this,
visibility computation must be added to the rendering. This makes rendering
less efficient, but at the same time, fewer objects are rendered. Therefore,
the overall rendering efficiency is improved.
- Configuring the properties of every material for every object that
needs to be rendered is also slow. To solve this, objects are sorted by
material to reduce the costs. At the same time, sorting has a cost.
- In 3D physics, a similar situation happens. The best algorithms to
handle large amounts of physics objects (such as SAP) are slow at
insertion/removal of objects and raycasting. Algorithms that allow faster
insertion and removal, as well as raycasting, will not be able to handle as
many active objects.
And there are many more examples of this! Game engines strive to be
general-purpose in nature. Balanced algorithms are always favored over
algorithms that might be fast in some situations and slow in others, or
algorithms that are fast but are more difficult to use.
Godot is not an exception to this. While it is designed to have backends
swappable for different algorithms, the default backends prioritize balance and
flexibility over performance.
With this clear, the aim of this tutorial section is to explain how to get the
maximum performance out of Godot. While the tutorials can be read in any order,
it is a good idea to start from :ref:`doc_general_optimization`.
Common
------
.. toctree::
:maxdepth: 1
:name: toc-learn-features-optimization
:name: toc-learn-features-general-optimization
general_optimization
using_servers
CPU
---
.. toctree::
:maxdepth: 1
:name: toc-learn-features-cpu-optimization
cpu_optimization
GPU
---
.. toctree::
:maxdepth: 1
:name: toc-learn-features-gpu-optimization
gpu_optimization
using_multimesh
3D
--
.. toctree::
:maxdepth: 1
:name: toc-learn-features-3d-optimization
optimizing_3d_performance

View File

@@ -0,0 +1,152 @@
.. meta::
:keywords: optimization
.. _doc_optimizing_3d_performance:
Optimizing 3D performance
=========================
Culling
=======
Godot will automatically perform view frustum culling in order to prevent
rendering objects that are outside the viewport. This works well for games that
take place in a small area, however things can quickly become problematic in
larger levels.
Occlusion culling
~~~~~~~~~~~~~~~~~
Walking around a town for example, you may only be able to see a few buildings
in the street you are in, as well as the sky and a few birds flying overhead. As
far as a naive renderer is concerned however, you can still see the entire town.
It won't just render the buildings in front of you, it will render the street
behind that, with the people on that street, the buildings behind that. You
quickly end up in situations where you are attempting to render 10× or 100× more
than what is visible.
Things aren't quite as bad as they seem, because the Z-buffer usually allows the
GPU to only fully shade the objects that are at the front. This is called *depth
prepass* and is enabled by default in Godot when using the GLES3 renderer.
However, unneeded objects are still reducing performance.
One way we can potentially reduce the amount to be rendered is to take advantage
of occlusion. As of Godot 3.2.2, there is no built in support for occlusion in
Godot. However, with careful design you can still get many of the advantages.
For instance, in our city street scenario, you may be able to work out in advance
that you can only see two other streets, ``B`` and ``C``, from street ``A``.
Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
you have to do is work out when your viewer is in street ``A`` (perhaps using
Godot Areas), then you can hide the other streets.
This is a manual version of what is known as a "potentially visible set". It is
a very powerful technique for speeding up rendering. You can also use it to
restrict physics or AI to the local area, and speed these up as well as
rendering.
.. note::
In some cases, you may have to adapt your level design to add more occlusion
opportunities. For example, you may have to add more walls to prevent the player
from seeing too far away, which would decrease performance due to the lost
opportunies for occlusion culling.
Other occlusion techniques
~~~~~~~~~~~~~~~~~~~~~~~~~~
There are other occlusion techniques such as portals, automatic PVS, and
raster-based occlusion culling. Some of these may be available through add-ons
and may be available in core Godot in the future.
Transparent objects
~~~~~~~~~~~~~~~~~~~
Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
<class_Shader>` to improve performance. This, however, can not be done with
transparent objects. Transparent objects are rendered from back to front to make
blending with what is behind work. As a result,
**try to use as few transparent objects as possible**. If an object has a
small section with transparency, try to make that section a separate surface
with its own material.
For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
doc.
Level of detail (LOD)
=====================
In some situations, particularly at a distance, it can be a good idea to
**replace complex geometry with simpler versions**. The end user will probably
not be able to see much difference. Consider looking at a large number of trees
in the far distance. There are several strategies for replacing models at
varying distance. You could use lower poly models, or use transparency to
simulate more complex geometry.
Billboards and imposters
~~~~~~~~~~~~~~~~~~~~~~~~
The simplest version of using transparency to deal with LOD is billboards. For
example, you can use a single transparent quad to represent a tree at distance.
This can be very cheap to render, unless of course, there are many trees in
front of each other. In which case transparency may start eating into fill rate
(for more information on fill rate, see :ref:`doc_gpu_optimization`).
An alternative is to render not just one tree, but a number of trees together as
a group. This can be especially effective if you can see an area but cannot
physically approach it in a game.
You can make imposters by pre-rendering views of an object at different angles.
Or you can even go one step further, and periodically re-render a view of an
object onto a texture to be used as an imposter. At a distance, you need to move
the viewer a considerable distance for the angle of view to change
significantly. This can be complex to get working, but may be worth it depending
on the type of project you are making.
Use instancing (MultiMesh)
~~~~~~~~~~~~~~~~~~~~~~~~~~
If several identical objects have to be drawn in the same place or nearby, try
using :ref:`MultiMesh <class_MultiMesh>` instead. MultiMesh allows the drawing
of many thousands of objects at very little performance cost, making it ideal
for flocks, grass, particles, and anything else where you have thousands of
identical objects.
Also see the :ref:`Using MultiMesh <doc_using_multimesh>` doc.
Bake lighting
=============
Lighting objects is one of the most costly rendering operations. Realtime
lighting, shadows (especially multiple lights), and GI are especially expensive.
They may simply be too much for lower power mobile devices to handle.
**Consider using baked lighting**, especially for mobile. This can look fantastic,
but has the downside that it will not be dynamic. Sometimes, this is a trade-off
worth making.
In general, if several lights need to affect a scene, it's best to use
:ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
indirect light bounces.
Animation and skinning
======================
Animation and vertex animation such as skinning and morphing can be very
expensive on some platforms. You may need to lower the polycount considerably
for animated models or limit the number of them on screen at any one time.
Large worlds
============
If you are making large worlds, there are different considerations than what you
may be familiar with from smaller games.
Large worlds may need to be built in tiles that can be loaded on demand as you
move around the world. This can prevent memory use from getting out of hand, and
also limit the processing needed to the local area.
There may also be rendering and physics glitches due to floating point error in
large worlds. You may be able to use techniques such as orienting the world
around the player (rather than the other way around), or shifting the origin
periodically to keep things centred around ``Vector3(0, 0, 0)``.