Merge pull request #4000 from clayjohn/Optimization
Initial port of optimization tutorials to master
@@ -7,7 +7,6 @@
|
||||
|
||||
introduction_to_3d
|
||||
using_transforms
|
||||
optimizing_3d_performance
|
||||
3d_rendering_limitations
|
||||
standard_material_3d
|
||||
lights_and_shadows
|
||||
|
||||
@@ -1,192 +0,0 @@
|
||||
.. meta::
|
||||
:keywords: optimization
|
||||
|
||||
.. _doc_optimizing_3d_performance:
|
||||
|
||||
Optimizing 3D performance
|
||||
=========================
|
||||
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Godot follows a balanced performance philosophy. In the performance world,
|
||||
there are always trade-offs, which consist of trading speed for
|
||||
usability and flexibility. Some practical examples of this are:
|
||||
|
||||
- Rendering objects efficiently in high amounts is easy, but when a
|
||||
large scene must be rendered, it can become inefficient. To solve
|
||||
this, visibility computation must be added to the rendering, which
|
||||
makes rendering less efficient, but, at the same time, fewer objects are
|
||||
rendered, so efficiency overall improves.
|
||||
- Configuring the properties of every material for every object that
|
||||
needs to be rendered is also slow. To solve this, objects are sorted
|
||||
by material to reduce the costs, but at the same time sorting has a
|
||||
cost.
|
||||
- In 3D physics a similar situation happens. The best algorithms to
|
||||
handle large amounts of physics objects (such as SAP) are slow
|
||||
at insertion/removal of objects and ray-casting. Algorithms that
|
||||
allow faster insertion and removal, as well as ray-casting, will not
|
||||
be able to handle as many active objects.
|
||||
|
||||
And there are many more examples of this! Game engines strive to be
|
||||
general purpose in nature, so balanced algorithms are always favored
|
||||
over algorithms that might be fast in some situations and slow in
|
||||
others.. or algorithms that are fast but make usability more difficult.
|
||||
|
||||
Godot is not an exception and, while it is designed to have backends
|
||||
swappable for different algorithms, the default ones (or more like, the
|
||||
only ones that are there for now) prioritize balance and flexibility
|
||||
over performance.
|
||||
|
||||
With this clear, the aim of this tutorial is to explain how to get the
|
||||
maximum performance out of Godot.
|
||||
|
||||
Rendering
|
||||
~~~~~~~~~
|
||||
|
||||
3D rendering is one of the most difficult areas to get performance from,
|
||||
so this section will have a list of tips.
|
||||
|
||||
Reuse shaders and materials
|
||||
---------------------------
|
||||
|
||||
The Godot renderer is a little different to what is out there. It's designed
|
||||
to minimize GPU state changes as much as possible.
|
||||
:ref:`class_StandardMaterial3D`
|
||||
does a good job at reusing materials that need similar shaders but, if
|
||||
custom shaders are used, make sure to reuse them as much as possible.
|
||||
Godot's priorities will be like this:
|
||||
|
||||
- **Reusing Materials**: The fewer different materials in the
|
||||
scene, the faster the rendering will be. If a scene has a huge amount
|
||||
of objects (in the hundreds or thousands) try reusing the materials
|
||||
or in the worst case use atlases.
|
||||
- **Reusing Shaders**: If materials can't be reused, at least try to
|
||||
re-use shaders (or StandardMaterial3Ds with different parameters but the same
|
||||
configuration).
|
||||
|
||||
If a scene has, for example, 20.000 objects with 20.000 different
|
||||
materials each, rendering will be slow. If the same scene has
|
||||
20.000 objects, but only uses 100 materials, rendering will be blazingly
|
||||
fast.
|
||||
|
||||
Pixel cost vs vertex cost
|
||||
-------------------------
|
||||
|
||||
It is a common thought that the lower the number of polygons in a model, the
|
||||
faster it will be rendered. This is *really* relative and depends on
|
||||
many factors.
|
||||
|
||||
On a modern PC and console, vertex cost is low. GPUs
|
||||
originally only rendered triangles, so all the vertices:
|
||||
|
||||
1. Had to be transformed by the CPU (including clipping).
|
||||
|
||||
2. Had to be sent to the GPU memory from the main RAM.
|
||||
|
||||
Nowadays, all this is handled inside the GPU, so the performance is
|
||||
extremely high. 3D artists usually have the wrong feeling about
|
||||
polycount performance because 3D DCCs (such as Blender, Max, etc.) need
|
||||
to keep geometry in CPU memory in order for it to be edited, reducing
|
||||
actual performance. Truth is, a model rendered by a 3D engine is much
|
||||
more optimal than how 3D DCCs display them.
|
||||
|
||||
On mobile devices, the story is different. PC and Console GPUs are
|
||||
brute-force monsters that can pull as much electricity as they need from
|
||||
the power grid. Mobile GPUs are limited to a tiny battery, so they need
|
||||
to be a lot more power efficient.
|
||||
|
||||
To be more efficient, mobile GPUs attempt to avoid *overdraw*. This
|
||||
means, the same pixel on the screen being rendered (as in, with lighting
|
||||
calculation, etc.) more than once. Imagine a town with several buildings,
|
||||
GPUs don't know what is visible and what is hidden until they
|
||||
draw it. A house might be drawn and then another house in front of it
|
||||
(rendering happened twice for the same pixel!). PC GPUs normally don't
|
||||
care much about this and just throw more pixel processors to the
|
||||
hardware to increase performance (but this also increases power
|
||||
consumption).
|
||||
|
||||
On mobile, pulling more power is not an option, so a technique called
|
||||
"Tile Based Rendering" is used (almost every mobile hardware uses a
|
||||
variant of it), which divides the screen into a grid. Each cell keeps the
|
||||
list of triangles drawn to it and sorts them by depth to minimize
|
||||
*overdraw*. This technique improves performance and reduces power
|
||||
consumption, but takes a toll on vertex performance. As a result, fewer
|
||||
vertices and triangles can be processed for drawing.
|
||||
|
||||
Generally, this is not so bad, but there is a corner case on mobile that
|
||||
must be avoided, which is to have small objects with a lot of geometry
|
||||
within a small portion of the screen. This forces mobile GPUs to put a
|
||||
lot of strain on a single screen cell, considerably decreasing
|
||||
performance (as all the other cells must wait for it to complete in
|
||||
order to display the frame).
|
||||
|
||||
To make it short, do not worry about vertex count so much on mobile, but
|
||||
avoid concentration of vertices in small parts of the screen. If, for
|
||||
example, a character, NPC, vehicle, etc. is far away (so it looks tiny),
|
||||
use a smaller level of detail (LOD) model instead.
|
||||
|
||||
An extra situation where vertex cost must be considered is objects that
|
||||
have extra processing per vertex, such as:
|
||||
|
||||
- Skinning (skeletal animation)
|
||||
- Morphs (shape keys)
|
||||
- Vertex Lit Objects (common on mobile)
|
||||
|
||||
Texture compression
|
||||
-------------------
|
||||
|
||||
Godot offers to compress textures of 3D models when imported (VRAM
|
||||
compression). Video RAM compression is not as efficient in size as PNG
|
||||
or JPG when stored, but increases performance enormously when drawing.
|
||||
|
||||
This is because the main goal of texture compression is bandwidth
|
||||
reduction between memory and the GPU.
|
||||
|
||||
In 3D, the shapes of objects depend more on the geometry than the
|
||||
texture, so compression is generally not noticeable. In 2D, compression
|
||||
depends more on shapes inside the textures, so the artifacts resulting
|
||||
from 2D compression are more noticeable.
|
||||
|
||||
As a warning, most Android devices do not support texture compression of
|
||||
textures with transparency (only opaque), so keep this in mind.
|
||||
|
||||
Transparent objects
|
||||
-------------------
|
||||
|
||||
As mentioned before, Godot sorts objects by material and shader to
|
||||
improve performance. This, however, can not be done on transparent
|
||||
objects. Transparent objects are rendered from back to front to make
|
||||
blending with what is behind work. As a result, please try to keep
|
||||
transparent objects to a minimum! If an object has a small section with
|
||||
transparency, try to make that section a separate material.
|
||||
|
||||
Level of detail (LOD)
|
||||
---------------------
|
||||
|
||||
As also mentioned before, using objects with fewer vertices can improve
|
||||
performance in some cases. Godot has a simple system to change level
|
||||
of detail,
|
||||
:ref:`GeometryInstance <class_GeometryInstance>`
|
||||
based objects have a visibility range that can be defined. Having
|
||||
several GeometryInstance objects in different ranges works as LOD.
|
||||
|
||||
Use instancing (MultiMesh)
|
||||
--------------------------
|
||||
|
||||
If several identical objects have to be drawn in the same place or
|
||||
nearby, try using :ref:`MultiMesh <class_MultiMesh>`
|
||||
instead. MultiMesh allows the drawing of dozens of thousands of objects at
|
||||
very little performance cost, making it ideal for flocks, grass,
|
||||
particles, etc.
|
||||
|
||||
Bake lighting
|
||||
-------------
|
||||
|
||||
Small lights are usually not a performance issue. Shadows a little more.
|
||||
In general, if several lights need to affect a scene, it's ideal to bake
|
||||
it (:ref:`doc_baked_lightmaps`). Baking can also improve the scene quality by
|
||||
adding indirect light bounces.
|
||||
|
||||
If working on mobile, baking to texture is recommended, since this
|
||||
method is even faster.
|
||||
277
tutorials/optimization/cpu_optimization.rst
Normal file
@@ -0,0 +1,277 @@
|
||||
.. _doc_cpu_optimization:
|
||||
|
||||
CPU optimization
|
||||
================
|
||||
|
||||
Measuring performance
|
||||
=====================
|
||||
|
||||
We have to know where the "bottlenecks" are to know how to speed up our program.
|
||||
Bottlenecks are the slowest parts of the program that limit the rate that
|
||||
everything can progress. Focussing on bottlenecks allows us to concentrate our
|
||||
efforts on optimizing the areas which will give us the greatest speed
|
||||
improvement, instead of spending a lot of time optimizing functions that will
|
||||
lead to small performance improvements.
|
||||
|
||||
For the CPU, the easiest way to identify bottlenecks is to use a profiler.
|
||||
|
||||
CPU profilers
|
||||
=============
|
||||
|
||||
Profilers run alongside your program and take timing measurements to work out
|
||||
what proportion of time is spent in each function.
|
||||
|
||||
The Godot IDE conveniently has a built-in profiler. It does not run every time
|
||||
you start your project: it must be manually started and stopped. This is
|
||||
because, like most profilers, recording these timing measurements can
|
||||
slow down your project significantly.
|
||||
|
||||
After profiling, you can look back at the results for a frame.
|
||||
|
||||
.. figure:: img/godot_profiler.png
|
||||
.. figure:: img/godot_profiler.png
|
||||
:alt: Screenshot of the Godot profiler
|
||||
|
||||
Results of a profile of one of the demo projects.
|
||||
|
||||
.. note:: We can see the cost of built-in processes such as physics and audio,
|
||||
as well as seeing the cost of our own scripting functions at the
|
||||
bottom.
|
||||
|
||||
Time spent waiting for various built-in servers may not be counted in
|
||||
the profilers. This is a known bug.
|
||||
|
||||
When a project is running slowly, you will often see an obvious function or
|
||||
process taking a lot more time than others. This is your primary bottleneck, and
|
||||
you can usually increase speed by optimizing this area.
|
||||
|
||||
For more info about using Godot's built-in profiler, see
|
||||
:ref:`doc_debugger_panel`.
|
||||
|
||||
External profilers
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Although the Godot IDE profiler is very convenient and useful, sometimes you
|
||||
need more power, and the ability to profile the Godot engine source code itself.
|
||||
|
||||
You can use a number of third party profilers to do this including
|
||||
`Valgrind <https://www.valgrind.org/>`__,
|
||||
`VerySleepy <http://www.codersnotes.com/sleepy/>`__,
|
||||
`HotSpot <https://github.com/KDAB/hotspot>`__,
|
||||
`Visual Studio <https://visualstudio.microsoft.com/>`__ and
|
||||
`Intel VTune <https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html>`__.
|
||||
|
||||
.. note:: You will need to compile Godot from source to use a third-party profiler.
|
||||
This is required to obtain debugging symbols. You can also use a debug
|
||||
build, however, note that the results of profiling a debug build will
|
||||
be different to a release build, because debug builds are less
|
||||
optimized. Bottlenecks are often in a different place in debug builds,
|
||||
so you should profile release builds whenever possible.
|
||||
|
||||
.. figure:: img/valgrind.png
|
||||
:alt: Screenshot of Callgrind
|
||||
|
||||
Example results from Callgrind, which is part of Valgrind.
|
||||
|
||||
From the left, Callgrind is listing the percentage of time within a function and
|
||||
its children (Inclusive), the percentage of time spent within the function
|
||||
itself, excluding child functions (Self), the number of times the function is
|
||||
called, the function name, and the file or module.
|
||||
|
||||
In this example, we can see nearly all time is spent under the
|
||||
`Main::iteration()` function. This is the master function in the Godot source
|
||||
code that is called repeatedly. It causes frames to be drawn, physics ticks to
|
||||
be simulated, and nodes and scripts to be updated. A large proportion of the
|
||||
time is spent in the functions to render a canvas (66%), because this example
|
||||
uses a 2D benchmark. Below this, we see that almost 50% of the time is spent
|
||||
outside Godot code in ``libglapi`` and ``i965_dri`` (the graphics driver).
|
||||
This tells us the a large proportion of CPU time is being spent in the
|
||||
graphics driver.
|
||||
|
||||
This is actually an excellent example because, in an ideal world, only a very
|
||||
small proportion of time would be spent in the graphics driver. This is an
|
||||
indication that there is a problem with too much communication and work being
|
||||
done in the graphics API. This specific profiling led to the development of 2D
|
||||
batching, which greatly speeds up 2D rendering by reducing bottlenecks in this
|
||||
area.
|
||||
|
||||
Manually timing functions
|
||||
=========================
|
||||
|
||||
Another handy technique, especially once you have identified the bottleneck
|
||||
using a profiler, is to manually time the function or area under test.
|
||||
The specifics vary depending on the language, but in GDScript, you would do
|
||||
the following:
|
||||
|
||||
::
|
||||
|
||||
var time_start = OS.get_ticks_usec()
|
||||
|
||||
# Your function you want to time
|
||||
update_enemies()
|
||||
|
||||
var time_end = OS.get_ticks_usec()
|
||||
print("update_enemies() took %d microseconds" % time_end - time_start)
|
||||
|
||||
When manually timing functions, it is usually a good idea to run the function
|
||||
many times (1,000 or more times), instead of just once (unless it is a very slow
|
||||
function). The reason for doing this is that timers often have limited accuracy.
|
||||
Moreover, CPUs will schedule processes in a haphazard manner. Therefore, an
|
||||
average over a series of runs is more accurate than a single measurement.
|
||||
|
||||
As you attempt to optimize functions, be sure to either repeatedly profile or
|
||||
time them as you go. This will give you crucial feedback as to whether the
|
||||
optimization is working (or not).
|
||||
|
||||
Caches
|
||||
======
|
||||
|
||||
CPU caches are something else to be particularly aware of, especially when
|
||||
comparing timing results of two different versions of a function. The results
|
||||
can be highly dependent on whether the data is in the CPU cache or not. CPUs
|
||||
don't load data directly from the system RAM, even though it's huge in
|
||||
comparison to the CPU cache (several gigabytes instead of a few megabytes). This
|
||||
is because system RAM is very slow to access. Instead, CPUs load data from a
|
||||
smaller, faster bank of memory called cache. Loading data from cache is very
|
||||
fast, but every time you try and load a memory address that is not stored in
|
||||
cache, the cache must make a trip to main memory and slowly load in some data.
|
||||
This delay can result in the CPU sitting around idle for a long time, and is
|
||||
referred to as a "cache miss".
|
||||
|
||||
This means that the first time you run a function, it may run slowly because the
|
||||
data is not in the CPU cache. The second and later times, it may run much faster
|
||||
because the data is in the cache. Due to this, always use averages when timing,
|
||||
and be aware of the effects of cache.
|
||||
|
||||
Understanding caching is also crucial to CPU optimization. If you have an
|
||||
algorithm (routine) that loads small bits of data from randomly spread out areas
|
||||
of main memory, this can result in a lot of cache misses, a lot of the time, the
|
||||
CPU will be waiting around for data instead of doing any work. Instead, if you
|
||||
can make your data accesses localised, or even better, access memory in a linear
|
||||
fashion (like a continuous list), then the cache will work optimally and the CPU
|
||||
will be able to work as fast as possible.
|
||||
|
||||
Godot usually takes care of such low-level details for you. For example, the
|
||||
Server APIs make sure data is optimized for caching already for things like
|
||||
rendering and physics. Still, you should be especially aware of caching when
|
||||
using :ref:`GDNative <toc-tutorials-gdnative>`.
|
||||
|
||||
Languages
|
||||
=========
|
||||
|
||||
Godot supports a number of different languages, and it is worth bearing in mind
|
||||
that there are trade-offs involved. Some languages are designed for ease of use
|
||||
at the cost of speed, and others are faster but more difficult to work with.
|
||||
|
||||
Built-in engine functions run at the same speed regardless of the scripting
|
||||
language you choose. If your project is making a lot of calculations in its own
|
||||
code, consider moving those calculations to a faster language.
|
||||
|
||||
GDScript
|
||||
~~~~~~~~
|
||||
|
||||
:ref:`GDScript <toc-learn-scripting-gdscript>` is designed to be easy to use and iterate,
|
||||
and is ideal for making many types of games. However, in this language, ease of
|
||||
use is considered more important than performance. If you need to make heavy
|
||||
calculations, consider moving some of your project to one of the other
|
||||
languages.
|
||||
|
||||
C#
|
||||
~~
|
||||
|
||||
:ref:`C# <toc-learn-scripting-C#>` is popular and has first-class support in Godot.It
|
||||
offers a good compromise between speed and ease of use. Beware of possible
|
||||
garbage collection pauses and leaks that can occur during gameplay, though. A
|
||||
common approach to workaround issues with garbage collection is to use *object
|
||||
pooling*, which is outside the scope of this guide.
|
||||
|
||||
Other languages
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Third parties provide support for several other languages, including `Rust
|
||||
<https://github.com/godot-rust/godot-rust>`_ and `Javascript
|
||||
<https://github.com/GodotExplorer/ECMAScript>`_.
|
||||
|
||||
C++
|
||||
~~~
|
||||
|
||||
Godot is written in C++. Using C++ will usually result in the fastest code.
|
||||
However, on a practical level, it is the most difficult to deploy to end users'
|
||||
machines on different platforms. Options for using C++ include
|
||||
:ref:`GDNative <toc-tutorials-gdnative>` and
|
||||
:ref:`custom modules <doc_custom_modules_in_c++>`.
|
||||
|
||||
Threads
|
||||
=======
|
||||
|
||||
Consider using threads when making a lot of calculations that can run in
|
||||
parallel to each other. Modern CPUs have multiple cores, each one capable of
|
||||
doing a limited amount of work. By spreading work over multiple threads, you can
|
||||
move further towards peak CPU efficiency.
|
||||
|
||||
The disadvantage of threads is that you have to be incredibly careful. As each
|
||||
CPU core operates independently, they can end up trying to access the same
|
||||
memory at the same time. One thread can be reading to a variable while another
|
||||
is writing: this is called a *race condition*. Before you use threads, make sure
|
||||
you understand the dangers and how to try and prevent these race conditions.
|
||||
|
||||
Threads can also make debugging considerably more difficult. The GDScript
|
||||
debugger doesn't support setting up breakpoints in threads yet.
|
||||
|
||||
For more information on threads, see :ref:`doc_using_multiple_threads`.
|
||||
|
||||
SceneTree
|
||||
=========
|
||||
|
||||
Although Nodes are an incredibly powerful and versatile concept, be aware that
|
||||
every node has a cost. Built-in functions such as `_process()` and
|
||||
`_physics_process()` propagate through the tree. This housekeeping can reduce
|
||||
performance when you have very large numbers of nodes (usually in the thousands).
|
||||
|
||||
Each node is handled individually in the Godot renderer. Therefore, a smaller
|
||||
number of nodes with more in each can lead to better performance.
|
||||
|
||||
One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
|
||||
get much better performance by removing nodes from the SceneTree, rather than by
|
||||
pausing or hiding them. You don't have to delete a detached node. You can for
|
||||
example, keep a reference to a node, detach it from the scene tree using
|
||||
:ref:`Node.remove_child(node) <class_Node_method_remove_child>`, then reattach
|
||||
it later using :ref:`Node.add_child(node) <class_Node_method_add_child>`.
|
||||
This can be very useful for adding and removing areas from a game, for example.
|
||||
|
||||
You can avoid the SceneTree altogether by using Server APIs. For more
|
||||
information, see :ref:`doc_using_servers`.
|
||||
|
||||
Physics
|
||||
=======
|
||||
|
||||
In some situations, physics can end up becoming a bottleneck. This is
|
||||
particularly the case with complex worlds and large numbers of physics objects.
|
||||
|
||||
Here are some techniques to speed up physics:
|
||||
|
||||
- Try using simplified versions of your rendered geometry for collision shapes.
|
||||
Often, this won't be noticeable for end users, but can greatly increase
|
||||
performance.
|
||||
- Try removing objects from physics when they are out of view / outside the
|
||||
current area, or reusing physics objects (maybe you allow 8 monsters per area,
|
||||
for example, and reuse these).
|
||||
|
||||
Another crucial aspect to physics is the physics tick rate. In some games, you
|
||||
can greatly reduce the tick rate, and instead of for example, updating physics
|
||||
60 times per second, you may update them only 30 or even 20 times per second.
|
||||
This can greatly reduce the CPU load.
|
||||
|
||||
The downside of changing physics tick rate is you can get jerky movement or
|
||||
jitter when the physics update rate does not match the frames per second
|
||||
rendered. Also, decreasing the physics tick rate will increase input lag.
|
||||
It's recommended to stick to the default physics tick rate (60 Hz) in most games
|
||||
that feature real-time player movement.
|
||||
|
||||
The solution to jitter is to use *fixed timestep interpolation*, which involves
|
||||
smoothing the rendered positions and rotations over multiple frames to match the
|
||||
physics. You can either implement this yourself or use a
|
||||
`third-party addon <https://github.com/lawnjelly/smoothing-addon>`__.
|
||||
Performance-wise, interpolation is a very cheap operation compared to running a
|
||||
physics tick. It's orders of magnitude faster, so this can be a significant
|
||||
performance win while also reducing jitter.
|
||||
297
tutorials/optimization/general_optimization.rst
Normal file
@@ -0,0 +1,297 @@
|
||||
.. _doc_general_optimization:
|
||||
|
||||
General optimization tips
|
||||
=========================
|
||||
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
In an ideal world, computers would run at infinite speed. The only limit to
|
||||
what we could achieve would be our imagination. However, in the real world, it's
|
||||
all too easy to produce software that will bring even the fastest computer to
|
||||
its knees.
|
||||
|
||||
Thus, designing games and other software is a compromise between what we would
|
||||
like to be possible, and what we can realistically achieve while maintaining
|
||||
good performance.
|
||||
|
||||
To achieve the best results, we have two approaches:
|
||||
|
||||
- Work faster.
|
||||
- Work smarter.
|
||||
|
||||
And preferably, we will use a blend of the two.
|
||||
|
||||
Smoke and mirrors
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
Part of working smarter is recognizing that, in games, we can often get the
|
||||
player to believe they're in a world that is far more complex, interactive, and
|
||||
graphically exciting than it really is. A good programmer is a magician, and
|
||||
should strive to learn the tricks of the trade while trying to invent new ones.
|
||||
|
||||
The nature of slowness
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To the outside observer, performance problems are often lumped together.
|
||||
But in reality, there are several different kinds of performance problems:
|
||||
|
||||
- A slow process that occurs every frame, leading to a continuously low frame
|
||||
rate.
|
||||
- An intermittent process that causes "spikes" of slowness, leading to
|
||||
stalls.
|
||||
- A slow process that occurs outside of normal gameplay, for instance,
|
||||
when loading a level.
|
||||
|
||||
Each of these are annoying to the user, but in different ways.
|
||||
|
||||
Measuring performance
|
||||
=====================
|
||||
|
||||
Probably the most important tool for optimization is the ability to measure
|
||||
performance - to identify where bottlenecks are, and to measure the success of
|
||||
our attempts to speed them up.
|
||||
|
||||
There are several methods of measuring performance, including:
|
||||
|
||||
- Putting a start/stop timer around code of interest.
|
||||
- Using the Godot profiler.
|
||||
- Using external third-party CPU profilers.
|
||||
- Using GPU profilers/debuggers such as
|
||||
`NVIDIA Nsight Graphics <https://developer.nvidia.com/nsight-graphics>`__
|
||||
or `apitrace <https://apitrace.github.io/>`__.
|
||||
- Checking the frame rate (with V-Sync disabled).
|
||||
|
||||
Be very aware that the relative performance of different areas can vary on
|
||||
different hardware. It's often a good idea to measure timings on more than one
|
||||
device. This is especially the case if you're targeting mobile devices.
|
||||
|
||||
Limitations
|
||||
~~~~~~~~~~~
|
||||
|
||||
CPU profilers are often the go-to method for measuring performance. However,
|
||||
they don't always tell the whole story.
|
||||
|
||||
- Bottlenecks are often on the GPU, "as a result" of instructions given by the
|
||||
CPU.
|
||||
- Spikes can occur in the operating system processes (outside of Godot) "as a
|
||||
result" of instructions used in Godot (for example, dynamic memory allocation).
|
||||
- You may not always be able to profile specific devices like a mobile phone
|
||||
due to the initial setup required.
|
||||
- You may have to solve performance problems that occur on hardware you don't
|
||||
have access to.
|
||||
|
||||
As a result of these limitations, you often need to use detective work to find
|
||||
out where bottlenecks are.
|
||||
|
||||
Detective work
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
Detective work is a crucial skill for developers (both in terms of performance,
|
||||
and also in terms of bug fixing). This can include hypothesis testing, and
|
||||
binary search.
|
||||
|
||||
Hypothesis testing
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Say, for example, that you believe sprites are slowing down your game.
|
||||
You can test this hypothesis by:
|
||||
|
||||
- Measuring the performance when you add more sprites, or take some away.
|
||||
|
||||
This may lead to a further hypothesis: does the size of the sprite determine
|
||||
the performance drop?
|
||||
|
||||
- You can test this by keeping everything the same, but changing the sprite
|
||||
size, and measuring performance.
|
||||
|
||||
Binary search
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
If you know that frames are taking much longer than they should, but you're
|
||||
not sure where the bottleneck lies. You could begin by commenting out
|
||||
approximately half the routines that occur on a normal frame. Has the
|
||||
performance improved more or less than expected?
|
||||
|
||||
Once you know which of the two halves contains the bottleneck, you can
|
||||
repeat this process until you've pinned down the problematic area.
|
||||
|
||||
Profilers
|
||||
=========
|
||||
|
||||
Profilers allow you to time your program while running it. Profilers then
|
||||
provide results telling you what percentage of time was spent in different
|
||||
functions and areas, and how often functions were called.
|
||||
|
||||
This can be very useful both to identify bottlenecks and to measure the results
|
||||
of your improvements. Sometimes, attempts to improve performance can backfire
|
||||
and lead to slower performance.
|
||||
**Always use profiling and timing to guide your efforts.**
|
||||
|
||||
For more info about using Godot's built-in profiler, see :ref:`doc_debugger_panel`.
|
||||
|
||||
Principles
|
||||
==========
|
||||
|
||||
`Donald Knuth <https://en.wikipedia.org/wiki/Donald_Knuth>`__ said:
|
||||
|
||||
*Programmers waste enormous amounts of time thinking about, or worrying
|
||||
about, the speed of noncritical parts of their programs, and these attempts
|
||||
at efficiency actually have a strong negative impact when debugging and
|
||||
maintenance are considered. We should forget about small efficiencies, say
|
||||
about 97% of the time: premature optimization is the root of all evil. Yet
|
||||
we should not pass up our opportunities in that critical 3%.*
|
||||
|
||||
The messages are very important:
|
||||
|
||||
- Developer time is limited. Instead of blindly trying to speed up
|
||||
all aspects of a program, we should concentrate our efforts on the aspects
|
||||
that really matter.
|
||||
- Efforts at optimization often end up with code that is harder to read and
|
||||
debug than non-optimized code. It is in our interests to limit this to areas
|
||||
that will really benefit.
|
||||
|
||||
Just because we *can* optimize a particular bit of code, it doesn't necessarily
|
||||
mean that we *should*. Knowing when and when not to optimize is a great skill to
|
||||
develop.
|
||||
|
||||
One misleading aspect of the quote is that people tend to focus on the subquote
|
||||
*"premature optimization is the root of all evil"*. While *premature*
|
||||
optimization is (by definition) undesirable, performant software is the result
|
||||
of performant design.
|
||||
|
||||
Performant design
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
The danger with encouraging people to ignore optimization until necessary, is
|
||||
that it conveniently ignores that the most important time to consider
|
||||
performance is at the design stage, before a key has even hit a keyboard. If the
|
||||
design or algorithms of a program are inefficient, then no amount of polishing
|
||||
the details later will make it run fast. It may run *faster*, but it will never
|
||||
run as fast as a program designed for performance.
|
||||
|
||||
This tends to be far more important in game or graphics programming than in
|
||||
general programming. A performant design, even without low-level optimization,
|
||||
will often run many times faster than a mediocre design with low-level
|
||||
optimization.
|
||||
|
||||
Incremental design
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Of course, in practice, unless you have prior knowledge, you are unlikely to
|
||||
come up with the best design the first time. Instead, you'll often make a series
|
||||
of versions of a particular area of code, each taking a different approach to
|
||||
the problem, until you come to a satisfactory solution. It's important not to
|
||||
spend too much time on the details at this stage until you have finalized the
|
||||
overall design. Otherwise, much of your work will be thrown out.
|
||||
|
||||
It's difficult to give general guidelines for performant design because this is
|
||||
so dependent on the problem. One point worth mentioning though, on the CPU side,
|
||||
is that modern CPUs are nearly always limited by memory bandwidth. This has led
|
||||
to a resurgence in data-oriented design, which involves designing data
|
||||
structures and algorithms for *cache locality* of data and linear access, rather
|
||||
than jumping around in memory.
|
||||
|
||||
The optimization process
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Assuming we have a reasonable design, and taking our lessons from Knuth, our
|
||||
first step in optimization should be to identify the biggest bottlenecks - the
|
||||
slowest functions, the low-hanging fruit.
|
||||
|
||||
Once we've successfully improved the speed of the slowest area, it may no
|
||||
longer be the bottleneck. So we should test/profile again and find the next
|
||||
bottleneck on which to focus.
|
||||
|
||||
The process is thus:
|
||||
|
||||
1. Profile / Identify bottleneck.
|
||||
2. Optimize bottleneck.
|
||||
3. Return to step 1.
|
||||
|
||||
Optimizing bottlenecks
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Some profilers will even tell you which part of a function (which data accesses,
|
||||
calculations) are slowing things down.
|
||||
|
||||
As with design, you should concentrate your efforts first on making sure the
|
||||
algorithms and data structures are the best they can be. Data access should be
|
||||
local (to make best use of CPU cache), and it can often be better to use compact
|
||||
storage of data (again, always profile to test results). Often, you precalculate
|
||||
heavy computations ahead of time. This can be done by performing the computation
|
||||
when loading a level, by loading a file containing precalculated data or simply
|
||||
by storing the results of complex calculations into a script constant and
|
||||
reading its value.
|
||||
|
||||
Once algorithms and data are good, you can often make small changes in routines
|
||||
which improve performance. For instance, you can move some calculations outside
|
||||
of loops or transform nested ``for`` loops into non-nested loops.
|
||||
(This should be feasible if you know a 2D array's width or height in advance.)
|
||||
|
||||
Always retest your timing/bottlenecks after making each change. Some changes
|
||||
will increase speed, others may have a negative effect. Sometimes, a small
|
||||
positive effect will be outweighed by the negatives of more complex code, and
|
||||
you may choose to leave out that optimization.
|
||||
|
||||
Appendix
|
||||
========
|
||||
|
||||
Bottleneck math
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
The proverb *"a chain is only as strong as its weakest link"* applies directly to
|
||||
performance optimization. If your project is spending 90% of the time in
|
||||
function ``A``, then optimizing ``A`` can have a massive effect on performance.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
A: 9 ms
|
||||
Everything else: 1 ms
|
||||
Total frame time: 10 ms
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
A: 1 ms
|
||||
Everything else: 1ms
|
||||
Total frame time: 2 ms
|
||||
|
||||
In this example, improving this bottleneck ``A`` by a factor of 9× decreases
|
||||
overall frame time by 5× while increasing frames per second by 5×.
|
||||
|
||||
However, if something else is running slowly and also bottlenecking your
|
||||
project, then the same improvement can lead to less dramatic gains:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
A: 9 ms
|
||||
Everything else: 50 ms
|
||||
Total frame time: 59 ms
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
A: 1 ms
|
||||
Everything else: 50 ms
|
||||
Total frame time: 51 ms
|
||||
|
||||
In this example, even though we have hugely optimized function ``A``,
|
||||
the actual gain in terms of frame rate is quite small.
|
||||
|
||||
In games, things become even more complicated because the CPU and GPU run
|
||||
independently of one another. Your total frame time is determined by the slower
|
||||
of the two.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
CPU: 9 ms
|
||||
GPU: 50 ms
|
||||
Total frame time: 50 ms
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
CPU: 1 ms
|
||||
GPU: 50 ms
|
||||
Total frame time: 50 ms
|
||||
|
||||
In this example, we optimized the CPU hugely again, but the frame time didn't
|
||||
improve because we are GPU-bottlenecked.
|
||||
280
tutorials/optimization/gpu_optimization.rst
Normal file
@@ -0,0 +1,280 @@
|
||||
.. _doc_gpu_optimization:
|
||||
|
||||
GPU optimization
|
||||
================
|
||||
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
The demand for new graphics features and progress almost guarantees that you
|
||||
will encounter graphics bottlenecks. Some of these can be on the CPU side, for
|
||||
instance in calculations inside the Godot engine to prepare objects for
|
||||
rendering. Bottlenecks can also occur on the CPU in the graphics driver, which
|
||||
sorts instructions to pass to the GPU, and in the transfer of these
|
||||
instructions. And finally, bottlenecks also occur on the GPU itself.
|
||||
|
||||
Where bottlenecks occur in rendering is highly hardware-specific.
|
||||
Mobile GPUs in particular may struggle with scenes that run easily on desktop.
|
||||
|
||||
Understanding and investigating GPU bottlenecks is slightly different to the
|
||||
situation on the CPU. This is because, often, you can only change performance
|
||||
indirectly by changing the instructions you give to the GPU. Also, it may be
|
||||
more difficult to take measurements. In many cases, the only way of measuring
|
||||
performance is by examining changes in the time spent rendering each frame.
|
||||
|
||||
Draw calls, state changes, and APIs
|
||||
===================================
|
||||
|
||||
.. note:: The following section is not relevant to end-users, but is useful to
|
||||
provide background information that is relevant in later sections.
|
||||
|
||||
Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or
|
||||
Vulkan). The communication and driver activity involved can be quite costly,
|
||||
especially in OpenGL and OpenGL ES. If we can provide these instructions in a
|
||||
way that is preferred by the driver and GPU, we can greatly increase
|
||||
performance.
|
||||
|
||||
Nearly every API command in OpenGL requires a certain amount of validation to
|
||||
make sure the GPU is in the correct state. Even seemingly simple commands can
|
||||
lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to
|
||||
reduce these instructions to a bare minimum and group together similar objects
|
||||
as much as possible so they can be rendered together, or with the minimum number
|
||||
of these expensive state changes.
|
||||
|
||||
2D batching
|
||||
~~~~~~~~~~~
|
||||
|
||||
In 2D, the costs of treating each item individually can be prohibitively high -
|
||||
there can easily be thousands of them on the screen. This is why 2D *batching*
|
||||
is used. Multiple similar items are grouped together and rendered in a batch,
|
||||
via a single draw call, rather than making a separate draw call for each item.
|
||||
In addition, this means state changes, material and texture changes can be kept
|
||||
to a minimum.
|
||||
|
||||
3D batching
|
||||
~~~~~~~~~~~
|
||||
|
||||
In 3D, we still aim to minimize draw calls and state changes. However, it can be
|
||||
more difficult to batch together several objects into a single draw call. 3D
|
||||
meshes tend to comprise hundreds or thousands of triangles, and combining large
|
||||
meshes in real-time is prohibitively expensive. The costs of joining them quickly
|
||||
exceeds any benefits as the number of triangles grows per mesh. A much better
|
||||
alternative is to **join meshes ahead of time** (static meshes in relation to each
|
||||
other). This can either be done by artists, or programmatically within Godot.
|
||||
|
||||
There is also a cost to batching together objects in 3D. Several objects
|
||||
rendered as one cannot be individually culled. An entire city that is off-screen
|
||||
will still be rendered if it is joined to a single blade of grass that is on
|
||||
screen. Thus, you should always take objects' location and culling into account
|
||||
when attempting to batch 3D objects together. Despite this, the benefits of
|
||||
joining static objects often outweigh other considerations, especially for large
|
||||
numbers of distant or low-poly objects.
|
||||
|
||||
For more information on 3D specific optimizations, see
|
||||
:ref:`doc_optimizing_3d_performance`.
|
||||
|
||||
Reuse Shaders and Materials
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The Godot renderer is a little different to what is out there. It's designed to
|
||||
minimize GPU state changes as much as possible. :ref:`StandardMaterial3D
|
||||
<class_StandardMaterial3D>` does a good job at reusing materials that need similar
|
||||
shaders. if custom shaders are used, make sure to reuse them as much as
|
||||
possible. Godot's priorities are:
|
||||
|
||||
- **Reusing Materials:** The fewer different materials in the
|
||||
scene, the faster the rendering will be. If a scene has a huge amount
|
||||
of objects (in the hundreds or thousands), try reusing the materials.
|
||||
In the worst case, use atlases to decrease the amount of texture changes.
|
||||
- **Reusing Shaders:** If materials can't be reused, at least try to
|
||||
re-use shaders (or StandardMaterial3Ds with different parameters but the same
|
||||
configuration).
|
||||
|
||||
If a scene has, for example, ``20,000`` objects with ``20,000`` different
|
||||
materials each, rendering will be slow. If the same scene has ``20,000``
|
||||
objects, but only uses ``100`` materials, rendering will be much faster.
|
||||
|
||||
Pixel cost versus vertex cost
|
||||
=============================
|
||||
|
||||
You may have heard that the lower the number of polygons in a model, the faster
|
||||
it will be rendered. This is *really* relative and depends on many factors.
|
||||
|
||||
On a modern PC and console, vertex cost is low. GPUs originally only rendered
|
||||
triangles. This meant that every frame:
|
||||
|
||||
1. All vertices had to be transformed by the CPU (including clipping).
|
||||
2. All vertices had to be sent to the GPU memory from the main RAM.
|
||||
|
||||
Nowadays, all this is handled inside the GPU, greatly increasing performance.
|
||||
3D artists usually have the wrong feeling about polycount performance because 3D
|
||||
DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory for it to
|
||||
be edited, reducing actual performance. Game engines rely on the GPU more, so
|
||||
they can render many triangles much more efficiently.
|
||||
|
||||
On mobile devices, the story is different. PC and console GPUs are
|
||||
brute-force monsters that can pull as much electricity as they need from
|
||||
the power grid. Mobile GPUs are limited to a tiny battery, so they need
|
||||
to be a lot more power efficient.
|
||||
|
||||
To be more efficient, mobile GPUs attempt to avoid *overdraw*. Overdraw occurs
|
||||
when the same pixel on the screen is being rendered more than once. Imagine a
|
||||
town with several buildings. GPUs don't know what is visible and what is hidden
|
||||
until they draw it. For example, a house might be drawn and then another house
|
||||
in front of it (which means rendering happened twice for the same pixel). PC
|
||||
GPUs normally don't care much about this and just throw more pixel processors to
|
||||
the hardware to increase performance (which also increases power consumption).
|
||||
|
||||
Using more power is not an option on mobile so mobile devices use a technique
|
||||
called *tile-based rendering* which divides the screen into a grid. Each cell
|
||||
keeps the list of triangles drawn to it and sorts them by depth to minimize
|
||||
*overdraw*. This technique improves performance and reduces power consumption,
|
||||
but takes a toll on vertex performance. As a result, fewer vertices and
|
||||
triangles can be processed for drawing.
|
||||
|
||||
Additionally, tile-based rendering struggles when there are small objects with a
|
||||
lot of geometry within a small portion of the screen. This forces mobile GPUs to
|
||||
put a lot of strain on a single screen tile, which considerably decreases
|
||||
performance as all the other cells must wait for it to complete before
|
||||
displaying the frame.
|
||||
|
||||
To summarize, don't worry about vertex count on mobile, but
|
||||
**avoid concentration of vertices in small parts of the screen**.
|
||||
If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use
|
||||
a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to
|
||||
avoid having triangles smaller than the size of a pixel on screen.
|
||||
|
||||
Pay attention to the additional vertex processing required when using:
|
||||
|
||||
- Skinning (skeletal animation)
|
||||
- Morphs (shape keys)
|
||||
- Vertex-lit objects (common on mobile)
|
||||
|
||||
Pixel/fragment shaders and fill rate
|
||||
====================================
|
||||
|
||||
In contrast to vertex processing, the costs of fragment (per-pixel) shading have
|
||||
increased dramatically over the years. Screen resolutions have increased (the
|
||||
area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA
|
||||
screen, that is 27x the area), but also the complexity of fragment shaders has
|
||||
exploded. Physically-based rendering requires complex calculations for each
|
||||
fragment.
|
||||
|
||||
You can test whether a project is fill rate-limited quite easily. Turn off
|
||||
V-Sync to prevent capping the frames per second, then compare the frames per
|
||||
second when running with a large window, to running with a very small window.
|
||||
You may also benefit from similarly reducing your shadow map size if using
|
||||
shadows. Usually, you will find the FPS increases quite a bit using a small
|
||||
window, which indicates you are to some extent fill rate-limited. On the other
|
||||
hand, if there is little to no increase in FPS, then your bottleneck lies
|
||||
elsewhere.
|
||||
|
||||
You can increase performance in a fill rate-limited project by reducing the
|
||||
amount of work the GPU has to do. You can do this by simplifying the shader
|
||||
(perhaps turn off expensive options if you are using a :ref:`StandardMaterial3D
|
||||
<class_StandardMaterial3D>`), or reducing the number and size of textures used.
|
||||
|
||||
**When targeting mobile devices, consider using the simplest possible shaders
|
||||
you can reasonably afford to use.**
|
||||
|
||||
Reading textures
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
The other factor in fragment shaders is the cost of reading textures. Reading
|
||||
textures is an expensive operation, especially when reading from several
|
||||
textures in a single fragment shader. Also, consider that filtering may slow it
|
||||
down further (trilinear filtering between mipmaps, and averaging). Reading
|
||||
textures is also expensive in terms of power usage, which is a big issue on
|
||||
mobiles.
|
||||
|
||||
**If you use third-party shaders or write your own shaders, try to use
|
||||
algorithms that require as few texture reads as possible.**
|
||||
|
||||
Texture compression
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default, Godot compresses textures of 3D models when imported using video RAM
|
||||
(VRAM) compression. Video RAM compression isn't as efficient in size as PNG or
|
||||
JPG when stored, but increases performance enormously when drawing large enough
|
||||
textures.
|
||||
|
||||
This is because the main goal of texture compression is bandwidth reduction
|
||||
between memory and the GPU.
|
||||
|
||||
In 3D, the shapes of objects depend more on the geometry than the texture, so
|
||||
compression is generally not noticeable. In 2D, compression depends more on
|
||||
shapes inside the textures, so the artifacts resulting from 2D compression are
|
||||
more noticeable.
|
||||
|
||||
As a warning, most Android devices do not support texture compression of
|
||||
textures with transparency (only opaque), so keep this in mind.
|
||||
|
||||
.. note::
|
||||
|
||||
Even in 3D, "pixel art" textures should have VRAM compression disabled as it
|
||||
will negatively affect their appearance, without improving performance
|
||||
significantly due to their low resolution.
|
||||
|
||||
|
||||
Post-processing and shadows
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Post-processing effects and shadows can also be expensive in terms of fragment
|
||||
shading activity. Always test the impact of these on different hardware.
|
||||
|
||||
**Reducing the size of shadowmaps can increase performance**, both in terms of
|
||||
writing and reading the shadowmaps. On top of that, the best way to improve
|
||||
performance of shadows is to turn shadows off for as many lights and objects as
|
||||
possible. Smaller or distant OmniLights/SpotLights can often have their shadows
|
||||
disabled with only a small visual impact.
|
||||
|
||||
Transparency and blending
|
||||
=========================
|
||||
|
||||
Transparent objects present particular problems for rendering efficiency. Opaque
|
||||
objects (especially in 3D) can be essentially rendered in any order and the
|
||||
Z-buffer will ensure that only the front most objects get shaded. Transparent or
|
||||
blended objects are different. In most cases, they cannot rely on the Z-buffer
|
||||
and must be rendered in "painter's order" (i.e. from back to front) to look
|
||||
correct.
|
||||
|
||||
Transparent objects are also particularly bad for fill rate, because every item
|
||||
has to be drawn even if other transparent objects will be drawn on top
|
||||
later on.
|
||||
|
||||
Opaque objects don't have to do this. They can usually take advantage of the
|
||||
Z-buffer by writing to the Z-buffer only first, then only performing the
|
||||
fragment shader on the "winning" fragment, the object that is at the front at a
|
||||
particular pixel.
|
||||
|
||||
Transparency is particularly expensive where multiple transparent objects
|
||||
overlap. It is usually better to use transparent areas as small as possible to
|
||||
minimize these fill rate requirements, especially on mobile, where fill rate is
|
||||
very expensive. Indeed, in many situations, rendering more complex opaque
|
||||
geometry can end up being faster than using transparency to "cheat".
|
||||
|
||||
Multi-platform advice
|
||||
=====================
|
||||
|
||||
If you are aiming to release on multiple platforms, test *early* and test
|
||||
*often* on all your platforms, especially mobile. Developing a game on desktop
|
||||
but attempting to port it to mobile at the last minute is a recipe for disaster.
|
||||
|
||||
In general, you should design your game for the lowest common denominator, then
|
||||
add optional enhancements for more powerful platforms. For example, you may want
|
||||
to use the GLES2 backend for both desktop and mobile platforms where you target
|
||||
both.
|
||||
|
||||
Mobile/tiled renderers
|
||||
======================
|
||||
|
||||
As described above, GPUs on mobile devices work in dramatically different ways
|
||||
from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers
|
||||
split up the screen into regular-sized tiles that fit into super fast cache
|
||||
memory, which reduces the number of read/write operations to the main memory.
|
||||
|
||||
There are some downsides though. Tiled rendering can make certain techniques
|
||||
much more complicated and expensive to perform. Tiles that rely on the results
|
||||
of rendering in different tiles or on the results of earlier operations being
|
||||
preserved can be very slow. Be very careful to test the performance of shaders,
|
||||
viewport textures and post processing.
|
||||
BIN
tutorials/optimization/img/godot_profiler.png
Normal file
|
After Width: | Height: | Size: 45 KiB |
BIN
tutorials/optimization/img/lights_overlap.png
Normal file
|
After Width: | Height: | Size: 146 KiB |
BIN
tutorials/optimization/img/lights_separate.png
Normal file
|
After Width: | Height: | Size: 160 KiB |
BIN
tutorials/optimization/img/overlap1.png
Normal file
|
After Width: | Height: | Size: 96 KiB |
BIN
tutorials/optimization/img/overlap2.png
Normal file
|
After Width: | Height: | Size: 101 KiB |
BIN
tutorials/optimization/img/scissoring.png
Normal file
|
After Width: | Height: | Size: 56 KiB |
BIN
tutorials/optimization/img/valgrind.png
Normal file
|
After Width: | Height: | Size: 177 KiB |
@@ -1,9 +1,76 @@
|
||||
Optimization
|
||||
=============
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
Godot follows a balanced performance philosophy. In the performance world,
|
||||
there are always trade-offs, which consist of trading speed for usability
|
||||
and flexibility. Some practical examples of this are:
|
||||
|
||||
- Rendering large amounts of objects efficiently is easy, but when a
|
||||
large scene must be rendered, it can become inefficient. To solve this,
|
||||
visibility computation must be added to the rendering. This makes rendering
|
||||
less efficient, but at the same time, fewer objects are rendered. Therefore,
|
||||
the overall rendering efficiency is improved.
|
||||
|
||||
- Configuring the properties of every material for every object that
|
||||
needs to be rendered is also slow. To solve this, objects are sorted by
|
||||
material to reduce the costs. At the same time, sorting has a cost.
|
||||
|
||||
- In 3D physics, a similar situation happens. The best algorithms to
|
||||
handle large amounts of physics objects (such as SAP) are slow at
|
||||
insertion/removal of objects and raycasting. Algorithms that allow faster
|
||||
insertion and removal, as well as raycasting, will not be able to handle as
|
||||
many active objects.
|
||||
|
||||
And there are many more examples of this! Game engines strive to be
|
||||
general-purpose in nature. Balanced algorithms are always favored over
|
||||
algorithms that might be fast in some situations and slow in others, or
|
||||
algorithms that are fast but are more difficult to use.
|
||||
|
||||
Godot is not an exception to this. While it is designed to have backends
|
||||
swappable for different algorithms, the default backends prioritize balance and
|
||||
flexibility over performance.
|
||||
|
||||
With this clear, the aim of this tutorial section is to explain how to get the
|
||||
maximum performance out of Godot. While the tutorials can be read in any order,
|
||||
it is a good idea to start from :ref:`doc_general_optimization`.
|
||||
|
||||
Common
|
||||
------
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:name: toc-learn-features-optimization
|
||||
:name: toc-learn-features-general-optimization
|
||||
|
||||
general_optimization
|
||||
using_servers
|
||||
|
||||
CPU
|
||||
---
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:name: toc-learn-features-cpu-optimization
|
||||
|
||||
cpu_optimization
|
||||
|
||||
GPU
|
||||
---
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:name: toc-learn-features-gpu-optimization
|
||||
|
||||
gpu_optimization
|
||||
using_multimesh
|
||||
|
||||
3D
|
||||
--
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:name: toc-learn-features-3d-optimization
|
||||
|
||||
optimizing_3d_performance
|
||||
|
||||
152
tutorials/optimization/optimizing_3d_performance.rst
Normal file
@@ -0,0 +1,152 @@
|
||||
.. meta::
|
||||
:keywords: optimization
|
||||
|
||||
.. _doc_optimizing_3d_performance:
|
||||
|
||||
Optimizing 3D performance
|
||||
=========================
|
||||
|
||||
Culling
|
||||
=======
|
||||
|
||||
Godot will automatically perform view frustum culling in order to prevent
|
||||
rendering objects that are outside the viewport. This works well for games that
|
||||
take place in a small area, however things can quickly become problematic in
|
||||
larger levels.
|
||||
|
||||
Occlusion culling
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Walking around a town for example, you may only be able to see a few buildings
|
||||
in the street you are in, as well as the sky and a few birds flying overhead. As
|
||||
far as a naive renderer is concerned however, you can still see the entire town.
|
||||
It won't just render the buildings in front of you, it will render the street
|
||||
behind that, with the people on that street, the buildings behind that. You
|
||||
quickly end up in situations where you are attempting to render 10× or 100× more
|
||||
than what is visible.
|
||||
|
||||
Things aren't quite as bad as they seem, because the Z-buffer usually allows the
|
||||
GPU to only fully shade the objects that are at the front. This is called *depth
|
||||
prepass* and is enabled by default in Godot when using the GLES3 renderer.
|
||||
However, unneeded objects are still reducing performance.
|
||||
|
||||
One way we can potentially reduce the amount to be rendered is to take advantage
|
||||
of occlusion. As of Godot 3.2.2, there is no built in support for occlusion in
|
||||
Godot. However, with careful design you can still get many of the advantages.
|
||||
|
||||
For instance, in our city street scenario, you may be able to work out in advance
|
||||
that you can only see two other streets, ``B`` and ``C``, from street ``A``.
|
||||
Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
|
||||
you have to do is work out when your viewer is in street ``A`` (perhaps using
|
||||
Godot Areas), then you can hide the other streets.
|
||||
|
||||
This is a manual version of what is known as a "potentially visible set". It is
|
||||
a very powerful technique for speeding up rendering. You can also use it to
|
||||
restrict physics or AI to the local area, and speed these up as well as
|
||||
rendering.
|
||||
|
||||
.. note::
|
||||
|
||||
In some cases, you may have to adapt your level design to add more occlusion
|
||||
opportunities. For example, you may have to add more walls to prevent the player
|
||||
from seeing too far away, which would decrease performance due to the lost
|
||||
opportunies for occlusion culling.
|
||||
|
||||
Other occlusion techniques
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
There are other occlusion techniques such as portals, automatic PVS, and
|
||||
raster-based occlusion culling. Some of these may be available through add-ons
|
||||
and may be available in core Godot in the future.
|
||||
|
||||
Transparent objects
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
|
||||
<class_Shader>` to improve performance. This, however, can not be done with
|
||||
transparent objects. Transparent objects are rendered from back to front to make
|
||||
blending with what is behind work. As a result,
|
||||
**try to use as few transparent objects as possible**. If an object has a
|
||||
small section with transparency, try to make that section a separate surface
|
||||
with its own material.
|
||||
|
||||
For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
|
||||
doc.
|
||||
|
||||
Level of detail (LOD)
|
||||
=====================
|
||||
|
||||
In some situations, particularly at a distance, it can be a good idea to
|
||||
**replace complex geometry with simpler versions**. The end user will probably
|
||||
not be able to see much difference. Consider looking at a large number of trees
|
||||
in the far distance. There are several strategies for replacing models at
|
||||
varying distance. You could use lower poly models, or use transparency to
|
||||
simulate more complex geometry.
|
||||
|
||||
Billboards and imposters
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The simplest version of using transparency to deal with LOD is billboards. For
|
||||
example, you can use a single transparent quad to represent a tree at distance.
|
||||
This can be very cheap to render, unless of course, there are many trees in
|
||||
front of each other. In which case transparency may start eating into fill rate
|
||||
(for more information on fill rate, see :ref:`doc_gpu_optimization`).
|
||||
|
||||
An alternative is to render not just one tree, but a number of trees together as
|
||||
a group. This can be especially effective if you can see an area but cannot
|
||||
physically approach it in a game.
|
||||
|
||||
You can make imposters by pre-rendering views of an object at different angles.
|
||||
Or you can even go one step further, and periodically re-render a view of an
|
||||
object onto a texture to be used as an imposter. At a distance, you need to move
|
||||
the viewer a considerable distance for the angle of view to change
|
||||
significantly. This can be complex to get working, but may be worth it depending
|
||||
on the type of project you are making.
|
||||
|
||||
Use instancing (MultiMesh)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If several identical objects have to be drawn in the same place or nearby, try
|
||||
using :ref:`MultiMesh <class_MultiMesh>` instead. MultiMesh allows the drawing
|
||||
of many thousands of objects at very little performance cost, making it ideal
|
||||
for flocks, grass, particles, and anything else where you have thousands of
|
||||
identical objects.
|
||||
|
||||
Also see the :ref:`Using MultiMesh <doc_using_multimesh>` doc.
|
||||
|
||||
Bake lighting
|
||||
=============
|
||||
|
||||
Lighting objects is one of the most costly rendering operations. Realtime
|
||||
lighting, shadows (especially multiple lights), and GI are especially expensive.
|
||||
They may simply be too much for lower power mobile devices to handle.
|
||||
|
||||
**Consider using baked lighting**, especially for mobile. This can look fantastic,
|
||||
but has the downside that it will not be dynamic. Sometimes, this is a trade-off
|
||||
worth making.
|
||||
|
||||
In general, if several lights need to affect a scene, it's best to use
|
||||
:ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
|
||||
indirect light bounces.
|
||||
|
||||
Animation and skinning
|
||||
======================
|
||||
|
||||
Animation and vertex animation such as skinning and morphing can be very
|
||||
expensive on some platforms. You may need to lower the polycount considerably
|
||||
for animated models or limit the number of them on screen at any one time.
|
||||
|
||||
Large worlds
|
||||
============
|
||||
|
||||
If you are making large worlds, there are different considerations than what you
|
||||
may be familiar with from smaller games.
|
||||
|
||||
Large worlds may need to be built in tiles that can be loaded on demand as you
|
||||
move around the world. This can prevent memory use from getting out of hand, and
|
||||
also limit the processing needed to the local area.
|
||||
|
||||
There may also be rendering and physics glitches due to floating point error in
|
||||
large worlds. You may be able to use techniques such as orienting the world
|
||||
around the player (rather than the other way around), or shifting the origin
|
||||
periodically to keep things centred around ``Vector3(0, 0, 0)``.
|
||||