Merge pull request #3852 from Calinou/improve-optimization-3.2

Proofread and improve the optimization guides
This commit is contained in:
Rémi Verschelde
2020-09-09 15:07:13 +02:00
committed by GitHub
6 changed files with 687 additions and 639 deletions

View File

@@ -6,10 +6,10 @@ Optimization using batching
Introduction
~~~~~~~~~~~~
Game engines have to send a set of instructions to the GPU in order to tell the
GPU what and where to draw. These instructions are sent using common
instructions, called APIs (Application Programming Interfaces), examples of
which are OpenGL, OpenGL ES, and Vulkan.
Game engines have to send a set of instructions to the GPU to tell the GPU what
and where to draw. These instructions are sent using common instructions called
:abbr:`APIs (Application Programming Interfaces)`. Examples of graphics APIs are
OpenGL, OpenGL ES, and Vulkan.
Different APIs incur different costs when drawing objects. OpenGL handles a lot
of work for the user in the GPU driver at the cost of more expensive draw calls.
@@ -29,21 +29,21 @@ one primitive at a time, telling it some information such as the texture used,
the material, the position, size, etc. then saying "Draw!" (this is called a
draw call).
It turns out that while this is conceptually simple from the engine side, GPUs
operate very slowly when used in this manner. GPUs work much more efficiently
if, instead of telling them to draw a single primitive, you tell them to draw a
number of similar primitives all in one draw call, which we will call a "batch".
While this is conceptually simple from the engine side, GPUs operate very slowly
when used in this manner. GPUs work much more efficiently if you tell them to
draw a number of similar primitives all in one draw call, which we will call a
"batch".
And it turns out that they don't just work a bit faster when used in this
manner, they work a *lot* faster.
It turns out that they don't just work a bit faster when used in this manner;
they work a *lot* faster.
As Godot is designed to be a general purpose engine, the primitives coming into
As Godot is designed to be a general-purpose engine, the primitives coming into
the Godot renderer can be in any order, sometimes similar, and sometimes
dissimilar. In order to match the general purpose nature of Godot with the
batching preferences of GPUs, Godot features an intermediate layer which can
automatically group together primitives wherever possible, and send these
batches on to the GPU. This can give an increase in rendering performance while
requiring few, if any, changes to your Godot project.
dissimilar. To match Godot's general-purpose nature with the batching
preferences of GPUs, Godot features an intermediate layer which can
automatically group together primitives wherever possible and send these batches
on to the GPU. This can give an increase in rendering performance while
requiring few (if any) changes to your Godot project.
How it works
~~~~~~~~~~~~
@@ -51,78 +51,77 @@ How it works
Instructions come into the renderer from your game in the form of a series of
items, each of which can contain one or more commands. The items correspond to
Nodes in the scene tree, and the commands correspond to primitives such as
rectangles or polygons. Some items, such as tilemaps, and text, can contain a
large number of commands (tiles and letters respectively). Others, such as
sprites, may only contain a single command (rectangle).
rectangles or polygons. Some items such as TileMaps and text can contain a
large number of commands (tiles and glyphs respectively). Others, such as
sprites, may only contain a single command (a rectangle).
The batcher uses two main techniques to group together primitives:
* Consecutive items can be joined together
* Consecutive commands within an item can be joined to form a batch
- Consecutive items can be joined together.
- Consecutive commands within an item can be joined to form a batch.
Breaking batching
^^^^^^^^^^^^^^^^^
Batching can only take place if the items or commands are similar enough to be
rendered in one draw call. Certain changes (or techniques), by necessity, prevent
the formation of a contiguous batch, this is referred to as 'breaking batching'.
the formation of a contiguous batch, this is referred to as "breaking batching".
Batching will be broken by (amongst other things):
* Change of texture
* Change of material
* Change of primitive type (say going from rectangles to lines)
- Change of texture.
- Change of material.
- Change of primitive type (say, going from rectangles to lines).
.. note::
If for example, you draw a series of sprites each with a different texture,
there is no way they can be batched.
For example, if you draw a series of sprites each with a different texture,
there is no way they can be batched.
Render order
^^^^^^^^^^^^
Determining the rendering order
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The question arises, if only similar items can be drawn together in a batch, why
don't we look through all the items in a scene, group together all the similar
items, and draw them together?
In 3D, this is often exactly how engines work. However, in Godot 2D, items are
drawn in 'painter's order', from back to front. This ensures that items at the
front are drawn on top of earlier items, when they overlap.
In 3D, this is often exactly how engines work. However, in Godot's 2D renderer,
items are drawn in "painter's order", from back to front. This ensures that
items at the front are drawn on top of earlier items when they overlap.
This also means that if we try and draw objects in order of, for example,
texture, then this painter's order may break and objects will be drawn in the
wrong order.
This also means that if we try and draw objects on a per-texture basis, then
this painter's order may break and objects will be drawn in the wrong order.
In Godot this back to front order is determined by:
* The order of objects in the scene tree
* The Z index of objects
* The canvas layer
* Y sort nodes
In Godot, this back-to-front order is determined by:
- The order of objects in the scene tree.
- The Z index of objects.
- The canvas layer.
- :ref:`class_YSort` nodes.
.. note::
You can group similar objects together for easier batching. While doing so
is not a requirement on your part, think of it as an optional approach that
can improve performance in some cases. See the diagnostics section in order
to help you make this decision.
You can group similar objects together for easier batching. While doing so
is not a requirement on your part, think of it as an optional approach that
can improve performance in some cases. See the
:ref:`doc_batching_diagnostics` section to help you make this decision.
A trick
^^^^^^^
And now a sleight of hand. Although the idea of painter's order is that objects
are rendered from back to front, consider 3 objects A, B and C, that contain 2
different textures, grass and wood.
And now, a sleight of hand. Even though the idea of painter's order is that
objects are rendered from back to front, consider 3 objects ``A``, ``B`` and
``C``, that contain 2 different textures: grass and wood.
.. image:: img/overlap1.png
In painter's order they are ordered:
In painter's order they are ordered::
::
A - wood
B - grass
C - wood
A - wood
B - grass
C - wood
Because the texture changes, they cannot be batched, and will be rendered in 3
Because of the texture changes, they can't be batched and will be rendered in 3
draw calls.
However, painter's order is only needed on the assumption that they will be
@@ -145,62 +144,62 @@ balance the costs and benefits in your project.
::
A - wood
C - wood
B - grass
A - wood
C - wood
B - grass
Because the texture only changes once, we can render the above in only 2
draw calls.
Since the texture only changes once, we can render the above in only 2 draw
calls.
Lights
~~~~~~
Although the job for the batching system is normally quite straightforward, it
becomes considerably more complex when 2D lights are used, because lights are
drawn using extra passes, one for each light affecting the primitive. Consider 2
sprites A and B, with identical texture and material. Without lights they would
be batched together and drawn in one draw call. But with 3 lights, they would be
drawn as follows, each line a draw call:
Although the batching system's job is normally quite straightforward, it becomes
considerably more complex when 2D lights are used. This is because lights are
drawn using additional passes, one for each light affecting the primitive.
Consider 2 sprites ``A`` and ``B``, with identical texture and material. Without
lights, they would be batched together and drawn in one draw call. But with 3
lights, they would be drawn as follows, each line being a draw call:
.. image:: img/lights_overlap.png
::
A
A - light 1
A - light 2
A - light 3
B
B - light 1
B - light 2
B - light 3
A
A - light 1
A - light 2
A - light 3
B
B - light 1
B - light 2
B - light 3
That is a lot of draw calls, 8 for only 2 sprites. Now consider we are drawing
1000 sprites, the number of draw calls quickly becomes astronomical, and
That is a lot of draw calls: 8 for only 2 sprites. Now, consider we are drawing
1,000 sprites. The number of draw calls quickly becomes astronomical and
performance suffers. This is partly why lights have the potential to drastically
slow down 2D.
slow down 2D rendering.
However, if you remember our magician's trick from item reordering, it turns out
we can use the same trick to get around painter's order for lights!
If A and B are not overlapping, we can render them together in a batch, so the
draw process is as follows:
If ``A`` and ``B`` are not overlapping, we can render them together in a batch,
so the drawing process is as follows:
.. image:: img/lights_separate.png
::
AB
AB - light 1
AB - light 2
AB - light 3
AB
AB - light 1
AB - light 2
AB - light 3
That is 4 draw calls. Not bad, that is a 50% improvement. However consider that
in a real game, you might be drawing closer to 1000 sprites.
That is only 4 draw calls. Not bad, as that is a 2× reduction. However, consider
that in a real game, you might be drawing closer to 1,000 sprites.
- Before: 1000 * 4 = 4000 draw calls.
- After: 1 * 4 = 4 draw calls.
- **Before:** 1000 × 4 = 4,000 draw calls.
- **After:** 1 × 4 = 4 draw calls.
That is a 1000× decrease in draw calls, and should give a huge increase in
performance.
@@ -208,158 +207,163 @@ performance.
Overlap test
^^^^^^^^^^^^
However, as with the item reordering, things are not that simple, we must first
perform the overlap test to determine whether we can join these primitives, and
the overlap test has a small cost. So again you can choose the number of
primitives to lookahead in the overlap test to balance the benefits against the
cost. Usually with lights the benefits far outweigh the costs.
However, as with the item reordering, things are not that simple. We must first
perform the overlap test to determine whether we can join these primitives. This
overlap test has a small cost. Again, you can choose the number of primitives to
lookahead in the overlap test to balance the benefits against the cost. With
lights, the benefits usually far outweigh the costs.
Also consider that depending on the arrangement of primitives in the viewport,
the overlap test will sometimes fail (because the primitives overlap and thus
should not be joined). So in practice the decrease in draw calls may be less
dramatic than the perfect situation of no overlap. However performance is
usually far higher than without this lighting optimization.
the overlap test will sometimes fail (because the primitives overlap and
therefore shouldn't be joined). In practice, the decrease in draw calls may be
less dramatic than in a perfect situation with no overlapping at all. However,
performance is usually far higher than without this lighting optimization.
Light Scissoring
Light scissoring
~~~~~~~~~~~~~~~~
Batching can make it more difficult to cull out objects that are not affected or
partially affected by a light. This can increase the fill rate requirements
quite a bit, and slow rendering. Fill rate is the rate at which pixels are
colored, it is another potential bottleneck unrelated to draw calls.
quite a bit and slow down rendering. *Fill rate* is the rate at which pixels are
colored. It is another potential bottleneck unrelated to draw calls.
In order to counter this problem, (and also speedup lighting in general),
batching introduces light scissoring. This enables the use of the OpenGL command
``glScissor()``, which identifies an area, outside of which, the GPU will not
render any pixels. We can thus greatly optimize fill rate by identifying the
intersection area between a light and a primitive, and limit rendering the light
to *that area only*.
In order to counter this problem (and speed up lighting in general), batching
introduces light scissoring. This enables the use of the OpenGL command
``glScissor()``, which identifies an area outside of which the GPU won't render
any pixels. We can greatly optimize fill rate by identifying the intersection
area between a light and a primitive, and limit rendering the light to
*that area only*.
Light scissoring is controlled with the :ref:`scissor_area_threshold
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
project setting. This value is between 1.0 and 0.0, with 1.0 being off (no
scissoring), and 0.0 being scissoring in every circumstance. The reason for the
setting is that there may be some small cost to scissoring on some hardware.
Generally though, when you are using lighting, it should result in some
performance gains.
That said, scissoring should usually result in performance gains when you're
using 2D lighting.
The relationship between the threshold and whether a scissor operation takes
place is not altogether straight forward, but generally it represents the pixel
area that is potentially 'saved' by a scissor operation (i.e. the fill rate
saved). At 1.0, the entire screens pixels would need to be saved, which rarely
if ever happens, so it is switched off. In practice the useful values are
bunched towards zero, as only a small percentage of pixels need to be saved for
the operation to be useful.
place is not always straightforward. Generally, it represents the pixel area
that is potentially "saved" by a scissor operation (i.e. the fill rate saved).
At 1.0, the entire screen's pixels would need to be saved, which rarely (if
ever) happens, so it is switched off. In practice, the useful values are close
to 0.0, as only a small percentage of pixels need to be saved for the operation
to be useful.
The exact relationship is probably not necessary for users to worry about, but
out of interest is included in the appendix.
is included in the appendix out of interest:
:ref:`doc_batching_light_scissoring_threshold_calculation`
.. image:: img/scissoring.png
.. figure:: img/scissoring.png
:alt: Light scissoring example diagram
*Bottom right is a light, the red area is the pixels saved by the scissoring
operation. Only the intersection needs to be rendered.*
Bottom right is a light, the red area is the pixels saved by the scissoring
operation. Only the intersection needs to be rendered.
Vertex baking
~~~~~~~~~~~~~
The GPU shader receives instructions on what to draw in 2 main ways:
* Shader uniforms (e.g. modulate color, item transform)
* Vertex attributes (vertex color, local transform)
- Shader uniforms (e.g. modulate color, item transform).
- Vertex attributes (vertex color, local transform).
However, within a single draw call (batch) we cannot change uniforms. This means
that naively, we would not be able to batch together items or commands that
change final_modulate, or item transform. Unfortunately that is an awful lot of
cases. Sprites for instance typically are individual nodes with their own item
transform, and they may have their own color modulate.
However, within a single draw call (batch), we cannot change uniforms. This
means that naively, we would not be able to batch together items or commands
that change ``final_modulate`` or an item's transform. Unfortunately, that
happens in an awful lot of cases. For instance, sprites are typically
individual nodes with their own item transform, and they may have their own
color modulate as well.
To get around this problem, the batching can "bake" some of the uniforms into
the vertex attributes.
* The item transform can be combined with the local transform and sent in a
- The item transform can be combined with the local transform and sent in a
vertex attribute.
- The final modulate color can be combined with the vertex colors, and sent in a
vertex attribute.
* The final modulate color can be combined with the vertex colors, and sent in a
vertex attribute.
In most cases this works fine, but this shortcut breaks down if a shader expects
these values to be available individually, rather than combined. This can happen
In most cases, this works fine, but this shortcut breaks down if a shader expects
these values to be available individually rather than combined. This can happen
in custom shaders.
Custom Shaders
Custom shaders
^^^^^^^^^^^^^^
As a result certain operations in custom shaders will prevent baking, and thus
decrease the potential for batching. While we are working to decrease these
cases, currently the following conditions apply:
As a result of the limitation described above, certain operations in custom
shaders will prevent vertex baking and therefore decrease the potential for
batching. While we are working to decrease these cases, the following caveats
currently apply:
* Reading or writing ``COLOR`` or ``MODULATE`` - disables vertex color baking
* Reading ``VERTEX`` - disables vertex position baking
- Reading or writing ``COLOR`` or ``MODULATE`` disables vertex color baking.
- Reading ``VERTEX`` disables vertex position baking.
Project Settings
~~~~~~~~~~~~~~~~
In order to fine tune batching, a number of project settings are available. You
can usually leave these at default during development, but it is a good idea to
To fine-tune batching, a number of project settings are available. You can
usually leave these at default during development, but it's a good idea to
experiment to ensure you are getting maximum performance. Spending a little time
tweaking parameters can often give considerable performance gain, for very
little effort. See the tooltips in the project settings for more info.
tweaking parameters can often give considerable performance gains for very
little effort. See the on-hover tooltips in the Project Settings for more
information.
rendering/batching/options
^^^^^^^^^^^^^^^^^^^^^^^^^^
* :ref:`use_batching
- :ref:`use_batching
<class_ProjectSettings_property_rendering/batching/options/use_batching>` -
Turns batching on and off
Turns batching on or off.
* :ref:`use_batching_in_editor
- :ref:`use_batching_in_editor
<class_ProjectSettings_property_rendering/batching/options/use_batching_in_editor>`
Turns batching on or off in the Godot editor.
This setting doesn't affect the running project in any way.
* :ref:`single_rect_fallback
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
- This is a faster way of drawing unbatchable rectangles, however it may lead
to flicker on some hardware so is not recommended
- :ref:`single_rect_fallback
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>` -
This is a faster way of drawing unbatchable rectangles. However, it may lead
to flicker on some hardware so it's not recommended.
rendering/batching/parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* :ref:`max_join_item_commands <class_ProjectSettings_property_rendering/batching/parameters/max_join_item_commands>` -
One of the most important ways of achieving
batching is to join suitable adjacent items (nodes) together, however they can
only be joined if the commands they contain are compatible. The system must
therefore do a lookahead through the commands in an item to determine whether
it can be joined. This has a small cost per command, and items with a large
number of commands are not worth joining, so the best value may be project
dependent.
- :ref:`max_join_item_commands <class_ProjectSettings_property_rendering/batching/parameters/max_join_item_commands>` -
One of the most important ways of achieving batching is to join suitable
adjacent items (nodes) together, however they can only be joined if the
commands they contain are compatible. The system must therefore do a lookahead
through the commands in an item to determine whether it can be joined. This
has a small cost per command, and items with a large number of commands are
not worth joining, so the best value may be project dependent.
* :ref:`colored_vertex_format_threshold
<class_ProjectSettings_property_rendering/batching/parameters/colored_vertex_format_threshold>` - Baking colors into
vertices results in a
larger vertex format. This is not necessarily worth doing unless there are a
lot of color changes going on within a joined item. This parameter represents
the proportion of commands containing color changes / the total commands,
above which it switches to baked colors.
- :ref:`colored_vertex_format_threshold
<class_ProjectSettings_property_rendering/batching/parameters/colored_vertex_format_threshold>` -
Baking colors into vertices results in a larger vertex format. This is not
necessarily worth doing unless there are a lot of color changes going on
within a joined item. This parameter represents the proportion of commands
containing color changes / the total commands, above which it switches to
baked colors.
* :ref:`batch_buffer_size
<class_ProjectSettings_property_rendering/batching/parameters/batch_buffer_size>`
- This determines the maximum size of a batch, it doesn't have a huge effect
- :ref:`batch_buffer_size
<class_ProjectSettings_property_rendering/batching/parameters/batch_buffer_size>` -
This determines the maximum size of a batch, it doesn't have a huge effect
on performance but can be worth decreasing for mobile if RAM is at a premium.
* :ref:`item_reordering_lookahead
<class_ProjectSettings_property_rendering/batching/parameters/item_reordering_lookahead>`
- Item reordering can help especially with
interleaved sprites using different textures. The lookahead for the overlap
test has a small cost, so the best value may change per project.
- :ref:`item_reordering_lookahead
<class_ProjectSettings_property_rendering/batching/parameters/item_reordering_lookahead>` -
Item reordering can help especially with interleaved sprites using different
textures. The lookahead for the overlap test has a small cost, so the best
value may change per project.
rendering/batching/lights
^^^^^^^^^^^^^^^^^^^^^^^^^
* :ref:`scissor_area_threshold
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
- See light scissoring.
- :ref:`scissor_area_threshold
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>` -
See light scissoring.
* :ref:`max_join_items
<class_ProjectSettings_property_rendering/batching/lights/max_join_items>` -
- :ref:`max_join_items
<class_ProjectSettings_property_rendering/batching/lights/max_join_items>` -
Joining items before lighting can significantly increase
performance. This requires an overlap test, which has a small cost, so the
costs and benefits may be project dependent, and hence the best value to use
@@ -368,22 +372,22 @@ rendering/batching/lights
rendering/batching/debug
^^^^^^^^^^^^^^^^^^^^^^^^
* :ref:`flash_batching
<class_ProjectSettings_property_rendering/batching/debug/flash_batching>` -
- :ref:`flash_batching
<class_ProjectSettings_property_rendering/batching/debug/flash_batching>` -
This is purely a debugging feature to identify regressions between the
batching and legacy renderer. When it is switched on, the batching and legacy
renderer are used alternately on each frame. This will decrease performance,
and should not be used for your final export, only for testing.
* :ref:`diagnose_frame
<class_ProjectSettings_property_rendering/batching/debug/diagnose_frame>` -
- :ref:`diagnose_frame
<class_ProjectSettings_property_rendering/batching/debug/diagnose_frame>` -
This will periodically print a diagnostic batching log to
the Godot IDE / console.
rendering/batching/precision
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* :ref:`uv_contract
- :ref:`uv_contract
<class_ProjectSettings_property_rendering/batching/precision/uv_contract>` -
On some hardware (notably some Android devices) there have been reports of
tilemap tiles drawing slightly outside their UV range, leading to edge
@@ -391,10 +395,12 @@ rendering/batching/precision
contract. This makes a small contraction in the UV coordinates to compensate
for precision errors on devices.
* :ref:`uv_contract_amount
<class_ProjectSettings_property_rendering/batching/precision/uv_contract_amount>`
- Hopefully the default amount should cure artifacts on most devices, but just
in case, this value is editable.
- :ref:`uv_contract_amount
<class_ProjectSettings_property_rendering/batching/precision/uv_contract_amount>` -
Hopefully, the default amount should cure artifacts on most devices,
but this value remains adjustable just in case.
.. _doc_batching_diagnostics:
Diagnostics
~~~~~~~~~~~
@@ -403,120 +409,117 @@ Although you can change parameters and examine the effect on frame rate, this
can feel like working blindly, with no idea of what is going on under the hood.
To help with this, batching offers a diagnostic mode, which will periodically
print out (to the IDE or console) a list of the batches that are being
processed. This can help pin point situations where batching is not occurring as
intended, and help you to fix them, in order to get the best possible
performance.
processed. This can help pinpoint situations where batching isn't occurring
as intended, and help you fix these situations to get the best possible performance.
Reading a diagnostic
^^^^^^^^^^^^^^^^^^^^
.. code-block:: cpp
canvas_begin FRAME 2604
items
joined_item 1 refs
batch D 0-0
batch D 0-2 n n
batch R 0-1 [0 - 0] {255 255 255 255 }
joined_item 1 refs
batch D 0-0
batch R 0-1 [0 - 146] {255 255 255 255 }
batch D 0-0
batch R 0-1 [0 - 146] {255 255 255 255 }
joined_item 1 refs
batch D 0-0
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
batch D 0-0
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
batch D 0-0
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
canvas_end
canvas_begin FRAME 2604
items
joined_item 1 refs
batch D 0-0
batch D 0-2 n n
batch R 0-1 [0 - 0] {255 255 255 255 }
joined_item 1 refs
batch D 0-0
batch R 0-1 [0 - 146] {255 255 255 255 }
batch D 0-0
batch R 0-1 [0 - 146] {255 255 255 255 }
joined_item 1 refs
batch D 0-0
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
batch D 0-0
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
batch D 0-0
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
canvas_end
This is a typical diagnostic.
* **joined_item** - A joined item can contain 1 or
more references to items (nodes). Generally joined_items containing many
- **joined_item:** A joined item can contain 1 or
more references to items (nodes). Generally, joined_items containing many
references is preferable to many joined_items containing a single reference.
Whether items can be joined will be determined by their contents and
compatibility with the previous item.
* **batch R** - a batch containing rectangles. The second number is the number of
- **batch R:** A batch containing rectangles. The second number is the number of
rects. The second number in square brackets is the Godot texture ID, and the
numbers in curly braces is the color. If the batch contains more than one rect,
MULTI is added to the line to make it easy to identify. Seeing MULTI is good,
because this indicates successful batching.
* **batch D** - a default batch, containing everything else that is not currently
``MULTI`` is added to the line to make it easy to identify.
Seeing ``MULTI`` is good as it indicates successful batching.
- **batch D:** A default batch, containing everything else that is not currently
batched.
Default Batches
Default batches
^^^^^^^^^^^^^^^
The second number following default batches is the number of commands in the
batch, and it is followed by a brief summary of the contents:
batch, and it is followed by a brief summary of the contents::
::
l - line
PL - polyline
r - rect
n - ninepatch
PR - primitive
p - polygon
m - mesh
MM - multimesh
PA - particles
c - circle
t - transform
CI - clip_ignore
l - line
PL - polyline
r - rect
n - ninepatch
PR - primitive
p - polygon
m - mesh
MM - multimesh
PA - particles
c - circle
t - transform
CI - clip_ignore
You may see "dummy" default batches containing no commands; you can ignore those.
You may see "dummy" default batches containing no commands, you can ignore
these.
Frequently asked questions
~~~~~~~~~~~~~~~~~~~~~~~~~~
FAQ
~~~
I don't get a large performance increase when enabling batching.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I don't get a large performance increase from switching on batching
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Try the diagnostics, see how much batching is occurring, and whether it can be
- Try the diagnostics, see how much batching is occurring, and whether it can be
improved
* Try changing parameters
* Consider that batching may not be your bottleneck (see bottlenecks)
- Try changing batching parameters in the Project Settings.
- Consider that batching may not be your bottleneck (see bottlenecks).
I get a decrease in performance with batching
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I get a decrease in performance with batching.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Try steps to increase batching given above
* Try switching :ref:`single_rect_fallback
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
to on
* The single rect fallback method is the default used without batching, and it
is approximately twice as fast, however it can result in flicker on some
hardware, so its use is discouraged
* After trying the above, if your scene is still performing worse, consider
- Try the steps described above to increase the number of batching opportunities.
- Try enabling :ref:`single_rect_fallback
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`.
- The single rect fallback method is the default used without batching, and it
is approximately twice as fast. However, it can result in flickering on some
hardware, so its use is discouraged.
- After trying the above, if your scene is still performing worse, consider
turning off batching.
I use custom shaders and the items are not batching
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I use custom shaders and the items are not batching.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Custom shaders can be problematic for batching, see the custom shaders section
- Custom shaders can be problematic for batching, see the custom shaders section
I am seeing line artifacts appear on certain hardware
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I am seeing line artifacts appear on certain hardware.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* See the :ref:`uv_contract
- See the :ref:`uv_contract
<class_ProjectSettings_property_rendering/batching/precision/uv_contract>`
project setting which can be used to solve this problem.
I use a large number of textures, so few items are being batched
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I use a large number of textures, so few items are being batched.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Consider the use of texture atlases. As well as allowing batching, these
reduce the need for state changes associated with changing texture.
- Consider using texture atlases. As well as allowing batching, these
reduce the need for state changes associated with changing textures.
Appendix
~~~~~~~~
.. _doc_batching_light_scissoring_threshold_calculation:
Light scissoring threshold calculation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -525,29 +528,23 @@ The actual proportion of screen pixel area used as the threshold is the
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
value to the power of 4.
For example, on a screen size ``1920 x 1080`` there are ``2,073,600`` pixels.
For example, on a screen size of 1920×1080, there are 2,073,600 pixels.
At a threshold of ``1000`` pixels, the proportion would be:
At a threshold of 1,000 pixels, the proportion would be::
::
1000 / 2073600 = 0.00048225
0.00048225 ^ 0.25 = 0.14819
.. note:: The power of 0.25 is the opposite of power of 4).
1000 / 2073600 = 0.00048225
0.00048225 ^ (1/4) = 0.14819
So a :ref:`scissor_area_threshold
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
of 0.15 would be a reasonable value to try.
of ``0.15`` would be a reasonable value to try.
Going the other way, for instance with a :ref:`scissor_area_threshold
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
of ``0.5``:
of ``0.5``::
::
0.5 ^ 4 = 0.0625
0.0625 * 2073600 = 129600 pixels
0.5 ^ 4 = 0.0625
0.0625 * 2073600 = 129600 pixels
If the number of pixels saved is more than this threshold, the scissor is
If the number of pixels saved is greater than this threshold, the scissor is
activated.

View File

@@ -1,13 +1,13 @@
.. _doc_cpu_optimization:
CPU Optimizations
=================
CPU optimization
================
Measuring performance
=====================
To know how to speed up our program, we have to know where the "bottlenecks"
are. Bottlenecks are the slowest parts of the program that limit the rate that
are. Bottlenecks are the slowest parts of the program that limit the rate that
everything can progress. This allows us to concentrate our efforts on optimizing
the areas which will give us the greatest speed improvement, instead of spending
a lot of time optimizing functions that will lead to small performance
@@ -21,27 +21,31 @@ CPU profilers
Profilers run alongside your program and take timing measurements to work out
what proportion of time is spent in each function.
The Godot IDE conveniently has a built in profiler. It does not run every time
you start your project, and must be manually started and stopped. This is
because, in common with most profilers, recording these timing measurements can
The Godot IDE conveniently has a built-in profiler. It does not run every time
you start your project: it must be manually started and stopped. This is
because, like most profilers, recording these timing measurements can
slow down your project significantly.
After profiling, you can look back at the results for a frame.
.. image:: img/godot_profiler.png
.. figure:: img/godot_profiler.png
.. figure:: img/godot_profiler.png
:alt: Screenshot of the Godot profiler
`These are the results of a profile of one of the demo projects.`
Results of a profile of one of the demo projects.
.. note:: We can see the cost of built-in processes such as physics and audio,
as well as seeing the cost of our own scripting functions at the
bottom.
Time spent waiting for various built-in servers may not be counted in
the profilers. This is a known bug.
When a project is running slowly, you will often see an obvious function or
process taking a lot more time than others. This is your primary bottleneck, and
you can usually increase speed by optimizing this area.
For more info about using the profiler within Godot see
:ref:`doc_debugger_panel`.
For more info about using Godot's built-in profiler, see :ref:`doc_debugger_panel`.
External profilers
~~~~~~~~~~~~~~~~~~
@@ -49,70 +53,70 @@ External profilers
Although the Godot IDE profiler is very convenient and useful, sometimes you
need more power, and the ability to profile the Godot engine source code itself.
You can use a number of third party profilers to do this including Valgrind,
VerySleepy, Visual Studio and Intel VTune.
You can use a number of third party profilers to do this including
`Valgrind <https://www.valgrind.org/>`__,
`VerySleepy <http://www.codersnotes.com/sleepy/>`__,
`HotSpot <https://github.com/KDAB/hotspot>`__,
`Visual Studio <https://visualstudio.microsoft.com/>`__ and
`Intel VTune <https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html>`__.
.. note:: You may need to compile Godot from source in order to use a third
party profiler so that you have program database information
available. You can also use a debug build, however, note that the
results of profiling a debug build will be different to a release
build, because debug builds are less optimized. Bottlenecks are often
in a different place in debug builds, so you should profile release
builds wherever possible.
.. note:: You will need to compile Godot from source to use a third-party profiler.
This is required to obtain debugging symbols. You can also use a debug
build, however, note that the results of profiling a debug build will
be different to a release build, because debug builds are less
optimized. Bottlenecks are often in a different place in debug builds,
so you should profile release builds whenever possible.
.. image:: img/valgrind.png
.. figure:: img/valgrind.png
:alt: Screenshot of Callgrind
`These are example results from Callgrind, part of Valgrind, on Linux.`
Example results from Callgrind, which is part of Valgrind.
From the left, Callgrind is listing the percentage of time within a function and
its children (Inclusive), the percentage of time spent within the function
itself, excluding child functions (Self), the number of times the function is
called, the function name, and the file or module.
In this example we can see nearly all time is spent under the
`Main::iteration()` function, this is the master function in the Godot source
code that is called repeatedly, and causes frames to be drawn, physics ticks to
In this example, we can see nearly all time is spent under the
`Main::iteration()` function. This is the master function in the Godot source
code that is called repeatedly. It causes frames to be drawn, physics ticks to
be simulated, and nodes and scripts to be updated. A large proportion of the
time is spent in the functions to render a canvas (66%), because this example
uses a 2d benchmark. Below this we see that almost 50% of the time is spent
outside Godot code in `libglapi`, and `i965_dri` (the graphics driver). This
tells us the a large proportion of CPU time is being spent in the graphics
driver.
uses a 2D benchmark. Below this, we see that almost 50% of the time is spent
outside Godot code in ``libglapi`` and ``i965_dri`` (the graphics driver).
This tells us the a large proportion of CPU time is being spent in the
graphics driver.
This is actually an excellent example because in an ideal world, only a very
small proportion of time would be spent in the graphics driver, and this is an
This is actually an excellent example because, in an ideal world, only a very
small proportion of time would be spent in the graphics driver. This is an
indication that there is a problem with too much communication and work being
done in the graphics API. This profiling lead to the development of 2d batching,
which greatly speeds up 2d by reducing bottlenecks in this area.
done in the graphics API. This specific profiling led to the development of 2D
batching, which greatly speeds up 2D rendering by reducing bottlenecks in this
area.
Manually timing functions
=========================
Another handy technique, especially once you have identified the bottleneck
using a profiler, is to manually time the function or area under test. The
specifics vary according to language, but in GDScript, you would do the
following:
using a profiler, is to manually time the function or area under test.
The specifics vary depending on the language, but in GDScript, you would do
the following:
::
var time_start = OS.get_system_time_msecs()
var time_start = OS.get_ticks_usec()
# Your function you want to time
update_enemies()
var time_end = OS.get_system_time_msecs()
print("Function took: " + str(time_end - time_start))
You may want to consider using other functions for time if another time unit is
more suitable, for example :ref:`OS.get_system_time_secs
<class_OS_method_get_system_time_secs>` if the function will take many seconds.
var time_end = OS.get_ticks_usec()
print("update_enemies() took %d microseconds" % time_end - time_start)
When manually timing functions, it is usually a good idea to run the function
many times (say ``1000`` or more times), instead of just once (unless it is a
very slow function). A large part of the reason for this is that timers often
have limited accuracy, and CPUs will schedule processes in a haphazard manner,
so an average over a series of runs is more accurate than a single measurement.
many times (1,000 or more times), instead of just once (unless it is a very slow
function). The reason for doing this is that timers often have limited accuracy.
Moreover, CPUs will schedule processes in a haphazard manner. Therefore, an
average over a series of runs is more accurate than a single measurement.
As you attempt to optimize functions, be sure to either repeatedly profile or
time them as you go. This will give you crucial feedback as to whether the
@@ -121,21 +125,22 @@ optimization is working (or not).
Caches
======
Something else to be particularly aware of, especially when comparing timing
results of two different versions of a function, is that the results can be
highly dependent on whether the data is in the CPU cache or not. CPUs don't load
data directly from main memory, because although main memory can be huge (many
GBs), it is very slow to access. Instead CPUs load data from a smaller, higher
speed bank of memory, called cache. Loading data from cache is super fast, but
every time you try and load a memory address that is not stored in cache, the
cache must make a trip to main memory and slowly load in some data. This delay
can result in the CPU sitting around idle for a long time, and is referred to as
a "cache miss".
CPU caches are something else to be particularly aware of, especially when
comparing timing results of two different versions of a function. The results
can be highly dependent on whether the data is in the CPU cache or not. CPUs
don't load data directly from the system RAM, even though it's huge in
comparison to the CPU cache (several gigabytes instead of a few megabytes). This
is because system RAM is very slow to access. Instead, CPUs load data from a
smaller, faster bank of memory called cache. Loading data from cache is very
fast, but every time you try and load a memory address that is not stored in
cache, the cache must make a trip to main memory and slowly load in some data.
This delay can result in the CPU sitting around idle for a long time, and is
referred to as a "cache miss".
This means that the first time you run a function, it may run slowly, because
the data is not in cache. The second and later times, it may run much faster
because the data is in cache. So always use averages when timing, and be aware
of the effects of cache.
This means that the first time you run a function, it may run slowly because the
data is not in the CPU cache. The second and later times, it may run much faster
because the data is in the cache. Due to this, always use averages when timing,
and be aware of the effects of cache.
Understanding caching is also crucial to CPU optimization. If you have an
algorithm (routine) that loads small bits of data from randomly spread out areas
@@ -147,16 +152,15 @@ will be able to work as fast as possible.
Godot usually takes care of such low-level details for you. For example, the
Server APIs make sure data is optimized for caching already for things like
rendering and physics. But you should be especially aware of caching when using
GDNative.
rendering and physics. Still, you should be especially aware of caching when
using :ref:`GDNative <toc-tutorials-gdnative>`.
Languages
=========
Godot supports a number of different languages, and it is worth bearing in mind
that there are trade-offs involved - some languages are designed for ease of
use, at the cost of speed, and others are faster but more difficult to work
with.
that there are trade-offs involved. Some languages are designed for ease of use
at the cost of speed, and others are faster but more difficult to work with.
Built-in engine functions run at the same speed regardless of the scripting
language you choose. If your project is making a lot of calculations in its own
@@ -165,16 +169,20 @@ code, consider moving those calculations to a faster language.
GDScript
~~~~~~~~
GDScript is designed to be easy to use and iterate, and is ideal for making many
types of games. However, ease of use is considered more important than
performance, so if you need to make heavy calculations, consider moving some of
your project to one of the other languages.
:ref:`GDScript <toc-learn-scripting-gdscript>` is designed to be easy to use and iterate,
and is ideal for making many types of games. However, in this language, ease of
use is considered more important than performance. If you need to make heavy
calculations, consider moving some of your project to one of the other
languages.
C#
~~
C# is popular and has first class support in Godot. It offers a good compromise
between speed and ease of use.
:ref:`C# <toc-learn-scripting-C#>` is popular and has first-class support in Godot.It
offers a good compromise between speed and ease of use. Beware of possible
garbage collection pauses and leaks that can occur during gameplay, though. A
common approach to workaround issues with garbage collection is to use *object
pooling*, which is outside the scope of this guide.
Other languages
~~~~~~~~~~~~~~~
@@ -186,44 +194,49 @@ Third parties provide support for several other languages, including `Rust
C++
~~~
Godot is written in C++. Using C++ will usually result in the fastest code,
however, on a practical level, it is the most difficult to deploy to end users'
machines on different platforms. Options for using C++ include GDNative, and
custom modules.
Godot is written in C++. Using C++ will usually result in the fastest code.
However, on a practical level, it is the most difficult to deploy to end users'
machines on different platforms. Options for using C++ include
:ref:`GDNative <toc-tutorials-gdnative>` and
:ref:`custom modules <doc_custom_modules_in_c++>`.
Threads
=======
Consider using threads when making a lot of calculations that can run parallel
to one another. Modern CPUs have multiple cores, each one capable of doing a
limited amount of work. By spreading work over multiple threads you can move
further towards peak CPU efficiency.
Consider using threads when making a lot of calculations that can run in
parallel to each other. Modern CPUs have multiple cores, each one capable of
doing a limited amount of work. By spreading work over multiple threads, you can
move further towards peak CPU efficiency.
The disadvantage of threads is that you have to be incredibly careful. As each
CPU core operates independently, they can end up trying to access the same
memory at the same time. One thread can be reading to a variable while another
is writing. Before you use threads make sure you understand the dangers and how
to try and prevent these race conditions.
is writing: this is called a *race condition*. Before you use threads, make sure
you understand the dangers and how to try and prevent these race conditions.
For more information on threads see :ref:`doc_using_multiple_threads`.
Threads can also make debugging considerably more difficult. The GDScript
debugger doesn't support setting up breakpoints in threads yet.
For more information on threads, see :ref:`doc_using_multiple_threads`.
SceneTree
=========
Although Nodes are an incredibly powerful and versatile concept, be aware that
every node has a cost. Built in functions such as `_process()` and
every node has a cost. Built-in functions such as `_process()` and
`_physics_process()` propagate through the tree. This housekeeping can reduce
performance when you have very large numbers of nodes.
performance when you have very large numbers of nodes (usually in the thousands).
Each node is handled individually in the Godot renderer so sometimes a smaller
Each node is handled individually in the Godot renderer. Therefore, a smaller
number of nodes with more in each can lead to better performance.
One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
get much better performance by removing nodes from the SceneTree, rather than
by pausing or hiding them. You don't have to delete a detached node. You
can for example, keep a reference to a node, detach it from the scene tree, then
reattach it later. This can be very useful for adding and removing areas from a
game for example.
get much better performance by removing nodes from the SceneTree, rather than by
pausing or hiding them. You don't have to delete a detached node. You can for
example, keep a reference to a node, detach it from the scene tree using
:ref:`Node.remove_child(node) <class_Node_method_remove_child>`, then reattach
it later using :ref:`Node.add_child(node) <class_Node_method_add_child>`.
This can be very useful for adding and removing areas from a game, for example.
You can avoid the SceneTree altogether by using Server APIs. For more
information, see :ref:`doc_using_servers`.
@@ -231,28 +244,33 @@ information, see :ref:`doc_using_servers`.
Physics
=======
In some situations physics can end up becoming a bottleneck, particularly with
complex worlds, and large numbers of physics objects.
In some situations, physics can end up becoming a bottleneck. This is
particularly the case with complex worlds and large numbers of physics objects.
Some techniques to speed up physics:
Here are some techniques to speed up physics:
* Try using simplified versions of your rendered geometry for physics. Often
this won't be noticeable for end users, but can greatly increase performance.
* Try removing objects from physics when they are out of view / outside the
- Try using simplified versions of your rendered geometry for collision shapes.
Often, this won't be noticeable for end users, but can greatly increase
performance.
- Try removing objects from physics when they are out of view / outside the
current area, or reusing physics objects (maybe you allow 8 monsters per area,
for example, and reuse these).
Another crucial aspect to physics is the physics tick rate. In some games you
Another crucial aspect to physics is the physics tick rate. In some games, you
can greatly reduce the tick rate, and instead of for example, updating physics
60 times per second, you may update it at 20, or even 10 ticks per second. This
can greatly reduce the CPU load.
60 times per second, you may update them only 30 or even 20 times per second.
This can greatly reduce the CPU load.
The downside of changing physics tick rate is you can get jerky movement or
jitter when the physics update rate does not match the frames rendered.
jitter when the physics update rate does not match the frames per second
rendered. Also, decreasing the physics tick rate will increase input lag.
It's recommended to stick to the default physics tick rate (60 Hz) in most games
that feature real-time player movement.
The solution to this problem is 'fixed timestep interpolation', which involves
The solution to jitter is to use *fixed timestep interpolation*, which involves
smoothing the rendered positions and rotations over multiple frames to match the
physics. You can either implement this yourself or use a third-party addon.
Interpolation is a very cheap operation, performance wise, compared to running a
physics tick, orders of magnitude faster, so this can be a significant win, as
well as reducing jitter.
physics. You can either implement this yourself or use a
`third-party addon <https://github.com/lawnjelly/smoothing-addon>`__.
Performance-wise, interpolation is a very cheap operation compared to running a
physics tick. It's orders of magnitude faster, so this can be a significant
performance win while also reducing jitter.

View File

@@ -6,42 +6,42 @@ General optimization tips
Introduction
~~~~~~~~~~~~
In an ideal world, computers would run at infinite speed, and the only limit to
what we could achieve would be our imagination. In the real world, however, it
is all too easy to produce software that will bring even the fastest computer to
In an ideal world, computers would run at infinite speed. The only limit to
what we could achieve would be our imagination. However, in the real world, it's
all too easy to produce software that will bring even the fastest computer to
its knees.
Designing games and other software is thus a compromise between what we would
Thus, designing games and other software is a compromise between what we would
like to be possible, and what we can realistically achieve while maintaining
good performance.
To achieve the best results, we have two approaches:
* Work faster
* Work smarter
- Work faster.
- Work smarter.
And preferably, we will use a blend of the two.
Smoke and Mirrors
Smoke and mirrors
^^^^^^^^^^^^^^^^^
Part of working smarter is recognizing that, especially in games, we can often
get the player to believe they are in a world that is far more complex,
interactive, and graphically exciting than it really is. A good programmer is a
magician, and should strive to learn the tricks of the trade, and try to invent
new ones.
Part of working smarter is recognizing that, in games, we can often get the
player to believe they're in a world that is far more complex, interactive, and
graphically exciting than it really is. A good programmer is a magician, and
should strive to learn the tricks of the trade while trying to invent new ones.
The nature of slowness
^^^^^^^^^^^^^^^^^^^^^^
To the outside observer, performance problems are often lumped together. But in
reality, there are several different kinds of performance problem:
To the outside observer, performance problems are often lumped together.
But in reality, there are several different kinds of performance problems:
* A slow process that occurs every frame, leading to a continuously low frame
rate
* An intermittent process that causes 'spikes' of slowness, leading to
stalls
* A slow process that occurs outside of normal gameplay, for instance, on
level load
- A slow process that occurs every frame, leading to a continuously low frame
rate.
- An intermittent process that causes "spikes" of slowness, leading to
stalls.
- A slow process that occurs outside of normal gameplay, for instance,
when loading a level.
Each of these are annoying to the user, but in different ways.
@@ -54,30 +54,32 @@ our attempts to speed them up.
There are several methods of measuring performance, including:
* Putting a start / stop timer around code of interest
* Using the Godot profiler
* Using external third party profilers
* Using GPU profilers / debuggers
* Checking the frame rate (with vsync disabled)
- Putting a start/stop timer around code of interest.
- Using the Godot profiler.
- Using external third-party CPU profilers.
- Using GPU profilers/debuggers such as
`NVIDIA Nsight Graphics <https://developer.nvidia.com/nsight-graphics>`__
or `apitrace <https://apitrace.github.io/>`__.
- Checking the frame rate (with V-Sync disabled).
Be very aware that the relative performance of different areas can vary on
different hardware. Often it is a good idea to make timings on more than one
device, especially including mobile as well as desktop, if you are targeting
mobile.
different hardware. It's often a good idea to measure timings on more than one
device. This is especially the case if you're targeting mobile devices.
Limitations
~~~~~~~~~~~
CPU Profilers are often the 'go to' method for measuring performance, however
CPU profilers are often the go-to method for measuring performance. However,
they don't always tell the whole story.
- Bottlenecks are often on the GPU, `as a result` of instructions given by the
CPU
- Spikes can occur in the Operating System processes (outside of Godot) `as a
result` of instructions used in Godot (for example dynamic memory allocation)
- You may not be able to profile e.g. a mobile phone
- Bottlenecks are often on the GPU, "as a result" of instructions given by the
CPU.
- Spikes can occur in the operating system processes (outside of Godot) "as a
result" of instructions used in Godot (for example, dynamic memory allocation).
- You may not always be able to profile specific devices like a mobile phone
due to the initial setup required.
- You may have to solve performance problems that occur on hardware you don't
have access to
have access to.
As a result of these limitations, you often need to use detective work to find
out where bottlenecks are.
@@ -92,27 +94,27 @@ binary search.
Hypothesis testing
^^^^^^^^^^^^^^^^^^
Say for example you believe that sprites are slowing down your game. You can
test this hypothesis for example by:
Say, for example, that you believe sprites are slowing down your game.
You can test this hypothesis by:
* Measuring the performance when you add more sprites, or take some away.
- Measuring the performance when you add more sprites, or take some away.
This may lead to a further hypothesis - does the size of the sprite determine
This may lead to a further hypothesis: does the size of the sprite determine
the performance drop?
* You can test this by keeping everything the same, but changing the sprite
size, and measuring performance
- You can test this by keeping everything the same, but changing the sprite
size, and measuring performance.
Binary search
^^^^^^^^^^^^^
Say you know that frames are taking much longer than they should, but you are
If you know that frames are taking much longer than they should, but you're
not sure where the bottleneck lies. You could begin by commenting out
approximately half the routines that occur on a normal frame. Has the
performance improved more or less than expected?
Once you know which of the two halves contains the bottleneck, you can then
repeat this process, until you have pinned down the problematic area.
Once you know which of the two halves contains the bottleneck, you can
repeat this process until you've pinned down the problematic area.
Profilers
=========
@@ -122,17 +124,16 @@ provide results telling you what percentage of time was spent in different
functions and areas, and how often functions were called.
This can be very useful both to identify bottlenecks and to measure the results
of your improvements. Sometimes attempts to improve performance can backfire and
lead to slower performance, so always use profiling and timing to guide your
efforts.
of your improvements. Sometimes, attempts to improve performance can backfire
and lead to slower performance.
**Always use profiling and timing to guide your efforts.**
For more info about using the profiler within Godot see
:ref:`doc_debugger_panel`.
For more info about using Godot's built-in profiler, see :ref:`doc_debugger_panel`.
Principles
==========
Donald Knuth:
`Donald Knuth <https://en.wikipedia.org/wiki/Donald_Knuth>`__ said:
*Programmers waste enormous amounts of time thinking about, or worrying
about, the speed of noncritical parts of their programs, and these attempts
@@ -143,19 +144,19 @@ Donald Knuth:
The messages are very important:
* Programmer / Developer time is limited. Instead of blindly trying to speed up
all aspects of a program we should concentrate our efforts on the aspects that
- Developer time is limited. Instead of blindly trying to speed up
all aspects of a program, we should concentrate our efforts on the aspects that
really matter.
* Efforts at optimization often end up with code that is harder to read and
- Efforts at optimization often end up with code that is harder to read and
debug than non-optimized code. It is in our interests to limit this to areas
that will really benefit.
Just because we `can` optimize a particular bit of code, it doesn't necessarily
mean that we should. Knowing when, and when not to optimize is a great skill to
Just because we *can* optimize a particular bit of code, it doesn't necessarily
mean that we *should*. Knowing when and when not to optimize is a great skill to
develop.
One misleading aspect of the quote is that people tend to focus on the subquote
"premature optimization is the root of all evil". While `premature` optimization
*"premature optimization is the root of all evil"*. While *premature* optimization
is (by definition) undesirable, performant software is the result of performant
design.
@@ -165,30 +166,30 @@ Performant design
The danger with encouraging people to ignore optimization until necessary, is
that it conveniently ignores that the most important time to consider
performance is at the design stage, before a key has even hit a keyboard. If the
design / algorithms of a program are inefficient, then no amount of polishing the
details later will make it run fast. It may run `faster`, but it will never run
design or algorithms of a program are inefficient, then no amount of polishing the
details later will make it run fast. It may run *faster*, but it will never run
as fast as a program designed for performance.
This tends to be far more important in game / graphics programming than in
general programming. A performant design, even without low level optimization,
will often run many times faster than a mediocre design with low level
This tends to be far more important in game or graphics programming than in
general programming. A performant design, even without low-level optimization,
will often run many times faster than a mediocre design with low-level
optimization.
Incremental design
~~~~~~~~~~~~~~~~~~
Of course, in practice, unless you have prior knowledge, you are unlikely to
come up with the best design first time. So you will often make a series of
come up with the best design the first time. Instead, you'll often make a series of
versions of a particular area of code, each taking a different approach to the
problem, until you come to a satisfactory solution. It is important not to spend
problem, until you come to a satisfactory solution. It's important not to spend
too much time on the details at this stage until you have finalized the overall
design, otherwise much of your work will be thrown out.
design. Otherwise, much of your work will be thrown out.
It is difficult to give general guidelines for performant design because this is
It's difficult to give general guidelines for performant design because this is
so dependent on the problem. One point worth mentioning though, on the CPU
side, is that modern CPUs are nearly always limited by memory bandwidth. This
has led to a resurgence in data orientated design, which involves designing data
structures and algorithms for locality of data and linear access, rather than
has led to a resurgence in data-oriented design, which involves designing data
structures and algorithms for *cache locality* of data and linear access, rather than
jumping around in memory.
The optimization process
@@ -196,17 +197,17 @@ The optimization process
Assuming we have a reasonable design, and taking our lessons from Knuth, our
first step in optimization should be to identify the biggest bottlenecks - the
slowest functions, the low hanging fruit.
slowest functions, the low-hanging fruit.
Once we have successfully improved the speed of the slowest area, it may no
longer be the bottleneck. So we should test / profile again, and find the next
Once we've successfully improved the speed of the slowest area, it may no
longer be the bottleneck. So we should test/profile again and find the next
bottleneck on which to focus.
The process is thus:
1. Profile / Identify bottleneck
2. Optimize bottleneck
3. Return to step 1
1. Profile / Identify bottleneck.
2. Optimize bottleneck.
3. Return to step 1.
Optimizing bottlenecks
~~~~~~~~~~~~~~~~~~~~~~
@@ -214,18 +215,22 @@ Optimizing bottlenecks
Some profilers will even tell you which part of a function (which data accesses,
calculations) are slowing things down.
As with design you should concentrate your efforts first on making sure the
As with design, you should concentrate your efforts first on making sure the
algorithms and data structures are the best they can be. Data access should be
local (to make best use of CPU cache), and it can often be better to use compact
storage of data (again, always profile to test results). Often you precalculate
heavy computations ahead of time (e.g. at level load, or loading precalculated
data files).
storage of data (again, always profile to test results). Often, you precalculate
heavy computations ahead of time. This can be done by performing the computation
when loading a level, by loading a file containing precalculated data or simply
by storing the results of complex calculations into a script constant and
reading its value.
Once algorithms and data are good, you can often make small changes in routines
which improve performance, things like moving calculations outside of loops.
which improve performance. For instance, you can move some calculations outside
of loops or transform nested ``for`` loops into non-nested loops.
(This should be feasible if you know a 2D array's width or height in advance.)
Always retest your timing / bottlenecks after making each change. Some changes
will increase speed, others may have a negative effect. Sometimes a small
Always retest your timing/bottlenecks after making each change. Some changes
will increase speed, others may have a negative effect. Sometimes, a small
positive effect will be outweighed by the negatives of more complex code, and
you may choose to leave out that optimization.
@@ -235,9 +240,9 @@ Appendix
Bottleneck math
~~~~~~~~~~~~~~~
The proverb "a chain is only as strong as its weakest link" applies directly to
The proverb *"a chain is only as strong as its weakest link"* applies directly to
performance optimization. If your project is spending 90% of the time in
function 'A', then optimizing A can have a massive effect on performance.
function ``A``, then optimizing ``A`` can have a massive effect on performance.
.. code-block:: none
@@ -247,14 +252,14 @@ function 'A', then optimizing A can have a massive effect on performance.
.. code-block:: none
A: 1 ms
Everything else: 1ms
A: 1 ms
Everything else: 1ms
Total frame time: 2 ms
So in this example improving this bottleneck A by a factor of 9x, decreases
overall frame time by 5x, and increases frames per second by 5x.
In this example, improving this bottleneck ``A`` by a factor of 9× decreases
overall frame time by 5× while increasing frames per second by 5×.
If however, something else is running slowly and also bottlenecking your
However, if something else is running slowly and also bottlenecking your
project, then the same improvement can lead to less dramatic gains:
.. code-block:: none
@@ -269,8 +274,8 @@ project, then the same improvement can lead to less dramatic gains:
Everything else: 50 ms
Total frame time: 51 ms
So in this example, even though we have hugely optimized functionality A, the
actual gain in terms of frame rate is quite small.
In this example, even though we have hugely optimized function ``A``,
the actual gain in terms of frame rate is quite small.
In games, things become even more complicated because the CPU and GPU run
independently of one another. Your total frame time is determined by the slower
@@ -288,5 +293,5 @@ of the two.
GPU: 50 ms
Total frame time: 50 ms
In this example, we optimized the CPU hugely again, but the frame time did not
improve, because we are GPU-bottlenecked.
In this example, we optimized the CPU hugely again, but the frame time didn't
improve because we are GPU-bottlenecked.

View File

@@ -1,75 +1,76 @@
.. _doc_gpu_optimization:
GPU Optimizations
=================
GPU optimization
================
Introduction
~~~~~~~~~~~~
The demand for new graphics features and progress almost guarantees that you
will encounter graphics bottlenecks. Some of these can be CPU side, for instance
in calculations inside the Godot engine to prepare objects for rendering.
Bottlenecks can also occur on the CPU in the graphics driver, which sorts
instructions to pass to the GPU, and in the transfer of these instructions. And
finally bottlenecks also occur on the GPU itself.
will encounter graphics bottlenecks. Some of these can be on the CPU side, for
instance in calculations inside the Godot engine to prepare objects for
rendering. Bottlenecks can also occur on the CPU in the graphics driver, which
sorts instructions to pass to the GPU, and in the transfer of these
instructions. And finally, bottlenecks also occur on the GPU itself.
Where bottlenecks occur in rendering is highly hardware specific. Mobile GPUs in
particular may struggle with scenes that run easily on desktop.
Where bottlenecks occur in rendering is highly hardware-specific.
Mobile GPUs in particular may struggle with scenes that run easily on desktop.
Understanding and investigating GPU bottlenecks is slightly different to the
situation on the CPU, because often you can only change performance indirectly,
by changing the instructions you give to the GPU, and it may be more difficult
to take measurements. Often the only way of measuring performance is by
examining changes in frame rate.
situation on the CPU. This is because, often, you can only change performance
indirectly by changing the instructions you give to the GPU. Also, it may be
more difficult to take measurements. In many cases, the only way of measuring
performance is by examining changes in the time spent rendering each frame.
Drawcalls, state changes, and APIs
==================================
Draw calls, state changes, and APIs
===================================
.. note:: The following section is not relevant to end-users, but is useful to
provide background information that is relevant in later sections.
Godot sends instructions to the GPU via a graphics API (OpenGL, GLES2, GLES3,
Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or
Vulkan). The communication and driver activity involved can be quite costly,
especially in OpenGL. If we can provide these instructions in a way that is
preferred by the driver and GPU, we can greatly increase performance.
especially in OpenGL and OpenGL ES. If we can provide these instructions in a
way that is preferred by the driver and GPU, we can greatly increase
performance.
Nearly every API command in OpenGL requires a certain amount of validation, to
Nearly every API command in OpenGL requires a certain amount of validation to
make sure the GPU is in the correct state. Even seemingly simple commands can
lead to a flurry of behind the scenes housekeeping. Therefore the name of the
game is reduce these instructions to a bare minimum, and group together similar
objects as much as possible so they can be rendered together, or with the
minimum number of these expensive state changes.
lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to
reduce these instructions to a bare minimum and group together similar objects
as much as possible so they can be rendered together, or with the minimum number
of these expensive state changes.
2D batching
~~~~~~~~~~~
In 2d, the costs of treating each item individually can be prohibitively high -
there can easily be thousands on screen. This is why 2d batching is used -
multiple similar items are grouped together and rendered in a batch, via a
single drawcall, rather than making a separate drawcall for each item. In
addition this means that state changes, material and texture changes can be kept
In 2D, the costs of treating each item individually can be prohibitively high -
there can easily be thousands of them on the screen. This is why 2D *batching*
is used. Multiple similar items are grouped together and rendered in a batch,
via a single draw call, rather than making a separate draw call for each item.
In addition, this means state changes, material and texture changes can be kept
to a minimum.
For more information on 2D batching see :ref:`doc_batching`.
For more information on 2D batching, see :ref:`doc_batching`.
3D batching
~~~~~~~~~~~
In 3d, we still aim to minimize draw calls and state changes, however, it can be
more difficult to batch together several objects into a single draw call. 3d
In 3D, we still aim to minimize draw calls and state changes. However, it can be
more difficult to batch together several objects into a single draw call. 3D
meshes tend to comprise hundreds or thousands of triangles, and combining large
meshes at runtime is prohibitively expensive. The costs of joining them quickly
meshes in real-time is prohibitively expensive. The costs of joining them quickly
exceeds any benefits as the number of triangles grows per mesh. A much better
alternative is to join meshes ahead of time (static meshes in relation to each
alternative is to **join meshes ahead of time** (static meshes in relation to each
other). This can either be done by artists, or programmatically within Godot.
There is also a cost to batching together objects in 3d. Several objects
rendered as one cannot be individually culled. An entire city that is off screen
There is also a cost to batching together objects in 3D. Several objects
rendered as one cannot be individually culled. An entire city that is off-screen
will still be rendered if it is joined to a single blade of grass that is on
screen. So attempting to batch together 3d objects should take account of their
location and effect on culling. Despite this, the benefits of joining static
objects often outweigh other considerations, especially for large numbers of low
poly objects.
screen. Thus, you should always take objects' location and culling into account
when attempting to batch 3D objects together. Despite this, the benefits of
joining static objects often outweigh other considerations, especially for large
numbers of distant or low-poly objects.
For more information on 3D specific optimizations, see
:ref:`doc_optimizing_3d_performance`.
@@ -80,14 +81,14 @@ Reuse Shaders and Materials
The Godot renderer is a little different to what is out there. It's designed to
minimize GPU state changes as much as possible. :ref:`SpatialMaterial
<class_SpatialMaterial>` does a good job at reusing materials that need similar
shaders but, if custom shaders are used, make sure to reuse them as much as
shaders. if custom shaders are used, make sure to reuse them as much as
possible. Godot's priorities are:
- **Reusing Materials**: The fewer different materials in the
- **Reusing Materials:** The fewer different materials in the
scene, the faster the rendering will be. If a scene has a huge amount
of objects (in the hundreds or thousands) try reusing the materials
or in the worst case use atlases.
- **Reusing Shaders**: If materials can't be reused, at least try to
of objects (in the hundreds or thousands), try reusing the materials.
In the worst case, use atlases to decrease the amount of texture changes.
- **Reusing Shaders:** If materials can't be reused, at least try to
re-use shaders (or SpatialMaterials with different parameters but the same
configuration).
@@ -95,54 +96,55 @@ If a scene has, for example, ``20,000`` objects with ``20,000`` different
materials each, rendering will be slow. If the same scene has ``20,000``
objects, but only uses ``100`` materials, rendering will be much faster.
Pixel cost vs vertex cost
=========================
Pixel cost versus vertex cost
=============================
You may have heard that the lower the number of polygons in a model, the faster
it will be rendered. This is *really* relative and depends on many factors.
On a modern PC and console, vertex cost is low. GPUs originally only rendered
triangles, so every frame all the vertices:
triangles. This meant that every frame:
1. Had to be transformed by the CPU (including clipping).
1. All vertices had to be transformed by the CPU (including clipping).
2. All vertices had to be sent to the GPU memory from the main RAM.
2. Had to be sent to the GPU memory from the main RAM.
Nowadays, all this is handled inside the GPU, greatly increasing performance.
3D artists usually have the wrong feeling about polycount performance because 3D
DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory for it to
be edited, reducing actual performance. Game engines rely on the GPU more, so
they can render many triangles much more efficiently.
Now all this is handled inside the GPU, so the performance is much higher. 3D
artists usually have the wrong feeling about polycount performance because 3D
DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory in order
for it to be edited, reducing actual performance. Game engines rely on the GPU
more so they can render many triangles much more efficiently.
On mobile devices, the story is different. PC and Console GPUs are
On mobile devices, the story is different. PC and console GPUs are
brute-force monsters that can pull as much electricity as they need from
the power grid. Mobile GPUs are limited to a tiny battery, so they need
to be a lot more power efficient.
To be more efficient, mobile GPUs attempt to avoid *overdraw*. This means, the
same pixel on the screen being rendered more than once. Imagine a town with
several buildings, GPUs don't know what is visible and what is hidden until they
draw it. A house might be drawn and then another house in front of it (rendering
happened twice for the same pixel!). PC GPUs normally don't care much about this
and just throw more pixel processors to the hardware to increase performance
(but this also increases power consumption).
To be more efficient, mobile GPUs attempt to avoid *overdraw*. Overdraw occurs
when the same pixel on the screen is being rendered more than once. Imagine a
town with several buildings. GPUs don't know what is visible and what is hidden
until they draw it. For example, a house might be drawn and then another house
in front of it (which means rendering happened twice for the same pixel). PC
GPUs normally don't care much about this and just throw more pixel processors to
the hardware to increase performance (which also increases power consumption).
Using more power is not an option on mobile so mobile devices use a technique
called "Tile Based Rendering" which divides the screen into a grid. Each cell
called *tile-based rendering* which divides the screen into a grid. Each cell
keeps the list of triangles drawn to it and sorts them by depth to minimize
*overdraw*. This technique improves performance and reduces power consumption,
but takes a toll on vertex performance. As a result, fewer vertices and
triangles can be processed for drawing.
Additionally, Tile Based Rendering struggles when there are small objects with a
Additionally, tile-based rendering struggles when there are small objects with a
lot of geometry within a small portion of the screen. This forces mobile GPUs to
put a lot of strain on a single screen tile which considerably decreases
performance as all the other cells must wait for it to complete in order to
display the frame.
put a lot of strain on a single screen tile, which considerably decreases
performance as all the other cells must wait for it to complete before
displaying the frame.
In summary, do not worry about vertex count on mobile, but avoid concentration
of vertices in small parts of the screen. If a character, NPC, vehicle, etc. is
far away (so it looks tiny), use a smaller level of detail (LOD) model.
To summarize, don't worry about vertex count on mobile, but
**avoid concentration of vertices in small parts of the screen**.
If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use
a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to
avoid having triangles smaller than the size of a pixel on screen.
Pay attention to the additional vertex processing required when using:
@@ -150,47 +152,53 @@ Pay attention to the additional vertex processing required when using:
- Morphs (shape keys)
- Vertex-lit objects (common on mobile)
Pixel / fragment shaders - fill rate
Pixel/fragment shaders and fill rate
====================================
In contrast to vertex processing, the costs of fragment shading has increased
dramatically over the years. Screen resolutions have increased (the area of a 4K
screen is ``8,294,400`` pixels, versus ``307,200`` for an old ``640x480`` VGA
In contrast to vertex processing, the costs of fragment (per-pixel) shading have
increased dramatically over the years. Screen resolutions have increased (the
area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA
screen, that is 27x the area), but also the complexity of fragment shaders has
exploded. Physically based rendering requires complex calculations for each
exploded. Physically-based rendering requires complex calculations for each
fragment.
You can test whether a project is fill rate limited quite easily. Turn off vsync
to prevent capping the frames per second, then compare the frames per second
when running with a large window, to running with a postage stamp sized window
(you may also benefit from similarly reducing your shadow map size if using
shadows). Usually you will find the fps increases quite a bit using a small
window, which indicates you are to some extent fill rate limited. If on the
other hand there is little to no increase in fps, then your bottleneck lies
You can test whether a project is fill rate-limited quite easily. Turn off
V-Sync to prevent capping the frames per second, then compare the frames per
second when running with a large window, to running with a very small window.
You may also benefit from similarly reducing your shadow map size if using
shadows. Usually, you will find the FPS increases quite a bit using a small
window, which indicates you are to some extent fill rate-limited. On the other
hand, if there is little to no increase in FPS, then your bottleneck lies
elsewhere.
You can increase performance in a fill rate limited project by reducing the
You can increase performance in a fill rate-limited project by reducing the
amount of work the GPU has to do. You can do this by simplifying the shader
(perhaps turn off expensive options if you are using a :ref:`SpatialMaterial
<class_SpatialMaterial>`), or reducing the number and size of textures used.
Consider shipping simpler shaders for mobile.
**When targeting mobile devices, consider using the simplest possible shaders you
can reasonably afford to use.**
Reading textures
~~~~~~~~~~~~~~~~
The other factor in fragment shaders is the cost of reading textures. Reading
textures is an expensive operation (especially reading from several in a single
fragment shader), and also consider the filtering may add expense to this
(trilinear filtering between mipmaps, and averaging). Reading textures is also
expensive in power terms, which is a big issue on mobiles.
textures is an expensive operation, especially when reading from several
textures in a single fragment shader. Also, consider that filtering may slow it
down further (trilinear filtering between mipmaps, and averaging). Reading
textures is also expensive in terms of power usage, which is a big issue on
mobiles.
**If you use third-party shaders or write your own shaders, try to use
algorithms that require as few texture reads as possible.**
Texture compression
~~~~~~~~~~~~~~~~~~~
Godot compresses textures of 3D models when imported (VRAM compression) by
default. Video RAM compression is not as efficient in size as PNG or JPG when
stored, but increases performance enormously when drawing.
By default, Godot compresses textures of 3D models when imported using video RAM
(VRAM) compression. Video RAM compression isn't as efficient in size as PNG or
JPG when stored, but increases performance enormously when drawing large enough
textures.
This is because the main goal of texture compression is bandwidth reduction
between memory and the GPU.
@@ -203,61 +211,72 @@ more noticeable.
As a warning, most Android devices do not support texture compression of
textures with transparency (only opaque), so keep this in mind.
Post processing / shadows
~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::
Post processing effects and shadows can also be expensive in terms of fragment
Even in 3D, "pixel art" textures should have VRAM compression disabled as it
will negatively affect their appearance, without improving performance
significantly due to their low resolution.
Post-processing and shadows
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Post-processing effects and shadows can also be expensive in terms of fragment
shading activity. Always test the impact of these on different hardware.
Reducing the size of shadow maps can increase performance, both in terms of
writing, and reading the maps.
**Reducing the size of shadowmaps can increase performance**, both in terms of
writing and reading the shadowmaps. On top of that, the best way to improve
performance of shadows is to turn shadows off for as many lights and objects as
possible. Smaller or distant OmniLights/SpotLights can often have their shadows
disabled with only a small visual impact.
Transparency / blending
=======================
Transparency and blending
=========================
Transparent items present particular problems for rendering efficiency. Opaque
items (especially in 3d) can be essentially rendered in any order and the
Transparent objects present particular problems for rendering efficiency. Opaque
objects (especially in 3D) can be essentially rendered in any order and the
Z-buffer will ensure that only the front most objects get shaded. Transparent or
blended objects are different - in most cases they cannot rely on the Z-buffer
and must be rendered in "painter's order" (i.e. from back to front) in order to
look correct.
blended objects are different. In most cases, they cannot rely on the Z-buffer
and must be rendered in "painter's order" (i.e. from back to front) to look
correct.
The transparent items are also particularly bad for fill rate, because every
item has to be drawn, even if later transparent items will be drawn on top.
Transparent objects are also particularly bad for fill rate, because every item
has to be drawn even if other transparent objects will be drawn on top
later on.
Opaque items don't have to do this. They can usually take advantage of the
Opaque objects don't have to do this. They can usually take advantage of the
Z-buffer by writing to the Z-buffer only first, then only performing the
fragment shader on the 'winning' fragment, the item that is at the front at a
fragment shader on the "winning" fragment, the object that is at the front at a
particular pixel.
Transparency is particularly expensive where multiple transparent items overlap.
It is usually better to use as small a transparent area as possible in order to
Transparency is particularly expensive where multiple transparent objects
overlap. It is usually better to use transparent areas as small as possible to
minimize these fill rate requirements, especially on mobile, where fill rate is
very expensive. Indeed, in many situations, rendering more complex opaque
geometry can end up being faster than using transparency to "cheat".
Multi-Platform Advice
Multi-platform advice
=====================
If you are aiming to release on multiple platforms, test `early` and test
`often` on all your platforms, especially mobile. Developing a game on desktop
but attempting to port to mobile at the last minute is a recipe for disaster.
If you are aiming to release on multiple platforms, test *early* and test
*often* on all your platforms, especially mobile. Developing a game on desktop
but attempting to port it to mobile at the last minute is a recipe for disaster.
In general you should design your game for the lowest common denominator, then
In general, you should design your game for the lowest common denominator, then
add optional enhancements for more powerful platforms. For example, you may want
to use the GLES2 backend for both desktop and mobile platforms where you target
both.
Mobile / tile renderers
=======================
Mobile/tiled renderers
======================
GPUs on mobile devices work in dramatically different ways from GPUs on desktop.
Most mobile devices use tile renderers. Tile renderers split up the screen into
regular sized tiles that fit into super fast cache memory, and reduce the reads
and writes to main memory.
As described above, GPUs on mobile devices work in dramatically different ways
from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers
split up the screen into regular-sized tiles that fit into super fast cache
memory, which reduces the number of read/write operations to the main memory.
There are some downsides though, it can make certain techniques much more
complicated and expensive to perform. Tiles that rely on the results of
rendering in different tiles or on the results of earlier operations being
There are some downsides though. Tiled rendering can make certain techniques
much more complicated and expensive to perform. Tiles that rely on the results
of rendering in different tiles or on the results of earlier operations being
preserved can be very slow. Be very careful to test the performance of shaders,
viewport textures and post processing.

View File

@@ -4,33 +4,33 @@ Optimization
Introduction
------------
Godot follows a balanced performance philosophy. In the performance world, there
are always trade-offs, which consist of trading speed for usability and
flexibility. Some practical examples of this are:
Godot follows a balanced performance philosophy. In the performance world,
there are always trade-offs, which consist of trading speed for usability
and flexibility. Some practical examples of this are:
- Rendering objects efficiently in high amounts is easy, but when a
large scene must be rendered, it can become inefficient. To solve this,
visibility computation must be added to the rendering, which makes rendering
less efficient, but, at the same time, fewer objects are rendered, so
efficiency overall improves.
visibility computation must be added to the rendering. This makes rendering
less efficient, but at the same time, fewer objects are rendered.
Therefore, the overall rendering efficiency is improved.
- Configuring the properties of every material for every object that
needs to be rendered is also slow. To solve this, objects are sorted by
material to reduce the costs, but at the same time sorting has a cost.
material to reduce the costs. At the same time, sorting has a cost.
- In 3D physics a similar situation happens. The best algorithms to
- In 3D physics, a similar situation happens. The best algorithms to
handle large amounts of physics objects (such as SAP) are slow at
insertion/removal of objects and ray-casting. Algorithms that allow faster
insertion and removal, as well as ray-casting, will not be able to handle as
insertion/removal of objects and raycasting. Algorithms that allow faster
insertion and removal, as well as raycasting, will not be able to handle as
many active objects.
And there are many more examples of this! Game engines strive to be general
purpose in nature, so balanced algorithms are always favored over algorithms
that might be fast in some situations and slow in others or algorithms that are
fast but make usability more difficult.
And there are many more examples of this! Game engines strive to be general-purpose
in nature. Balanced algorithms are always favored over algorithms
that might be fast in some situations and slow in others, or algorithms that are
fast but are more difficult to use.
Godot is not an exception and, while it is designed to have backends swappable
for different algorithms, the default ones prioritize balance and flexibility
Godot is not an exception to this. While it is designed to have backends swappable
for different algorithms, the default backends prioritize balance and flexibility
over performance.
With this clear, the aim of this tutorial section is to explain how to get the

View File

@@ -22,34 +22,42 @@ in the street you are in, as well as the sky and a few birds flying overhead. As
far as a naive renderer is concerned however, you can still see the entire town.
It won't just render the buildings in front of you, it will render the street
behind that, with the people on that street, the buildings behind that. You
quickly end up in situations where you are attempting to render 10x, or 100x
more than what is visible.
quickly end up in situations where you are attempting to render 10× or 100× more
than what is visible.
Things aren't quite as bad as they seem, because the Z-buffer usually allows the
GPU to only fully shade the objects that are at the front. However, unneeded
objects are still reducing performance.
GPU to only fully shade the objects that are at the front. This is called *depth
prepass* and is enabled by default in Godot when using the GLES3 renderer.
However, unneeded objects are still reducing performance.
One way we can potentially reduce the amount to be rendered is to take advantage
of occlusion. As of version 3.2.2 there is no built in support for occlusion in
Godot, however with careful design you can still get many of the advantages.
of occlusion. As of Godot 3.2.2, there is no built in support for occlusion in
Godot. However, with careful design you can still get many of the advantages.
For instance in our city street scenario, you may be able to work out in advance
For instance, in our city street scenario, you may be able to work out in advance
that you can only see two other streets, ``B`` and ``C``, from street ``A``.
Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
you have to do is work out when your viewer is in street ``A`` (perhaps using
Godot Areas), then you can hide the other streets.
This is a manual version of what is known as a 'potentially visible set'. It is
This is a manual version of what is known as a "potentially visible set". It is
a very powerful technique for speeding up rendering. You can also use it to
restrict physics or AI to the local area, and speed these up as well as
rendering.
.. note::
In some cases, you may have to adapt your level design to add more occlusion
opportunities. For example, you may have to add more walls to prevent the player
from seeing too far away, which would decrease performance due to the lost
opportunies for occlusion culling.
Other occlusion techniques
~~~~~~~~~~~~~~~~~~~~~~~~~~
There are other occlusion techniques such as portals, automatic PVS, and raster
based occlusion culling. Some of these may be available through addons, and may
be available in core Godot in the future.
There are other occlusion techniques such as portals, automatic PVS, and
raster-based occlusion culling. Some of these may be available through add-ons
and may be available in core Godot in the future.
Transparent objects
~~~~~~~~~~~~~~~~~~~
@@ -57,9 +65,10 @@ Transparent objects
Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
<class_Shader>` to improve performance. This, however, can not be done with
transparent objects. Transparent objects are rendered from back to front to make
blending with what is behind work. As a result, try to use as few transparent
objects as possible. If an object has a small section with transparency, try to
make that section a separate surface with its own Material.
blending with what is behind work. As a result,
**try to use as few transparent objects as possible**. If an object has a
small section with transparency, try to make that section a separate surface
with its own material.
For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
doc.
@@ -67,12 +76,12 @@ doc.
Level of detail (LOD)
=====================
In some situations, particularly at a distance, it can be a good idea to replace
complex geometry with simpler versions - the end user will probably not be able
to see much difference. Consider looking at a large number of trees in the far
distance. There are several strategies for replacing models at varying distance.
You could use lower poly models, or use transparency to simulate more complex
geometry.
In some situations, particularly at a distance, it can be a good idea to
**replace complex geometry with simpler versions**. The end user will probably
not be able to see much difference. Consider looking at a large number of trees
in the far distance. There are several strategies for replacing models at
varying distance. You could use lower poly models, or use transparency to
simulate more complex geometry.
Billboards and imposters
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -112,19 +121,19 @@ Lighting objects is one of the most costly rendering operations. Realtime
lighting, shadows (especially multiple lights), and GI are especially expensive.
They may simply be too much for lower power mobile devices to handle.
Consider using baked lighting, especially for mobile. This can look fantastic,
but has the downside that it will not be dynamic. Sometimes this is a trade off
**Consider using baked lighting**, especially for mobile. This can look fantastic,
but has the downside that it will not be dynamic. Sometimes, this is a trade-off
worth making.
In general, if several lights need to affect a scene, it's best to use
:ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
indirect light bounces.
Animation / Skinning
====================
Animation and skinning
======================
Animation and particularly vertex animation such as skinning and morphing can be
very expensive on some platforms. You may need to lower poly count considerably
Animation and vertex animation such as skinning and morphing can be very
expensive on some platforms. You may need to lower the polycount considerably
for animated models or limit the number of them on screen at any one time.
Large worlds
@@ -137,7 +146,7 @@ Large worlds may need to be built in tiles that can be loaded on demand as you
move around the world. This can prevent memory use from getting out of hand, and
also limit the processing needed to the local area.
There may be glitches due to floating point error in large worlds. You may be
able to use techniques such as orienting the world around the player (rather
than the other way around), or shifting the origin periodically to keep things
centred around (0, 0, 0).
There may also be rendering and physics glitches due to floating point error in
large worlds. You may be able to use techniques such as orienting the world
around the player (rather than the other way around), or shifting the origin
periodically to keep things centred around ``Vector3(0, 0, 0)``.