mirror of
https://github.com/godotengine/godot-docs.git
synced 2026-01-07 02:12:07 +03:00
Merge pull request #3852 from Calinou/improve-optimization-3.2
Proofread and improve the optimization guides
This commit is contained in:
@@ -6,10 +6,10 @@ Optimization using batching
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Game engines have to send a set of instructions to the GPU in order to tell the
|
||||
GPU what and where to draw. These instructions are sent using common
|
||||
instructions, called APIs (Application Programming Interfaces), examples of
|
||||
which are OpenGL, OpenGL ES, and Vulkan.
|
||||
Game engines have to send a set of instructions to the GPU to tell the GPU what
|
||||
and where to draw. These instructions are sent using common instructions called
|
||||
:abbr:`APIs (Application Programming Interfaces)`. Examples of graphics APIs are
|
||||
OpenGL, OpenGL ES, and Vulkan.
|
||||
|
||||
Different APIs incur different costs when drawing objects. OpenGL handles a lot
|
||||
of work for the user in the GPU driver at the cost of more expensive draw calls.
|
||||
@@ -29,21 +29,21 @@ one primitive at a time, telling it some information such as the texture used,
|
||||
the material, the position, size, etc. then saying "Draw!" (this is called a
|
||||
draw call).
|
||||
|
||||
It turns out that while this is conceptually simple from the engine side, GPUs
|
||||
operate very slowly when used in this manner. GPUs work much more efficiently
|
||||
if, instead of telling them to draw a single primitive, you tell them to draw a
|
||||
number of similar primitives all in one draw call, which we will call a "batch".
|
||||
While this is conceptually simple from the engine side, GPUs operate very slowly
|
||||
when used in this manner. GPUs work much more efficiently if you tell them to
|
||||
draw a number of similar primitives all in one draw call, which we will call a
|
||||
"batch".
|
||||
|
||||
And it turns out that they don't just work a bit faster when used in this
|
||||
manner, they work a *lot* faster.
|
||||
It turns out that they don't just work a bit faster when used in this manner;
|
||||
they work a *lot* faster.
|
||||
|
||||
As Godot is designed to be a general purpose engine, the primitives coming into
|
||||
As Godot is designed to be a general-purpose engine, the primitives coming into
|
||||
the Godot renderer can be in any order, sometimes similar, and sometimes
|
||||
dissimilar. In order to match the general purpose nature of Godot with the
|
||||
batching preferences of GPUs, Godot features an intermediate layer which can
|
||||
automatically group together primitives wherever possible, and send these
|
||||
batches on to the GPU. This can give an increase in rendering performance while
|
||||
requiring few, if any, changes to your Godot project.
|
||||
dissimilar. To match Godot's general-purpose nature with the batching
|
||||
preferences of GPUs, Godot features an intermediate layer which can
|
||||
automatically group together primitives wherever possible and send these batches
|
||||
on to the GPU. This can give an increase in rendering performance while
|
||||
requiring few (if any) changes to your Godot project.
|
||||
|
||||
How it works
|
||||
~~~~~~~~~~~~
|
||||
@@ -51,78 +51,77 @@ How it works
|
||||
Instructions come into the renderer from your game in the form of a series of
|
||||
items, each of which can contain one or more commands. The items correspond to
|
||||
Nodes in the scene tree, and the commands correspond to primitives such as
|
||||
rectangles or polygons. Some items, such as tilemaps, and text, can contain a
|
||||
large number of commands (tiles and letters respectively). Others, such as
|
||||
sprites, may only contain a single command (rectangle).
|
||||
rectangles or polygons. Some items such as TileMaps and text can contain a
|
||||
large number of commands (tiles and glyphs respectively). Others, such as
|
||||
sprites, may only contain a single command (a rectangle).
|
||||
|
||||
The batcher uses two main techniques to group together primitives:
|
||||
|
||||
* Consecutive items can be joined together
|
||||
* Consecutive commands within an item can be joined to form a batch
|
||||
- Consecutive items can be joined together.
|
||||
- Consecutive commands within an item can be joined to form a batch.
|
||||
|
||||
Breaking batching
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
Batching can only take place if the items or commands are similar enough to be
|
||||
rendered in one draw call. Certain changes (or techniques), by necessity, prevent
|
||||
the formation of a contiguous batch, this is referred to as 'breaking batching'.
|
||||
the formation of a contiguous batch, this is referred to as "breaking batching".
|
||||
|
||||
Batching will be broken by (amongst other things):
|
||||
* Change of texture
|
||||
* Change of material
|
||||
* Change of primitive type (say going from rectangles to lines)
|
||||
|
||||
- Change of texture.
|
||||
- Change of material.
|
||||
- Change of primitive type (say, going from rectangles to lines).
|
||||
|
||||
.. note::
|
||||
|
||||
If for example, you draw a series of sprites each with a different texture,
|
||||
there is no way they can be batched.
|
||||
For example, if you draw a series of sprites each with a different texture,
|
||||
there is no way they can be batched.
|
||||
|
||||
Render order
|
||||
^^^^^^^^^^^^
|
||||
Determining the rendering order
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The question arises, if only similar items can be drawn together in a batch, why
|
||||
don't we look through all the items in a scene, group together all the similar
|
||||
items, and draw them together?
|
||||
|
||||
In 3D, this is often exactly how engines work. However, in Godot 2D, items are
|
||||
drawn in 'painter's order', from back to front. This ensures that items at the
|
||||
front are drawn on top of earlier items, when they overlap.
|
||||
In 3D, this is often exactly how engines work. However, in Godot's 2D renderer,
|
||||
items are drawn in "painter's order", from back to front. This ensures that
|
||||
items at the front are drawn on top of earlier items when they overlap.
|
||||
|
||||
This also means that if we try and draw objects in order of, for example,
|
||||
texture, then this painter's order may break and objects will be drawn in the
|
||||
wrong order.
|
||||
This also means that if we try and draw objects on a per-texture basis, then
|
||||
this painter's order may break and objects will be drawn in the wrong order.
|
||||
|
||||
In Godot this back to front order is determined by:
|
||||
* The order of objects in the scene tree
|
||||
* The Z index of objects
|
||||
* The canvas layer
|
||||
* Y sort nodes
|
||||
In Godot, this back-to-front order is determined by:
|
||||
|
||||
- The order of objects in the scene tree.
|
||||
- The Z index of objects.
|
||||
- The canvas layer.
|
||||
- :ref:`class_YSort` nodes.
|
||||
|
||||
.. note::
|
||||
|
||||
You can group similar objects together for easier batching. While doing so
|
||||
is not a requirement on your part, think of it as an optional approach that
|
||||
can improve performance in some cases. See the diagnostics section in order
|
||||
to help you make this decision.
|
||||
You can group similar objects together for easier batching. While doing so
|
||||
is not a requirement on your part, think of it as an optional approach that
|
||||
can improve performance in some cases. See the
|
||||
:ref:`doc_batching_diagnostics` section to help you make this decision.
|
||||
|
||||
A trick
|
||||
^^^^^^^
|
||||
|
||||
And now a sleight of hand. Although the idea of painter's order is that objects
|
||||
are rendered from back to front, consider 3 objects A, B and C, that contain 2
|
||||
different textures, grass and wood.
|
||||
And now, a sleight of hand. Even though the idea of painter's order is that
|
||||
objects are rendered from back to front, consider 3 objects ``A``, ``B`` and
|
||||
``C``, that contain 2 different textures: grass and wood.
|
||||
|
||||
.. image:: img/overlap1.png
|
||||
|
||||
In painter's order they are ordered:
|
||||
In painter's order they are ordered::
|
||||
|
||||
::
|
||||
A - wood
|
||||
B - grass
|
||||
C - wood
|
||||
|
||||
A - wood
|
||||
B - grass
|
||||
C - wood
|
||||
|
||||
Because the texture changes, they cannot be batched, and will be rendered in 3
|
||||
Because of the texture changes, they can't be batched and will be rendered in 3
|
||||
draw calls.
|
||||
|
||||
However, painter's order is only needed on the assumption that they will be
|
||||
@@ -145,62 +144,62 @@ balance the costs and benefits in your project.
|
||||
|
||||
::
|
||||
|
||||
A - wood
|
||||
C - wood
|
||||
B - grass
|
||||
A - wood
|
||||
C - wood
|
||||
B - grass
|
||||
|
||||
Because the texture only changes once, we can render the above in only 2
|
||||
draw calls.
|
||||
Since the texture only changes once, we can render the above in only 2 draw
|
||||
calls.
|
||||
|
||||
Lights
|
||||
~~~~~~
|
||||
|
||||
Although the job for the batching system is normally quite straightforward, it
|
||||
becomes considerably more complex when 2D lights are used, because lights are
|
||||
drawn using extra passes, one for each light affecting the primitive. Consider 2
|
||||
sprites A and B, with identical texture and material. Without lights they would
|
||||
be batched together and drawn in one draw call. But with 3 lights, they would be
|
||||
drawn as follows, each line a draw call:
|
||||
Although the batching system's job is normally quite straightforward, it becomes
|
||||
considerably more complex when 2D lights are used. This is because lights are
|
||||
drawn using additional passes, one for each light affecting the primitive.
|
||||
Consider 2 sprites ``A`` and ``B``, with identical texture and material. Without
|
||||
lights, they would be batched together and drawn in one draw call. But with 3
|
||||
lights, they would be drawn as follows, each line being a draw call:
|
||||
|
||||
.. image:: img/lights_overlap.png
|
||||
|
||||
::
|
||||
|
||||
A
|
||||
A - light 1
|
||||
A - light 2
|
||||
A - light 3
|
||||
B
|
||||
B - light 1
|
||||
B - light 2
|
||||
B - light 3
|
||||
A
|
||||
A - light 1
|
||||
A - light 2
|
||||
A - light 3
|
||||
B
|
||||
B - light 1
|
||||
B - light 2
|
||||
B - light 3
|
||||
|
||||
That is a lot of draw calls, 8 for only 2 sprites. Now consider we are drawing
|
||||
1000 sprites, the number of draw calls quickly becomes astronomical, and
|
||||
That is a lot of draw calls: 8 for only 2 sprites. Now, consider we are drawing
|
||||
1,000 sprites. The number of draw calls quickly becomes astronomical and
|
||||
performance suffers. This is partly why lights have the potential to drastically
|
||||
slow down 2D.
|
||||
slow down 2D rendering.
|
||||
|
||||
However, if you remember our magician's trick from item reordering, it turns out
|
||||
we can use the same trick to get around painter's order for lights!
|
||||
|
||||
If A and B are not overlapping, we can render them together in a batch, so the
|
||||
draw process is as follows:
|
||||
If ``A`` and ``B`` are not overlapping, we can render them together in a batch,
|
||||
so the drawing process is as follows:
|
||||
|
||||
.. image:: img/lights_separate.png
|
||||
|
||||
::
|
||||
|
||||
AB
|
||||
AB - light 1
|
||||
AB - light 2
|
||||
AB - light 3
|
||||
AB
|
||||
AB - light 1
|
||||
AB - light 2
|
||||
AB - light 3
|
||||
|
||||
|
||||
That is 4 draw calls. Not bad, that is a 50% improvement. However consider that
|
||||
in a real game, you might be drawing closer to 1000 sprites.
|
||||
That is only 4 draw calls. Not bad, as that is a 2× reduction. However, consider
|
||||
that in a real game, you might be drawing closer to 1,000 sprites.
|
||||
|
||||
- Before: 1000 * 4 = 4000 draw calls.
|
||||
- After: 1 * 4 = 4 draw calls.
|
||||
- **Before:** 1000 × 4 = 4,000 draw calls.
|
||||
- **After:** 1 × 4 = 4 draw calls.
|
||||
|
||||
That is a 1000× decrease in draw calls, and should give a huge increase in
|
||||
performance.
|
||||
@@ -208,158 +207,163 @@ performance.
|
||||
Overlap test
|
||||
^^^^^^^^^^^^
|
||||
|
||||
However, as with the item reordering, things are not that simple, we must first
|
||||
perform the overlap test to determine whether we can join these primitives, and
|
||||
the overlap test has a small cost. So again you can choose the number of
|
||||
primitives to lookahead in the overlap test to balance the benefits against the
|
||||
cost. Usually with lights the benefits far outweigh the costs.
|
||||
However, as with the item reordering, things are not that simple. We must first
|
||||
perform the overlap test to determine whether we can join these primitives. This
|
||||
overlap test has a small cost. Again, you can choose the number of primitives to
|
||||
lookahead in the overlap test to balance the benefits against the cost. With
|
||||
lights, the benefits usually far outweigh the costs.
|
||||
|
||||
Also consider that depending on the arrangement of primitives in the viewport,
|
||||
the overlap test will sometimes fail (because the primitives overlap and thus
|
||||
should not be joined). So in practice the decrease in draw calls may be less
|
||||
dramatic than the perfect situation of no overlap. However performance is
|
||||
usually far higher than without this lighting optimization.
|
||||
the overlap test will sometimes fail (because the primitives overlap and
|
||||
therefore shouldn't be joined). In practice, the decrease in draw calls may be
|
||||
less dramatic than in a perfect situation with no overlapping at all. However,
|
||||
performance is usually far higher than without this lighting optimization.
|
||||
|
||||
Light Scissoring
|
||||
Light scissoring
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
Batching can make it more difficult to cull out objects that are not affected or
|
||||
partially affected by a light. This can increase the fill rate requirements
|
||||
quite a bit, and slow rendering. Fill rate is the rate at which pixels are
|
||||
colored, it is another potential bottleneck unrelated to draw calls.
|
||||
quite a bit and slow down rendering. *Fill rate* is the rate at which pixels are
|
||||
colored. It is another potential bottleneck unrelated to draw calls.
|
||||
|
||||
In order to counter this problem, (and also speedup lighting in general),
|
||||
batching introduces light scissoring. This enables the use of the OpenGL command
|
||||
``glScissor()``, which identifies an area, outside of which, the GPU will not
|
||||
render any pixels. We can thus greatly optimize fill rate by identifying the
|
||||
intersection area between a light and a primitive, and limit rendering the light
|
||||
to *that area only*.
|
||||
In order to counter this problem (and speed up lighting in general), batching
|
||||
introduces light scissoring. This enables the use of the OpenGL command
|
||||
``glScissor()``, which identifies an area outside of which the GPU won't render
|
||||
any pixels. We can greatly optimize fill rate by identifying the intersection
|
||||
area between a light and a primitive, and limit rendering the light to
|
||||
*that area only*.
|
||||
|
||||
Light scissoring is controlled with the :ref:`scissor_area_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
|
||||
project setting. This value is between 1.0 and 0.0, with 1.0 being off (no
|
||||
scissoring), and 0.0 being scissoring in every circumstance. The reason for the
|
||||
setting is that there may be some small cost to scissoring on some hardware.
|
||||
Generally though, when you are using lighting, it should result in some
|
||||
performance gains.
|
||||
That said, scissoring should usually result in performance gains when you're
|
||||
using 2D lighting.
|
||||
|
||||
The relationship between the threshold and whether a scissor operation takes
|
||||
place is not altogether straight forward, but generally it represents the pixel
|
||||
area that is potentially 'saved' by a scissor operation (i.e. the fill rate
|
||||
saved). At 1.0, the entire screens pixels would need to be saved, which rarely
|
||||
if ever happens, so it is switched off. In practice the useful values are
|
||||
bunched towards zero, as only a small percentage of pixels need to be saved for
|
||||
the operation to be useful.
|
||||
place is not always straightforward. Generally, it represents the pixel area
|
||||
that is potentially "saved" by a scissor operation (i.e. the fill rate saved).
|
||||
At 1.0, the entire screen's pixels would need to be saved, which rarely (if
|
||||
ever) happens, so it is switched off. In practice, the useful values are close
|
||||
to 0.0, as only a small percentage of pixels need to be saved for the operation
|
||||
to be useful.
|
||||
|
||||
The exact relationship is probably not necessary for users to worry about, but
|
||||
out of interest is included in the appendix.
|
||||
is included in the appendix out of interest:
|
||||
:ref:`doc_batching_light_scissoring_threshold_calculation`
|
||||
|
||||
.. image:: img/scissoring.png
|
||||
.. figure:: img/scissoring.png
|
||||
:alt: Light scissoring example diagram
|
||||
|
||||
*Bottom right is a light, the red area is the pixels saved by the scissoring
|
||||
operation. Only the intersection needs to be rendered.*
|
||||
Bottom right is a light, the red area is the pixels saved by the scissoring
|
||||
operation. Only the intersection needs to be rendered.
|
||||
|
||||
Vertex baking
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
The GPU shader receives instructions on what to draw in 2 main ways:
|
||||
|
||||
* Shader uniforms (e.g. modulate color, item transform)
|
||||
* Vertex attributes (vertex color, local transform)
|
||||
- Shader uniforms (e.g. modulate color, item transform).
|
||||
- Vertex attributes (vertex color, local transform).
|
||||
|
||||
However, within a single draw call (batch) we cannot change uniforms. This means
|
||||
that naively, we would not be able to batch together items or commands that
|
||||
change final_modulate, or item transform. Unfortunately that is an awful lot of
|
||||
cases. Sprites for instance typically are individual nodes with their own item
|
||||
transform, and they may have their own color modulate.
|
||||
However, within a single draw call (batch), we cannot change uniforms. This
|
||||
means that naively, we would not be able to batch together items or commands
|
||||
that change ``final_modulate`` or an item's transform. Unfortunately, that
|
||||
happens in an awful lot of cases. For instance, sprites are typically
|
||||
individual nodes with their own item transform, and they may have their own
|
||||
color modulate as well.
|
||||
|
||||
To get around this problem, the batching can "bake" some of the uniforms into
|
||||
the vertex attributes.
|
||||
|
||||
* The item transform can be combined with the local transform and sent in a
|
||||
- The item transform can be combined with the local transform and sent in a
|
||||
vertex attribute.
|
||||
- The final modulate color can be combined with the vertex colors, and sent in a
|
||||
vertex attribute.
|
||||
|
||||
* The final modulate color can be combined with the vertex colors, and sent in a
|
||||
vertex attribute.
|
||||
|
||||
In most cases this works fine, but this shortcut breaks down if a shader expects
|
||||
these values to be available individually, rather than combined. This can happen
|
||||
In most cases, this works fine, but this shortcut breaks down if a shader expects
|
||||
these values to be available individually rather than combined. This can happen
|
||||
in custom shaders.
|
||||
|
||||
Custom Shaders
|
||||
Custom shaders
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
As a result certain operations in custom shaders will prevent baking, and thus
|
||||
decrease the potential for batching. While we are working to decrease these
|
||||
cases, currently the following conditions apply:
|
||||
As a result of the limitation described above, certain operations in custom
|
||||
shaders will prevent vertex baking and therefore decrease the potential for
|
||||
batching. While we are working to decrease these cases, the following caveats
|
||||
currently apply:
|
||||
|
||||
* Reading or writing ``COLOR`` or ``MODULATE`` - disables vertex color baking
|
||||
* Reading ``VERTEX`` - disables vertex position baking
|
||||
- Reading or writing ``COLOR`` or ``MODULATE`` disables vertex color baking.
|
||||
- Reading ``VERTEX`` disables vertex position baking.
|
||||
|
||||
Project Settings
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
In order to fine tune batching, a number of project settings are available. You
|
||||
can usually leave these at default during development, but it is a good idea to
|
||||
To fine-tune batching, a number of project settings are available. You can
|
||||
usually leave these at default during development, but it's a good idea to
|
||||
experiment to ensure you are getting maximum performance. Spending a little time
|
||||
tweaking parameters can often give considerable performance gain, for very
|
||||
little effort. See the tooltips in the project settings for more info.
|
||||
tweaking parameters can often give considerable performance gains for very
|
||||
little effort. See the on-hover tooltips in the Project Settings for more
|
||||
information.
|
||||
|
||||
rendering/batching/options
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* :ref:`use_batching
|
||||
- :ref:`use_batching
|
||||
<class_ProjectSettings_property_rendering/batching/options/use_batching>` -
|
||||
Turns batching on and off
|
||||
Turns batching on or off.
|
||||
|
||||
* :ref:`use_batching_in_editor
|
||||
- :ref:`use_batching_in_editor
|
||||
<class_ProjectSettings_property_rendering/batching/options/use_batching_in_editor>`
|
||||
Turns batching on or off in the Godot editor.
|
||||
This setting doesn't affect the running project in any way.
|
||||
|
||||
* :ref:`single_rect_fallback
|
||||
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
|
||||
- This is a faster way of drawing unbatchable rectangles, however it may lead
|
||||
to flicker on some hardware so is not recommended
|
||||
- :ref:`single_rect_fallback
|
||||
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>` -
|
||||
This is a faster way of drawing unbatchable rectangles. However, it may lead
|
||||
to flicker on some hardware so it's not recommended.
|
||||
|
||||
rendering/batching/parameters
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* :ref:`max_join_item_commands <class_ProjectSettings_property_rendering/batching/parameters/max_join_item_commands>` -
|
||||
One of the most important ways of achieving
|
||||
batching is to join suitable adjacent items (nodes) together, however they can
|
||||
only be joined if the commands they contain are compatible. The system must
|
||||
therefore do a lookahead through the commands in an item to determine whether
|
||||
it can be joined. This has a small cost per command, and items with a large
|
||||
number of commands are not worth joining, so the best value may be project
|
||||
dependent.
|
||||
- :ref:`max_join_item_commands <class_ProjectSettings_property_rendering/batching/parameters/max_join_item_commands>` -
|
||||
One of the most important ways of achieving batching is to join suitable
|
||||
adjacent items (nodes) together, however they can only be joined if the
|
||||
commands they contain are compatible. The system must therefore do a lookahead
|
||||
through the commands in an item to determine whether it can be joined. This
|
||||
has a small cost per command, and items with a large number of commands are
|
||||
not worth joining, so the best value may be project dependent.
|
||||
|
||||
* :ref:`colored_vertex_format_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/parameters/colored_vertex_format_threshold>` - Baking colors into
|
||||
vertices results in a
|
||||
larger vertex format. This is not necessarily worth doing unless there are a
|
||||
lot of color changes going on within a joined item. This parameter represents
|
||||
the proportion of commands containing color changes / the total commands,
|
||||
above which it switches to baked colors.
|
||||
- :ref:`colored_vertex_format_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/parameters/colored_vertex_format_threshold>` -
|
||||
Baking colors into vertices results in a larger vertex format. This is not
|
||||
necessarily worth doing unless there are a lot of color changes going on
|
||||
within a joined item. This parameter represents the proportion of commands
|
||||
containing color changes / the total commands, above which it switches to
|
||||
baked colors.
|
||||
|
||||
* :ref:`batch_buffer_size
|
||||
<class_ProjectSettings_property_rendering/batching/parameters/batch_buffer_size>`
|
||||
- This determines the maximum size of a batch, it doesn't have a huge effect
|
||||
- :ref:`batch_buffer_size
|
||||
<class_ProjectSettings_property_rendering/batching/parameters/batch_buffer_size>` -
|
||||
This determines the maximum size of a batch, it doesn't have a huge effect
|
||||
on performance but can be worth decreasing for mobile if RAM is at a premium.
|
||||
|
||||
* :ref:`item_reordering_lookahead
|
||||
<class_ProjectSettings_property_rendering/batching/parameters/item_reordering_lookahead>`
|
||||
- Item reordering can help especially with
|
||||
interleaved sprites using different textures. The lookahead for the overlap
|
||||
test has a small cost, so the best value may change per project.
|
||||
- :ref:`item_reordering_lookahead
|
||||
<class_ProjectSettings_property_rendering/batching/parameters/item_reordering_lookahead>` -
|
||||
Item reordering can help especially with interleaved sprites using different
|
||||
textures. The lookahead for the overlap test has a small cost, so the best
|
||||
value may change per project.
|
||||
|
||||
rendering/batching/lights
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* :ref:`scissor_area_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
|
||||
- See light scissoring.
|
||||
- :ref:`scissor_area_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>` -
|
||||
See light scissoring.
|
||||
|
||||
* :ref:`max_join_items
|
||||
<class_ProjectSettings_property_rendering/batching/lights/max_join_items>` -
|
||||
- :ref:`max_join_items
|
||||
<class_ProjectSettings_property_rendering/batching/lights/max_join_items>` -
|
||||
Joining items before lighting can significantly increase
|
||||
performance. This requires an overlap test, which has a small cost, so the
|
||||
costs and benefits may be project dependent, and hence the best value to use
|
||||
@@ -368,22 +372,22 @@ rendering/batching/lights
|
||||
rendering/batching/debug
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* :ref:`flash_batching
|
||||
<class_ProjectSettings_property_rendering/batching/debug/flash_batching>` -
|
||||
- :ref:`flash_batching
|
||||
<class_ProjectSettings_property_rendering/batching/debug/flash_batching>` -
|
||||
This is purely a debugging feature to identify regressions between the
|
||||
batching and legacy renderer. When it is switched on, the batching and legacy
|
||||
renderer are used alternately on each frame. This will decrease performance,
|
||||
and should not be used for your final export, only for testing.
|
||||
|
||||
* :ref:`diagnose_frame
|
||||
<class_ProjectSettings_property_rendering/batching/debug/diagnose_frame>` -
|
||||
- :ref:`diagnose_frame
|
||||
<class_ProjectSettings_property_rendering/batching/debug/diagnose_frame>` -
|
||||
This will periodically print a diagnostic batching log to
|
||||
the Godot IDE / console.
|
||||
|
||||
rendering/batching/precision
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* :ref:`uv_contract
|
||||
- :ref:`uv_contract
|
||||
<class_ProjectSettings_property_rendering/batching/precision/uv_contract>` -
|
||||
On some hardware (notably some Android devices) there have been reports of
|
||||
tilemap tiles drawing slightly outside their UV range, leading to edge
|
||||
@@ -391,10 +395,12 @@ rendering/batching/precision
|
||||
contract. This makes a small contraction in the UV coordinates to compensate
|
||||
for precision errors on devices.
|
||||
|
||||
* :ref:`uv_contract_amount
|
||||
<class_ProjectSettings_property_rendering/batching/precision/uv_contract_amount>`
|
||||
- Hopefully the default amount should cure artifacts on most devices, but just
|
||||
in case, this value is editable.
|
||||
- :ref:`uv_contract_amount
|
||||
<class_ProjectSettings_property_rendering/batching/precision/uv_contract_amount>` -
|
||||
Hopefully, the default amount should cure artifacts on most devices,
|
||||
but this value remains adjustable just in case.
|
||||
|
||||
.. _doc_batching_diagnostics:
|
||||
|
||||
Diagnostics
|
||||
~~~~~~~~~~~
|
||||
@@ -403,120 +409,117 @@ Although you can change parameters and examine the effect on frame rate, this
|
||||
can feel like working blindly, with no idea of what is going on under the hood.
|
||||
To help with this, batching offers a diagnostic mode, which will periodically
|
||||
print out (to the IDE or console) a list of the batches that are being
|
||||
processed. This can help pin point situations where batching is not occurring as
|
||||
intended, and help you to fix them, in order to get the best possible
|
||||
performance.
|
||||
processed. This can help pinpoint situations where batching isn't occurring
|
||||
as intended, and help you fix these situations to get the best possible performance.
|
||||
|
||||
Reading a diagnostic
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
canvas_begin FRAME 2604
|
||||
items
|
||||
joined_item 1 refs
|
||||
batch D 0-0
|
||||
batch D 0-2 n n
|
||||
batch R 0-1 [0 - 0] {255 255 255 255 }
|
||||
joined_item 1 refs
|
||||
batch D 0-0
|
||||
batch R 0-1 [0 - 146] {255 255 255 255 }
|
||||
batch D 0-0
|
||||
batch R 0-1 [0 - 146] {255 255 255 255 }
|
||||
joined_item 1 refs
|
||||
batch D 0-0
|
||||
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
|
||||
batch D 0-0
|
||||
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
|
||||
batch D 0-0
|
||||
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
|
||||
canvas_end
|
||||
canvas_begin FRAME 2604
|
||||
items
|
||||
joined_item 1 refs
|
||||
batch D 0-0
|
||||
batch D 0-2 n n
|
||||
batch R 0-1 [0 - 0] {255 255 255 255 }
|
||||
joined_item 1 refs
|
||||
batch D 0-0
|
||||
batch R 0-1 [0 - 146] {255 255 255 255 }
|
||||
batch D 0-0
|
||||
batch R 0-1 [0 - 146] {255 255 255 255 }
|
||||
joined_item 1 refs
|
||||
batch D 0-0
|
||||
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
|
||||
batch D 0-0
|
||||
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
|
||||
batch D 0-0
|
||||
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
|
||||
canvas_end
|
||||
|
||||
|
||||
This is a typical diagnostic.
|
||||
|
||||
* **joined_item** - A joined item can contain 1 or
|
||||
more references to items (nodes). Generally joined_items containing many
|
||||
- **joined_item:** A joined item can contain 1 or
|
||||
more references to items (nodes). Generally, joined_items containing many
|
||||
references is preferable to many joined_items containing a single reference.
|
||||
Whether items can be joined will be determined by their contents and
|
||||
compatibility with the previous item.
|
||||
* **batch R** - a batch containing rectangles. The second number is the number of
|
||||
- **batch R:** A batch containing rectangles. The second number is the number of
|
||||
rects. The second number in square brackets is the Godot texture ID, and the
|
||||
numbers in curly braces is the color. If the batch contains more than one rect,
|
||||
MULTI is added to the line to make it easy to identify. Seeing MULTI is good,
|
||||
because this indicates successful batching.
|
||||
* **batch D** - a default batch, containing everything else that is not currently
|
||||
``MULTI`` is added to the line to make it easy to identify.
|
||||
Seeing ``MULTI`` is good as it indicates successful batching.
|
||||
- **batch D:** A default batch, containing everything else that is not currently
|
||||
batched.
|
||||
|
||||
Default Batches
|
||||
Default batches
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
The second number following default batches is the number of commands in the
|
||||
batch, and it is followed by a brief summary of the contents:
|
||||
batch, and it is followed by a brief summary of the contents::
|
||||
|
||||
::
|
||||
l - line
|
||||
PL - polyline
|
||||
r - rect
|
||||
n - ninepatch
|
||||
PR - primitive
|
||||
p - polygon
|
||||
m - mesh
|
||||
MM - multimesh
|
||||
PA - particles
|
||||
c - circle
|
||||
t - transform
|
||||
CI - clip_ignore
|
||||
|
||||
l - line
|
||||
PL - polyline
|
||||
r - rect
|
||||
n - ninepatch
|
||||
PR - primitive
|
||||
p - polygon
|
||||
m - mesh
|
||||
MM - multimesh
|
||||
PA - particles
|
||||
c - circle
|
||||
t - transform
|
||||
CI - clip_ignore
|
||||
You may see "dummy" default batches containing no commands; you can ignore those.
|
||||
|
||||
You may see "dummy" default batches containing no commands, you can ignore
|
||||
these.
|
||||
Frequently asked questions
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
FAQ
|
||||
~~~
|
||||
I don't get a large performance increase when enabling batching.
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
I don't get a large performance increase from switching on batching
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Try the diagnostics, see how much batching is occurring, and whether it can be
|
||||
- Try the diagnostics, see how much batching is occurring, and whether it can be
|
||||
improved
|
||||
* Try changing parameters
|
||||
* Consider that batching may not be your bottleneck (see bottlenecks)
|
||||
- Try changing batching parameters in the Project Settings.
|
||||
- Consider that batching may not be your bottleneck (see bottlenecks).
|
||||
|
||||
I get a decrease in performance with batching
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
I get a decrease in performance with batching.
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Try steps to increase batching given above
|
||||
* Try switching :ref:`single_rect_fallback
|
||||
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
|
||||
to on
|
||||
* The single rect fallback method is the default used without batching, and it
|
||||
is approximately twice as fast, however it can result in flicker on some
|
||||
hardware, so its use is discouraged
|
||||
* After trying the above, if your scene is still performing worse, consider
|
||||
- Try the steps described above to increase the number of batching opportunities.
|
||||
- Try enabling :ref:`single_rect_fallback
|
||||
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`.
|
||||
- The single rect fallback method is the default used without batching, and it
|
||||
is approximately twice as fast. However, it can result in flickering on some
|
||||
hardware, so its use is discouraged.
|
||||
- After trying the above, if your scene is still performing worse, consider
|
||||
turning off batching.
|
||||
|
||||
I use custom shaders and the items are not batching
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
I use custom shaders and the items are not batching.
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Custom shaders can be problematic for batching, see the custom shaders section
|
||||
- Custom shaders can be problematic for batching, see the custom shaders section
|
||||
|
||||
I am seeing line artifacts appear on certain hardware
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
I am seeing line artifacts appear on certain hardware.
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* See the :ref:`uv_contract
|
||||
- See the :ref:`uv_contract
|
||||
<class_ProjectSettings_property_rendering/batching/precision/uv_contract>`
|
||||
project setting which can be used to solve this problem.
|
||||
|
||||
I use a large number of textures, so few items are being batched
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
I use a large number of textures, so few items are being batched.
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Consider the use of texture atlases. As well as allowing batching, these
|
||||
reduce the need for state changes associated with changing texture.
|
||||
- Consider using texture atlases. As well as allowing batching, these
|
||||
reduce the need for state changes associated with changing textures.
|
||||
|
||||
Appendix
|
||||
~~~~~~~~
|
||||
|
||||
.. _doc_batching_light_scissoring_threshold_calculation:
|
||||
|
||||
Light scissoring threshold calculation
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
@@ -525,29 +528,23 @@ The actual proportion of screen pixel area used as the threshold is the
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
|
||||
value to the power of 4.
|
||||
|
||||
For example, on a screen size ``1920 x 1080`` there are ``2,073,600`` pixels.
|
||||
For example, on a screen size of 1920×1080, there are 2,073,600 pixels.
|
||||
|
||||
At a threshold of ``1000`` pixels, the proportion would be:
|
||||
At a threshold of 1,000 pixels, the proportion would be::
|
||||
|
||||
::
|
||||
|
||||
1000 / 2073600 = 0.00048225
|
||||
0.00048225 ^ 0.25 = 0.14819
|
||||
|
||||
.. note:: The power of 0.25 is the opposite of power of 4).
|
||||
1000 / 2073600 = 0.00048225
|
||||
0.00048225 ^ (1/4) = 0.14819
|
||||
|
||||
So a :ref:`scissor_area_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
|
||||
of 0.15 would be a reasonable value to try.
|
||||
of ``0.15`` would be a reasonable value to try.
|
||||
|
||||
Going the other way, for instance with a :ref:`scissor_area_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
|
||||
of ``0.5``:
|
||||
of ``0.5``::
|
||||
|
||||
::
|
||||
0.5 ^ 4 = 0.0625
|
||||
0.0625 * 2073600 = 129600 pixels
|
||||
|
||||
0.5 ^ 4 = 0.0625
|
||||
0.0625 * 2073600 = 129600 pixels
|
||||
|
||||
If the number of pixels saved is more than this threshold, the scissor is
|
||||
If the number of pixels saved is greater than this threshold, the scissor is
|
||||
activated.
|
||||
|
||||
@@ -1,13 +1,13 @@
|
||||
.. _doc_cpu_optimization:
|
||||
|
||||
CPU Optimizations
|
||||
=================
|
||||
CPU optimization
|
||||
================
|
||||
|
||||
Measuring performance
|
||||
=====================
|
||||
|
||||
To know how to speed up our program, we have to know where the "bottlenecks"
|
||||
are. Bottlenecks are the slowest parts of the program that limit the rate that
|
||||
are. Bottlenecks are the slowest parts of the program that limit the rate that
|
||||
everything can progress. This allows us to concentrate our efforts on optimizing
|
||||
the areas which will give us the greatest speed improvement, instead of spending
|
||||
a lot of time optimizing functions that will lead to small performance
|
||||
@@ -21,27 +21,31 @@ CPU profilers
|
||||
Profilers run alongside your program and take timing measurements to work out
|
||||
what proportion of time is spent in each function.
|
||||
|
||||
The Godot IDE conveniently has a built in profiler. It does not run every time
|
||||
you start your project, and must be manually started and stopped. This is
|
||||
because, in common with most profilers, recording these timing measurements can
|
||||
The Godot IDE conveniently has a built-in profiler. It does not run every time
|
||||
you start your project: it must be manually started and stopped. This is
|
||||
because, like most profilers, recording these timing measurements can
|
||||
slow down your project significantly.
|
||||
|
||||
After profiling, you can look back at the results for a frame.
|
||||
|
||||
.. image:: img/godot_profiler.png
|
||||
.. figure:: img/godot_profiler.png
|
||||
.. figure:: img/godot_profiler.png
|
||||
:alt: Screenshot of the Godot profiler
|
||||
|
||||
`These are the results of a profile of one of the demo projects.`
|
||||
Results of a profile of one of the demo projects.
|
||||
|
||||
.. note:: We can see the cost of built-in processes such as physics and audio,
|
||||
as well as seeing the cost of our own scripting functions at the
|
||||
bottom.
|
||||
|
||||
Time spent waiting for various built-in servers may not be counted in
|
||||
the profilers. This is a known bug.
|
||||
|
||||
When a project is running slowly, you will often see an obvious function or
|
||||
process taking a lot more time than others. This is your primary bottleneck, and
|
||||
you can usually increase speed by optimizing this area.
|
||||
|
||||
For more info about using the profiler within Godot see
|
||||
:ref:`doc_debugger_panel`.
|
||||
For more info about using Godot's built-in profiler, see :ref:`doc_debugger_panel`.
|
||||
|
||||
External profilers
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
@@ -49,70 +53,70 @@ External profilers
|
||||
Although the Godot IDE profiler is very convenient and useful, sometimes you
|
||||
need more power, and the ability to profile the Godot engine source code itself.
|
||||
|
||||
You can use a number of third party profilers to do this including Valgrind,
|
||||
VerySleepy, Visual Studio and Intel VTune.
|
||||
You can use a number of third party profilers to do this including
|
||||
`Valgrind <https://www.valgrind.org/>`__,
|
||||
`VerySleepy <http://www.codersnotes.com/sleepy/>`__,
|
||||
`HotSpot <https://github.com/KDAB/hotspot>`__,
|
||||
`Visual Studio <https://visualstudio.microsoft.com/>`__ and
|
||||
`Intel VTune <https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html>`__.
|
||||
|
||||
.. note:: You may need to compile Godot from source in order to use a third
|
||||
party profiler so that you have program database information
|
||||
available. You can also use a debug build, however, note that the
|
||||
results of profiling a debug build will be different to a release
|
||||
build, because debug builds are less optimized. Bottlenecks are often
|
||||
in a different place in debug builds, so you should profile release
|
||||
builds wherever possible.
|
||||
.. note:: You will need to compile Godot from source to use a third-party profiler.
|
||||
This is required to obtain debugging symbols. You can also use a debug
|
||||
build, however, note that the results of profiling a debug build will
|
||||
be different to a release build, because debug builds are less
|
||||
optimized. Bottlenecks are often in a different place in debug builds,
|
||||
so you should profile release builds whenever possible.
|
||||
|
||||
.. image:: img/valgrind.png
|
||||
.. figure:: img/valgrind.png
|
||||
:alt: Screenshot of Callgrind
|
||||
|
||||
`These are example results from Callgrind, part of Valgrind, on Linux.`
|
||||
Example results from Callgrind, which is part of Valgrind.
|
||||
|
||||
From the left, Callgrind is listing the percentage of time within a function and
|
||||
its children (Inclusive), the percentage of time spent within the function
|
||||
itself, excluding child functions (Self), the number of times the function is
|
||||
called, the function name, and the file or module.
|
||||
|
||||
In this example we can see nearly all time is spent under the
|
||||
`Main::iteration()` function, this is the master function in the Godot source
|
||||
code that is called repeatedly, and causes frames to be drawn, physics ticks to
|
||||
In this example, we can see nearly all time is spent under the
|
||||
`Main::iteration()` function. This is the master function in the Godot source
|
||||
code that is called repeatedly. It causes frames to be drawn, physics ticks to
|
||||
be simulated, and nodes and scripts to be updated. A large proportion of the
|
||||
time is spent in the functions to render a canvas (66%), because this example
|
||||
uses a 2d benchmark. Below this we see that almost 50% of the time is spent
|
||||
outside Godot code in `libglapi`, and `i965_dri` (the graphics driver). This
|
||||
tells us the a large proportion of CPU time is being spent in the graphics
|
||||
driver.
|
||||
uses a 2D benchmark. Below this, we see that almost 50% of the time is spent
|
||||
outside Godot code in ``libglapi`` and ``i965_dri`` (the graphics driver).
|
||||
This tells us the a large proportion of CPU time is being spent in the
|
||||
graphics driver.
|
||||
|
||||
This is actually an excellent example because in an ideal world, only a very
|
||||
small proportion of time would be spent in the graphics driver, and this is an
|
||||
This is actually an excellent example because, in an ideal world, only a very
|
||||
small proportion of time would be spent in the graphics driver. This is an
|
||||
indication that there is a problem with too much communication and work being
|
||||
done in the graphics API. This profiling lead to the development of 2d batching,
|
||||
which greatly speeds up 2d by reducing bottlenecks in this area.
|
||||
done in the graphics API. This specific profiling led to the development of 2D
|
||||
batching, which greatly speeds up 2D rendering by reducing bottlenecks in this
|
||||
area.
|
||||
|
||||
Manually timing functions
|
||||
=========================
|
||||
|
||||
Another handy technique, especially once you have identified the bottleneck
|
||||
using a profiler, is to manually time the function or area under test. The
|
||||
specifics vary according to language, but in GDScript, you would do the
|
||||
following:
|
||||
using a profiler, is to manually time the function or area under test.
|
||||
The specifics vary depending on the language, but in GDScript, you would do
|
||||
the following:
|
||||
|
||||
::
|
||||
|
||||
var time_start = OS.get_system_time_msecs()
|
||||
|
||||
var time_start = OS.get_ticks_usec()
|
||||
|
||||
# Your function you want to time
|
||||
update_enemies()
|
||||
|
||||
var time_end = OS.get_system_time_msecs()
|
||||
print("Function took: " + str(time_end - time_start))
|
||||
|
||||
|
||||
You may want to consider using other functions for time if another time unit is
|
||||
more suitable, for example :ref:`OS.get_system_time_secs
|
||||
<class_OS_method_get_system_time_secs>` if the function will take many seconds.
|
||||
var time_end = OS.get_ticks_usec()
|
||||
print("update_enemies() took %d microseconds" % time_end - time_start)
|
||||
|
||||
When manually timing functions, it is usually a good idea to run the function
|
||||
many times (say ``1000`` or more times), instead of just once (unless it is a
|
||||
very slow function). A large part of the reason for this is that timers often
|
||||
have limited accuracy, and CPUs will schedule processes in a haphazard manner,
|
||||
so an average over a series of runs is more accurate than a single measurement.
|
||||
many times (1,000 or more times), instead of just once (unless it is a very slow
|
||||
function). The reason for doing this is that timers often have limited accuracy.
|
||||
Moreover, CPUs will schedule processes in a haphazard manner. Therefore, an
|
||||
average over a series of runs is more accurate than a single measurement.
|
||||
|
||||
As you attempt to optimize functions, be sure to either repeatedly profile or
|
||||
time them as you go. This will give you crucial feedback as to whether the
|
||||
@@ -121,21 +125,22 @@ optimization is working (or not).
|
||||
Caches
|
||||
======
|
||||
|
||||
Something else to be particularly aware of, especially when comparing timing
|
||||
results of two different versions of a function, is that the results can be
|
||||
highly dependent on whether the data is in the CPU cache or not. CPUs don't load
|
||||
data directly from main memory, because although main memory can be huge (many
|
||||
GBs), it is very slow to access. Instead CPUs load data from a smaller, higher
|
||||
speed bank of memory, called cache. Loading data from cache is super fast, but
|
||||
every time you try and load a memory address that is not stored in cache, the
|
||||
cache must make a trip to main memory and slowly load in some data. This delay
|
||||
can result in the CPU sitting around idle for a long time, and is referred to as
|
||||
a "cache miss".
|
||||
CPU caches are something else to be particularly aware of, especially when
|
||||
comparing timing results of two different versions of a function. The results
|
||||
can be highly dependent on whether the data is in the CPU cache or not. CPUs
|
||||
don't load data directly from the system RAM, even though it's huge in
|
||||
comparison to the CPU cache (several gigabytes instead of a few megabytes). This
|
||||
is because system RAM is very slow to access. Instead, CPUs load data from a
|
||||
smaller, faster bank of memory called cache. Loading data from cache is very
|
||||
fast, but every time you try and load a memory address that is not stored in
|
||||
cache, the cache must make a trip to main memory and slowly load in some data.
|
||||
This delay can result in the CPU sitting around idle for a long time, and is
|
||||
referred to as a "cache miss".
|
||||
|
||||
This means that the first time you run a function, it may run slowly, because
|
||||
the data is not in cache. The second and later times, it may run much faster
|
||||
because the data is in cache. So always use averages when timing, and be aware
|
||||
of the effects of cache.
|
||||
This means that the first time you run a function, it may run slowly because the
|
||||
data is not in the CPU cache. The second and later times, it may run much faster
|
||||
because the data is in the cache. Due to this, always use averages when timing,
|
||||
and be aware of the effects of cache.
|
||||
|
||||
Understanding caching is also crucial to CPU optimization. If you have an
|
||||
algorithm (routine) that loads small bits of data from randomly spread out areas
|
||||
@@ -147,16 +152,15 @@ will be able to work as fast as possible.
|
||||
|
||||
Godot usually takes care of such low-level details for you. For example, the
|
||||
Server APIs make sure data is optimized for caching already for things like
|
||||
rendering and physics. But you should be especially aware of caching when using
|
||||
GDNative.
|
||||
rendering and physics. Still, you should be especially aware of caching when
|
||||
using :ref:`GDNative <toc-tutorials-gdnative>`.
|
||||
|
||||
Languages
|
||||
=========
|
||||
|
||||
Godot supports a number of different languages, and it is worth bearing in mind
|
||||
that there are trade-offs involved - some languages are designed for ease of
|
||||
use, at the cost of speed, and others are faster but more difficult to work
|
||||
with.
|
||||
that there are trade-offs involved. Some languages are designed for ease of use
|
||||
at the cost of speed, and others are faster but more difficult to work with.
|
||||
|
||||
Built-in engine functions run at the same speed regardless of the scripting
|
||||
language you choose. If your project is making a lot of calculations in its own
|
||||
@@ -165,16 +169,20 @@ code, consider moving those calculations to a faster language.
|
||||
GDScript
|
||||
~~~~~~~~
|
||||
|
||||
GDScript is designed to be easy to use and iterate, and is ideal for making many
|
||||
types of games. However, ease of use is considered more important than
|
||||
performance, so if you need to make heavy calculations, consider moving some of
|
||||
your project to one of the other languages.
|
||||
:ref:`GDScript <toc-learn-scripting-gdscript>` is designed to be easy to use and iterate,
|
||||
and is ideal for making many types of games. However, in this language, ease of
|
||||
use is considered more important than performance. If you need to make heavy
|
||||
calculations, consider moving some of your project to one of the other
|
||||
languages.
|
||||
|
||||
C#
|
||||
~~
|
||||
|
||||
C# is popular and has first class support in Godot. It offers a good compromise
|
||||
between speed and ease of use.
|
||||
:ref:`C# <toc-learn-scripting-C#>` is popular and has first-class support in Godot.It
|
||||
offers a good compromise between speed and ease of use. Beware of possible
|
||||
garbage collection pauses and leaks that can occur during gameplay, though. A
|
||||
common approach to workaround issues with garbage collection is to use *object
|
||||
pooling*, which is outside the scope of this guide.
|
||||
|
||||
Other languages
|
||||
~~~~~~~~~~~~~~~
|
||||
@@ -186,44 +194,49 @@ Third parties provide support for several other languages, including `Rust
|
||||
C++
|
||||
~~~
|
||||
|
||||
Godot is written in C++. Using C++ will usually result in the fastest code,
|
||||
however, on a practical level, it is the most difficult to deploy to end users'
|
||||
machines on different platforms. Options for using C++ include GDNative, and
|
||||
custom modules.
|
||||
Godot is written in C++. Using C++ will usually result in the fastest code.
|
||||
However, on a practical level, it is the most difficult to deploy to end users'
|
||||
machines on different platforms. Options for using C++ include
|
||||
:ref:`GDNative <toc-tutorials-gdnative>` and
|
||||
:ref:`custom modules <doc_custom_modules_in_c++>`.
|
||||
|
||||
Threads
|
||||
=======
|
||||
|
||||
Consider using threads when making a lot of calculations that can run parallel
|
||||
to one another. Modern CPUs have multiple cores, each one capable of doing a
|
||||
limited amount of work. By spreading work over multiple threads you can move
|
||||
further towards peak CPU efficiency.
|
||||
Consider using threads when making a lot of calculations that can run in
|
||||
parallel to each other. Modern CPUs have multiple cores, each one capable of
|
||||
doing a limited amount of work. By spreading work over multiple threads, you can
|
||||
move further towards peak CPU efficiency.
|
||||
|
||||
The disadvantage of threads is that you have to be incredibly careful. As each
|
||||
CPU core operates independently, they can end up trying to access the same
|
||||
memory at the same time. One thread can be reading to a variable while another
|
||||
is writing. Before you use threads make sure you understand the dangers and how
|
||||
to try and prevent these race conditions.
|
||||
is writing: this is called a *race condition*. Before you use threads, make sure
|
||||
you understand the dangers and how to try and prevent these race conditions.
|
||||
|
||||
For more information on threads see :ref:`doc_using_multiple_threads`.
|
||||
Threads can also make debugging considerably more difficult. The GDScript
|
||||
debugger doesn't support setting up breakpoints in threads yet.
|
||||
|
||||
For more information on threads, see :ref:`doc_using_multiple_threads`.
|
||||
|
||||
SceneTree
|
||||
=========
|
||||
|
||||
Although Nodes are an incredibly powerful and versatile concept, be aware that
|
||||
every node has a cost. Built in functions such as `_process()` and
|
||||
every node has a cost. Built-in functions such as `_process()` and
|
||||
`_physics_process()` propagate through the tree. This housekeeping can reduce
|
||||
performance when you have very large numbers of nodes.
|
||||
performance when you have very large numbers of nodes (usually in the thousands).
|
||||
|
||||
Each node is handled individually in the Godot renderer so sometimes a smaller
|
||||
Each node is handled individually in the Godot renderer. Therefore, a smaller
|
||||
number of nodes with more in each can lead to better performance.
|
||||
|
||||
One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
|
||||
get much better performance by removing nodes from the SceneTree, rather than
|
||||
by pausing or hiding them. You don't have to delete a detached node. You
|
||||
can for example, keep a reference to a node, detach it from the scene tree, then
|
||||
reattach it later. This can be very useful for adding and removing areas from a
|
||||
game for example.
|
||||
get much better performance by removing nodes from the SceneTree, rather than by
|
||||
pausing or hiding them. You don't have to delete a detached node. You can for
|
||||
example, keep a reference to a node, detach it from the scene tree using
|
||||
:ref:`Node.remove_child(node) <class_Node_method_remove_child>`, then reattach
|
||||
it later using :ref:`Node.add_child(node) <class_Node_method_add_child>`.
|
||||
This can be very useful for adding and removing areas from a game, for example.
|
||||
|
||||
You can avoid the SceneTree altogether by using Server APIs. For more
|
||||
information, see :ref:`doc_using_servers`.
|
||||
@@ -231,28 +244,33 @@ information, see :ref:`doc_using_servers`.
|
||||
Physics
|
||||
=======
|
||||
|
||||
In some situations physics can end up becoming a bottleneck, particularly with
|
||||
complex worlds, and large numbers of physics objects.
|
||||
In some situations, physics can end up becoming a bottleneck. This is
|
||||
particularly the case with complex worlds and large numbers of physics objects.
|
||||
|
||||
Some techniques to speed up physics:
|
||||
Here are some techniques to speed up physics:
|
||||
|
||||
* Try using simplified versions of your rendered geometry for physics. Often
|
||||
this won't be noticeable for end users, but can greatly increase performance.
|
||||
* Try removing objects from physics when they are out of view / outside the
|
||||
- Try using simplified versions of your rendered geometry for collision shapes.
|
||||
Often, this won't be noticeable for end users, but can greatly increase
|
||||
performance.
|
||||
- Try removing objects from physics when they are out of view / outside the
|
||||
current area, or reusing physics objects (maybe you allow 8 monsters per area,
|
||||
for example, and reuse these).
|
||||
|
||||
Another crucial aspect to physics is the physics tick rate. In some games you
|
||||
Another crucial aspect to physics is the physics tick rate. In some games, you
|
||||
can greatly reduce the tick rate, and instead of for example, updating physics
|
||||
60 times per second, you may update it at 20, or even 10 ticks per second. This
|
||||
can greatly reduce the CPU load.
|
||||
60 times per second, you may update them only 30 or even 20 times per second.
|
||||
This can greatly reduce the CPU load.
|
||||
|
||||
The downside of changing physics tick rate is you can get jerky movement or
|
||||
jitter when the physics update rate does not match the frames rendered.
|
||||
jitter when the physics update rate does not match the frames per second
|
||||
rendered. Also, decreasing the physics tick rate will increase input lag.
|
||||
It's recommended to stick to the default physics tick rate (60 Hz) in most games
|
||||
that feature real-time player movement.
|
||||
|
||||
The solution to this problem is 'fixed timestep interpolation', which involves
|
||||
The solution to jitter is to use *fixed timestep interpolation*, which involves
|
||||
smoothing the rendered positions and rotations over multiple frames to match the
|
||||
physics. You can either implement this yourself or use a third-party addon.
|
||||
Interpolation is a very cheap operation, performance wise, compared to running a
|
||||
physics tick, orders of magnitude faster, so this can be a significant win, as
|
||||
well as reducing jitter.
|
||||
physics. You can either implement this yourself or use a
|
||||
`third-party addon <https://github.com/lawnjelly/smoothing-addon>`__.
|
||||
Performance-wise, interpolation is a very cheap operation compared to running a
|
||||
physics tick. It's orders of magnitude faster, so this can be a significant
|
||||
performance win while also reducing jitter.
|
||||
|
||||
@@ -6,42 +6,42 @@ General optimization tips
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
In an ideal world, computers would run at infinite speed, and the only limit to
|
||||
what we could achieve would be our imagination. In the real world, however, it
|
||||
is all too easy to produce software that will bring even the fastest computer to
|
||||
In an ideal world, computers would run at infinite speed. The only limit to
|
||||
what we could achieve would be our imagination. However, in the real world, it's
|
||||
all too easy to produce software that will bring even the fastest computer to
|
||||
its knees.
|
||||
|
||||
Designing games and other software is thus a compromise between what we would
|
||||
Thus, designing games and other software is a compromise between what we would
|
||||
like to be possible, and what we can realistically achieve while maintaining
|
||||
good performance.
|
||||
|
||||
To achieve the best results, we have two approaches:
|
||||
* Work faster
|
||||
* Work smarter
|
||||
|
||||
- Work faster.
|
||||
- Work smarter.
|
||||
|
||||
And preferably, we will use a blend of the two.
|
||||
|
||||
Smoke and Mirrors
|
||||
Smoke and mirrors
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
Part of working smarter is recognizing that, especially in games, we can often
|
||||
get the player to believe they are in a world that is far more complex,
|
||||
interactive, and graphically exciting than it really is. A good programmer is a
|
||||
magician, and should strive to learn the tricks of the trade, and try to invent
|
||||
new ones.
|
||||
Part of working smarter is recognizing that, in games, we can often get the
|
||||
player to believe they're in a world that is far more complex, interactive, and
|
||||
graphically exciting than it really is. A good programmer is a magician, and
|
||||
should strive to learn the tricks of the trade while trying to invent new ones.
|
||||
|
||||
The nature of slowness
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To the outside observer, performance problems are often lumped together. But in
|
||||
reality, there are several different kinds of performance problem:
|
||||
To the outside observer, performance problems are often lumped together.
|
||||
But in reality, there are several different kinds of performance problems:
|
||||
|
||||
* A slow process that occurs every frame, leading to a continuously low frame
|
||||
rate
|
||||
* An intermittent process that causes 'spikes' of slowness, leading to
|
||||
stalls
|
||||
* A slow process that occurs outside of normal gameplay, for instance, on
|
||||
level load
|
||||
- A slow process that occurs every frame, leading to a continuously low frame
|
||||
rate.
|
||||
- An intermittent process that causes "spikes" of slowness, leading to
|
||||
stalls.
|
||||
- A slow process that occurs outside of normal gameplay, for instance,
|
||||
when loading a level.
|
||||
|
||||
Each of these are annoying to the user, but in different ways.
|
||||
|
||||
@@ -54,30 +54,32 @@ our attempts to speed them up.
|
||||
|
||||
There are several methods of measuring performance, including:
|
||||
|
||||
* Putting a start / stop timer around code of interest
|
||||
* Using the Godot profiler
|
||||
* Using external third party profilers
|
||||
* Using GPU profilers / debuggers
|
||||
* Checking the frame rate (with vsync disabled)
|
||||
- Putting a start/stop timer around code of interest.
|
||||
- Using the Godot profiler.
|
||||
- Using external third-party CPU profilers.
|
||||
- Using GPU profilers/debuggers such as
|
||||
`NVIDIA Nsight Graphics <https://developer.nvidia.com/nsight-graphics>`__
|
||||
or `apitrace <https://apitrace.github.io/>`__.
|
||||
- Checking the frame rate (with V-Sync disabled).
|
||||
|
||||
Be very aware that the relative performance of different areas can vary on
|
||||
different hardware. Often it is a good idea to make timings on more than one
|
||||
device, especially including mobile as well as desktop, if you are targeting
|
||||
mobile.
|
||||
different hardware. It's often a good idea to measure timings on more than one
|
||||
device. This is especially the case if you're targeting mobile devices.
|
||||
|
||||
Limitations
|
||||
~~~~~~~~~~~
|
||||
|
||||
CPU Profilers are often the 'go to' method for measuring performance, however
|
||||
CPU profilers are often the go-to method for measuring performance. However,
|
||||
they don't always tell the whole story.
|
||||
|
||||
- Bottlenecks are often on the GPU, `as a result` of instructions given by the
|
||||
CPU
|
||||
- Spikes can occur in the Operating System processes (outside of Godot) `as a
|
||||
result` of instructions used in Godot (for example dynamic memory allocation)
|
||||
- You may not be able to profile e.g. a mobile phone
|
||||
- Bottlenecks are often on the GPU, "as a result" of instructions given by the
|
||||
CPU.
|
||||
- Spikes can occur in the operating system processes (outside of Godot) "as a
|
||||
result" of instructions used in Godot (for example, dynamic memory allocation).
|
||||
- You may not always be able to profile specific devices like a mobile phone
|
||||
due to the initial setup required.
|
||||
- You may have to solve performance problems that occur on hardware you don't
|
||||
have access to
|
||||
have access to.
|
||||
|
||||
As a result of these limitations, you often need to use detective work to find
|
||||
out where bottlenecks are.
|
||||
@@ -92,27 +94,27 @@ binary search.
|
||||
Hypothesis testing
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Say for example you believe that sprites are slowing down your game. You can
|
||||
test this hypothesis for example by:
|
||||
Say, for example, that you believe sprites are slowing down your game.
|
||||
You can test this hypothesis by:
|
||||
|
||||
* Measuring the performance when you add more sprites, or take some away.
|
||||
- Measuring the performance when you add more sprites, or take some away.
|
||||
|
||||
This may lead to a further hypothesis - does the size of the sprite determine
|
||||
This may lead to a further hypothesis: does the size of the sprite determine
|
||||
the performance drop?
|
||||
|
||||
* You can test this by keeping everything the same, but changing the sprite
|
||||
size, and measuring performance
|
||||
- You can test this by keeping everything the same, but changing the sprite
|
||||
size, and measuring performance.
|
||||
|
||||
Binary search
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
Say you know that frames are taking much longer than they should, but you are
|
||||
If you know that frames are taking much longer than they should, but you're
|
||||
not sure where the bottleneck lies. You could begin by commenting out
|
||||
approximately half the routines that occur on a normal frame. Has the
|
||||
performance improved more or less than expected?
|
||||
|
||||
Once you know which of the two halves contains the bottleneck, you can then
|
||||
repeat this process, until you have pinned down the problematic area.
|
||||
Once you know which of the two halves contains the bottleneck, you can
|
||||
repeat this process until you've pinned down the problematic area.
|
||||
|
||||
Profilers
|
||||
=========
|
||||
@@ -122,17 +124,16 @@ provide results telling you what percentage of time was spent in different
|
||||
functions and areas, and how often functions were called.
|
||||
|
||||
This can be very useful both to identify bottlenecks and to measure the results
|
||||
of your improvements. Sometimes attempts to improve performance can backfire and
|
||||
lead to slower performance, so always use profiling and timing to guide your
|
||||
efforts.
|
||||
of your improvements. Sometimes, attempts to improve performance can backfire
|
||||
and lead to slower performance.
|
||||
**Always use profiling and timing to guide your efforts.**
|
||||
|
||||
For more info about using the profiler within Godot see
|
||||
:ref:`doc_debugger_panel`.
|
||||
For more info about using Godot's built-in profiler, see :ref:`doc_debugger_panel`.
|
||||
|
||||
Principles
|
||||
==========
|
||||
|
||||
Donald Knuth:
|
||||
`Donald Knuth <https://en.wikipedia.org/wiki/Donald_Knuth>`__ said:
|
||||
|
||||
*Programmers waste enormous amounts of time thinking about, or worrying
|
||||
about, the speed of noncritical parts of their programs, and these attempts
|
||||
@@ -143,19 +144,19 @@ Donald Knuth:
|
||||
|
||||
The messages are very important:
|
||||
|
||||
* Programmer / Developer time is limited. Instead of blindly trying to speed up
|
||||
all aspects of a program we should concentrate our efforts on the aspects that
|
||||
- Developer time is limited. Instead of blindly trying to speed up
|
||||
all aspects of a program, we should concentrate our efforts on the aspects that
|
||||
really matter.
|
||||
* Efforts at optimization often end up with code that is harder to read and
|
||||
- Efforts at optimization often end up with code that is harder to read and
|
||||
debug than non-optimized code. It is in our interests to limit this to areas
|
||||
that will really benefit.
|
||||
|
||||
Just because we `can` optimize a particular bit of code, it doesn't necessarily
|
||||
mean that we should. Knowing when, and when not to optimize is a great skill to
|
||||
Just because we *can* optimize a particular bit of code, it doesn't necessarily
|
||||
mean that we *should*. Knowing when and when not to optimize is a great skill to
|
||||
develop.
|
||||
|
||||
One misleading aspect of the quote is that people tend to focus on the subquote
|
||||
"premature optimization is the root of all evil". While `premature` optimization
|
||||
*"premature optimization is the root of all evil"*. While *premature* optimization
|
||||
is (by definition) undesirable, performant software is the result of performant
|
||||
design.
|
||||
|
||||
@@ -165,30 +166,30 @@ Performant design
|
||||
The danger with encouraging people to ignore optimization until necessary, is
|
||||
that it conveniently ignores that the most important time to consider
|
||||
performance is at the design stage, before a key has even hit a keyboard. If the
|
||||
design / algorithms of a program are inefficient, then no amount of polishing the
|
||||
details later will make it run fast. It may run `faster`, but it will never run
|
||||
design or algorithms of a program are inefficient, then no amount of polishing the
|
||||
details later will make it run fast. It may run *faster*, but it will never run
|
||||
as fast as a program designed for performance.
|
||||
|
||||
This tends to be far more important in game / graphics programming than in
|
||||
general programming. A performant design, even without low level optimization,
|
||||
will often run many times faster than a mediocre design with low level
|
||||
This tends to be far more important in game or graphics programming than in
|
||||
general programming. A performant design, even without low-level optimization,
|
||||
will often run many times faster than a mediocre design with low-level
|
||||
optimization.
|
||||
|
||||
Incremental design
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Of course, in practice, unless you have prior knowledge, you are unlikely to
|
||||
come up with the best design first time. So you will often make a series of
|
||||
come up with the best design the first time. Instead, you'll often make a series of
|
||||
versions of a particular area of code, each taking a different approach to the
|
||||
problem, until you come to a satisfactory solution. It is important not to spend
|
||||
problem, until you come to a satisfactory solution. It's important not to spend
|
||||
too much time on the details at this stage until you have finalized the overall
|
||||
design, otherwise much of your work will be thrown out.
|
||||
design. Otherwise, much of your work will be thrown out.
|
||||
|
||||
It is difficult to give general guidelines for performant design because this is
|
||||
It's difficult to give general guidelines for performant design because this is
|
||||
so dependent on the problem. One point worth mentioning though, on the CPU
|
||||
side, is that modern CPUs are nearly always limited by memory bandwidth. This
|
||||
has led to a resurgence in data orientated design, which involves designing data
|
||||
structures and algorithms for locality of data and linear access, rather than
|
||||
has led to a resurgence in data-oriented design, which involves designing data
|
||||
structures and algorithms for *cache locality* of data and linear access, rather than
|
||||
jumping around in memory.
|
||||
|
||||
The optimization process
|
||||
@@ -196,17 +197,17 @@ The optimization process
|
||||
|
||||
Assuming we have a reasonable design, and taking our lessons from Knuth, our
|
||||
first step in optimization should be to identify the biggest bottlenecks - the
|
||||
slowest functions, the low hanging fruit.
|
||||
slowest functions, the low-hanging fruit.
|
||||
|
||||
Once we have successfully improved the speed of the slowest area, it may no
|
||||
longer be the bottleneck. So we should test / profile again, and find the next
|
||||
Once we've successfully improved the speed of the slowest area, it may no
|
||||
longer be the bottleneck. So we should test/profile again and find the next
|
||||
bottleneck on which to focus.
|
||||
|
||||
The process is thus:
|
||||
|
||||
1. Profile / Identify bottleneck
|
||||
2. Optimize bottleneck
|
||||
3. Return to step 1
|
||||
1. Profile / Identify bottleneck.
|
||||
2. Optimize bottleneck.
|
||||
3. Return to step 1.
|
||||
|
||||
Optimizing bottlenecks
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -214,18 +215,22 @@ Optimizing bottlenecks
|
||||
Some profilers will even tell you which part of a function (which data accesses,
|
||||
calculations) are slowing things down.
|
||||
|
||||
As with design you should concentrate your efforts first on making sure the
|
||||
As with design, you should concentrate your efforts first on making sure the
|
||||
algorithms and data structures are the best they can be. Data access should be
|
||||
local (to make best use of CPU cache), and it can often be better to use compact
|
||||
storage of data (again, always profile to test results). Often you precalculate
|
||||
heavy computations ahead of time (e.g. at level load, or loading precalculated
|
||||
data files).
|
||||
storage of data (again, always profile to test results). Often, you precalculate
|
||||
heavy computations ahead of time. This can be done by performing the computation
|
||||
when loading a level, by loading a file containing precalculated data or simply
|
||||
by storing the results of complex calculations into a script constant and
|
||||
reading its value.
|
||||
|
||||
Once algorithms and data are good, you can often make small changes in routines
|
||||
which improve performance, things like moving calculations outside of loops.
|
||||
which improve performance. For instance, you can move some calculations outside
|
||||
of loops or transform nested ``for`` loops into non-nested loops.
|
||||
(This should be feasible if you know a 2D array's width or height in advance.)
|
||||
|
||||
Always retest your timing / bottlenecks after making each change. Some changes
|
||||
will increase speed, others may have a negative effect. Sometimes a small
|
||||
Always retest your timing/bottlenecks after making each change. Some changes
|
||||
will increase speed, others may have a negative effect. Sometimes, a small
|
||||
positive effect will be outweighed by the negatives of more complex code, and
|
||||
you may choose to leave out that optimization.
|
||||
|
||||
@@ -235,9 +240,9 @@ Appendix
|
||||
Bottleneck math
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
The proverb "a chain is only as strong as its weakest link" applies directly to
|
||||
The proverb *"a chain is only as strong as its weakest link"* applies directly to
|
||||
performance optimization. If your project is spending 90% of the time in
|
||||
function 'A', then optimizing A can have a massive effect on performance.
|
||||
function ``A``, then optimizing ``A`` can have a massive effect on performance.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
@@ -247,14 +252,14 @@ function 'A', then optimizing A can have a massive effect on performance.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
A: 1 ms
|
||||
Everything else: 1ms
|
||||
A: 1 ms
|
||||
Everything else: 1ms
|
||||
Total frame time: 2 ms
|
||||
|
||||
So in this example improving this bottleneck A by a factor of 9x, decreases
|
||||
overall frame time by 5x, and increases frames per second by 5x.
|
||||
In this example, improving this bottleneck ``A`` by a factor of 9× decreases
|
||||
overall frame time by 5× while increasing frames per second by 5×.
|
||||
|
||||
If however, something else is running slowly and also bottlenecking your
|
||||
However, if something else is running slowly and also bottlenecking your
|
||||
project, then the same improvement can lead to less dramatic gains:
|
||||
|
||||
.. code-block:: none
|
||||
@@ -269,8 +274,8 @@ project, then the same improvement can lead to less dramatic gains:
|
||||
Everything else: 50 ms
|
||||
Total frame time: 51 ms
|
||||
|
||||
So in this example, even though we have hugely optimized functionality A, the
|
||||
actual gain in terms of frame rate is quite small.
|
||||
In this example, even though we have hugely optimized function ``A``,
|
||||
the actual gain in terms of frame rate is quite small.
|
||||
|
||||
In games, things become even more complicated because the CPU and GPU run
|
||||
independently of one another. Your total frame time is determined by the slower
|
||||
@@ -288,5 +293,5 @@ of the two.
|
||||
GPU: 50 ms
|
||||
Total frame time: 50 ms
|
||||
|
||||
In this example, we optimized the CPU hugely again, but the frame time did not
|
||||
improve, because we are GPU-bottlenecked.
|
||||
In this example, we optimized the CPU hugely again, but the frame time didn't
|
||||
improve because we are GPU-bottlenecked.
|
||||
|
||||
@@ -1,75 +1,76 @@
|
||||
.. _doc_gpu_optimization:
|
||||
|
||||
GPU Optimizations
|
||||
=================
|
||||
GPU optimization
|
||||
================
|
||||
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
The demand for new graphics features and progress almost guarantees that you
|
||||
will encounter graphics bottlenecks. Some of these can be CPU side, for instance
|
||||
in calculations inside the Godot engine to prepare objects for rendering.
|
||||
Bottlenecks can also occur on the CPU in the graphics driver, which sorts
|
||||
instructions to pass to the GPU, and in the transfer of these instructions. And
|
||||
finally bottlenecks also occur on the GPU itself.
|
||||
will encounter graphics bottlenecks. Some of these can be on the CPU side, for
|
||||
instance in calculations inside the Godot engine to prepare objects for
|
||||
rendering. Bottlenecks can also occur on the CPU in the graphics driver, which
|
||||
sorts instructions to pass to the GPU, and in the transfer of these
|
||||
instructions. And finally, bottlenecks also occur on the GPU itself.
|
||||
|
||||
Where bottlenecks occur in rendering is highly hardware specific. Mobile GPUs in
|
||||
particular may struggle with scenes that run easily on desktop.
|
||||
Where bottlenecks occur in rendering is highly hardware-specific.
|
||||
Mobile GPUs in particular may struggle with scenes that run easily on desktop.
|
||||
|
||||
Understanding and investigating GPU bottlenecks is slightly different to the
|
||||
situation on the CPU, because often you can only change performance indirectly,
|
||||
by changing the instructions you give to the GPU, and it may be more difficult
|
||||
to take measurements. Often the only way of measuring performance is by
|
||||
examining changes in frame rate.
|
||||
situation on the CPU. This is because, often, you can only change performance
|
||||
indirectly by changing the instructions you give to the GPU. Also, it may be
|
||||
more difficult to take measurements. In many cases, the only way of measuring
|
||||
performance is by examining changes in the time spent rendering each frame.
|
||||
|
||||
Drawcalls, state changes, and APIs
|
||||
==================================
|
||||
Draw calls, state changes, and APIs
|
||||
===================================
|
||||
|
||||
.. note:: The following section is not relevant to end-users, but is useful to
|
||||
provide background information that is relevant in later sections.
|
||||
|
||||
Godot sends instructions to the GPU via a graphics API (OpenGL, GLES2, GLES3,
|
||||
Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or
|
||||
Vulkan). The communication and driver activity involved can be quite costly,
|
||||
especially in OpenGL. If we can provide these instructions in a way that is
|
||||
preferred by the driver and GPU, we can greatly increase performance.
|
||||
especially in OpenGL and OpenGL ES. If we can provide these instructions in a
|
||||
way that is preferred by the driver and GPU, we can greatly increase
|
||||
performance.
|
||||
|
||||
Nearly every API command in OpenGL requires a certain amount of validation, to
|
||||
Nearly every API command in OpenGL requires a certain amount of validation to
|
||||
make sure the GPU is in the correct state. Even seemingly simple commands can
|
||||
lead to a flurry of behind the scenes housekeeping. Therefore the name of the
|
||||
game is reduce these instructions to a bare minimum, and group together similar
|
||||
objects as much as possible so they can be rendered together, or with the
|
||||
minimum number of these expensive state changes.
|
||||
lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to
|
||||
reduce these instructions to a bare minimum and group together similar objects
|
||||
as much as possible so they can be rendered together, or with the minimum number
|
||||
of these expensive state changes.
|
||||
|
||||
2D batching
|
||||
~~~~~~~~~~~
|
||||
|
||||
In 2d, the costs of treating each item individually can be prohibitively high -
|
||||
there can easily be thousands on screen. This is why 2d batching is used -
|
||||
multiple similar items are grouped together and rendered in a batch, via a
|
||||
single drawcall, rather than making a separate drawcall for each item. In
|
||||
addition this means that state changes, material and texture changes can be kept
|
||||
In 2D, the costs of treating each item individually can be prohibitively high -
|
||||
there can easily be thousands of them on the screen. This is why 2D *batching*
|
||||
is used. Multiple similar items are grouped together and rendered in a batch,
|
||||
via a single draw call, rather than making a separate draw call for each item.
|
||||
In addition, this means state changes, material and texture changes can be kept
|
||||
to a minimum.
|
||||
|
||||
For more information on 2D batching see :ref:`doc_batching`.
|
||||
For more information on 2D batching, see :ref:`doc_batching`.
|
||||
|
||||
3D batching
|
||||
~~~~~~~~~~~
|
||||
|
||||
In 3d, we still aim to minimize draw calls and state changes, however, it can be
|
||||
more difficult to batch together several objects into a single draw call. 3d
|
||||
In 3D, we still aim to minimize draw calls and state changes. However, it can be
|
||||
more difficult to batch together several objects into a single draw call. 3D
|
||||
meshes tend to comprise hundreds or thousands of triangles, and combining large
|
||||
meshes at runtime is prohibitively expensive. The costs of joining them quickly
|
||||
meshes in real-time is prohibitively expensive. The costs of joining them quickly
|
||||
exceeds any benefits as the number of triangles grows per mesh. A much better
|
||||
alternative is to join meshes ahead of time (static meshes in relation to each
|
||||
alternative is to **join meshes ahead of time** (static meshes in relation to each
|
||||
other). This can either be done by artists, or programmatically within Godot.
|
||||
|
||||
There is also a cost to batching together objects in 3d. Several objects
|
||||
rendered as one cannot be individually culled. An entire city that is off screen
|
||||
There is also a cost to batching together objects in 3D. Several objects
|
||||
rendered as one cannot be individually culled. An entire city that is off-screen
|
||||
will still be rendered if it is joined to a single blade of grass that is on
|
||||
screen. So attempting to batch together 3d objects should take account of their
|
||||
location and effect on culling. Despite this, the benefits of joining static
|
||||
objects often outweigh other considerations, especially for large numbers of low
|
||||
poly objects.
|
||||
screen. Thus, you should always take objects' location and culling into account
|
||||
when attempting to batch 3D objects together. Despite this, the benefits of
|
||||
joining static objects often outweigh other considerations, especially for large
|
||||
numbers of distant or low-poly objects.
|
||||
|
||||
For more information on 3D specific optimizations, see
|
||||
:ref:`doc_optimizing_3d_performance`.
|
||||
@@ -80,14 +81,14 @@ Reuse Shaders and Materials
|
||||
The Godot renderer is a little different to what is out there. It's designed to
|
||||
minimize GPU state changes as much as possible. :ref:`SpatialMaterial
|
||||
<class_SpatialMaterial>` does a good job at reusing materials that need similar
|
||||
shaders but, if custom shaders are used, make sure to reuse them as much as
|
||||
shaders. if custom shaders are used, make sure to reuse them as much as
|
||||
possible. Godot's priorities are:
|
||||
|
||||
- **Reusing Materials**: The fewer different materials in the
|
||||
- **Reusing Materials:** The fewer different materials in the
|
||||
scene, the faster the rendering will be. If a scene has a huge amount
|
||||
of objects (in the hundreds or thousands) try reusing the materials
|
||||
or in the worst case use atlases.
|
||||
- **Reusing Shaders**: If materials can't be reused, at least try to
|
||||
of objects (in the hundreds or thousands), try reusing the materials.
|
||||
In the worst case, use atlases to decrease the amount of texture changes.
|
||||
- **Reusing Shaders:** If materials can't be reused, at least try to
|
||||
re-use shaders (or SpatialMaterials with different parameters but the same
|
||||
configuration).
|
||||
|
||||
@@ -95,54 +96,55 @@ If a scene has, for example, ``20,000`` objects with ``20,000`` different
|
||||
materials each, rendering will be slow. If the same scene has ``20,000``
|
||||
objects, but only uses ``100`` materials, rendering will be much faster.
|
||||
|
||||
Pixel cost vs vertex cost
|
||||
=========================
|
||||
Pixel cost versus vertex cost
|
||||
=============================
|
||||
|
||||
You may have heard that the lower the number of polygons in a model, the faster
|
||||
it will be rendered. This is *really* relative and depends on many factors.
|
||||
|
||||
On a modern PC and console, vertex cost is low. GPUs originally only rendered
|
||||
triangles, so every frame all the vertices:
|
||||
triangles. This meant that every frame:
|
||||
|
||||
1. Had to be transformed by the CPU (including clipping).
|
||||
1. All vertices had to be transformed by the CPU (including clipping).
|
||||
2. All vertices had to be sent to the GPU memory from the main RAM.
|
||||
|
||||
2. Had to be sent to the GPU memory from the main RAM.
|
||||
Nowadays, all this is handled inside the GPU, greatly increasing performance.
|
||||
3D artists usually have the wrong feeling about polycount performance because 3D
|
||||
DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory for it to
|
||||
be edited, reducing actual performance. Game engines rely on the GPU more, so
|
||||
they can render many triangles much more efficiently.
|
||||
|
||||
Now all this is handled inside the GPU, so the performance is much higher. 3D
|
||||
artists usually have the wrong feeling about polycount performance because 3D
|
||||
DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory in order
|
||||
for it to be edited, reducing actual performance. Game engines rely on the GPU
|
||||
more so they can render many triangles much more efficiently.
|
||||
|
||||
On mobile devices, the story is different. PC and Console GPUs are
|
||||
On mobile devices, the story is different. PC and console GPUs are
|
||||
brute-force monsters that can pull as much electricity as they need from
|
||||
the power grid. Mobile GPUs are limited to a tiny battery, so they need
|
||||
to be a lot more power efficient.
|
||||
|
||||
To be more efficient, mobile GPUs attempt to avoid *overdraw*. This means, the
|
||||
same pixel on the screen being rendered more than once. Imagine a town with
|
||||
several buildings, GPUs don't know what is visible and what is hidden until they
|
||||
draw it. A house might be drawn and then another house in front of it (rendering
|
||||
happened twice for the same pixel!). PC GPUs normally don't care much about this
|
||||
and just throw more pixel processors to the hardware to increase performance
|
||||
(but this also increases power consumption).
|
||||
To be more efficient, mobile GPUs attempt to avoid *overdraw*. Overdraw occurs
|
||||
when the same pixel on the screen is being rendered more than once. Imagine a
|
||||
town with several buildings. GPUs don't know what is visible and what is hidden
|
||||
until they draw it. For example, a house might be drawn and then another house
|
||||
in front of it (which means rendering happened twice for the same pixel). PC
|
||||
GPUs normally don't care much about this and just throw more pixel processors to
|
||||
the hardware to increase performance (which also increases power consumption).
|
||||
|
||||
Using more power is not an option on mobile so mobile devices use a technique
|
||||
called "Tile Based Rendering" which divides the screen into a grid. Each cell
|
||||
called *tile-based rendering* which divides the screen into a grid. Each cell
|
||||
keeps the list of triangles drawn to it and sorts them by depth to minimize
|
||||
*overdraw*. This technique improves performance and reduces power consumption,
|
||||
but takes a toll on vertex performance. As a result, fewer vertices and
|
||||
triangles can be processed for drawing.
|
||||
|
||||
Additionally, Tile Based Rendering struggles when there are small objects with a
|
||||
Additionally, tile-based rendering struggles when there are small objects with a
|
||||
lot of geometry within a small portion of the screen. This forces mobile GPUs to
|
||||
put a lot of strain on a single screen tile which considerably decreases
|
||||
performance as all the other cells must wait for it to complete in order to
|
||||
display the frame.
|
||||
put a lot of strain on a single screen tile, which considerably decreases
|
||||
performance as all the other cells must wait for it to complete before
|
||||
displaying the frame.
|
||||
|
||||
In summary, do not worry about vertex count on mobile, but avoid concentration
|
||||
of vertices in small parts of the screen. If a character, NPC, vehicle, etc. is
|
||||
far away (so it looks tiny), use a smaller level of detail (LOD) model.
|
||||
To summarize, don't worry about vertex count on mobile, but
|
||||
**avoid concentration of vertices in small parts of the screen**.
|
||||
If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use
|
||||
a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to
|
||||
avoid having triangles smaller than the size of a pixel on screen.
|
||||
|
||||
Pay attention to the additional vertex processing required when using:
|
||||
|
||||
@@ -150,47 +152,53 @@ Pay attention to the additional vertex processing required when using:
|
||||
- Morphs (shape keys)
|
||||
- Vertex-lit objects (common on mobile)
|
||||
|
||||
Pixel / fragment shaders - fill rate
|
||||
Pixel/fragment shaders and fill rate
|
||||
====================================
|
||||
|
||||
In contrast to vertex processing, the costs of fragment shading has increased
|
||||
dramatically over the years. Screen resolutions have increased (the area of a 4K
|
||||
screen is ``8,294,400`` pixels, versus ``307,200`` for an old ``640x480`` VGA
|
||||
In contrast to vertex processing, the costs of fragment (per-pixel) shading have
|
||||
increased dramatically over the years. Screen resolutions have increased (the
|
||||
area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA
|
||||
screen, that is 27x the area), but also the complexity of fragment shaders has
|
||||
exploded. Physically based rendering requires complex calculations for each
|
||||
exploded. Physically-based rendering requires complex calculations for each
|
||||
fragment.
|
||||
|
||||
You can test whether a project is fill rate limited quite easily. Turn off vsync
|
||||
to prevent capping the frames per second, then compare the frames per second
|
||||
when running with a large window, to running with a postage stamp sized window
|
||||
(you may also benefit from similarly reducing your shadow map size if using
|
||||
shadows). Usually you will find the fps increases quite a bit using a small
|
||||
window, which indicates you are to some extent fill rate limited. If on the
|
||||
other hand there is little to no increase in fps, then your bottleneck lies
|
||||
You can test whether a project is fill rate-limited quite easily. Turn off
|
||||
V-Sync to prevent capping the frames per second, then compare the frames per
|
||||
second when running with a large window, to running with a very small window.
|
||||
You may also benefit from similarly reducing your shadow map size if using
|
||||
shadows. Usually, you will find the FPS increases quite a bit using a small
|
||||
window, which indicates you are to some extent fill rate-limited. On the other
|
||||
hand, if there is little to no increase in FPS, then your bottleneck lies
|
||||
elsewhere.
|
||||
|
||||
You can increase performance in a fill rate limited project by reducing the
|
||||
You can increase performance in a fill rate-limited project by reducing the
|
||||
amount of work the GPU has to do. You can do this by simplifying the shader
|
||||
(perhaps turn off expensive options if you are using a :ref:`SpatialMaterial
|
||||
<class_SpatialMaterial>`), or reducing the number and size of textures used.
|
||||
|
||||
Consider shipping simpler shaders for mobile.
|
||||
**When targeting mobile devices, consider using the simplest possible shaders you
|
||||
can reasonably afford to use.**
|
||||
|
||||
Reading textures
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
The other factor in fragment shaders is the cost of reading textures. Reading
|
||||
textures is an expensive operation (especially reading from several in a single
|
||||
fragment shader), and also consider the filtering may add expense to this
|
||||
(trilinear filtering between mipmaps, and averaging). Reading textures is also
|
||||
expensive in power terms, which is a big issue on mobiles.
|
||||
textures is an expensive operation, especially when reading from several
|
||||
textures in a single fragment shader. Also, consider that filtering may slow it
|
||||
down further (trilinear filtering between mipmaps, and averaging). Reading
|
||||
textures is also expensive in terms of power usage, which is a big issue on
|
||||
mobiles.
|
||||
|
||||
**If you use third-party shaders or write your own shaders, try to use
|
||||
algorithms that require as few texture reads as possible.**
|
||||
|
||||
Texture compression
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Godot compresses textures of 3D models when imported (VRAM compression) by
|
||||
default. Video RAM compression is not as efficient in size as PNG or JPG when
|
||||
stored, but increases performance enormously when drawing.
|
||||
By default, Godot compresses textures of 3D models when imported using video RAM
|
||||
(VRAM) compression. Video RAM compression isn't as efficient in size as PNG or
|
||||
JPG when stored, but increases performance enormously when drawing large enough
|
||||
textures.
|
||||
|
||||
This is because the main goal of texture compression is bandwidth reduction
|
||||
between memory and the GPU.
|
||||
@@ -203,61 +211,72 @@ more noticeable.
|
||||
As a warning, most Android devices do not support texture compression of
|
||||
textures with transparency (only opaque), so keep this in mind.
|
||||
|
||||
Post processing / shadows
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
.. note::
|
||||
|
||||
Post processing effects and shadows can also be expensive in terms of fragment
|
||||
Even in 3D, "pixel art" textures should have VRAM compression disabled as it
|
||||
will negatively affect their appearance, without improving performance
|
||||
significantly due to their low resolution.
|
||||
|
||||
|
||||
Post-processing and shadows
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Post-processing effects and shadows can also be expensive in terms of fragment
|
||||
shading activity. Always test the impact of these on different hardware.
|
||||
|
||||
Reducing the size of shadow maps can increase performance, both in terms of
|
||||
writing, and reading the maps.
|
||||
**Reducing the size of shadowmaps can increase performance**, both in terms of
|
||||
writing and reading the shadowmaps. On top of that, the best way to improve
|
||||
performance of shadows is to turn shadows off for as many lights and objects as
|
||||
possible. Smaller or distant OmniLights/SpotLights can often have their shadows
|
||||
disabled with only a small visual impact.
|
||||
|
||||
Transparency / blending
|
||||
=======================
|
||||
Transparency and blending
|
||||
=========================
|
||||
|
||||
Transparent items present particular problems for rendering efficiency. Opaque
|
||||
items (especially in 3d) can be essentially rendered in any order and the
|
||||
Transparent objects present particular problems for rendering efficiency. Opaque
|
||||
objects (especially in 3D) can be essentially rendered in any order and the
|
||||
Z-buffer will ensure that only the front most objects get shaded. Transparent or
|
||||
blended objects are different - in most cases they cannot rely on the Z-buffer
|
||||
and must be rendered in "painter's order" (i.e. from back to front) in order to
|
||||
look correct.
|
||||
blended objects are different. In most cases, they cannot rely on the Z-buffer
|
||||
and must be rendered in "painter's order" (i.e. from back to front) to look
|
||||
correct.
|
||||
|
||||
The transparent items are also particularly bad for fill rate, because every
|
||||
item has to be drawn, even if later transparent items will be drawn on top.
|
||||
Transparent objects are also particularly bad for fill rate, because every item
|
||||
has to be drawn even if other transparent objects will be drawn on top
|
||||
later on.
|
||||
|
||||
Opaque items don't have to do this. They can usually take advantage of the
|
||||
Opaque objects don't have to do this. They can usually take advantage of the
|
||||
Z-buffer by writing to the Z-buffer only first, then only performing the
|
||||
fragment shader on the 'winning' fragment, the item that is at the front at a
|
||||
fragment shader on the "winning" fragment, the object that is at the front at a
|
||||
particular pixel.
|
||||
|
||||
Transparency is particularly expensive where multiple transparent items overlap.
|
||||
It is usually better to use as small a transparent area as possible in order to
|
||||
Transparency is particularly expensive where multiple transparent objects
|
||||
overlap. It is usually better to use transparent areas as small as possible to
|
||||
minimize these fill rate requirements, especially on mobile, where fill rate is
|
||||
very expensive. Indeed, in many situations, rendering more complex opaque
|
||||
geometry can end up being faster than using transparency to "cheat".
|
||||
|
||||
Multi-Platform Advice
|
||||
Multi-platform advice
|
||||
=====================
|
||||
|
||||
If you are aiming to release on multiple platforms, test `early` and test
|
||||
`often` on all your platforms, especially mobile. Developing a game on desktop
|
||||
but attempting to port to mobile at the last minute is a recipe for disaster.
|
||||
If you are aiming to release on multiple platforms, test *early* and test
|
||||
*often* on all your platforms, especially mobile. Developing a game on desktop
|
||||
but attempting to port it to mobile at the last minute is a recipe for disaster.
|
||||
|
||||
In general you should design your game for the lowest common denominator, then
|
||||
In general, you should design your game for the lowest common denominator, then
|
||||
add optional enhancements for more powerful platforms. For example, you may want
|
||||
to use the GLES2 backend for both desktop and mobile platforms where you target
|
||||
both.
|
||||
|
||||
Mobile / tile renderers
|
||||
=======================
|
||||
Mobile/tiled renderers
|
||||
======================
|
||||
|
||||
GPUs on mobile devices work in dramatically different ways from GPUs on desktop.
|
||||
Most mobile devices use tile renderers. Tile renderers split up the screen into
|
||||
regular sized tiles that fit into super fast cache memory, and reduce the reads
|
||||
and writes to main memory.
|
||||
As described above, GPUs on mobile devices work in dramatically different ways
|
||||
from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers
|
||||
split up the screen into regular-sized tiles that fit into super fast cache
|
||||
memory, which reduces the number of read/write operations to the main memory.
|
||||
|
||||
There are some downsides though, it can make certain techniques much more
|
||||
complicated and expensive to perform. Tiles that rely on the results of
|
||||
rendering in different tiles or on the results of earlier operations being
|
||||
There are some downsides though. Tiled rendering can make certain techniques
|
||||
much more complicated and expensive to perform. Tiles that rely on the results
|
||||
of rendering in different tiles or on the results of earlier operations being
|
||||
preserved can be very slow. Be very careful to test the performance of shaders,
|
||||
viewport textures and post processing.
|
||||
|
||||
@@ -4,33 +4,33 @@ Optimization
|
||||
Introduction
|
||||
------------
|
||||
|
||||
Godot follows a balanced performance philosophy. In the performance world, there
|
||||
are always trade-offs, which consist of trading speed for usability and
|
||||
flexibility. Some practical examples of this are:
|
||||
Godot follows a balanced performance philosophy. In the performance world,
|
||||
there are always trade-offs, which consist of trading speed for usability
|
||||
and flexibility. Some practical examples of this are:
|
||||
|
||||
- Rendering objects efficiently in high amounts is easy, but when a
|
||||
large scene must be rendered, it can become inefficient. To solve this,
|
||||
visibility computation must be added to the rendering, which makes rendering
|
||||
less efficient, but, at the same time, fewer objects are rendered, so
|
||||
efficiency overall improves.
|
||||
visibility computation must be added to the rendering. This makes rendering
|
||||
less efficient, but at the same time, fewer objects are rendered.
|
||||
Therefore, the overall rendering efficiency is improved.
|
||||
|
||||
- Configuring the properties of every material for every object that
|
||||
needs to be rendered is also slow. To solve this, objects are sorted by
|
||||
material to reduce the costs, but at the same time sorting has a cost.
|
||||
material to reduce the costs. At the same time, sorting has a cost.
|
||||
|
||||
- In 3D physics a similar situation happens. The best algorithms to
|
||||
- In 3D physics, a similar situation happens. The best algorithms to
|
||||
handle large amounts of physics objects (such as SAP) are slow at
|
||||
insertion/removal of objects and ray-casting. Algorithms that allow faster
|
||||
insertion and removal, as well as ray-casting, will not be able to handle as
|
||||
insertion/removal of objects and raycasting. Algorithms that allow faster
|
||||
insertion and removal, as well as raycasting, will not be able to handle as
|
||||
many active objects.
|
||||
|
||||
And there are many more examples of this! Game engines strive to be general
|
||||
purpose in nature, so balanced algorithms are always favored over algorithms
|
||||
that might be fast in some situations and slow in others or algorithms that are
|
||||
fast but make usability more difficult.
|
||||
And there are many more examples of this! Game engines strive to be general-purpose
|
||||
in nature. Balanced algorithms are always favored over algorithms
|
||||
that might be fast in some situations and slow in others, or algorithms that are
|
||||
fast but are more difficult to use.
|
||||
|
||||
Godot is not an exception and, while it is designed to have backends swappable
|
||||
for different algorithms, the default ones prioritize balance and flexibility
|
||||
Godot is not an exception to this. While it is designed to have backends swappable
|
||||
for different algorithms, the default backends prioritize balance and flexibility
|
||||
over performance.
|
||||
|
||||
With this clear, the aim of this tutorial section is to explain how to get the
|
||||
|
||||
@@ -22,34 +22,42 @@ in the street you are in, as well as the sky and a few birds flying overhead. As
|
||||
far as a naive renderer is concerned however, you can still see the entire town.
|
||||
It won't just render the buildings in front of you, it will render the street
|
||||
behind that, with the people on that street, the buildings behind that. You
|
||||
quickly end up in situations where you are attempting to render 10x, or 100x
|
||||
more than what is visible.
|
||||
quickly end up in situations where you are attempting to render 10× or 100× more
|
||||
than what is visible.
|
||||
|
||||
Things aren't quite as bad as they seem, because the Z-buffer usually allows the
|
||||
GPU to only fully shade the objects that are at the front. However, unneeded
|
||||
objects are still reducing performance.
|
||||
GPU to only fully shade the objects that are at the front. This is called *depth
|
||||
prepass* and is enabled by default in Godot when using the GLES3 renderer.
|
||||
However, unneeded objects are still reducing performance.
|
||||
|
||||
One way we can potentially reduce the amount to be rendered is to take advantage
|
||||
of occlusion. As of version 3.2.2 there is no built in support for occlusion in
|
||||
Godot, however with careful design you can still get many of the advantages.
|
||||
of occlusion. As of Godot 3.2.2, there is no built in support for occlusion in
|
||||
Godot. However, with careful design you can still get many of the advantages.
|
||||
|
||||
For instance in our city street scenario, you may be able to work out in advance
|
||||
For instance, in our city street scenario, you may be able to work out in advance
|
||||
that you can only see two other streets, ``B`` and ``C``, from street ``A``.
|
||||
Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
|
||||
you have to do is work out when your viewer is in street ``A`` (perhaps using
|
||||
Godot Areas), then you can hide the other streets.
|
||||
|
||||
This is a manual version of what is known as a 'potentially visible set'. It is
|
||||
This is a manual version of what is known as a "potentially visible set". It is
|
||||
a very powerful technique for speeding up rendering. You can also use it to
|
||||
restrict physics or AI to the local area, and speed these up as well as
|
||||
rendering.
|
||||
|
||||
.. note::
|
||||
|
||||
In some cases, you may have to adapt your level design to add more occlusion
|
||||
opportunities. For example, you may have to add more walls to prevent the player
|
||||
from seeing too far away, which would decrease performance due to the lost
|
||||
opportunies for occlusion culling.
|
||||
|
||||
Other occlusion techniques
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
There are other occlusion techniques such as portals, automatic PVS, and raster
|
||||
based occlusion culling. Some of these may be available through addons, and may
|
||||
be available in core Godot in the future.
|
||||
There are other occlusion techniques such as portals, automatic PVS, and
|
||||
raster-based occlusion culling. Some of these may be available through add-ons
|
||||
and may be available in core Godot in the future.
|
||||
|
||||
Transparent objects
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
@@ -57,9 +65,10 @@ Transparent objects
|
||||
Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
|
||||
<class_Shader>` to improve performance. This, however, can not be done with
|
||||
transparent objects. Transparent objects are rendered from back to front to make
|
||||
blending with what is behind work. As a result, try to use as few transparent
|
||||
objects as possible. If an object has a small section with transparency, try to
|
||||
make that section a separate surface with its own Material.
|
||||
blending with what is behind work. As a result,
|
||||
**try to use as few transparent objects as possible**. If an object has a
|
||||
small section with transparency, try to make that section a separate surface
|
||||
with its own material.
|
||||
|
||||
For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
|
||||
doc.
|
||||
@@ -67,12 +76,12 @@ doc.
|
||||
Level of detail (LOD)
|
||||
=====================
|
||||
|
||||
In some situations, particularly at a distance, it can be a good idea to replace
|
||||
complex geometry with simpler versions - the end user will probably not be able
|
||||
to see much difference. Consider looking at a large number of trees in the far
|
||||
distance. There are several strategies for replacing models at varying distance.
|
||||
You could use lower poly models, or use transparency to simulate more complex
|
||||
geometry.
|
||||
In some situations, particularly at a distance, it can be a good idea to
|
||||
**replace complex geometry with simpler versions**. The end user will probably
|
||||
not be able to see much difference. Consider looking at a large number of trees
|
||||
in the far distance. There are several strategies for replacing models at
|
||||
varying distance. You could use lower poly models, or use transparency to
|
||||
simulate more complex geometry.
|
||||
|
||||
Billboards and imposters
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@@ -112,19 +121,19 @@ Lighting objects is one of the most costly rendering operations. Realtime
|
||||
lighting, shadows (especially multiple lights), and GI are especially expensive.
|
||||
They may simply be too much for lower power mobile devices to handle.
|
||||
|
||||
Consider using baked lighting, especially for mobile. This can look fantastic,
|
||||
but has the downside that it will not be dynamic. Sometimes this is a trade off
|
||||
**Consider using baked lighting**, especially for mobile. This can look fantastic,
|
||||
but has the downside that it will not be dynamic. Sometimes, this is a trade-off
|
||||
worth making.
|
||||
|
||||
In general, if several lights need to affect a scene, it's best to use
|
||||
:ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
|
||||
indirect light bounces.
|
||||
|
||||
Animation / Skinning
|
||||
====================
|
||||
Animation and skinning
|
||||
======================
|
||||
|
||||
Animation and particularly vertex animation such as skinning and morphing can be
|
||||
very expensive on some platforms. You may need to lower poly count considerably
|
||||
Animation and vertex animation such as skinning and morphing can be very
|
||||
expensive on some platforms. You may need to lower the polycount considerably
|
||||
for animated models or limit the number of them on screen at any one time.
|
||||
|
||||
Large worlds
|
||||
@@ -137,7 +146,7 @@ Large worlds may need to be built in tiles that can be loaded on demand as you
|
||||
move around the world. This can prevent memory use from getting out of hand, and
|
||||
also limit the processing needed to the local area.
|
||||
|
||||
There may be glitches due to floating point error in large worlds. You may be
|
||||
able to use techniques such as orienting the world around the player (rather
|
||||
than the other way around), or shifting the origin periodically to keep things
|
||||
centred around (0, 0, 0).
|
||||
There may also be rendering and physics glitches due to floating point error in
|
||||
large worlds. You may be able to use techniques such as orienting the world
|
||||
around the player (rather than the other way around), or shifting the origin
|
||||
periodically to keep things centred around ``Vector3(0, 0, 0)``.
|
||||
|
||||
Reference in New Issue
Block a user