Overhaul optimization tutorials
Co-authored-by: lawnjelly <lawnjelly@gmail.com>
@@ -7,7 +7,6 @@
|
||||
|
||||
introduction_to_3d
|
||||
using_transforms
|
||||
optimizing_3d_performance
|
||||
3d_rendering_limitations
|
||||
spatial_material
|
||||
lights_and_shadows
|
||||
|
||||
@@ -1,192 +0,0 @@
|
||||
.. meta::
|
||||
:keywords: optimization
|
||||
|
||||
.. _doc_optimizing_3d_performance:
|
||||
|
||||
Optimizing 3D performance
|
||||
=========================
|
||||
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Godot follows a balanced performance philosophy. In the performance world,
|
||||
there are always trade-offs, which consist of trading speed for
|
||||
usability and flexibility. Some practical examples of this are:
|
||||
|
||||
- Rendering objects efficiently in high amounts is easy, but when a
|
||||
large scene must be rendered, it can become inefficient. To solve
|
||||
this, visibility computation must be added to the rendering, which
|
||||
makes rendering less efficient, but, at the same time, fewer objects are
|
||||
rendered, so efficiency overall improves.
|
||||
- Configuring the properties of every material for every object that
|
||||
needs to be rendered is also slow. To solve this, objects are sorted
|
||||
by material to reduce the costs, but at the same time sorting has a
|
||||
cost.
|
||||
- In 3D physics a similar situation happens. The best algorithms to
|
||||
handle large amounts of physics objects (such as SAP) are slow
|
||||
at insertion/removal of objects and ray-casting. Algorithms that
|
||||
allow faster insertion and removal, as well as ray-casting, will not
|
||||
be able to handle as many active objects.
|
||||
|
||||
And there are many more examples of this! Game engines strive to be
|
||||
general purpose in nature, so balanced algorithms are always favored
|
||||
over algorithms that might be fast in some situations and slow in
|
||||
others.. or algorithms that are fast but make usability more difficult.
|
||||
|
||||
Godot is not an exception and, while it is designed to have backends
|
||||
swappable for different algorithms, the default ones (or more like, the
|
||||
only ones that are there for now) prioritize balance and flexibility
|
||||
over performance.
|
||||
|
||||
With this clear, the aim of this tutorial is to explain how to get the
|
||||
maximum performance out of Godot.
|
||||
|
||||
Rendering
|
||||
~~~~~~~~~
|
||||
|
||||
3D rendering is one of the most difficult areas to get performance from,
|
||||
so this section will have a list of tips.
|
||||
|
||||
Reuse shaders and materials
|
||||
---------------------------
|
||||
|
||||
The Godot renderer is a little different to what is out there. It's designed
|
||||
to minimize GPU state changes as much as possible.
|
||||
:ref:`class_SpatialMaterial`
|
||||
does a good job at reusing materials that need similar shaders but, if
|
||||
custom shaders are used, make sure to reuse them as much as possible.
|
||||
Godot's priorities will be like this:
|
||||
|
||||
- **Reusing Materials**: The fewer different materials in the
|
||||
scene, the faster the rendering will be. If a scene has a huge amount
|
||||
of objects (in the hundreds or thousands) try reusing the materials
|
||||
or in the worst case use atlases.
|
||||
- **Reusing Shaders**: If materials can't be reused, at least try to
|
||||
re-use shaders (or SpatialMaterials with different parameters but the same
|
||||
configuration).
|
||||
|
||||
If a scene has, for example, 20.000 objects with 20.000 different
|
||||
materials each, rendering will be slow. If the same scene has
|
||||
20.000 objects, but only uses 100 materials, rendering will be blazingly
|
||||
fast.
|
||||
|
||||
Pixel cost vs vertex cost
|
||||
-------------------------
|
||||
|
||||
It is a common thought that the lower the number of polygons in a model, the
|
||||
faster it will be rendered. This is *really* relative and depends on
|
||||
many factors.
|
||||
|
||||
On a modern PC and console, vertex cost is low. GPUs
|
||||
originally only rendered triangles, so all the vertices:
|
||||
|
||||
1. Had to be transformed by the CPU (including clipping).
|
||||
|
||||
2. Had to be sent to the GPU memory from the main RAM.
|
||||
|
||||
Nowadays, all this is handled inside the GPU, so the performance is
|
||||
extremely high. 3D artists usually have the wrong feeling about
|
||||
polycount performance because 3D DCCs (such as Blender, Max, etc.) need
|
||||
to keep geometry in CPU memory in order for it to be edited, reducing
|
||||
actual performance. Truth is, a model rendered by a 3D engine is much
|
||||
more optimal than how 3D DCCs display them.
|
||||
|
||||
On mobile devices, the story is different. PC and Console GPUs are
|
||||
brute-force monsters that can pull as much electricity as they need from
|
||||
the power grid. Mobile GPUs are limited to a tiny battery, so they need
|
||||
to be a lot more power efficient.
|
||||
|
||||
To be more efficient, mobile GPUs attempt to avoid *overdraw*. This
|
||||
means, the same pixel on the screen being rendered (as in, with lighting
|
||||
calculation, etc.) more than once. Imagine a town with several buildings,
|
||||
GPUs don't know what is visible and what is hidden until they
|
||||
draw it. A house might be drawn and then another house in front of it
|
||||
(rendering happened twice for the same pixel!). PC GPUs normally don't
|
||||
care much about this and just throw more pixel processors to the
|
||||
hardware to increase performance (but this also increases power
|
||||
consumption).
|
||||
|
||||
On mobile, pulling more power is not an option, so a technique called
|
||||
"Tile Based Rendering" is used (almost every mobile hardware uses a
|
||||
variant of it), which divides the screen into a grid. Each cell keeps the
|
||||
list of triangles drawn to it and sorts them by depth to minimize
|
||||
*overdraw*. This technique improves performance and reduces power
|
||||
consumption, but takes a toll on vertex performance. As a result, fewer
|
||||
vertices and triangles can be processed for drawing.
|
||||
|
||||
Generally, this is not so bad, but there is a corner case on mobile that
|
||||
must be avoided, which is to have small objects with a lot of geometry
|
||||
within a small portion of the screen. This forces mobile GPUs to put a
|
||||
lot of strain on a single screen cell, considerably decreasing
|
||||
performance (as all the other cells must wait for it to complete in
|
||||
order to display the frame).
|
||||
|
||||
To make it short, do not worry about vertex count so much on mobile, but
|
||||
avoid concentration of vertices in small parts of the screen. If, for
|
||||
example, a character, NPC, vehicle, etc. is far away (so it looks tiny),
|
||||
use a smaller level of detail (LOD) model instead.
|
||||
|
||||
An extra situation where vertex cost must be considered is objects that
|
||||
have extra processing per vertex, such as:
|
||||
|
||||
- Skinning (skeletal animation)
|
||||
- Morphs (shape keys)
|
||||
- Vertex Lit Objects (common on mobile)
|
||||
|
||||
Texture compression
|
||||
-------------------
|
||||
|
||||
Godot offers to compress textures of 3D models when imported (VRAM
|
||||
compression). Video RAM compression is not as efficient in size as PNG
|
||||
or JPG when stored, but increases performance enormously when drawing.
|
||||
|
||||
This is because the main goal of texture compression is bandwidth
|
||||
reduction between memory and the GPU.
|
||||
|
||||
In 3D, the shapes of objects depend more on the geometry than the
|
||||
texture, so compression is generally not noticeable. In 2D, compression
|
||||
depends more on shapes inside the textures, so the artifacts resulting
|
||||
from 2D compression are more noticeable.
|
||||
|
||||
As a warning, most Android devices do not support texture compression of
|
||||
textures with transparency (only opaque), so keep this in mind.
|
||||
|
||||
Transparent objects
|
||||
-------------------
|
||||
|
||||
As mentioned before, Godot sorts objects by material and shader to
|
||||
improve performance. This, however, can not be done on transparent
|
||||
objects. Transparent objects are rendered from back to front to make
|
||||
blending with what is behind work. As a result, please try to keep
|
||||
transparent objects to a minimum! If an object has a small section with
|
||||
transparency, try to make that section a separate material.
|
||||
|
||||
Level of detail (LOD)
|
||||
---------------------
|
||||
|
||||
As also mentioned before, using objects with fewer vertices can improve
|
||||
performance in some cases. Godot has a simple system to change level
|
||||
of detail,
|
||||
:ref:`GeometryInstance <class_GeometryInstance>`
|
||||
based objects have a visibility range that can be defined. Having
|
||||
several GeometryInstance objects in different ranges works as LOD.
|
||||
|
||||
Use instancing (MultiMesh)
|
||||
--------------------------
|
||||
|
||||
If several identical objects have to be drawn in the same place or
|
||||
nearby, try using :ref:`MultiMesh <class_MultiMesh>`
|
||||
instead. MultiMesh allows the drawing of dozens of thousands of objects at
|
||||
very little performance cost, making it ideal for flocks, grass,
|
||||
particles, etc.
|
||||
|
||||
Bake lighting
|
||||
-------------
|
||||
|
||||
Small lights are usually not a performance issue. Shadows a little more.
|
||||
In general, if several lights need to affect a scene, it's ideal to bake
|
||||
it (:ref:`doc_baked_lightmaps`). Baking can also improve the scene quality by
|
||||
adding indirect light bounces.
|
||||
|
||||
If working on mobile, baking to texture is recommended, since this
|
||||
method is even faster.
|
||||
549
tutorials/optimization/batching.rst
Normal file
@@ -0,0 +1,549 @@
|
||||
.. _doc_batching:
|
||||
|
||||
Optimization using batching
|
||||
===========================
|
||||
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Game engines have to send a set of instructions to the GPU in order to tell the
|
||||
GPU what and where to draw. These instructions are sent using common
|
||||
instructions, called APIs (Application Programming Interfaces), examples of
|
||||
which are OpenGL, OpenGL ES, and Vulkan.
|
||||
|
||||
Different APIs incur different costs when drawing objects. OpenGL handles a lot
|
||||
of work for the user in the GPU driver at the cost of more expensive draw calls.
|
||||
As a result, applications can often be sped up by reducing the number of draw
|
||||
calls.
|
||||
|
||||
Draw calls
|
||||
^^^^^^^^^^
|
||||
|
||||
In 2D, we need to tell the GPU to render a series of primitives (rectangles,
|
||||
lines, polygons etc). The most obvious technique is to tell the GPU to render
|
||||
one primitive at a time, telling it some information such as the texture used,
|
||||
the material, the position, size, etc. then saying "Draw!" (this is called a
|
||||
draw call).
|
||||
|
||||
It turns out that while this is conceptually simple from the engine side, GPUs
|
||||
operate very slowly when used in this manner. GPUs work much more efficiently
|
||||
if, instead of telling them to draw a single primitive, you tell them to draw a
|
||||
number of similar primitives all in one draw call, which we will call a "batch".
|
||||
|
||||
And it turns out that they don't just work a bit faster when used in this
|
||||
manner, they work a *lot* faster.
|
||||
|
||||
As Godot is designed to be a general purpose engine, the primitives coming into
|
||||
the Godot renderer can be in any order, sometimes similar, and sometimes
|
||||
dissimilar. In order to match the general purpose nature of Godot with the
|
||||
batching preferences of GPUs, Godot features an intermediate layer which can
|
||||
automatically group together primitives wherever possible, and send these
|
||||
batches on to the GPU. This can give an increase in rendering performance while
|
||||
requiring few, if any, changes to your Godot project.
|
||||
|
||||
How it works
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Instructions come into the renderer from your game in the form of a series of
|
||||
items, each of which can contain one or more commands. The items correspond to
|
||||
Nodes in the scene tree, and the commands correspond to primitives such as
|
||||
rectangles or polygons. Some items, such as tilemaps, and text, can contain a
|
||||
large number of commands (tiles and letters respectively). Others, such as
|
||||
sprites, may only contain a single command (rectangle).
|
||||
|
||||
The batcher uses two main techniques to group together primitives:
|
||||
|
||||
* Consecutive items can be joined together
|
||||
* Consecutive commands within an item can be joined to form a batch
|
||||
|
||||
Breaking batching
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
Batching can only take place if the items or commands are similar enough to be
|
||||
rendered in one draw call. Certain changes (or techniques), by necessity, prevent
|
||||
the formation of a contiguous batch, this is referred to as 'breaking batching'.
|
||||
|
||||
Batching will be broken by (amongst other things):
|
||||
* Change of texture
|
||||
* Change of material
|
||||
* Change of primitive type (say going from rectangles to lines)
|
||||
|
||||
.. note::
|
||||
|
||||
If for example, you draw a series of sprites each with a different texture,
|
||||
there is no way they can be batched.
|
||||
|
||||
Render order
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The question arises, if only similar items can be drawn together in a batch, why
|
||||
don't we look through all the items in a scene, group together all the similar
|
||||
items, and draw them together?
|
||||
|
||||
In 3D, this is often exactly how engines work. However, in Godot 2D, items are
|
||||
drawn in 'painter's order', from back to front. This ensures that items at the
|
||||
front are drawn on top of earlier items, when they overlap.
|
||||
|
||||
This also means that if we try and draw objects in order of, for example,
|
||||
texture, then this painter's order may break and objects will be drawn in the
|
||||
wrong order.
|
||||
|
||||
In Godot this back to front order is determined by:
|
||||
* The order of objects in the scene tree
|
||||
* The Z index of objects
|
||||
* The canvas layer
|
||||
* Y sort nodes
|
||||
|
||||
.. note::
|
||||
|
||||
You can group similar objects together for easier batching. While doing so
|
||||
is not a requirement on your part, think of it as an optional approach that
|
||||
can improve performance in some cases. See the diagnostics section in order
|
||||
to help you make this decision.
|
||||
|
||||
A trick
|
||||
^^^^^^^
|
||||
|
||||
And now a sleight of hand. Although the idea of painter's order is that objects
|
||||
are rendered from back to front, consider 3 objects A, B and C, that contain 2
|
||||
different textures, grass and wood.
|
||||
|
||||
.. image:: img/overlap1.png
|
||||
|
||||
In painter's order they are ordered:
|
||||
|
||||
::
|
||||
|
||||
A - wood
|
||||
B - grass
|
||||
C - wood
|
||||
|
||||
Because the texture changes, they cannot be batched, and will be rendered in 3
|
||||
draw calls.
|
||||
|
||||
However, painter's order is only needed on the assumption that they will be
|
||||
drawn *on top* of each other. If we relax that assumption, i.e. if none of these
|
||||
3 objects are overlapping, there is *no need* to preserve painter's order. The
|
||||
rendered result will be the same. What if we could take advantage of this?
|
||||
|
||||
Item reordering
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
.. image:: img/overlap2.png
|
||||
|
||||
It turns out that we can reorder items. However, we can only do this if the
|
||||
items satisfy the conditions of an overlap test, to ensure that the end result
|
||||
will be the same as if they were not reordered. The overlap test is very cheap
|
||||
in performance terms, but not absolutely free, so there is a slight cost to
|
||||
looking ahead to decide whether items can be reordered. The number of items to
|
||||
lookahead for reordering can be set in project settings (see below), in order to
|
||||
balance the costs and benefits in your project.
|
||||
|
||||
::
|
||||
|
||||
A - wood
|
||||
C - wood
|
||||
B - grass
|
||||
|
||||
Because the texture only changes once, we can render the above in only 2
|
||||
draw calls.
|
||||
|
||||
Lights
|
||||
~~~~~~
|
||||
|
||||
Although the job for the batching system is normally quite straightforward, it
|
||||
becomes considerably more complex when 2D lights are used, because lights are
|
||||
drawn using extra passes, one for each light affecting the primitive. Consider 2
|
||||
sprites A and B, with identical texture and material. Without lights they would
|
||||
be batched together and drawn in one draw call. But with 3 lights, they would be
|
||||
drawn as follows, each line a draw call:
|
||||
|
||||
.. image:: img/lights_overlap.png
|
||||
|
||||
::
|
||||
|
||||
A
|
||||
A - light 1
|
||||
A - light 2
|
||||
A - light 3
|
||||
B
|
||||
B - light 1
|
||||
B - light 2
|
||||
B - light 3
|
||||
|
||||
That is a lot of draw calls, 8 for only 2 sprites. Now consider we are drawing
|
||||
1000 sprites, the number of draw calls quickly becomes astronomical, and
|
||||
performance suffers. This is partly why lights have the potential to drastically
|
||||
slow down 2D.
|
||||
|
||||
However, if you remember our magician's trick from item reordering, it turns out
|
||||
we can use the same trick to get around painter's order for lights!
|
||||
|
||||
If A and B are not overlapping, we can render them together in a batch, so the
|
||||
draw process is as follows:
|
||||
|
||||
.. image:: img/lights_separate.png
|
||||
|
||||
::
|
||||
|
||||
AB
|
||||
AB - light 1
|
||||
AB - light 2
|
||||
AB - light 3
|
||||
|
||||
|
||||
That is 4 draw calls. Not bad, that is a 50% improvement. However consider that
|
||||
in a real game, you might be drawing closer to 1000 sprites.
|
||||
|
||||
- Before: 1000 * 4 = 4000 draw calls.
|
||||
- After: 1 * 4 = 4 draw calls.
|
||||
|
||||
That is 1000x decrease in draw calls, and should give a huge increase in
|
||||
performance.
|
||||
|
||||
Overlap test
|
||||
^^^^^^^^^^^^
|
||||
|
||||
However, as with the item reordering, things are not that simple, we must first
|
||||
perform the overlap test to determine whether we can join these primitives, and
|
||||
the overlap test has a small cost. So again you can choose the number of
|
||||
primitives to lookahead in the overlap test to balance the benefits against the
|
||||
cost. Usually with lights the benefits far outweigh the costs.
|
||||
|
||||
Also consider that depending on the arrangement of primitives in the viewport,
|
||||
the overlap test will sometimes fail (because the primitives overlap and thus
|
||||
should not be joined). So in practice the decrease in draw calls may be less
|
||||
dramatic than the perfect situation of no overlap. However performance is
|
||||
usually far higher than without this lighting optimization.
|
||||
|
||||
Light Scissoring
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
Batching can make it more difficult to cull out objects that are not affected or
|
||||
partially affected by a light. This can increase the fill rate requirements
|
||||
quite a bit, and slow rendering. Fill rate is the rate at which pixels are
|
||||
colored, it is another potential bottleneck unrelated to draw calls.
|
||||
|
||||
In order to counter this problem, (and also speedup lighting in general),
|
||||
batching introduces light scissoring. This enables the use of the OpenGL command
|
||||
``glScissor()``, which identifies an area, outside of which, the GPU will not
|
||||
render any pixels. We can thus greatly optimize fill rate by identifying the
|
||||
intersection area between a light and a primitive, and limit rendering the light
|
||||
to *that area only*.
|
||||
|
||||
Light scissoring is controlled with the :ref:`scissor_area_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
|
||||
project setting. This value is between 1.0 and 0.0, with 1.0 being off (no
|
||||
scissoring), and 0.0 being scissoring in every circumstance. The reason for the
|
||||
setting is that there may be some small cost to scissoring on some hardware.
|
||||
Generally though, when you are using lighting, it should result in some
|
||||
performance gains.
|
||||
|
||||
The relationship between the threshold and whether a scissor operation takes
|
||||
place is not altogether straight forward, but generally it represents the pixel
|
||||
area that is potentially 'saved' by a scissor operation (i.e. the fill rate
|
||||
saved). At 1.0, the entire screens pixels would need to be saved, which rarely
|
||||
if ever happens, so it is switched off. In practice the useful values are
|
||||
bunched towards zero, as only a small percentage of pixels need to be saved for
|
||||
the operation to be useful.
|
||||
|
||||
The exact relationship is probably not necessary for users to worry about, but
|
||||
out of interest is included in the appendix.
|
||||
|
||||
.. image:: img/scissoring.png
|
||||
|
||||
*Bottom right is a light, the red area is the pixels saved by the scissoring
|
||||
operation. Only the intersection needs to be rendered.*
|
||||
|
||||
Vertex baking
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
The GPU shader receives instructions on what to draw in 2 main ways:
|
||||
|
||||
* Shader uniforms (e.g. modulate color, item transform)
|
||||
* Vertex attributes (vertex color, local transform)
|
||||
|
||||
However, within a single draw call (batch) we cannot change uniforms. This means
|
||||
that naively, we would not be able to batch together items or commands that
|
||||
change final_modulate, or item transform. Unfortunately that is an awful lot of
|
||||
cases. Sprites for instance typically are individual nodes with their own item
|
||||
transform, and they may have their own color modulate.
|
||||
|
||||
To get around this problem, the batching can "bake" some of the uniforms into
|
||||
the vertex attributes.
|
||||
|
||||
* The item transform can be combined with the local transform and sent in a
|
||||
vertex attribute.
|
||||
|
||||
* The final modulate color can be combined with the vertex colors, and sent in a
|
||||
vertex attribute.
|
||||
|
||||
In most cases this works fine, but this shortcut breaks down if a shader expects
|
||||
these values to be available individually, rather than combined. This can happen
|
||||
in custom shaders.
|
||||
|
||||
Custom Shaders
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
As a result certain operations in custom shaders will prevent baking, and thus
|
||||
decrease the potential for batching. While we are working to decrease these
|
||||
cases, currently the following conditions apply:
|
||||
|
||||
* Reading or writing ``COLOR`` or ``MODULATE`` - disables vertex color baking
|
||||
* Reading ``VERTEX`` - disables vertex position baking
|
||||
|
||||
Project Settings
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
In order to fine tune batching, a number of project settings are available. You
|
||||
can usually leave these at default during development, but it is a good idea to
|
||||
experiment to ensure you are getting maximum performance. Spending a little time
|
||||
tweaking parameters can often give considerable performance gain, for very
|
||||
little effort. See the tooltips in the project settings for more info.
|
||||
|
||||
rendering/batching/options
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* :ref:`use_batching
|
||||
<class_ProjectSettings_property_rendering/batching/options/use_batching>` -
|
||||
Turns batching on and off
|
||||
|
||||
* :ref:`use_batching_in_editor
|
||||
<class_ProjectSettings_property_rendering/batching/options/use_batching_in_editor>`
|
||||
|
||||
* :ref:`single_rect_fallback
|
||||
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
|
||||
- This is a faster way of drawing unbatchable rectangles, however it may lead
|
||||
to flicker on some hardware so is not recommended
|
||||
|
||||
rendering/batching/parameters
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* :ref:`max_join_item_commands <class_ProjectSettings_property_rendering/batching/parameters/max_join_item_commands>` -
|
||||
One of the most important ways of achieving
|
||||
batching is to join suitable adjacent items (nodes) together, however they can
|
||||
only be joined if the commands they contain are compatible. The system must
|
||||
therefore do a lookahead through the commands in an item to determine whether
|
||||
it can be joined. This has a small cost per command, and items with a large
|
||||
number of commands are not worth joining, so the best value may be project
|
||||
dependent.
|
||||
|
||||
* :ref:`colored_vertex_format_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/parameters/colored_vertex_format_threshold>` - Baking colors into
|
||||
vertices results in a
|
||||
larger vertex format. This is not necessarily worth doing unless there are a
|
||||
lot of color changes going on within a joined item. This parameter represents
|
||||
the proportion of commands containing color changes / the total commands,
|
||||
above which it switches to baked colors.
|
||||
|
||||
* :ref:`batch_buffer_size
|
||||
<class_ProjectSettings_property_rendering/batching/parameters/batch_buffer_size>`
|
||||
- This determines the maximum size of a batch, it doesn't have a huge effect
|
||||
on performance but can be worth decreasing for mobile if RAM is at a premium.
|
||||
|
||||
* :ref:`item_reordering_lookahead
|
||||
<class_ProjectSettings_property_rendering/batching/parameters/item_reordering_lookahead>`
|
||||
- Item reordering can help especially with
|
||||
interleaved sprites using different textures. The lookahead for the overlap
|
||||
test has a small cost, so the best value may change per project.
|
||||
|
||||
rendering/batching/lights
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* :ref:`scissor_area_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
|
||||
- See light scissoring.
|
||||
|
||||
* :ref:`max_join_items
|
||||
<class_ProjectSettings_property_rendering/batching/lights/max_join_items>` -
|
||||
Joining items before lighting can significantly increase
|
||||
performance. This requires an overlap test, which has a small cost, so the
|
||||
costs and benefits may be project dependent, and hence the best value to use
|
||||
here.
|
||||
|
||||
rendering/batching/debug
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* :ref:`flash_batching
|
||||
<class_ProjectSettings_property_rendering/batching/debug/flash_batching>` -
|
||||
This is purely a debugging feature to identify regressions between the
|
||||
batching and legacy renderer. When it is switched on, the batching and legacy
|
||||
renderer are used alternately on each frame. This will decrease performance,
|
||||
and should not be used for your final export, only for testing.
|
||||
|
||||
* :ref:`diagnose_frame
|
||||
<class_ProjectSettings_property_rendering/batching/debug/diagnose_frame>` -
|
||||
This will periodically print a diagnostic batching log to
|
||||
the Godot IDE / console.
|
||||
|
||||
rendering/batching/precision
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* :ref:`uv_contract
|
||||
<class_ProjectSettings_property_rendering/batching/precision/uv_contract>` -
|
||||
On some hardware (notably some Android devices) there have been reports of
|
||||
tilemap tiles drawing slightly outside their UV range, leading to edge
|
||||
artifacts such as lines around tiles. If you see this problem, try enabling uv
|
||||
contract. This makes a small contraction in the UV coordinates to compensate
|
||||
for precision errors on devices.
|
||||
|
||||
* :ref:`uv_contract_amount
|
||||
<class_ProjectSettings_property_rendering/batching/precision/uv_contract_amount>`
|
||||
- Hopefully the default amount should cure artifacts on most devices, but just
|
||||
in case, this value is editable.
|
||||
|
||||
Diagnostics
|
||||
~~~~~~~~~~~
|
||||
|
||||
Although you can change parameters and examine the effect on frame rate, this
|
||||
can feel like working blindly, with no idea of what is going on under the hood.
|
||||
To help with this, batching offers a diagnostic mode, which will periodically
|
||||
print out (to the IDE or console) a list of the batches that are being
|
||||
processed. This can help pin point situations where batching is not occurring as
|
||||
intended, and help you to fix them, in order to get the best possible
|
||||
performance.
|
||||
|
||||
Reading a diagnostic
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
canvas_begin FRAME 2604
|
||||
items
|
||||
joined_item 1 refs
|
||||
batch D 0-0
|
||||
batch D 0-2 n n
|
||||
batch R 0-1 [0 - 0] {255 255 255 255 }
|
||||
joined_item 1 refs
|
||||
batch D 0-0
|
||||
batch R 0-1 [0 - 146] {255 255 255 255 }
|
||||
batch D 0-0
|
||||
batch R 0-1 [0 - 146] {255 255 255 255 }
|
||||
joined_item 1 refs
|
||||
batch D 0-0
|
||||
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
|
||||
batch D 0-0
|
||||
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
|
||||
batch D 0-0
|
||||
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
|
||||
canvas_end
|
||||
|
||||
|
||||
This is a typical diagnostic.
|
||||
|
||||
* **joined_item** - A joined item can contain 1 or
|
||||
more references to items (nodes). Generally joined_items containing many
|
||||
references is preferable to many joined_items containing a single reference.
|
||||
Whether items can be joined will be determined by their contents and
|
||||
compatibility with the previous item.
|
||||
* **batch R** - a batch containing rectangles. The second number is the number of
|
||||
rects. The second number in square brackets is the Godot texture ID, and the
|
||||
numbers in curly braces is the color. If the batch contains more than one rect,
|
||||
MULTI is added to the line to make it easy to identify. Seeing MULTI is good,
|
||||
because this indicates successful batching.
|
||||
* **batch D** - a default batch, containing everything else that is not currently
|
||||
batched.
|
||||
|
||||
Default Batches
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
The second number following default batches is the number of commands in the
|
||||
batch, and it is followed by a brief summary of the contents:
|
||||
|
||||
::
|
||||
|
||||
l - line
|
||||
PL - polyline
|
||||
r - rect
|
||||
n - ninepatch
|
||||
PR - primitive
|
||||
p - polygon
|
||||
m - mesh
|
||||
MM - multimesh
|
||||
PA - particles
|
||||
c - circle
|
||||
t - transform
|
||||
CI - clip_ignore
|
||||
|
||||
You may see "dummy" default batches containing no commands, you can ignore
|
||||
these.
|
||||
|
||||
FAQ
|
||||
~~~
|
||||
|
||||
I don't get a large performance increase from switching on batching
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Try the diagnostics, see how much batching is occurring, and whether it can be
|
||||
improved
|
||||
* Try changing parameters
|
||||
* Consider that batching may not be your bottleneck (see bottlenecks)
|
||||
|
||||
I get a decrease in performance with batching
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Try steps to increase batching given above
|
||||
* Try switching :ref:`single_rect_fallback
|
||||
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
|
||||
to on
|
||||
* The single rect fallback method is the default used without batching, and it
|
||||
is approximately twice as fast, however it can result in flicker on some
|
||||
hardware, so its use is discouraged
|
||||
* After trying the above, if your scene is still performing worse, consider
|
||||
turning off batching.
|
||||
|
||||
I use custom shaders and the items are not batching
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Custom shaders can be problematic for batching, see the custom shaders section
|
||||
|
||||
I am seeing line artifacts appear on certain hardware
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* See the :ref:`uv_contract
|
||||
<class_ProjectSettings_property_rendering/batching/precision/uv_contract>`
|
||||
project setting which can be used to solve this problem.
|
||||
|
||||
I use a large number of textures, so few items are being batched
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
* Consider the use of texture atlases. As well as allowing batching, these
|
||||
reduce the need for state changes associated with changing texture.
|
||||
|
||||
Appendix
|
||||
~~~~~~~~
|
||||
|
||||
Light scissoring threshold calculation
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The actual proportion of screen pixel area used as the threshold is the
|
||||
:ref:`scissor_area_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
|
||||
value to the power of 4.
|
||||
|
||||
For example, on a screen size ``1920 x 1080`` there are ``2,073,600`` pixels.
|
||||
|
||||
At a threshold of ``1000`` pixels, the proportion would be:
|
||||
|
||||
::
|
||||
|
||||
1000 / 2073600 = 0.00048225
|
||||
0.00048225 ^ 0.25 = 0.14819
|
||||
|
||||
.. note:: The power of 0.25 is the opposite of power of 4).
|
||||
|
||||
So a :ref:`scissor_area_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
|
||||
of 0.15 would be a reasonable value to try.
|
||||
|
||||
Going the other way, for instance with a :ref:`scissor_area_threshold
|
||||
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
|
||||
of ``0.5``:
|
||||
|
||||
::
|
||||
|
||||
0.5 ^ 4 = 0.0625
|
||||
0.0625 * 2073600 = 129600 pixels
|
||||
|
||||
If the number of pixels saved is more than this threshold, the scissor is
|
||||
activated.
|
||||
258
tutorials/optimization/cpu_optimization.rst
Normal file
@@ -0,0 +1,258 @@
|
||||
.. _doc_cpu_optimization:
|
||||
|
||||
CPU Optimizations
|
||||
=================
|
||||
|
||||
Measuring performance
|
||||
=====================
|
||||
|
||||
To know how to speed up our program, we have to know where the "bottlenecks"
|
||||
are. Bottlenecks are the slowest parts of the program that limit the rate that
|
||||
everything can progress. This allows us to concentrate our efforts on optimizing
|
||||
the areas which will give us the greatest speed improvement, instead of spending
|
||||
a lot of time optimizing functions that will lead to small performance
|
||||
improvements.
|
||||
|
||||
For the CPU, the easiest way to identify bottlenecks is to use a profiler.
|
||||
|
||||
CPU profilers
|
||||
=============
|
||||
|
||||
Profilers run alongside your program and take timing measurements to work out
|
||||
what proportion of time is spent in each function.
|
||||
|
||||
The Godot IDE conveniently has a built in profiler. It does not run every time
|
||||
you start your project, and must be manually started and stopped. This is
|
||||
because, in common with most profilers, recording these timing measurements can
|
||||
slow down your project significantly.
|
||||
|
||||
After profiling, you can look back at the results for a frame.
|
||||
|
||||
.. image:: img/godot_profiler.png
|
||||
|
||||
`These are the results of a profile of one of the demo projects.`
|
||||
|
||||
.. note:: We can see the cost of built-in processes such as physics and audio,
|
||||
as well as seeing the cost of our own scripting functions at the
|
||||
bottom.
|
||||
|
||||
When a project is running slowly, you will often see an obvious function or
|
||||
process taking a lot more time than others. This is your primary bottleneck, and
|
||||
you can usually increase speed by optimizing this area.
|
||||
|
||||
For more info about using the profiler within Godot see
|
||||
:ref:`doc_debugger_panel`.
|
||||
|
||||
External profilers
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Although the Godot IDE profiler is very convenient and useful, sometimes you
|
||||
need more power, and the ability to profile the Godot engine source code itself.
|
||||
|
||||
You can use a number of third party profilers to do this including Valgrind,
|
||||
VerySleepy, Visual Studio and Intel VTune.
|
||||
|
||||
.. note:: You may need to compile Godot from source in order to use a third
|
||||
party profiler so that you have program database information
|
||||
available. You can also use a debug build, however, note that the
|
||||
results of profiling a debug build will be different to a release
|
||||
build, because debug builds are less optimized. Bottlenecks are often
|
||||
in a different place in debug builds, so you should profile release
|
||||
builds wherever possible.
|
||||
|
||||
.. image:: img/valgrind.png
|
||||
|
||||
`These are example results from Callgrind, part of Valgrind, on Linux.`
|
||||
|
||||
From the left, Callgrind is listing the percentage of time within a function and
|
||||
its children (Inclusive), the percentage of time spent within the function
|
||||
itself, excluding child functions (Self), the number of times the function is
|
||||
called, the function name, and the file or module.
|
||||
|
||||
In this example we can see nearly all time is spent under the
|
||||
`Main::iteration()` function, this is the master function in the Godot source
|
||||
code that is called repeatedly, and causes frames to be drawn, physics ticks to
|
||||
be simulated, and nodes and scripts to be updated. A large proportion of the
|
||||
time is spent in the functions to render a canvas (66%), because this example
|
||||
uses a 2d benchmark. Below this we see that almost 50% of the time is spent
|
||||
outside Godot code in `libglapi`, and `i965_dri` (the graphics driver). This
|
||||
tells us the a large proportion of CPU time is being spent in the graphics
|
||||
driver.
|
||||
|
||||
This is actually an excellent example because in an ideal world, only a very
|
||||
small proportion of time would be spent in the graphics driver, and this is an
|
||||
indication that there is a problem with too much communication and work being
|
||||
done in the graphics API. This profiling lead to the development of 2d batching,
|
||||
which greatly speeds up 2d by reducing bottlenecks in this area.
|
||||
|
||||
Manually timing functions
|
||||
=========================
|
||||
|
||||
Another handy technique, especially once you have identified the bottleneck
|
||||
using a profiler, is to manually time the function or area under test. The
|
||||
specifics vary according to language, but in GDScript, you would do the
|
||||
following:
|
||||
|
||||
::
|
||||
|
||||
var time_start = OS.get_system_time_msecs()
|
||||
|
||||
# Your function you want to time
|
||||
update_enemies()
|
||||
|
||||
var time_end = OS.get_system_time_msecs()
|
||||
print("Function took: " + str(time_end - time_start))
|
||||
|
||||
|
||||
You may want to consider using other functions for time if another time unit is
|
||||
more suitable, for example :ref:`OS.get_system_time_secs
|
||||
<class_OS_method_get_system_time_secs>` if the function will take many seconds.
|
||||
|
||||
When manually timing functions, it is usually a good idea to run the function
|
||||
many times (say ``1000`` or more times), instead of just once (unless it is a
|
||||
very slow function). A large part of the reason for this is that timers often
|
||||
have limited accuracy, and CPUs will schedule processes in a haphazard manner,
|
||||
so an average over a series of runs is more accurate than a single measurement.
|
||||
|
||||
As you attempt to optimize functions, be sure to either repeatedly profile or
|
||||
time them as you go. This will give you crucial feedback as to whether the
|
||||
optimization is working (or not).
|
||||
|
||||
Caches
|
||||
======
|
||||
|
||||
Something else to be particularly aware of, especially when comparing timing
|
||||
results of two different versions of a function, is that the results can be
|
||||
highly dependent on whether the data is in the CPU cache or not. CPUs don't load
|
||||
data directly from main memory, because although main memory can be huge (many
|
||||
GBs), it is very slow to access. Instead CPUs load data from a smaller, higher
|
||||
speed bank of memory, called cache. Loading data from cache is super fast, but
|
||||
every time you try and load a memory address that is not stored in cache, the
|
||||
cache must make a trip to main memory and slowly load in some data. This delay
|
||||
can result in the CPU sitting around idle for a long time, and is referred to as
|
||||
a "cache miss".
|
||||
|
||||
This means that the first time you run a function, it may run slowly, because
|
||||
the data is not in cache. The second and later times, it may run much faster
|
||||
because the data is in cache. So always use averages when timing, and be aware
|
||||
of the effects of cache.
|
||||
|
||||
Understanding caching is also crucial to CPU optimization. If you have an
|
||||
algorithm (routine) that loads small bits of data from randomly spread out areas
|
||||
of main memory, this can result in a lot of cache misses, a lot of the time, the
|
||||
CPU will be waiting around for data instead of doing any work. Instead, if you
|
||||
can make your data accesses localised, or even better, access memory in a linear
|
||||
fashion (like a continuous list), then the cache will work optimally and the CPU
|
||||
will be able to work as fast as possible.
|
||||
|
||||
Godot usually takes care of such low-level details for you. For example, the
|
||||
Server APIs make sure data is optimized for caching already for things like
|
||||
rendering and physics. But you should be especially aware of caching when using
|
||||
GDNative.
|
||||
|
||||
Languages
|
||||
=========
|
||||
|
||||
Godot supports a number of different languages, and it is worth bearing in mind
|
||||
that there are trade-offs involved - some languages are designed for ease of
|
||||
use, at the cost of speed, and others are faster but more difficult to work
|
||||
with.
|
||||
|
||||
Built-in engine functions run at the same speed regardless of the scripting
|
||||
language you choose. If your project is making a lot of calculations in its own
|
||||
code, consider moving those calculations to a faster language.
|
||||
|
||||
GDScript
|
||||
~~~~~~~~
|
||||
|
||||
GDScript is designed to be easy to use and iterate, and is ideal for making many
|
||||
types of games. However, ease of use is considered more important than
|
||||
performance, so if you need to make heavy calculations, consider moving some of
|
||||
your project to one of the other languages.
|
||||
|
||||
C#
|
||||
~~
|
||||
|
||||
C# is popular and has first class support in Godot. It offers a good compromise
|
||||
between speed and ease of use.
|
||||
|
||||
Other languages
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Third parties provide support for several other languages, including `Rust
|
||||
<https://github.com/godot-rust/godot-rust>`_ and `Javascript
|
||||
<https://github.com/GodotExplorer/ECMAScript>`_.
|
||||
|
||||
C++
|
||||
~~~
|
||||
|
||||
Godot is written in C++. Using C++ will usually result in the fastest code,
|
||||
however, on a practical level, it is the most difficult to deploy to end users'
|
||||
machines on different platforms. Options for using C++ include GDNative, and
|
||||
custom modules.
|
||||
|
||||
Threads
|
||||
=======
|
||||
|
||||
Consider using threads when making a lot of calculations that can run parallel
|
||||
to one another. Modern CPUs have multiple cores, each one capable of doing a
|
||||
limited amount of work. By spreading work over multiple threads you can move
|
||||
further towards peak CPU efficiency.
|
||||
|
||||
The disadvantage of threads is that you have to be incredibly careful. As each
|
||||
CPU core operates independently, they can end up trying to access the same
|
||||
memory at the same time. One thread can be reading to a variable while another
|
||||
is writing. Before you use threads make sure you understand the dangers and how
|
||||
to try and prevent these race conditions.
|
||||
|
||||
For more information on threads see :ref:`doc_using_multiple_threads`.
|
||||
|
||||
SceneTree
|
||||
=========
|
||||
|
||||
Although Nodes are an incredibly powerful and versatile concept, be aware that
|
||||
every node has a cost. Built in functions such as `_process()` and
|
||||
`_physics_process()` propagate through the tree. This housekeeping can reduce
|
||||
performance when you have very large numbers of nodes.
|
||||
|
||||
Each node is handled individually in the Godot renderer so sometimes a smaller
|
||||
number of nodes with more in each can lead to better performance.
|
||||
|
||||
One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
|
||||
get much better performance by removing nodes from the SceneTree, rather than
|
||||
by pausing or hiding them. You don't have to delete a detached node. You
|
||||
can for example, keep a reference to a node, detach it from the scene tree, then
|
||||
reattach it later. This can be very useful for adding and removing areas from a
|
||||
game for example.
|
||||
|
||||
You can avoid the SceneTree altogether by using Server APIs. For more
|
||||
information, see :ref:`doc_using_servers`.
|
||||
|
||||
Physics
|
||||
=======
|
||||
|
||||
In some situations physics can end up becoming a bottleneck, particularly with
|
||||
complex worlds, and large numbers of physics objects.
|
||||
|
||||
Some techniques to speed up physics:
|
||||
|
||||
* Try using simplified versions of your rendered geometry for physics. Often
|
||||
this won't be noticeable for end users, but can greatly increase performance.
|
||||
* Try removing objects from physics when they are out of view / outside the
|
||||
current area, or reusing physics objects (maybe you allow 8 monsters per area,
|
||||
for example, and reuse these).
|
||||
|
||||
Another crucial aspect to physics is the physics tick rate. In some games you
|
||||
can greatly reduce the tick rate, and instead of for example, updating physics
|
||||
60 times per second, you may update it at 20, or even 10 ticks per second. This
|
||||
can greatly reduce the CPU load.
|
||||
|
||||
The downside of changing physics tick rate is you can get jerky movement or
|
||||
jitter when the physics update rate does not match the frames rendered.
|
||||
|
||||
The solution to this problem is 'fixed timestep interpolation', which involves
|
||||
smoothing the rendered positions and rotations over multiple frames to match the
|
||||
physics. You can either implement this yourself or use a third-party addon.
|
||||
Interpolation is a very cheap operation, performance wise, compared to running a
|
||||
physics tick, orders of magnitude faster, so this can be a significant win, as
|
||||
well as reducing jitter.
|
||||
291
tutorials/optimization/general_optimization.rst
Normal file
@@ -0,0 +1,291 @@
|
||||
.. _doc_general_optimization:
|
||||
|
||||
General optimization tips
|
||||
=========================
|
||||
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
In an ideal world, computers would run at infinite speed, and the only limit to
|
||||
what we could achieve would be our imagination. In the real world, however, it
|
||||
is all too easy to produce software that will bring even the fastest computer to
|
||||
its knees.
|
||||
|
||||
Designing games and other software is thus a compromise between what we would
|
||||
like to be possible, and what we can realistically achieve while maintaining
|
||||
good performance.
|
||||
|
||||
To achieve the best results, we have two approaches:
|
||||
* Work faster
|
||||
* Work smarter
|
||||
|
||||
And preferably, we will use a blend of the two.
|
||||
|
||||
Smoke and Mirrors
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
Part of working smarter is recognizing that, especially in games, we can often
|
||||
get the player to believe they are in a world that is far more complex,
|
||||
interactive, and graphically exciting than it really is. A good programmer is a
|
||||
magician, and should strive to learn the tricks of the trade, and try to invent
|
||||
new ones.
|
||||
|
||||
The nature of slowness
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To the outside observer, performance problems are often lumped together. But in
|
||||
reality, there are several different kinds of performance problem:
|
||||
|
||||
* A slow process that occurs every frame, leading to a continuously low frame
|
||||
rate
|
||||
* An intermittent process that causes 'spikes' of slowness, leading to
|
||||
stalls
|
||||
* A slow process that occurs outside of normal gameplay, for instance, on
|
||||
level load
|
||||
|
||||
Each of these are annoying to the user, but in different ways.
|
||||
|
||||
Measuring Performance
|
||||
=====================
|
||||
|
||||
Probably the most important tool for optimization is the ability to measure
|
||||
performance - to identify where bottlenecks are, and to measure the success of
|
||||
our attempts to speed them up.
|
||||
|
||||
There are several methods of measuring performance, including :
|
||||
* Putting a start / stop timer around code of interest
|
||||
* Using the Godot profiler
|
||||
* Using external third party profilers
|
||||
* Using GPU profilers / debuggers
|
||||
* Checking the frame rate (with vsync disabled)
|
||||
|
||||
Be very aware that the relative performance of different areas can vary on
|
||||
different hardware. Often it is a good idea to make timings on more than one
|
||||
device, especially including mobile as well as desktop, if you are targeting
|
||||
mobile.
|
||||
|
||||
Limitations
|
||||
~~~~~~~~~~~
|
||||
|
||||
CPU Profilers are often the 'go to' method for measuring performance, however
|
||||
they don't always tell the whole story.
|
||||
|
||||
- Bottlenecks are often on the GPU, `as a result` of instructions given by the
|
||||
CPU
|
||||
- Spikes can occur in the Operating System processes (outside of Godot) `as a
|
||||
result` of instructions used in Godot (for example dynamic memory allocation)
|
||||
- You may not be able to profile e.g. a mobile phone
|
||||
- You may have to solve performance problems that occur on hardware you don't
|
||||
have access to
|
||||
|
||||
As a result of these limitations, you often need to use detective work to find
|
||||
out where bottlenecks are.
|
||||
|
||||
Detective work
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
Detective work is a crucial skill for developers (both in terms of performance,
|
||||
and also in terms of bug fixing). This can include hypothesis testing, and
|
||||
binary search.
|
||||
|
||||
Hypothesis testing
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Say for example you believe that sprites are slowing down your game. You can
|
||||
test this hypothesis for example by:
|
||||
|
||||
* Measuring the performance when you add more sprites, or take some away.
|
||||
|
||||
This may lead to a further hypothesis - does the size of the sprite determine
|
||||
the performance drop?
|
||||
|
||||
* You can test this by keeping everything the same, but changing the sprite
|
||||
size, and measuring performance
|
||||
|
||||
Binary search
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
Say you know that frames are taking much longer than they should, but you are
|
||||
not sure where the bottleneck lies. You could begin by commenting out
|
||||
approximately half the routines that occur on a normal frame. Has the
|
||||
performance improved more or less than expected?
|
||||
|
||||
Once you know which of the two halves contains the bottleneck, you can then
|
||||
repeat this process, until you have pinned down the problematic area.
|
||||
|
||||
Profilers
|
||||
=========
|
||||
|
||||
Profilers allow you to time your program while running it. Profilers then
|
||||
provide results telling you what percentage of time was spent in different
|
||||
functions and areas, and how often functions were called.
|
||||
|
||||
This can be very useful both to identify bottlenecks and to measure the results
|
||||
of your improvements. Sometimes attempts to improve performance can backfire and
|
||||
lead to slower performance, so always use profiling and timing to guide your
|
||||
efforts.
|
||||
|
||||
For more info about using the profiler within Godot see
|
||||
:ref:`doc_debugger_panel`.
|
||||
|
||||
Principles
|
||||
==========
|
||||
|
||||
Donald Knuth:
|
||||
|
||||
*Programmers waste enormous amounts of time thinking about, or worrying
|
||||
about, the speed of noncritical parts of their programs, and these attempts
|
||||
at efficiency actually have a strong negative impact when debugging and
|
||||
maintenance are considered. We should forget about small efficiencies, say
|
||||
about 97% of the time: premature optimization is the root of all evil. Yet
|
||||
we should not pass up our opportunities in that critical 3%.*
|
||||
|
||||
The messages are very important:
|
||||
|
||||
* Programmer / Developer time is limited. Instead of blindly trying to speed up
|
||||
all aspects of a program we should concentrate our efforts on the aspects that
|
||||
really matter.
|
||||
* Efforts at optimization often end up with code that is harder to read and
|
||||
debug than non-optimized code. It is in our interests to limit this to areas
|
||||
that will really benefit.
|
||||
|
||||
Just because we `can` optimize a particular bit of code, it doesn't necessarily
|
||||
mean that we should. Knowing when, and when not to optimize is a great skill to
|
||||
develop.
|
||||
|
||||
One misleading aspect of the quote is that people tend to focus on the subquote
|
||||
"premature optimization is the root of all evil". While `premature` optimization
|
||||
is (by definition) undesirable, performant software is the result of performant
|
||||
design.
|
||||
|
||||
Performant design
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
The danger with encouraging people to ignore optimization until necessary, is
|
||||
that it conveniently ignores that the most important time to consider
|
||||
performance is at the design stage, before a key has even hit a keyboard. If the
|
||||
design / algorithms of a program are inefficient, then no amount of polishing the
|
||||
details later will make it run fast. It may run `faster`, but it will never run
|
||||
as fast as a program designed for performance.
|
||||
|
||||
This tends to be far more important in game / graphics programming than in
|
||||
general programming. A performant design, even without low level optimization,
|
||||
will often run many times faster than a mediocre design with low level
|
||||
optimization.
|
||||
|
||||
Incremental design
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Of course, in practice, unless you have prior knowledge, you are unlikely to
|
||||
come up with the best design first time. So you will often make a series of
|
||||
versions of a particular area of code, each taking a different approach to the
|
||||
problem, until you come to a satisfactory solution. It is important not to spend
|
||||
too much time on the details at this stage until you have finalized the overall
|
||||
design, otherwise much of your work will be thrown out.
|
||||
|
||||
It is difficult to give general guidelines for performant design because this is
|
||||
so dependent on the problem. One point worth mentioning though, on the CPU
|
||||
side, is that modern CPUs are nearly always limited by memory bandwidth. This
|
||||
has led to a resurgence in data orientated design, which involves designing data
|
||||
structures and algorithms for locality of data and linear access, rather than
|
||||
jumping around in memory.
|
||||
|
||||
The optimization process
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Assuming we have a reasonable design, and taking our lessons from Knuth, our
|
||||
first step in optimization should be to identify the biggest bottlenecks - the
|
||||
slowest functions, the low hanging fruit.
|
||||
|
||||
Once we have successfully improved the speed of the slowest area, it may no
|
||||
longer be the bottleneck. So we should test / profile again, and find the next
|
||||
bottleneck on which to focus.
|
||||
|
||||
The process is thus:
|
||||
|
||||
1. Profile / Identify bottleneck
|
||||
2. Optimize bottleneck
|
||||
3. Return to step 1
|
||||
|
||||
Optimizing bottlenecks
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Some profilers will even tell you which part of a function (which data accesses,
|
||||
calculations) are slowing things down.
|
||||
|
||||
As with design you should concentrate your efforts first on making sure the
|
||||
algorithms and data structures are the best they can be. Data access should be
|
||||
local (to make best use of CPU cache), and it can often be better to use compact
|
||||
storage of data (again, always profile to test results). Often you precalculate
|
||||
heavy computations ahead of time (e.g. at level load, or loading precalculated
|
||||
data files).
|
||||
|
||||
Once algorithms and data are good, you can often make small changes in routines
|
||||
which improve performance, things like moving calculations outside of loops.
|
||||
|
||||
Always retest your timing / bottlenecks after making each change. Some changes
|
||||
will increase speed, others may have a negative effect. Sometimes a small
|
||||
positive effect will be outweighed by the negatives of more complex code, and
|
||||
you may choose to leave out that optimization.
|
||||
|
||||
Appendix
|
||||
========
|
||||
|
||||
Bottleneck math
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
The proverb "a chain is only as strong as its weakest link" applies directly to
|
||||
performance optimization. If your project is spending 90% of the time in
|
||||
function 'A', then optimizing A can have a massive effect on performance.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
A: 9 ms
|
||||
Everything else: 1 ms
|
||||
Total frame time: 10 ms
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
A: 1 ms
|
||||
Everything else: 1ms
|
||||
Total frame time: 2 ms
|
||||
|
||||
So in this example improving this bottleneck A by a factor of 9x, decreases
|
||||
overall frame time by 5x, and increases frames per second by 5x.
|
||||
|
||||
If however, something else is running slowly and also bottlenecking your
|
||||
project, then the same improvement can lead to less dramatic gains:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
A: 9 ms
|
||||
Everything else: 50 ms
|
||||
Total frame time: 59 ms
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
A: 1 ms
|
||||
Everything else: 50 ms
|
||||
Total frame time: 51 ms
|
||||
|
||||
So in this example, even though we have hugely optimized functionality A, the
|
||||
actual gain in terms of frame rate is quite small.
|
||||
|
||||
In games, things become even more complicated because the CPU and GPU run
|
||||
independently of one another. Your total frame time is determined by the slower
|
||||
of the two.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
CPU: 9 ms
|
||||
GPU: 50 ms
|
||||
Total frame time: 50 ms
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
CPU: 1 ms
|
||||
GPU: 50 ms
|
||||
Total frame time: 50 ms
|
||||
|
||||
In this example, we optimized the CPU hugely again, but the frame time did not
|
||||
improve, because we are GPU-bottlenecked.
|
||||
263
tutorials/optimization/gpu_optimization.rst
Normal file
@@ -0,0 +1,263 @@
|
||||
.. _doc_gpu_optimization:
|
||||
|
||||
GPU Optimizations
|
||||
=================
|
||||
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
The demand for new graphics features and progress almost guarantees that you
|
||||
will encounter graphics bottlenecks. Some of these can be CPU side, for instance
|
||||
in calculations inside the Godot engine to prepare objects for rendering.
|
||||
Bottlenecks can also occur on the CPU in the graphics driver, which sorts
|
||||
instructions to pass to the GPU, and in the transfer of these instructions. And
|
||||
finally bottlenecks also occur on the GPU itself.
|
||||
|
||||
Where bottlenecks occur in rendering is highly hardware specific. Mobile GPUs in
|
||||
particular may struggle with scenes that run easily on desktop.
|
||||
|
||||
Understanding and investigating GPU bottlenecks is slightly different to the
|
||||
situation on the CPU, because often you can only change performance indirectly,
|
||||
by changing the instructions you give to the GPU, and it may be more difficult
|
||||
to take measurements. Often the only way of measuring performance is by
|
||||
examining changes in frame rate.
|
||||
|
||||
Drawcalls, state changes, and APIs
|
||||
==================================
|
||||
|
||||
.. note:: The following section is not relevant to end-users, but is useful to
|
||||
provide background information that is relevant in later sections.
|
||||
|
||||
Godot sends instructions to the GPU via a graphics API (OpenGL, GLES2, GLES3,
|
||||
Vulkan). The communication and driver activity involved can be quite costly,
|
||||
especially in OpenGL. If we can provide these instructions in a way that is
|
||||
preferred by the driver and GPU, we can greatly increase performance.
|
||||
|
||||
Nearly every API command in OpenGL requires a certain amount of validation, to
|
||||
make sure the GPU is in the correct state. Even seemingly simple commands can
|
||||
lead to a flurry of behind the scenes housekeeping. Therefore the name of the
|
||||
game is reduce these instructions to a bare minimum, and group together similar
|
||||
objects as much as possible so they can be rendered together, or with the
|
||||
minimum number of these expensive state changes.
|
||||
|
||||
2D batching
|
||||
~~~~~~~~~~~
|
||||
|
||||
In 2d, the costs of treating each item individually can be prohibitively high -
|
||||
there can easily be thousands on screen. This is why 2d batching is used -
|
||||
multiple similar items are grouped together and rendered in a batch, via a
|
||||
single drawcall, rather than making a separate drawcall for each item. In
|
||||
addition this means that state changes, material and texture changes can be kept
|
||||
to a minimum.
|
||||
|
||||
For more information on 2D batching see :ref:`doc_batching`.
|
||||
|
||||
3D batching
|
||||
~~~~~~~~~~~
|
||||
|
||||
In 3d, we still aim to minimize draw calls and state changes, however, it can be
|
||||
more difficult to batch together several objects into a single draw call. 3d
|
||||
meshes tend to comprise hundreds or thousands of triangles, and combining large
|
||||
meshes at runtime is prohibitively expensive. The costs of joining them quickly
|
||||
exceeds any benefits as the number of triangles grows per mesh. A much better
|
||||
alternative is to join meshes ahead of time (static meshes in relation to each
|
||||
other). This can either be done by artists, or programmatically within Godot.
|
||||
|
||||
There is also a cost to batching together objects in 3d. Several objects
|
||||
rendered as one cannot be individually culled. An entire city that is off screen
|
||||
will still be rendered if it is joined to a single blade of grass that is on
|
||||
screen. So attempting to batch together 3d objects should take account of their
|
||||
location and effect on culling. Despite this, the benefits of joining static
|
||||
objects often outweigh other considerations, especially for large numbers of low
|
||||
poly objects.
|
||||
|
||||
For more information on 3D specific optimizations, see
|
||||
:ref:`doc_optimizing_3d_performance`.
|
||||
|
||||
Reuse Shaders and Materials
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The Godot renderer is a little different to what is out there. It's designed to
|
||||
minimize GPU state changes as much as possible. :ref:`SpatialMaterial
|
||||
<class_SpatialMaterial>` does a good job at reusing materials that need similar
|
||||
shaders but, if custom shaders are used, make sure to reuse them as much as
|
||||
possible. Godot's priorities are:
|
||||
|
||||
- **Reusing Materials**: The fewer different materials in the
|
||||
scene, the faster the rendering will be. If a scene has a huge amount
|
||||
of objects (in the hundreds or thousands) try reusing the materials
|
||||
or in the worst case use atlases.
|
||||
- **Reusing Shaders**: If materials can't be reused, at least try to
|
||||
re-use shaders (or SpatialMaterials with different parameters but the same
|
||||
configuration).
|
||||
|
||||
If a scene has, for example, ``20,000`` objects with ``20,000`` different
|
||||
materials each, rendering will be slow. If the same scene has ``20,000``
|
||||
objects, but only uses ``100`` materials, rendering will be much faster.
|
||||
|
||||
Pixel cost vs vertex cost
|
||||
=========================
|
||||
|
||||
You may have heard that the lower the number of polygons in a model, the faster
|
||||
it will be rendered. This is *really* relative and depends on many factors.
|
||||
|
||||
On a modern PC and console, vertex cost is low. GPUs originally only rendered
|
||||
triangles, so every frame all the vertices:
|
||||
|
||||
1. Had to be transformed by the CPU (including clipping).
|
||||
|
||||
2. Had to be sent to the GPU memory from the main RAM.
|
||||
|
||||
Now all this is handled inside the GPU, so the performance is much higher. 3D
|
||||
artists usually have the wrong feeling about polycount performance because 3D
|
||||
DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory in order
|
||||
for it to be edited, reducing actual performance. Game engines rely on the GPU
|
||||
more so they can render many triangles much more efficiently.
|
||||
|
||||
On mobile devices, the story is different. PC and Console GPUs are
|
||||
brute-force monsters that can pull as much electricity as they need from
|
||||
the power grid. Mobile GPUs are limited to a tiny battery, so they need
|
||||
to be a lot more power efficient.
|
||||
|
||||
To be more efficient, mobile GPUs attempt to avoid *overdraw*. This means, the
|
||||
same pixel on the screen being rendered more than once. Imagine a town with
|
||||
several buildings, GPUs don't know what is visible and what is hidden until they
|
||||
draw it. A house might be drawn and then another house in front of it (rendering
|
||||
happened twice for the same pixel!). PC GPUs normally don't care much about this
|
||||
and just throw more pixel processors to the hardware to increase performance
|
||||
(but this also increases power consumption).
|
||||
|
||||
Using more power is not an option on mobile so mobile devices use a technique
|
||||
called "Tile Based Rendering" which divides the screen into a grid. Each cell
|
||||
keeps the list of triangles drawn to it and sorts them by depth to minimize
|
||||
*overdraw*. This technique improves performance and reduces power consumption,
|
||||
but takes a toll on vertex performance. As a result, fewer vertices and
|
||||
triangles can be processed for drawing.
|
||||
|
||||
Additionally, Tile Based Rendering struggles when there are small objects with a
|
||||
lot of geometry within a small portion of the screen. This forces mobile GPUs to
|
||||
put a lot of strain on a single screen tile which considerably decreases
|
||||
performance as all the other cells must wait for it to complete in order to
|
||||
display the frame.
|
||||
|
||||
In summary, do not worry about vertex count on mobile, but avoid concentration
|
||||
of vertices in small parts of the screen. If a character, NPC, vehicle, etc. is
|
||||
far away (so it looks tiny), use a smaller level of detail (LOD) model.
|
||||
|
||||
Pay attention to the additional vertex processing required when using:
|
||||
|
||||
- Skinning (skeletal animation)
|
||||
- Morphs (shape keys)
|
||||
- Vertex-lit objects (common on mobile)
|
||||
|
||||
Pixel / fragment shaders - fill rate
|
||||
====================================
|
||||
|
||||
In contrast to vertex processing, the costs of fragment shading has increased
|
||||
dramatically over the years. Screen resolutions have increased (the area of a 4K
|
||||
screen is ``8,294,400`` pixels, versus ``307,200`` for an old ``640x480`` VGA
|
||||
screen, that is 27x the area), but also the complexity of fragment shaders has
|
||||
exploded. Physically based rendering requires complex calculations for each
|
||||
fragment.
|
||||
|
||||
You can test whether a project is fill rate limited quite easily. Turn off vsync
|
||||
to prevent capping the frames per second, then compare the frames per second
|
||||
when running with a large window, to running with a postage stamp sized window
|
||||
(you may also benefit from similarly reducing your shadow map size if using
|
||||
shadows). Usually you will find the fps increases quite a bit using a small
|
||||
window, which indicates you are to some extent fill rate limited. If on the
|
||||
other hand there is little to no increase in fps, then your bottleneck lies
|
||||
elsewhere.
|
||||
|
||||
You can increase performance in a fill rate limited project by reducing the
|
||||
amount of work the GPU has to do. You can do this by simplifying the shader
|
||||
(perhaps turn off expensive options if you are using a :ref:`SpatialMaterial
|
||||
<class_SpatialMaterial>`), or reducing the number and size of textures used.
|
||||
|
||||
Consider shipping simpler shaders for mobile.
|
||||
|
||||
Reading textures
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
The other factor in fragment shaders is the cost of reading textures. Reading
|
||||
textures is an expensive operation (especially reading from several in a single
|
||||
fragment shader), and also consider the filtering may add expense to this
|
||||
(trilinear filtering between mipmaps, and averaging). Reading textures is also
|
||||
expensive in power terms, which is a big issue on mobiles.
|
||||
|
||||
Texture compression
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Godot compresses textures of 3D models when imported (VRAM compression) by
|
||||
default. Video RAM compression is not as efficient in size as PNG or JPG when
|
||||
stored, but increases performance enormously when drawing.
|
||||
|
||||
This is because the main goal of texture compression is bandwidth reduction
|
||||
between memory and the GPU.
|
||||
|
||||
In 3D, the shapes of objects depend more on the geometry than the texture, so
|
||||
compression is generally not noticeable. In 2D, compression depends more on
|
||||
shapes inside the textures, so the artifacts resulting from 2D compression are
|
||||
more noticeable.
|
||||
|
||||
As a warning, most Android devices do not support texture compression of
|
||||
textures with transparency (only opaque), so keep this in mind.
|
||||
|
||||
Post processing / shadows
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Post processing effects and shadows can also be expensive in terms of fragment
|
||||
shading activity. Always test the impact of these on different hardware.
|
||||
|
||||
Reducing the size of shadow maps can increase performance, both in terms of
|
||||
writing, and reading the maps.
|
||||
|
||||
Transparency / blending
|
||||
=======================
|
||||
|
||||
Transparent items present particular problems for rendering efficiency. Opaque
|
||||
items (especially in 3d) can be essentially rendered in any order and the
|
||||
Z-buffer will ensure that only the front most objects get shaded. Transparent or
|
||||
blended objects are different - in most cases they cannot rely on the Z-buffer
|
||||
and must be rendered in "painter's order" (i.e. from back to front) in order to
|
||||
look correct.
|
||||
|
||||
The transparent items are also particularly bad for fill rate, because every
|
||||
item has to be drawn, even if later transparent items will be drawn on top.
|
||||
|
||||
Opaque items don't have to do this. They can usually take advantage of the
|
||||
Z-buffer by writing to the Z-buffer only first, then only performing the
|
||||
fragment shader on the 'winning' fragment, the item that is at the front at a
|
||||
particular pixel.
|
||||
|
||||
Transparency is particularly expensive where multiple transparent items overlap.
|
||||
It is usually better to use as small a transparent area as possible in order to
|
||||
minimize these fill rate requirements, especially on mobile, where fill rate is
|
||||
very expensive. Indeed, in many situations, rendering more complex opaque
|
||||
geometry can end up being faster than using transparency to "cheat".
|
||||
|
||||
Multi-Platform Advice
|
||||
=====================
|
||||
|
||||
If you are aiming to release on multiple platforms, test `early` and test
|
||||
`often` on all your platforms, especially mobile. Developing a game on desktop
|
||||
but attempting to port to mobile at the last minute is a recipe for disaster.
|
||||
|
||||
In general you should design your game for the lowest common denominator, then
|
||||
add optional enhancements for more powerful platforms. For example, you may want
|
||||
to use the GLES2 backend for both desktop and mobile platforms where you target
|
||||
both.
|
||||
|
||||
Mobile / tile renderers
|
||||
=======================
|
||||
|
||||
GPUs on mobile devices work in dramatically different ways from GPUs on desktop.
|
||||
Most mobile devices use tile renderers. Tile renderers split up the screen into
|
||||
regular sized tiles that fit into super fast cache memory, and reduce the reads
|
||||
and writes to main memory.
|
||||
|
||||
There are some downsides though, it can make certain techniques much more
|
||||
complicated and expensive to perform. Tiles that rely on the results of
|
||||
rendering in different tiles or on the results of earlier operations being
|
||||
preserved can be very slow. Be very careful to test the performance of shaders,
|
||||
viewport textures and post processing.
|
||||
BIN
tutorials/optimization/img/godot_profiler.png
Normal file
|
After Width: | Height: | Size: 45 KiB |
BIN
tutorials/optimization/img/lights_overlap.png
Normal file
|
After Width: | Height: | Size: 146 KiB |
BIN
tutorials/optimization/img/lights_separate.png
Normal file
|
After Width: | Height: | Size: 160 KiB |
BIN
tutorials/optimization/img/overlap1.png
Normal file
|
After Width: | Height: | Size: 96 KiB |
BIN
tutorials/optimization/img/overlap2.png
Normal file
|
After Width: | Height: | Size: 101 KiB |
BIN
tutorials/optimization/img/scissoring.png
Normal file
|
After Width: | Height: | Size: 56 KiB |
BIN
tutorials/optimization/img/valgrind.png
Normal file
|
After Width: | Height: | Size: 177 KiB |
@@ -1,9 +1,75 @@
|
||||
Optimization
|
||||
=============
|
||||
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Godot follows a balanced performance philosophy. In the performance world, there
|
||||
are always trade-offs, which consist of trading speed for usability and
|
||||
flexibility. Some practical examples of this are:
|
||||
|
||||
- Rendering objects efficiently in high amounts is easy, but when a
|
||||
large scene must be rendered, it can become inefficient. To solve this,
|
||||
visibility computation must be added to the rendering, which makes rendering
|
||||
less efficient, but, at the same time, fewer objects are rendered, so
|
||||
efficiency overall improves.
|
||||
|
||||
- Configuring the properties of every material for every object that
|
||||
needs to be rendered is also slow. To solve this, objects are sorted by
|
||||
material to reduce the costs, but at the same time sorting has a cost.
|
||||
|
||||
- In 3D physics a similar situation happens. The best algorithms to
|
||||
handle large amounts of physics objects (such as SAP) are slow at
|
||||
insertion/removal of objects and ray-casting. Algorithms that allow faster
|
||||
insertion and removal, as well as ray-casting, will not be able to handle as
|
||||
many active objects.
|
||||
|
||||
And there are many more examples of this! Game engines strive to be general
|
||||
purpose in nature, so balanced algorithms are always favored over algorithms
|
||||
that might be fast in some situations and slow in others or algorithms that are
|
||||
fast but make usability more difficult.
|
||||
|
||||
Godot is not an exception and, while it is designed to have backends swappable
|
||||
for different algorithms, the default ones prioritize balance and flexibility
|
||||
over performance.
|
||||
|
||||
With this clear, the aim of this tutorial section is to explain how to get the
|
||||
maximum performance out of Godot. While the tutorials can be read in any order,
|
||||
it is a good idea to start from :ref:`doc_general_optimization`.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:name: toc-learn-features-optimization
|
||||
:caption: Common
|
||||
:name: toc-learn-features-general-optimization
|
||||
|
||||
general_optimization
|
||||
using_servers
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: CPU
|
||||
:name: toc-learn-features-cpu-optimization
|
||||
|
||||
cpu_optimization
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: GPU
|
||||
:name: toc-learn-features-gpu-optimization
|
||||
|
||||
gpu_optimization
|
||||
using_multimesh
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: 2D
|
||||
:name: toc-learn-features-2d-optimization
|
||||
|
||||
batching
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: 3D
|
||||
:name: toc-learn-features-3d-optimization
|
||||
|
||||
optimizing_3d_performance
|
||||
|
||||
143
tutorials/optimization/optimizing_3d_performance.rst
Normal file
@@ -0,0 +1,143 @@
|
||||
.. meta::
|
||||
:keywords: optimization
|
||||
|
||||
.. _doc_optimizing_3d_performance:
|
||||
|
||||
Optimizing 3D performance
|
||||
=========================
|
||||
|
||||
Culling
|
||||
=======
|
||||
|
||||
Godot will automatically perform view frustum culling in order to prevent
|
||||
rendering objects that are outside the viewport. This works well for games that
|
||||
take place in a small area, however things can quickly become problematic in
|
||||
larger levels.
|
||||
|
||||
Occlusion culling
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Walking around a town for example, you may only be able to see a few buildings
|
||||
in the street you are in, as well as the sky and a few birds flying overhead. As
|
||||
far as a naive renderer is concerned however, you can still see the entire town.
|
||||
It won't just render the buildings in front of you, it will render the street
|
||||
behind that, with the people on that street, the buildings behind that. You
|
||||
quickly end up in situations where you are attempting to render 10x, or 100x
|
||||
more than what is visible.
|
||||
|
||||
Things aren't quite as bad as they seem, because the Z-buffer usually allows the
|
||||
GPU to only fully shade the objects that are at the front. However, unneeded
|
||||
objects are still reducing performance.
|
||||
|
||||
One way we can potentially reduce the amount to be rendered is to take advantage
|
||||
of occlusion. As of version 3.2.2 there is no built in support for occlusion in
|
||||
Godot, however with careful design you can still get many of the advantages.
|
||||
|
||||
For instance in our city street scenario, you may be able to work out in advance
|
||||
that you can only see two other streets, ``B`` and ``C``, from street ``A``.
|
||||
Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
|
||||
you have to do is work out when your viewer is in street ``A`` (perhaps using
|
||||
Godot Areas), then you can hide the other streets.
|
||||
|
||||
This is a manual version of what is known as a 'potentially visible set'. It is
|
||||
a very powerful technique for speeding up rendering. You can also use it to
|
||||
restrict physics or AI to the local area, and speed these up as well as
|
||||
rendering.
|
||||
|
||||
Other occlusion techniques
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
There are other occlusion techniques such as portals, automatic PVS, and raster
|
||||
based occlusion culling. Some of these may be available through addons, and may
|
||||
be available in core Godot in the future.
|
||||
|
||||
Transparent objects
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
|
||||
<class_Shader>` to improve performance. This, however, can not be done with
|
||||
transparent objects. Transparent objects are rendered from back to front to make
|
||||
blending with what is behind work. As a result, try to use as few transparent
|
||||
objects as possible. If an object has a small section with transparency, try to
|
||||
make that section a separate surface with its own Material.
|
||||
|
||||
For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
|
||||
doc.
|
||||
|
||||
Level of detail (LOD)
|
||||
=====================
|
||||
|
||||
In some situations, particularly at a distance, it can be a good idea to replace
|
||||
complex geometry with simpler versions - the end user will probably not be able
|
||||
to see much difference. Consider looking at a large number of trees in the far
|
||||
distance. There are several strategies for replacing models at varying distance.
|
||||
You could use lower poly models, or use transparency to simulate more complex
|
||||
geometry.
|
||||
|
||||
Billboards and imposters
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The simplest version of using transparency to deal with LOD is billboards. For
|
||||
example, you can use a single transparent quad to represent a tree at distance.
|
||||
This can be very cheap to render, unless of course, there are many trees in
|
||||
front of each other. In which case transparency may start eating into fill rate
|
||||
(for more information on fill rate, see :ref:`doc_gpu_optimization`).
|
||||
|
||||
An alternative is to render not just one tree, but a number of trees together as
|
||||
a group. This can be especially effective if you can see an area but cannot
|
||||
physically approach it in a game.
|
||||
|
||||
You can make imposters by pre-rendering views of an object at different angles.
|
||||
Or you can even go one step further, and periodically re-render a view of an
|
||||
object onto a texture to be used as an imposter. At a distance, you need to move
|
||||
the viewer a considerable distance for the angle of view to change
|
||||
significantly. This can be complex to get working, but may be worth it depending
|
||||
on the type of project you are making.
|
||||
|
||||
Use instancing (MultiMesh)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If several identical objects have to be drawn in the same place or nearby, try
|
||||
using :ref:`MultiMesh <class_MultiMesh>` instead. MultiMesh allows the drawing
|
||||
of many thousands of objects at very little performance cost, making it ideal
|
||||
for flocks, grass, particles, and anything else where you have thousands of
|
||||
identical objects.
|
||||
|
||||
Also see the :ref:`Using MultiMesh <doc_using_multimesh>` doc.
|
||||
|
||||
Bake lighting
|
||||
=============
|
||||
|
||||
Lighting objects is one of the most costly rendering operations. Realtime
|
||||
lighting, shadows (especially multiple lights), and GI are especially expensive.
|
||||
They may simply be too much for lower power mobile devices to handle.
|
||||
|
||||
Consider using baked lighting, especially for mobile. This can look fantastic,
|
||||
but has the downside that it will not be dynamic. Sometimes this is a trade off
|
||||
worth making.
|
||||
|
||||
In general, if several lights need to affect a scene, it's best to use
|
||||
:ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
|
||||
indirect light bounces.
|
||||
|
||||
Animation / Skinning
|
||||
====================
|
||||
|
||||
Animation and particularly vertex animation such as skinning and morphing can be
|
||||
very expensive on some platforms. You may need to lower poly count considerably
|
||||
for animated models or limit the number of them on screen at any one time.
|
||||
|
||||
Large worlds
|
||||
============
|
||||
|
||||
If you are making large worlds, there are different considerations than what you
|
||||
may be familiar with from smaller games.
|
||||
|
||||
Large worlds may need to be built in tiles that can be loaded on demand as you
|
||||
move around the world. This can prevent memory use from getting out of hand, and
|
||||
also limit the processing needed to the local area.
|
||||
|
||||
There may be glitches due to floating point error in large worlds. You may be
|
||||
able to use techniques such as orienting the world around the player (rather
|
||||
than the other way around), or shifting the origin periodically to keep things
|
||||
centred around (0, 0, 0).
|
||||