Overhaul optimization tutorials

Co-authored-by: lawnjelly <lawnjelly@gmail.com>
This commit is contained in:
clayjohn
2020-07-11 15:47:51 -07:00
parent 859d322e96
commit 8c02d179a5
15 changed files with 1571 additions and 194 deletions

View File

@@ -7,7 +7,6 @@
introduction_to_3d
using_transforms
optimizing_3d_performance
3d_rendering_limitations
spatial_material
lights_and_shadows

View File

@@ -1,192 +0,0 @@
.. meta::
:keywords: optimization
.. _doc_optimizing_3d_performance:
Optimizing 3D performance
=========================
Introduction
~~~~~~~~~~~~
Godot follows a balanced performance philosophy. In the performance world,
there are always trade-offs, which consist of trading speed for
usability and flexibility. Some practical examples of this are:
- Rendering objects efficiently in high amounts is easy, but when a
large scene must be rendered, it can become inefficient. To solve
this, visibility computation must be added to the rendering, which
makes rendering less efficient, but, at the same time, fewer objects are
rendered, so efficiency overall improves.
- Configuring the properties of every material for every object that
needs to be rendered is also slow. To solve this, objects are sorted
by material to reduce the costs, but at the same time sorting has a
cost.
- In 3D physics a similar situation happens. The best algorithms to
handle large amounts of physics objects (such as SAP) are slow
at insertion/removal of objects and ray-casting. Algorithms that
allow faster insertion and removal, as well as ray-casting, will not
be able to handle as many active objects.
And there are many more examples of this! Game engines strive to be
general purpose in nature, so balanced algorithms are always favored
over algorithms that might be fast in some situations and slow in
others.. or algorithms that are fast but make usability more difficult.
Godot is not an exception and, while it is designed to have backends
swappable for different algorithms, the default ones (or more like, the
only ones that are there for now) prioritize balance and flexibility
over performance.
With this clear, the aim of this tutorial is to explain how to get the
maximum performance out of Godot.
Rendering
~~~~~~~~~
3D rendering is one of the most difficult areas to get performance from,
so this section will have a list of tips.
Reuse shaders and materials
---------------------------
The Godot renderer is a little different to what is out there. It's designed
to minimize GPU state changes as much as possible.
:ref:`class_SpatialMaterial`
does a good job at reusing materials that need similar shaders but, if
custom shaders are used, make sure to reuse them as much as possible.
Godot's priorities will be like this:
- **Reusing Materials**: The fewer different materials in the
scene, the faster the rendering will be. If a scene has a huge amount
of objects (in the hundreds or thousands) try reusing the materials
or in the worst case use atlases.
- **Reusing Shaders**: If materials can't be reused, at least try to
re-use shaders (or SpatialMaterials with different parameters but the same
configuration).
If a scene has, for example, 20.000 objects with 20.000 different
materials each, rendering will be slow. If the same scene has
20.000 objects, but only uses 100 materials, rendering will be blazingly
fast.
Pixel cost vs vertex cost
-------------------------
It is a common thought that the lower the number of polygons in a model, the
faster it will be rendered. This is *really* relative and depends on
many factors.
On a modern PC and console, vertex cost is low. GPUs
originally only rendered triangles, so all the vertices:
1. Had to be transformed by the CPU (including clipping).
2. Had to be sent to the GPU memory from the main RAM.
Nowadays, all this is handled inside the GPU, so the performance is
extremely high. 3D artists usually have the wrong feeling about
polycount performance because 3D DCCs (such as Blender, Max, etc.) need
to keep geometry in CPU memory in order for it to be edited, reducing
actual performance. Truth is, a model rendered by a 3D engine is much
more optimal than how 3D DCCs display them.
On mobile devices, the story is different. PC and Console GPUs are
brute-force monsters that can pull as much electricity as they need from
the power grid. Mobile GPUs are limited to a tiny battery, so they need
to be a lot more power efficient.
To be more efficient, mobile GPUs attempt to avoid *overdraw*. This
means, the same pixel on the screen being rendered (as in, with lighting
calculation, etc.) more than once. Imagine a town with several buildings,
GPUs don't know what is visible and what is hidden until they
draw it. A house might be drawn and then another house in front of it
(rendering happened twice for the same pixel!). PC GPUs normally don't
care much about this and just throw more pixel processors to the
hardware to increase performance (but this also increases power
consumption).
On mobile, pulling more power is not an option, so a technique called
"Tile Based Rendering" is used (almost every mobile hardware uses a
variant of it), which divides the screen into a grid. Each cell keeps the
list of triangles drawn to it and sorts them by depth to minimize
*overdraw*. This technique improves performance and reduces power
consumption, but takes a toll on vertex performance. As a result, fewer
vertices and triangles can be processed for drawing.
Generally, this is not so bad, but there is a corner case on mobile that
must be avoided, which is to have small objects with a lot of geometry
within a small portion of the screen. This forces mobile GPUs to put a
lot of strain on a single screen cell, considerably decreasing
performance (as all the other cells must wait for it to complete in
order to display the frame).
To make it short, do not worry about vertex count so much on mobile, but
avoid concentration of vertices in small parts of the screen. If, for
example, a character, NPC, vehicle, etc. is far away (so it looks tiny),
use a smaller level of detail (LOD) model instead.
An extra situation where vertex cost must be considered is objects that
have extra processing per vertex, such as:
- Skinning (skeletal animation)
- Morphs (shape keys)
- Vertex Lit Objects (common on mobile)
Texture compression
-------------------
Godot offers to compress textures of 3D models when imported (VRAM
compression). Video RAM compression is not as efficient in size as PNG
or JPG when stored, but increases performance enormously when drawing.
This is because the main goal of texture compression is bandwidth
reduction between memory and the GPU.
In 3D, the shapes of objects depend more on the geometry than the
texture, so compression is generally not noticeable. In 2D, compression
depends more on shapes inside the textures, so the artifacts resulting
from 2D compression are more noticeable.
As a warning, most Android devices do not support texture compression of
textures with transparency (only opaque), so keep this in mind.
Transparent objects
-------------------
As mentioned before, Godot sorts objects by material and shader to
improve performance. This, however, can not be done on transparent
objects. Transparent objects are rendered from back to front to make
blending with what is behind work. As a result, please try to keep
transparent objects to a minimum! If an object has a small section with
transparency, try to make that section a separate material.
Level of detail (LOD)
---------------------
As also mentioned before, using objects with fewer vertices can improve
performance in some cases. Godot has a simple system to change level
of detail,
:ref:`GeometryInstance <class_GeometryInstance>`
based objects have a visibility range that can be defined. Having
several GeometryInstance objects in different ranges works as LOD.
Use instancing (MultiMesh)
--------------------------
If several identical objects have to be drawn in the same place or
nearby, try using :ref:`MultiMesh <class_MultiMesh>`
instead. MultiMesh allows the drawing of dozens of thousands of objects at
very little performance cost, making it ideal for flocks, grass,
particles, etc.
Bake lighting
-------------
Small lights are usually not a performance issue. Shadows a little more.
In general, if several lights need to affect a scene, it's ideal to bake
it (:ref:`doc_baked_lightmaps`). Baking can also improve the scene quality by
adding indirect light bounces.
If working on mobile, baking to texture is recommended, since this
method is even faster.

View File

@@ -0,0 +1,549 @@
.. _doc_batching:
Optimization using batching
===========================
Introduction
~~~~~~~~~~~~
Game engines have to send a set of instructions to the GPU in order to tell the
GPU what and where to draw. These instructions are sent using common
instructions, called APIs (Application Programming Interfaces), examples of
which are OpenGL, OpenGL ES, and Vulkan.
Different APIs incur different costs when drawing objects. OpenGL handles a lot
of work for the user in the GPU driver at the cost of more expensive draw calls.
As a result, applications can often be sped up by reducing the number of draw
calls.
Draw calls
^^^^^^^^^^
In 2D, we need to tell the GPU to render a series of primitives (rectangles,
lines, polygons etc). The most obvious technique is to tell the GPU to render
one primitive at a time, telling it some information such as the texture used,
the material, the position, size, etc. then saying "Draw!" (this is called a
draw call).
It turns out that while this is conceptually simple from the engine side, GPUs
operate very slowly when used in this manner. GPUs work much more efficiently
if, instead of telling them to draw a single primitive, you tell them to draw a
number of similar primitives all in one draw call, which we will call a "batch".
And it turns out that they don't just work a bit faster when used in this
manner, they work a *lot* faster.
As Godot is designed to be a general purpose engine, the primitives coming into
the Godot renderer can be in any order, sometimes similar, and sometimes
dissimilar. In order to match the general purpose nature of Godot with the
batching preferences of GPUs, Godot features an intermediate layer which can
automatically group together primitives wherever possible, and send these
batches on to the GPU. This can give an increase in rendering performance while
requiring few, if any, changes to your Godot project.
How it works
~~~~~~~~~~~~
Instructions come into the renderer from your game in the form of a series of
items, each of which can contain one or more commands. The items correspond to
Nodes in the scene tree, and the commands correspond to primitives such as
rectangles or polygons. Some items, such as tilemaps, and text, can contain a
large number of commands (tiles and letters respectively). Others, such as
sprites, may only contain a single command (rectangle).
The batcher uses two main techniques to group together primitives:
* Consecutive items can be joined together
* Consecutive commands within an item can be joined to form a batch
Breaking batching
^^^^^^^^^^^^^^^^^
Batching can only take place if the items or commands are similar enough to be
rendered in one draw call. Certain changes (or techniques), by necessity, prevent
the formation of a contiguous batch, this is referred to as 'breaking batching'.
Batching will be broken by (amongst other things):
* Change of texture
* Change of material
* Change of primitive type (say going from rectangles to lines)
.. note::
If for example, you draw a series of sprites each with a different texture,
there is no way they can be batched.
Render order
^^^^^^^^^^^^
The question arises, if only similar items can be drawn together in a batch, why
don't we look through all the items in a scene, group together all the similar
items, and draw them together?
In 3D, this is often exactly how engines work. However, in Godot 2D, items are
drawn in 'painter's order', from back to front. This ensures that items at the
front are drawn on top of earlier items, when they overlap.
This also means that if we try and draw objects in order of, for example,
texture, then this painter's order may break and objects will be drawn in the
wrong order.
In Godot this back to front order is determined by:
* The order of objects in the scene tree
* The Z index of objects
* The canvas layer
* Y sort nodes
.. note::
You can group similar objects together for easier batching. While doing so
is not a requirement on your part, think of it as an optional approach that
can improve performance in some cases. See the diagnostics section in order
to help you make this decision.
A trick
^^^^^^^
And now a sleight of hand. Although the idea of painter's order is that objects
are rendered from back to front, consider 3 objects A, B and C, that contain 2
different textures, grass and wood.
.. image:: img/overlap1.png
In painter's order they are ordered:
::
A - wood
B - grass
C - wood
Because the texture changes, they cannot be batched, and will be rendered in 3
draw calls.
However, painter's order is only needed on the assumption that they will be
drawn *on top* of each other. If we relax that assumption, i.e. if none of these
3 objects are overlapping, there is *no need* to preserve painter's order. The
rendered result will be the same. What if we could take advantage of this?
Item reordering
^^^^^^^^^^^^^^^
.. image:: img/overlap2.png
It turns out that we can reorder items. However, we can only do this if the
items satisfy the conditions of an overlap test, to ensure that the end result
will be the same as if they were not reordered. The overlap test is very cheap
in performance terms, but not absolutely free, so there is a slight cost to
looking ahead to decide whether items can be reordered. The number of items to
lookahead for reordering can be set in project settings (see below), in order to
balance the costs and benefits in your project.
::
A - wood
C - wood
B - grass
Because the texture only changes once, we can render the above in only 2
draw calls.
Lights
~~~~~~
Although the job for the batching system is normally quite straightforward, it
becomes considerably more complex when 2D lights are used, because lights are
drawn using extra passes, one for each light affecting the primitive. Consider 2
sprites A and B, with identical texture and material. Without lights they would
be batched together and drawn in one draw call. But with 3 lights, they would be
drawn as follows, each line a draw call:
.. image:: img/lights_overlap.png
::
A
A - light 1
A - light 2
A - light 3
B
B - light 1
B - light 2
B - light 3
That is a lot of draw calls, 8 for only 2 sprites. Now consider we are drawing
1000 sprites, the number of draw calls quickly becomes astronomical, and
performance suffers. This is partly why lights have the potential to drastically
slow down 2D.
However, if you remember our magician's trick from item reordering, it turns out
we can use the same trick to get around painter's order for lights!
If A and B are not overlapping, we can render them together in a batch, so the
draw process is as follows:
.. image:: img/lights_separate.png
::
AB
AB - light 1
AB - light 2
AB - light 3
That is 4 draw calls. Not bad, that is a 50% improvement. However consider that
in a real game, you might be drawing closer to 1000 sprites.
- Before: 1000 * 4 = 4000 draw calls.
- After: 1 * 4 = 4 draw calls.
That is 1000x decrease in draw calls, and should give a huge increase in
performance.
Overlap test
^^^^^^^^^^^^
However, as with the item reordering, things are not that simple, we must first
perform the overlap test to determine whether we can join these primitives, and
the overlap test has a small cost. So again you can choose the number of
primitives to lookahead in the overlap test to balance the benefits against the
cost. Usually with lights the benefits far outweigh the costs.
Also consider that depending on the arrangement of primitives in the viewport,
the overlap test will sometimes fail (because the primitives overlap and thus
should not be joined). So in practice the decrease in draw calls may be less
dramatic than the perfect situation of no overlap. However performance is
usually far higher than without this lighting optimization.
Light Scissoring
~~~~~~~~~~~~~~~~
Batching can make it more difficult to cull out objects that are not affected or
partially affected by a light. This can increase the fill rate requirements
quite a bit, and slow rendering. Fill rate is the rate at which pixels are
colored, it is another potential bottleneck unrelated to draw calls.
In order to counter this problem, (and also speedup lighting in general),
batching introduces light scissoring. This enables the use of the OpenGL command
``glScissor()``, which identifies an area, outside of which, the GPU will not
render any pixels. We can thus greatly optimize fill rate by identifying the
intersection area between a light and a primitive, and limit rendering the light
to *that area only*.
Light scissoring is controlled with the :ref:`scissor_area_threshold
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
project setting. This value is between 1.0 and 0.0, with 1.0 being off (no
scissoring), and 0.0 being scissoring in every circumstance. The reason for the
setting is that there may be some small cost to scissoring on some hardware.
Generally though, when you are using lighting, it should result in some
performance gains.
The relationship between the threshold and whether a scissor operation takes
place is not altogether straight forward, but generally it represents the pixel
area that is potentially 'saved' by a scissor operation (i.e. the fill rate
saved). At 1.0, the entire screens pixels would need to be saved, which rarely
if ever happens, so it is switched off. In practice the useful values are
bunched towards zero, as only a small percentage of pixels need to be saved for
the operation to be useful.
The exact relationship is probably not necessary for users to worry about, but
out of interest is included in the appendix.
.. image:: img/scissoring.png
*Bottom right is a light, the red area is the pixels saved by the scissoring
operation. Only the intersection needs to be rendered.*
Vertex baking
~~~~~~~~~~~~~
The GPU shader receives instructions on what to draw in 2 main ways:
* Shader uniforms (e.g. modulate color, item transform)
* Vertex attributes (vertex color, local transform)
However, within a single draw call (batch) we cannot change uniforms. This means
that naively, we would not be able to batch together items or commands that
change final_modulate, or item transform. Unfortunately that is an awful lot of
cases. Sprites for instance typically are individual nodes with their own item
transform, and they may have their own color modulate.
To get around this problem, the batching can "bake" some of the uniforms into
the vertex attributes.
* The item transform can be combined with the local transform and sent in a
vertex attribute.
* The final modulate color can be combined with the vertex colors, and sent in a
vertex attribute.
In most cases this works fine, but this shortcut breaks down if a shader expects
these values to be available individually, rather than combined. This can happen
in custom shaders.
Custom Shaders
^^^^^^^^^^^^^^
As a result certain operations in custom shaders will prevent baking, and thus
decrease the potential for batching. While we are working to decrease these
cases, currently the following conditions apply:
* Reading or writing ``COLOR`` or ``MODULATE`` - disables vertex color baking
* Reading ``VERTEX`` - disables vertex position baking
Project Settings
~~~~~~~~~~~~~~~~
In order to fine tune batching, a number of project settings are available. You
can usually leave these at default during development, but it is a good idea to
experiment to ensure you are getting maximum performance. Spending a little time
tweaking parameters can often give considerable performance gain, for very
little effort. See the tooltips in the project settings for more info.
rendering/batching/options
^^^^^^^^^^^^^^^^^^^^^^^^^^
* :ref:`use_batching
<class_ProjectSettings_property_rendering/batching/options/use_batching>` -
Turns batching on and off
* :ref:`use_batching_in_editor
<class_ProjectSettings_property_rendering/batching/options/use_batching_in_editor>`
* :ref:`single_rect_fallback
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
- This is a faster way of drawing unbatchable rectangles, however it may lead
to flicker on some hardware so is not recommended
rendering/batching/parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* :ref:`max_join_item_commands <class_ProjectSettings_property_rendering/batching/parameters/max_join_item_commands>` -
One of the most important ways of achieving
batching is to join suitable adjacent items (nodes) together, however they can
only be joined if the commands they contain are compatible. The system must
therefore do a lookahead through the commands in an item to determine whether
it can be joined. This has a small cost per command, and items with a large
number of commands are not worth joining, so the best value may be project
dependent.
* :ref:`colored_vertex_format_threshold
<class_ProjectSettings_property_rendering/batching/parameters/colored_vertex_format_threshold>` - Baking colors into
vertices results in a
larger vertex format. This is not necessarily worth doing unless there are a
lot of color changes going on within a joined item. This parameter represents
the proportion of commands containing color changes / the total commands,
above which it switches to baked colors.
* :ref:`batch_buffer_size
<class_ProjectSettings_property_rendering/batching/parameters/batch_buffer_size>`
- This determines the maximum size of a batch, it doesn't have a huge effect
on performance but can be worth decreasing for mobile if RAM is at a premium.
* :ref:`item_reordering_lookahead
<class_ProjectSettings_property_rendering/batching/parameters/item_reordering_lookahead>`
- Item reordering can help especially with
interleaved sprites using different textures. The lookahead for the overlap
test has a small cost, so the best value may change per project.
rendering/batching/lights
^^^^^^^^^^^^^^^^^^^^^^^^^
* :ref:`scissor_area_threshold
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
- See light scissoring.
* :ref:`max_join_items
<class_ProjectSettings_property_rendering/batching/lights/max_join_items>` -
Joining items before lighting can significantly increase
performance. This requires an overlap test, which has a small cost, so the
costs and benefits may be project dependent, and hence the best value to use
here.
rendering/batching/debug
^^^^^^^^^^^^^^^^^^^^^^^^
* :ref:`flash_batching
<class_ProjectSettings_property_rendering/batching/debug/flash_batching>` -
This is purely a debugging feature to identify regressions between the
batching and legacy renderer. When it is switched on, the batching and legacy
renderer are used alternately on each frame. This will decrease performance,
and should not be used for your final export, only for testing.
* :ref:`diagnose_frame
<class_ProjectSettings_property_rendering/batching/debug/diagnose_frame>` -
This will periodically print a diagnostic batching log to
the Godot IDE / console.
rendering/batching/precision
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* :ref:`uv_contract
<class_ProjectSettings_property_rendering/batching/precision/uv_contract>` -
On some hardware (notably some Android devices) there have been reports of
tilemap tiles drawing slightly outside their UV range, leading to edge
artifacts such as lines around tiles. If you see this problem, try enabling uv
contract. This makes a small contraction in the UV coordinates to compensate
for precision errors on devices.
* :ref:`uv_contract_amount
<class_ProjectSettings_property_rendering/batching/precision/uv_contract_amount>`
- Hopefully the default amount should cure artifacts on most devices, but just
in case, this value is editable.
Diagnostics
~~~~~~~~~~~
Although you can change parameters and examine the effect on frame rate, this
can feel like working blindly, with no idea of what is going on under the hood.
To help with this, batching offers a diagnostic mode, which will periodically
print out (to the IDE or console) a list of the batches that are being
processed. This can help pin point situations where batching is not occurring as
intended, and help you to fix them, in order to get the best possible
performance.
Reading a diagnostic
^^^^^^^^^^^^^^^^^^^^
.. code-block:: cpp
canvas_begin FRAME 2604
items
joined_item 1 refs
batch D 0-0
batch D 0-2 n n
batch R 0-1 [0 - 0] {255 255 255 255 }
joined_item 1 refs
batch D 0-0
batch R 0-1 [0 - 146] {255 255 255 255 }
batch D 0-0
batch R 0-1 [0 - 146] {255 255 255 255 }
joined_item 1 refs
batch D 0-0
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
batch D 0-0
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
batch D 0-0
batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
canvas_end
This is a typical diagnostic.
* **joined_item** - A joined item can contain 1 or
more references to items (nodes). Generally joined_items containing many
references is preferable to many joined_items containing a single reference.
Whether items can be joined will be determined by their contents and
compatibility with the previous item.
* **batch R** - a batch containing rectangles. The second number is the number of
rects. The second number in square brackets is the Godot texture ID, and the
numbers in curly braces is the color. If the batch contains more than one rect,
MULTI is added to the line to make it easy to identify. Seeing MULTI is good,
because this indicates successful batching.
* **batch D** - a default batch, containing everything else that is not currently
batched.
Default Batches
^^^^^^^^^^^^^^^
The second number following default batches is the number of commands in the
batch, and it is followed by a brief summary of the contents:
::
l - line
PL - polyline
r - rect
n - ninepatch
PR - primitive
p - polygon
m - mesh
MM - multimesh
PA - particles
c - circle
t - transform
CI - clip_ignore
You may see "dummy" default batches containing no commands, you can ignore
these.
FAQ
~~~
I don't get a large performance increase from switching on batching
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Try the diagnostics, see how much batching is occurring, and whether it can be
improved
* Try changing parameters
* Consider that batching may not be your bottleneck (see bottlenecks)
I get a decrease in performance with batching
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Try steps to increase batching given above
* Try switching :ref:`single_rect_fallback
<class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
to on
* The single rect fallback method is the default used without batching, and it
is approximately twice as fast, however it can result in flicker on some
hardware, so its use is discouraged
* After trying the above, if your scene is still performing worse, consider
turning off batching.
I use custom shaders and the items are not batching
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Custom shaders can be problematic for batching, see the custom shaders section
I am seeing line artifacts appear on certain hardware
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* See the :ref:`uv_contract
<class_ProjectSettings_property_rendering/batching/precision/uv_contract>`
project setting which can be used to solve this problem.
I use a large number of textures, so few items are being batched
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Consider the use of texture atlases. As well as allowing batching, these
reduce the need for state changes associated with changing texture.
Appendix
~~~~~~~~
Light scissoring threshold calculation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The actual proportion of screen pixel area used as the threshold is the
:ref:`scissor_area_threshold
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
value to the power of 4.
For example, on a screen size ``1920 x 1080`` there are ``2,073,600`` pixels.
At a threshold of ``1000`` pixels, the proportion would be:
::
1000 / 2073600 = 0.00048225
0.00048225 ^ 0.25 = 0.14819
.. note:: The power of 0.25 is the opposite of power of 4).
So a :ref:`scissor_area_threshold
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
of 0.15 would be a reasonable value to try.
Going the other way, for instance with a :ref:`scissor_area_threshold
<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
of ``0.5``:
::
0.5 ^ 4 = 0.0625
0.0625 * 2073600 = 129600 pixels
If the number of pixels saved is more than this threshold, the scissor is
activated.

View File

@@ -0,0 +1,258 @@
.. _doc_cpu_optimization:
CPU Optimizations
=================
Measuring performance
=====================
To know how to speed up our program, we have to know where the "bottlenecks"
are. Bottlenecks are the slowest parts of the program that limit the rate that
everything can progress. This allows us to concentrate our efforts on optimizing
the areas which will give us the greatest speed improvement, instead of spending
a lot of time optimizing functions that will lead to small performance
improvements.
For the CPU, the easiest way to identify bottlenecks is to use a profiler.
CPU profilers
=============
Profilers run alongside your program and take timing measurements to work out
what proportion of time is spent in each function.
The Godot IDE conveniently has a built in profiler. It does not run every time
you start your project, and must be manually started and stopped. This is
because, in common with most profilers, recording these timing measurements can
slow down your project significantly.
After profiling, you can look back at the results for a frame.
.. image:: img/godot_profiler.png
`These are the results of a profile of one of the demo projects.`
.. note:: We can see the cost of built-in processes such as physics and audio,
as well as seeing the cost of our own scripting functions at the
bottom.
When a project is running slowly, you will often see an obvious function or
process taking a lot more time than others. This is your primary bottleneck, and
you can usually increase speed by optimizing this area.
For more info about using the profiler within Godot see
:ref:`doc_debugger_panel`.
External profilers
~~~~~~~~~~~~~~~~~~
Although the Godot IDE profiler is very convenient and useful, sometimes you
need more power, and the ability to profile the Godot engine source code itself.
You can use a number of third party profilers to do this including Valgrind,
VerySleepy, Visual Studio and Intel VTune.
.. note:: You may need to compile Godot from source in order to use a third
party profiler so that you have program database information
available. You can also use a debug build, however, note that the
results of profiling a debug build will be different to a release
build, because debug builds are less optimized. Bottlenecks are often
in a different place in debug builds, so you should profile release
builds wherever possible.
.. image:: img/valgrind.png
`These are example results from Callgrind, part of Valgrind, on Linux.`
From the left, Callgrind is listing the percentage of time within a function and
its children (Inclusive), the percentage of time spent within the function
itself, excluding child functions (Self), the number of times the function is
called, the function name, and the file or module.
In this example we can see nearly all time is spent under the
`Main::iteration()` function, this is the master function in the Godot source
code that is called repeatedly, and causes frames to be drawn, physics ticks to
be simulated, and nodes and scripts to be updated. A large proportion of the
time is spent in the functions to render a canvas (66%), because this example
uses a 2d benchmark. Below this we see that almost 50% of the time is spent
outside Godot code in `libglapi`, and `i965_dri` (the graphics driver). This
tells us the a large proportion of CPU time is being spent in the graphics
driver.
This is actually an excellent example because in an ideal world, only a very
small proportion of time would be spent in the graphics driver, and this is an
indication that there is a problem with too much communication and work being
done in the graphics API. This profiling lead to the development of 2d batching,
which greatly speeds up 2d by reducing bottlenecks in this area.
Manually timing functions
=========================
Another handy technique, especially once you have identified the bottleneck
using a profiler, is to manually time the function or area under test. The
specifics vary according to language, but in GDScript, you would do the
following:
::
var time_start = OS.get_system_time_msecs()
# Your function you want to time
update_enemies()
var time_end = OS.get_system_time_msecs()
print("Function took: " + str(time_end - time_start))
You may want to consider using other functions for time if another time unit is
more suitable, for example :ref:`OS.get_system_time_secs
<class_OS_method_get_system_time_secs>` if the function will take many seconds.
When manually timing functions, it is usually a good idea to run the function
many times (say ``1000`` or more times), instead of just once (unless it is a
very slow function). A large part of the reason for this is that timers often
have limited accuracy, and CPUs will schedule processes in a haphazard manner,
so an average over a series of runs is more accurate than a single measurement.
As you attempt to optimize functions, be sure to either repeatedly profile or
time them as you go. This will give you crucial feedback as to whether the
optimization is working (or not).
Caches
======
Something else to be particularly aware of, especially when comparing timing
results of two different versions of a function, is that the results can be
highly dependent on whether the data is in the CPU cache or not. CPUs don't load
data directly from main memory, because although main memory can be huge (many
GBs), it is very slow to access. Instead CPUs load data from a smaller, higher
speed bank of memory, called cache. Loading data from cache is super fast, but
every time you try and load a memory address that is not stored in cache, the
cache must make a trip to main memory and slowly load in some data. This delay
can result in the CPU sitting around idle for a long time, and is referred to as
a "cache miss".
This means that the first time you run a function, it may run slowly, because
the data is not in cache. The second and later times, it may run much faster
because the data is in cache. So always use averages when timing, and be aware
of the effects of cache.
Understanding caching is also crucial to CPU optimization. If you have an
algorithm (routine) that loads small bits of data from randomly spread out areas
of main memory, this can result in a lot of cache misses, a lot of the time, the
CPU will be waiting around for data instead of doing any work. Instead, if you
can make your data accesses localised, or even better, access memory in a linear
fashion (like a continuous list), then the cache will work optimally and the CPU
will be able to work as fast as possible.
Godot usually takes care of such low-level details for you. For example, the
Server APIs make sure data is optimized for caching already for things like
rendering and physics. But you should be especially aware of caching when using
GDNative.
Languages
=========
Godot supports a number of different languages, and it is worth bearing in mind
that there are trade-offs involved - some languages are designed for ease of
use, at the cost of speed, and others are faster but more difficult to work
with.
Built-in engine functions run at the same speed regardless of the scripting
language you choose. If your project is making a lot of calculations in its own
code, consider moving those calculations to a faster language.
GDScript
~~~~~~~~
GDScript is designed to be easy to use and iterate, and is ideal for making many
types of games. However, ease of use is considered more important than
performance, so if you need to make heavy calculations, consider moving some of
your project to one of the other languages.
C#
~~
C# is popular and has first class support in Godot. It offers a good compromise
between speed and ease of use.
Other languages
~~~~~~~~~~~~~~~
Third parties provide support for several other languages, including `Rust
<https://github.com/godot-rust/godot-rust>`_ and `Javascript
<https://github.com/GodotExplorer/ECMAScript>`_.
C++
~~~
Godot is written in C++. Using C++ will usually result in the fastest code,
however, on a practical level, it is the most difficult to deploy to end users'
machines on different platforms. Options for using C++ include GDNative, and
custom modules.
Threads
=======
Consider using threads when making a lot of calculations that can run parallel
to one another. Modern CPUs have multiple cores, each one capable of doing a
limited amount of work. By spreading work over multiple threads you can move
further towards peak CPU efficiency.
The disadvantage of threads is that you have to be incredibly careful. As each
CPU core operates independently, they can end up trying to access the same
memory at the same time. One thread can be reading to a variable while another
is writing. Before you use threads make sure you understand the dangers and how
to try and prevent these race conditions.
For more information on threads see :ref:`doc_using_multiple_threads`.
SceneTree
=========
Although Nodes are an incredibly powerful and versatile concept, be aware that
every node has a cost. Built in functions such as `_process()` and
`_physics_process()` propagate through the tree. This housekeeping can reduce
performance when you have very large numbers of nodes.
Each node is handled individually in the Godot renderer so sometimes a smaller
number of nodes with more in each can lead to better performance.
One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
get much better performance by removing nodes from the SceneTree, rather than
by pausing or hiding them. You don't have to delete a detached node. You
can for example, keep a reference to a node, detach it from the scene tree, then
reattach it later. This can be very useful for adding and removing areas from a
game for example.
You can avoid the SceneTree altogether by using Server APIs. For more
information, see :ref:`doc_using_servers`.
Physics
=======
In some situations physics can end up becoming a bottleneck, particularly with
complex worlds, and large numbers of physics objects.
Some techniques to speed up physics:
* Try using simplified versions of your rendered geometry for physics. Often
this won't be noticeable for end users, but can greatly increase performance.
* Try removing objects from physics when they are out of view / outside the
current area, or reusing physics objects (maybe you allow 8 monsters per area,
for example, and reuse these).
Another crucial aspect to physics is the physics tick rate. In some games you
can greatly reduce the tick rate, and instead of for example, updating physics
60 times per second, you may update it at 20, or even 10 ticks per second. This
can greatly reduce the CPU load.
The downside of changing physics tick rate is you can get jerky movement or
jitter when the physics update rate does not match the frames rendered.
The solution to this problem is 'fixed timestep interpolation', which involves
smoothing the rendered positions and rotations over multiple frames to match the
physics. You can either implement this yourself or use a third-party addon.
Interpolation is a very cheap operation, performance wise, compared to running a
physics tick, orders of magnitude faster, so this can be a significant win, as
well as reducing jitter.

View File

@@ -0,0 +1,291 @@
.. _doc_general_optimization:
General optimization tips
=========================
Introduction
~~~~~~~~~~~~
In an ideal world, computers would run at infinite speed, and the only limit to
what we could achieve would be our imagination. In the real world, however, it
is all too easy to produce software that will bring even the fastest computer to
its knees.
Designing games and other software is thus a compromise between what we would
like to be possible, and what we can realistically achieve while maintaining
good performance.
To achieve the best results, we have two approaches:
* Work faster
* Work smarter
And preferably, we will use a blend of the two.
Smoke and Mirrors
^^^^^^^^^^^^^^^^^
Part of working smarter is recognizing that, especially in games, we can often
get the player to believe they are in a world that is far more complex,
interactive, and graphically exciting than it really is. A good programmer is a
magician, and should strive to learn the tricks of the trade, and try to invent
new ones.
The nature of slowness
^^^^^^^^^^^^^^^^^^^^^^
To the outside observer, performance problems are often lumped together. But in
reality, there are several different kinds of performance problem:
* A slow process that occurs every frame, leading to a continuously low frame
rate
* An intermittent process that causes 'spikes' of slowness, leading to
stalls
* A slow process that occurs outside of normal gameplay, for instance, on
level load
Each of these are annoying to the user, but in different ways.
Measuring Performance
=====================
Probably the most important tool for optimization is the ability to measure
performance - to identify where bottlenecks are, and to measure the success of
our attempts to speed them up.
There are several methods of measuring performance, including :
* Putting a start / stop timer around code of interest
* Using the Godot profiler
* Using external third party profilers
* Using GPU profilers / debuggers
* Checking the frame rate (with vsync disabled)
Be very aware that the relative performance of different areas can vary on
different hardware. Often it is a good idea to make timings on more than one
device, especially including mobile as well as desktop, if you are targeting
mobile.
Limitations
~~~~~~~~~~~
CPU Profilers are often the 'go to' method for measuring performance, however
they don't always tell the whole story.
- Bottlenecks are often on the GPU, `as a result` of instructions given by the
CPU
- Spikes can occur in the Operating System processes (outside of Godot) `as a
result` of instructions used in Godot (for example dynamic memory allocation)
- You may not be able to profile e.g. a mobile phone
- You may have to solve performance problems that occur on hardware you don't
have access to
As a result of these limitations, you often need to use detective work to find
out where bottlenecks are.
Detective work
~~~~~~~~~~~~~~
Detective work is a crucial skill for developers (both in terms of performance,
and also in terms of bug fixing). This can include hypothesis testing, and
binary search.
Hypothesis testing
^^^^^^^^^^^^^^^^^^
Say for example you believe that sprites are slowing down your game. You can
test this hypothesis for example by:
* Measuring the performance when you add more sprites, or take some away.
This may lead to a further hypothesis - does the size of the sprite determine
the performance drop?
* You can test this by keeping everything the same, but changing the sprite
size, and measuring performance
Binary search
^^^^^^^^^^^^^
Say you know that frames are taking much longer than they should, but you are
not sure where the bottleneck lies. You could begin by commenting out
approximately half the routines that occur on a normal frame. Has the
performance improved more or less than expected?
Once you know which of the two halves contains the bottleneck, you can then
repeat this process, until you have pinned down the problematic area.
Profilers
=========
Profilers allow you to time your program while running it. Profilers then
provide results telling you what percentage of time was spent in different
functions and areas, and how often functions were called.
This can be very useful both to identify bottlenecks and to measure the results
of your improvements. Sometimes attempts to improve performance can backfire and
lead to slower performance, so always use profiling and timing to guide your
efforts.
For more info about using the profiler within Godot see
:ref:`doc_debugger_panel`.
Principles
==========
Donald Knuth:
*Programmers waste enormous amounts of time thinking about, or worrying
about, the speed of noncritical parts of their programs, and these attempts
at efficiency actually have a strong negative impact when debugging and
maintenance are considered. We should forget about small efficiencies, say
about 97% of the time: premature optimization is the root of all evil. Yet
we should not pass up our opportunities in that critical 3%.*
The messages are very important:
* Programmer / Developer time is limited. Instead of blindly trying to speed up
all aspects of a program we should concentrate our efforts on the aspects that
really matter.
* Efforts at optimization often end up with code that is harder to read and
debug than non-optimized code. It is in our interests to limit this to areas
that will really benefit.
Just because we `can` optimize a particular bit of code, it doesn't necessarily
mean that we should. Knowing when, and when not to optimize is a great skill to
develop.
One misleading aspect of the quote is that people tend to focus on the subquote
"premature optimization is the root of all evil". While `premature` optimization
is (by definition) undesirable, performant software is the result of performant
design.
Performant design
~~~~~~~~~~~~~~~~~
The danger with encouraging people to ignore optimization until necessary, is
that it conveniently ignores that the most important time to consider
performance is at the design stage, before a key has even hit a keyboard. If the
design / algorithms of a program are inefficient, then no amount of polishing the
details later will make it run fast. It may run `faster`, but it will never run
as fast as a program designed for performance.
This tends to be far more important in game / graphics programming than in
general programming. A performant design, even without low level optimization,
will often run many times faster than a mediocre design with low level
optimization.
Incremental design
~~~~~~~~~~~~~~~~~~
Of course, in practice, unless you have prior knowledge, you are unlikely to
come up with the best design first time. So you will often make a series of
versions of a particular area of code, each taking a different approach to the
problem, until you come to a satisfactory solution. It is important not to spend
too much time on the details at this stage until you have finalized the overall
design, otherwise much of your work will be thrown out.
It is difficult to give general guidelines for performant design because this is
so dependent on the problem. One point worth mentioning though, on the CPU
side, is that modern CPUs are nearly always limited by memory bandwidth. This
has led to a resurgence in data orientated design, which involves designing data
structures and algorithms for locality of data and linear access, rather than
jumping around in memory.
The optimization process
~~~~~~~~~~~~~~~~~~~~~~~~
Assuming we have a reasonable design, and taking our lessons from Knuth, our
first step in optimization should be to identify the biggest bottlenecks - the
slowest functions, the low hanging fruit.
Once we have successfully improved the speed of the slowest area, it may no
longer be the bottleneck. So we should test / profile again, and find the next
bottleneck on which to focus.
The process is thus:
1. Profile / Identify bottleneck
2. Optimize bottleneck
3. Return to step 1
Optimizing bottlenecks
~~~~~~~~~~~~~~~~~~~~~~
Some profilers will even tell you which part of a function (which data accesses,
calculations) are slowing things down.
As with design you should concentrate your efforts first on making sure the
algorithms and data structures are the best they can be. Data access should be
local (to make best use of CPU cache), and it can often be better to use compact
storage of data (again, always profile to test results). Often you precalculate
heavy computations ahead of time (e.g. at level load, or loading precalculated
data files).
Once algorithms and data are good, you can often make small changes in routines
which improve performance, things like moving calculations outside of loops.
Always retest your timing / bottlenecks after making each change. Some changes
will increase speed, others may have a negative effect. Sometimes a small
positive effect will be outweighed by the negatives of more complex code, and
you may choose to leave out that optimization.
Appendix
========
Bottleneck math
~~~~~~~~~~~~~~~
The proverb "a chain is only as strong as its weakest link" applies directly to
performance optimization. If your project is spending 90% of the time in
function 'A', then optimizing A can have a massive effect on performance.
.. code-block:: none
A: 9 ms
Everything else: 1 ms
Total frame time: 10 ms
.. code-block:: none
A: 1 ms
Everything else: 1ms
Total frame time: 2 ms
So in this example improving this bottleneck A by a factor of 9x, decreases
overall frame time by 5x, and increases frames per second by 5x.
If however, something else is running slowly and also bottlenecking your
project, then the same improvement can lead to less dramatic gains:
.. code-block:: none
A: 9 ms
Everything else: 50 ms
Total frame time: 59 ms
.. code-block:: none
A: 1 ms
Everything else: 50 ms
Total frame time: 51 ms
So in this example, even though we have hugely optimized functionality A, the
actual gain in terms of frame rate is quite small.
In games, things become even more complicated because the CPU and GPU run
independently of one another. Your total frame time is determined by the slower
of the two.
.. code-block:: none
CPU: 9 ms
GPU: 50 ms
Total frame time: 50 ms
.. code-block:: none
CPU: 1 ms
GPU: 50 ms
Total frame time: 50 ms
In this example, we optimized the CPU hugely again, but the frame time did not
improve, because we are GPU-bottlenecked.

View File

@@ -0,0 +1,263 @@
.. _doc_gpu_optimization:
GPU Optimizations
=================
Introduction
~~~~~~~~~~~~
The demand for new graphics features and progress almost guarantees that you
will encounter graphics bottlenecks. Some of these can be CPU side, for instance
in calculations inside the Godot engine to prepare objects for rendering.
Bottlenecks can also occur on the CPU in the graphics driver, which sorts
instructions to pass to the GPU, and in the transfer of these instructions. And
finally bottlenecks also occur on the GPU itself.
Where bottlenecks occur in rendering is highly hardware specific. Mobile GPUs in
particular may struggle with scenes that run easily on desktop.
Understanding and investigating GPU bottlenecks is slightly different to the
situation on the CPU, because often you can only change performance indirectly,
by changing the instructions you give to the GPU, and it may be more difficult
to take measurements. Often the only way of measuring performance is by
examining changes in frame rate.
Drawcalls, state changes, and APIs
==================================
.. note:: The following section is not relevant to end-users, but is useful to
provide background information that is relevant in later sections.
Godot sends instructions to the GPU via a graphics API (OpenGL, GLES2, GLES3,
Vulkan). The communication and driver activity involved can be quite costly,
especially in OpenGL. If we can provide these instructions in a way that is
preferred by the driver and GPU, we can greatly increase performance.
Nearly every API command in OpenGL requires a certain amount of validation, to
make sure the GPU is in the correct state. Even seemingly simple commands can
lead to a flurry of behind the scenes housekeeping. Therefore the name of the
game is reduce these instructions to a bare minimum, and group together similar
objects as much as possible so they can be rendered together, or with the
minimum number of these expensive state changes.
2D batching
~~~~~~~~~~~
In 2d, the costs of treating each item individually can be prohibitively high -
there can easily be thousands on screen. This is why 2d batching is used -
multiple similar items are grouped together and rendered in a batch, via a
single drawcall, rather than making a separate drawcall for each item. In
addition this means that state changes, material and texture changes can be kept
to a minimum.
For more information on 2D batching see :ref:`doc_batching`.
3D batching
~~~~~~~~~~~
In 3d, we still aim to minimize draw calls and state changes, however, it can be
more difficult to batch together several objects into a single draw call. 3d
meshes tend to comprise hundreds or thousands of triangles, and combining large
meshes at runtime is prohibitively expensive. The costs of joining them quickly
exceeds any benefits as the number of triangles grows per mesh. A much better
alternative is to join meshes ahead of time (static meshes in relation to each
other). This can either be done by artists, or programmatically within Godot.
There is also a cost to batching together objects in 3d. Several objects
rendered as one cannot be individually culled. An entire city that is off screen
will still be rendered if it is joined to a single blade of grass that is on
screen. So attempting to batch together 3d objects should take account of their
location and effect on culling. Despite this, the benefits of joining static
objects often outweigh other considerations, especially for large numbers of low
poly objects.
For more information on 3D specific optimizations, see
:ref:`doc_optimizing_3d_performance`.
Reuse Shaders and Materials
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Godot renderer is a little different to what is out there. It's designed to
minimize GPU state changes as much as possible. :ref:`SpatialMaterial
<class_SpatialMaterial>` does a good job at reusing materials that need similar
shaders but, if custom shaders are used, make sure to reuse them as much as
possible. Godot's priorities are:
- **Reusing Materials**: The fewer different materials in the
scene, the faster the rendering will be. If a scene has a huge amount
of objects (in the hundreds or thousands) try reusing the materials
or in the worst case use atlases.
- **Reusing Shaders**: If materials can't be reused, at least try to
re-use shaders (or SpatialMaterials with different parameters but the same
configuration).
If a scene has, for example, ``20,000`` objects with ``20,000`` different
materials each, rendering will be slow. If the same scene has ``20,000``
objects, but only uses ``100`` materials, rendering will be much faster.
Pixel cost vs vertex cost
=========================
You may have heard that the lower the number of polygons in a model, the faster
it will be rendered. This is *really* relative and depends on many factors.
On a modern PC and console, vertex cost is low. GPUs originally only rendered
triangles, so every frame all the vertices:
1. Had to be transformed by the CPU (including clipping).
2. Had to be sent to the GPU memory from the main RAM.
Now all this is handled inside the GPU, so the performance is much higher. 3D
artists usually have the wrong feeling about polycount performance because 3D
DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory in order
for it to be edited, reducing actual performance. Game engines rely on the GPU
more so they can render many triangles much more efficiently.
On mobile devices, the story is different. PC and Console GPUs are
brute-force monsters that can pull as much electricity as they need from
the power grid. Mobile GPUs are limited to a tiny battery, so they need
to be a lot more power efficient.
To be more efficient, mobile GPUs attempt to avoid *overdraw*. This means, the
same pixel on the screen being rendered more than once. Imagine a town with
several buildings, GPUs don't know what is visible and what is hidden until they
draw it. A house might be drawn and then another house in front of it (rendering
happened twice for the same pixel!). PC GPUs normally don't care much about this
and just throw more pixel processors to the hardware to increase performance
(but this also increases power consumption).
Using more power is not an option on mobile so mobile devices use a technique
called "Tile Based Rendering" which divides the screen into a grid. Each cell
keeps the list of triangles drawn to it and sorts them by depth to minimize
*overdraw*. This technique improves performance and reduces power consumption,
but takes a toll on vertex performance. As a result, fewer vertices and
triangles can be processed for drawing.
Additionally, Tile Based Rendering struggles when there are small objects with a
lot of geometry within a small portion of the screen. This forces mobile GPUs to
put a lot of strain on a single screen tile which considerably decreases
performance as all the other cells must wait for it to complete in order to
display the frame.
In summary, do not worry about vertex count on mobile, but avoid concentration
of vertices in small parts of the screen. If a character, NPC, vehicle, etc. is
far away (so it looks tiny), use a smaller level of detail (LOD) model.
Pay attention to the additional vertex processing required when using:
- Skinning (skeletal animation)
- Morphs (shape keys)
- Vertex-lit objects (common on mobile)
Pixel / fragment shaders - fill rate
====================================
In contrast to vertex processing, the costs of fragment shading has increased
dramatically over the years. Screen resolutions have increased (the area of a 4K
screen is ``8,294,400`` pixels, versus ``307,200`` for an old ``640x480`` VGA
screen, that is 27x the area), but also the complexity of fragment shaders has
exploded. Physically based rendering requires complex calculations for each
fragment.
You can test whether a project is fill rate limited quite easily. Turn off vsync
to prevent capping the frames per second, then compare the frames per second
when running with a large window, to running with a postage stamp sized window
(you may also benefit from similarly reducing your shadow map size if using
shadows). Usually you will find the fps increases quite a bit using a small
window, which indicates you are to some extent fill rate limited. If on the
other hand there is little to no increase in fps, then your bottleneck lies
elsewhere.
You can increase performance in a fill rate limited project by reducing the
amount of work the GPU has to do. You can do this by simplifying the shader
(perhaps turn off expensive options if you are using a :ref:`SpatialMaterial
<class_SpatialMaterial>`), or reducing the number and size of textures used.
Consider shipping simpler shaders for mobile.
Reading textures
~~~~~~~~~~~~~~~~
The other factor in fragment shaders is the cost of reading textures. Reading
textures is an expensive operation (especially reading from several in a single
fragment shader), and also consider the filtering may add expense to this
(trilinear filtering between mipmaps, and averaging). Reading textures is also
expensive in power terms, which is a big issue on mobiles.
Texture compression
~~~~~~~~~~~~~~~~~~~
Godot compresses textures of 3D models when imported (VRAM compression) by
default. Video RAM compression is not as efficient in size as PNG or JPG when
stored, but increases performance enormously when drawing.
This is because the main goal of texture compression is bandwidth reduction
between memory and the GPU.
In 3D, the shapes of objects depend more on the geometry than the texture, so
compression is generally not noticeable. In 2D, compression depends more on
shapes inside the textures, so the artifacts resulting from 2D compression are
more noticeable.
As a warning, most Android devices do not support texture compression of
textures with transparency (only opaque), so keep this in mind.
Post processing / shadows
~~~~~~~~~~~~~~~~~~~~~~~~~
Post processing effects and shadows can also be expensive in terms of fragment
shading activity. Always test the impact of these on different hardware.
Reducing the size of shadow maps can increase performance, both in terms of
writing, and reading the maps.
Transparency / blending
=======================
Transparent items present particular problems for rendering efficiency. Opaque
items (especially in 3d) can be essentially rendered in any order and the
Z-buffer will ensure that only the front most objects get shaded. Transparent or
blended objects are different - in most cases they cannot rely on the Z-buffer
and must be rendered in "painter's order" (i.e. from back to front) in order to
look correct.
The transparent items are also particularly bad for fill rate, because every
item has to be drawn, even if later transparent items will be drawn on top.
Opaque items don't have to do this. They can usually take advantage of the
Z-buffer by writing to the Z-buffer only first, then only performing the
fragment shader on the 'winning' fragment, the item that is at the front at a
particular pixel.
Transparency is particularly expensive where multiple transparent items overlap.
It is usually better to use as small a transparent area as possible in order to
minimize these fill rate requirements, especially on mobile, where fill rate is
very expensive. Indeed, in many situations, rendering more complex opaque
geometry can end up being faster than using transparency to "cheat".
Multi-Platform Advice
=====================
If you are aiming to release on multiple platforms, test `early` and test
`often` on all your platforms, especially mobile. Developing a game on desktop
but attempting to port to mobile at the last minute is a recipe for disaster.
In general you should design your game for the lowest common denominator, then
add optional enhancements for more powerful platforms. For example, you may want
to use the GLES2 backend for both desktop and mobile platforms where you target
both.
Mobile / tile renderers
=======================
GPUs on mobile devices work in dramatically different ways from GPUs on desktop.
Most mobile devices use tile renderers. Tile renderers split up the screen into
regular sized tiles that fit into super fast cache memory, and reduce the reads
and writes to main memory.
There are some downsides though, it can make certain techniques much more
complicated and expensive to perform. Tiles that rely on the results of
rendering in different tiles or on the results of earlier operations being
preserved can be very slow. Be very careful to test the performance of shaders,
viewport textures and post processing.

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 146 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 160 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 96 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 177 KiB

View File

@@ -1,9 +1,75 @@
Optimization
=============
Introduction
~~~~~~~~~~~~
Godot follows a balanced performance philosophy. In the performance world, there
are always trade-offs, which consist of trading speed for usability and
flexibility. Some practical examples of this are:
- Rendering objects efficiently in high amounts is easy, but when a
large scene must be rendered, it can become inefficient. To solve this,
visibility computation must be added to the rendering, which makes rendering
less efficient, but, at the same time, fewer objects are rendered, so
efficiency overall improves.
- Configuring the properties of every material for every object that
needs to be rendered is also slow. To solve this, objects are sorted by
material to reduce the costs, but at the same time sorting has a cost.
- In 3D physics a similar situation happens. The best algorithms to
handle large amounts of physics objects (such as SAP) are slow at
insertion/removal of objects and ray-casting. Algorithms that allow faster
insertion and removal, as well as ray-casting, will not be able to handle as
many active objects.
And there are many more examples of this! Game engines strive to be general
purpose in nature, so balanced algorithms are always favored over algorithms
that might be fast in some situations and slow in others or algorithms that are
fast but make usability more difficult.
Godot is not an exception and, while it is designed to have backends swappable
for different algorithms, the default ones prioritize balance and flexibility
over performance.
With this clear, the aim of this tutorial section is to explain how to get the
maximum performance out of Godot. While the tutorials can be read in any order,
it is a good idea to start from :ref:`doc_general_optimization`.
.. toctree::
:maxdepth: 1
:name: toc-learn-features-optimization
:caption: Common
:name: toc-learn-features-general-optimization
general_optimization
using_servers
.. toctree::
:maxdepth: 1
:caption: CPU
:name: toc-learn-features-cpu-optimization
cpu_optimization
.. toctree::
:maxdepth: 1
:caption: GPU
:name: toc-learn-features-gpu-optimization
gpu_optimization
using_multimesh
.. toctree::
:maxdepth: 1
:caption: 2D
:name: toc-learn-features-2d-optimization
batching
.. toctree::
:maxdepth: 1
:caption: 3D
:name: toc-learn-features-3d-optimization
optimizing_3d_performance

View File

@@ -0,0 +1,143 @@
.. meta::
:keywords: optimization
.. _doc_optimizing_3d_performance:
Optimizing 3D performance
=========================
Culling
=======
Godot will automatically perform view frustum culling in order to prevent
rendering objects that are outside the viewport. This works well for games that
take place in a small area, however things can quickly become problematic in
larger levels.
Occlusion culling
~~~~~~~~~~~~~~~~~
Walking around a town for example, you may only be able to see a few buildings
in the street you are in, as well as the sky and a few birds flying overhead. As
far as a naive renderer is concerned however, you can still see the entire town.
It won't just render the buildings in front of you, it will render the street
behind that, with the people on that street, the buildings behind that. You
quickly end up in situations where you are attempting to render 10x, or 100x
more than what is visible.
Things aren't quite as bad as they seem, because the Z-buffer usually allows the
GPU to only fully shade the objects that are at the front. However, unneeded
objects are still reducing performance.
One way we can potentially reduce the amount to be rendered is to take advantage
of occlusion. As of version 3.2.2 there is no built in support for occlusion in
Godot, however with careful design you can still get many of the advantages.
For instance in our city street scenario, you may be able to work out in advance
that you can only see two other streets, ``B`` and ``C``, from street ``A``.
Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
you have to do is work out when your viewer is in street ``A`` (perhaps using
Godot Areas), then you can hide the other streets.
This is a manual version of what is known as a 'potentially visible set'. It is
a very powerful technique for speeding up rendering. You can also use it to
restrict physics or AI to the local area, and speed these up as well as
rendering.
Other occlusion techniques
~~~~~~~~~~~~~~~~~~~~~~~~~~
There are other occlusion techniques such as portals, automatic PVS, and raster
based occlusion culling. Some of these may be available through addons, and may
be available in core Godot in the future.
Transparent objects
~~~~~~~~~~~~~~~~~~~
Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
<class_Shader>` to improve performance. This, however, can not be done with
transparent objects. Transparent objects are rendered from back to front to make
blending with what is behind work. As a result, try to use as few transparent
objects as possible. If an object has a small section with transparency, try to
make that section a separate surface with its own Material.
For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
doc.
Level of detail (LOD)
=====================
In some situations, particularly at a distance, it can be a good idea to replace
complex geometry with simpler versions - the end user will probably not be able
to see much difference. Consider looking at a large number of trees in the far
distance. There are several strategies for replacing models at varying distance.
You could use lower poly models, or use transparency to simulate more complex
geometry.
Billboards and imposters
~~~~~~~~~~~~~~~~~~~~~~~~
The simplest version of using transparency to deal with LOD is billboards. For
example, you can use a single transparent quad to represent a tree at distance.
This can be very cheap to render, unless of course, there are many trees in
front of each other. In which case transparency may start eating into fill rate
(for more information on fill rate, see :ref:`doc_gpu_optimization`).
An alternative is to render not just one tree, but a number of trees together as
a group. This can be especially effective if you can see an area but cannot
physically approach it in a game.
You can make imposters by pre-rendering views of an object at different angles.
Or you can even go one step further, and periodically re-render a view of an
object onto a texture to be used as an imposter. At a distance, you need to move
the viewer a considerable distance for the angle of view to change
significantly. This can be complex to get working, but may be worth it depending
on the type of project you are making.
Use instancing (MultiMesh)
~~~~~~~~~~~~~~~~~~~~~~~~~~
If several identical objects have to be drawn in the same place or nearby, try
using :ref:`MultiMesh <class_MultiMesh>` instead. MultiMesh allows the drawing
of many thousands of objects at very little performance cost, making it ideal
for flocks, grass, particles, and anything else where you have thousands of
identical objects.
Also see the :ref:`Using MultiMesh <doc_using_multimesh>` doc.
Bake lighting
=============
Lighting objects is one of the most costly rendering operations. Realtime
lighting, shadows (especially multiple lights), and GI are especially expensive.
They may simply be too much for lower power mobile devices to handle.
Consider using baked lighting, especially for mobile. This can look fantastic,
but has the downside that it will not be dynamic. Sometimes this is a trade off
worth making.
In general, if several lights need to affect a scene, it's best to use
:ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
indirect light bounces.
Animation / Skinning
====================
Animation and particularly vertex animation such as skinning and morphing can be
very expensive on some platforms. You may need to lower poly count considerably
for animated models or limit the number of them on screen at any one time.
Large worlds
============
If you are making large worlds, there are different considerations than what you
may be familiar with from smaller games.
Large worlds may need to be built in tiles that can be loaded on demand as you
move around the world. This can prevent memory use from getting out of hand, and
also limit the processing needed to the local area.
There may be glitches due to floating point error in large worlds. You may be
able to use techniques such as orienting the world around the player (rather
than the other way around), or shifting the origin periodically to keep things
centred around (0, 0, 0).