Merge pull request #3852 from Calinou/improve-optimization-3.2

Proofread and improve the optimization guides
2026-01-07 02:12:07 +03:00 · 2020-09-09 15:07:13 +02:00
parent e337af304f 8b40561866
commit 9c92e2fb71
6 changed files with 687 additions and 639 deletions
--- a/tutorials/optimization/batching.rst
+++ b/tutorials/optimization/batching.rst
@@ -6,10 +6,10 @@ Optimization using batching
 Introduction
 ~~~~~~~~~~~~

-Game engines have to send a set of instructions to the GPU in order to tell the
-GPU what and where to draw. These instructions are sent using common
-instructions, called APIs (Application Programming Interfaces), examples of
-which are OpenGL, OpenGL ES, and Vulkan.
+Game engines have to send a set of instructions to the GPU to tell the GPU what
+and where to draw. These instructions are sent using common instructions called
+:abbr:`APIs (Application Programming Interfaces)`. Examples of graphics APIs are
+OpenGL, OpenGL ES, and Vulkan.

 Different APIs incur different costs when drawing objects. OpenGL handles a lot
 of work for the user in the GPU driver at the cost of more expensive draw calls.
@@ -29,21 +29,21 @@ one primitive at a time, telling it some information such as the texture used,
 the material, the position, size, etc. then saying "Draw!" (this is called a
 draw call).

-It turns out that while this is conceptually simple from the engine side, GPUs
-operate very slowly when used in this manner. GPUs work much more efficiently
-if, instead of telling them to draw a single primitive, you tell them to draw a
-number of similar primitives all in one draw call, which we will call a "batch".
+While this is conceptually simple from the engine side, GPUs operate very slowly
+when used in this manner. GPUs work much more efficiently if you tell them to
+draw a number of similar primitives all in one draw call, which we will call a
+"batch".

-And it turns out that they don't just work a bit faster when used in this
-manner, they work a *lot* faster.
+It turns out that they don't just work a bit faster when used in this manner;
+they work a *lot* faster.

-As Godot is designed to be a general purpose engine, the primitives coming into
+As Godot is designed to be a general-purpose engine, the primitives coming into
 the Godot renderer can be in any order, sometimes similar, and sometimes
-dissimilar. In order to match the general purpose nature of Godot with the
-batching preferences of GPUs, Godot features an intermediate layer which can
-automatically group together primitives wherever possible, and send these
-batches on to the GPU. This can give an increase in rendering performance while
-requiring few, if any, changes to your Godot project.
+dissimilar. To match Godot's general-purpose nature with the batching
+preferences of GPUs, Godot features an intermediate layer which can
+automatically group together primitives wherever possible and send these batches
+on to the GPU. This can give an increase in rendering performance while
+requiring few (if any) changes to your Godot project.

 How it works
 ~~~~~~~~~~~~
@@ -51,78 +51,77 @@ How it works
 Instructions come into the renderer from your game in the form of a series of
 items, each of which can contain one or more commands. The items correspond to
 Nodes in the scene tree, and the commands correspond to primitives such as
-rectangles or polygons. Some items, such as tilemaps, and text, can contain a
-large number of commands (tiles and letters respectively). Others, such as
-sprites, may only contain a single command (rectangle).
+rectangles or polygons. Some items such as TileMaps and text can contain a
+large number of commands (tiles and glyphs respectively). Others, such as
+sprites, may only contain a single command (a rectangle).

 The batcher uses two main techniques to group together primitives:

-* Consecutive items can be joined together
-* Consecutive commands within an item can be joined to form a batch
+- Consecutive items can be joined together.
+- Consecutive commands within an item can be joined to form a batch.

 Breaking batching
 ^^^^^^^^^^^^^^^^^

 Batching can only take place if the items or commands are similar enough to be
 rendered in one draw call. Certain changes (or techniques), by necessity, prevent
-the formation of a contiguous batch, this is referred to as 'breaking batching'.
+the formation of a contiguous batch, this is referred to as "breaking batching".

 Batching will be broken by (amongst other things):
-* Change of texture
-* Change of material
-* Change of primitive type (say going from rectangles to lines)
+
+- Change of texture.
+- Change of material.
+- Change of primitive type (say, going from rectangles to lines).

 .. note::

-	If for example, you draw a series of sprites each with a different texture,
-	there is no way they can be batched.
+    For example, if you draw a series of sprites each with a different texture,
+    there is no way they can be batched.

-Render order
-^^^^^^^^^^^^
+Determining the rendering order
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 The question arises, if only similar items can be drawn together in a batch, why
 don't we look through all the items in a scene, group together all the similar
 items, and draw them together?

-In 3D, this is often exactly how engines work. However, in Godot 2D, items are
-drawn in 'painter's order', from back to front. This ensures that items at the
-front are drawn on top of earlier items, when they overlap.
+In 3D, this is often exactly how engines work. However, in Godot's 2D renderer,
+items are drawn in "painter's order", from back to front. This ensures that
+items at the front are drawn on top of earlier items when they overlap.

-This also means that if we try and draw objects in order of, for example,
-texture, then this painter's order may break and objects will be drawn in the
-wrong order.
+This also means that if we try and draw objects on a per-texture basis, then
+this painter's order may break and objects will be drawn in the wrong order.

-In Godot this back to front order is determined by:
-* The order of objects in the scene tree
-* The Z index of objects
-* The canvas layer
-* Y sort nodes
+In Godot, this back-to-front order is determined by:
+
+- The order of objects in the scene tree.
+- The Z index of objects.
+- The canvas layer.
+- :ref:`class_YSort` nodes.

 .. note::

-	You can group similar objects together for easier batching. While doing so
-	is not a requirement on your part, think of it as an optional approach that
-	can improve performance in some cases. See the diagnostics section in order
-	to help you make this decision.
+    You can group similar objects together for easier batching. While doing so
+    is not a requirement on your part, think of it as an optional approach that
+    can improve performance in some cases. See the
+    :ref:`doc_batching_diagnostics` section to help you make this decision.

 A trick
 ^^^^^^^

-And now a sleight of hand. Although the idea of painter's order is that objects
-are rendered from back to front, consider 3 objects A, B and C, that contain 2
-different textures, grass and wood.
+And now, a sleight of hand. Even though the idea of painter's order is that
+objects are rendered from back to front, consider 3 objects ``A``, ``B`` and
+``C``, that contain 2 different textures: grass and wood.

 .. image:: img/overlap1.png

-In painter's order they are ordered:
+In painter's order they are ordered::

-::
+    A - wood
+    B - grass
+    C - wood

-	A - wood
-	B - grass
-	C - wood
-
-Because the texture changes, they cannot be batched, and will be rendered in 3
+Because of the texture changes, they can't be batched and will be rendered in 3
 draw calls.

 However, painter's order is only needed on the assumption that they will be
@@ -145,62 +144,62 @@ balance the costs and benefits in your project.

 ::

-	A - wood
-	C - wood
-	B - grass
+    A - wood
+    C - wood
+    B - grass

-Because the texture only changes once, we can render the above in only 2
-draw calls.
+Since the texture only changes once, we can render the above in only 2 draw
+calls.

 Lights
 ~~~~~~

-Although the job for the batching system is normally quite straightforward, it
-becomes considerably more complex when 2D lights are used, because lights are
-drawn using extra passes, one for each light affecting the primitive. Consider 2
-sprites A and B, with identical texture and material. Without lights they would
-be batched together and drawn in one draw call. But with 3 lights, they would be
-drawn as follows, each line a draw call:
+Although the batching system's job is normally quite straightforward, it becomes
+considerably more complex when 2D lights are used. This is because lights are
+drawn using additional passes, one for each light affecting the primitive.
+Consider 2 sprites ``A`` and ``B``, with identical texture and material. Without
+lights, they would be batched together and drawn in one draw call. But with 3
+lights, they would be drawn as follows, each line being a draw call:

 .. image:: img/lights_overlap.png

 ::

-	A
-	A - light 1
-	A - light 2
-	A - light 3
-	B
-	B - light 1
-	B - light 2
-	B - light 3
+    A
+    A - light 1
+    A - light 2
+    A - light 3
+    B
+    B - light 1
+    B - light 2
+    B - light 3

-That is a lot of draw calls, 8 for only 2 sprites. Now consider we are drawing
-1000 sprites, the number of draw calls quickly becomes astronomical, and
+That is a lot of draw calls: 8 for only 2 sprites. Now, consider we are drawing
+1,000 sprites. The number of draw calls quickly becomes astronomical and
 performance suffers. This is partly why lights have the potential to drastically
-slow down 2D.
+slow down 2D rendering.

 However, if you remember our magician's trick from item reordering, it turns out
 we can use the same trick to get around painter's order for lights!

-If A and B are not overlapping, we can render them together in a batch, so the
-draw process is as follows:
+If ``A`` and ``B`` are not overlapping, we can render them together in a batch,
+so the drawing process is as follows:

 .. image:: img/lights_separate.png

 ::

-	AB
-	AB - light 1
-	AB - light 2
-	AB - light 3
+    AB
+    AB - light 1
+    AB - light 2
+    AB - light 3


-That is 4 draw calls. Not bad, that is a 50% improvement. However consider that
-in a real game, you might be drawing closer to 1000 sprites.
+That is only 4 draw calls. Not bad, as that is a 2× reduction. However, consider
+that in a real game, you might be drawing closer to 1,000 sprites.

- Before: 1000 * 4 = 4000 draw calls.
- After: 1 * 4 = 4 draw calls.
+- **Before:** 1000 × 4 = 4,000 draw calls.
+- **After:** 1 × 4 = 4 draw calls.

 That is a 1000× decrease in draw calls, and should give a huge increase in
 performance.
@@ -208,158 +207,163 @@ performance.
 Overlap test
 ^^^^^^^^^^^^

-However, as with the item reordering, things are not that simple, we must first
-perform the overlap test to determine whether we can join these primitives, and
-the overlap test has a small cost. So again you can choose the number of
-primitives to lookahead in the overlap test to balance the benefits against the
-cost. Usually with lights the benefits far outweigh the costs.
+However, as with the item reordering, things are not that simple. We must first
+perform the overlap test to determine whether we can join these primitives. This
+overlap test has a small cost. Again, you can choose the number of primitives to
+lookahead in the overlap test to balance the benefits against the cost. With
+lights, the benefits usually far outweigh the costs.

 Also consider that depending on the arrangement of primitives in the viewport,
-the overlap test will sometimes fail (because the primitives overlap and thus
-should not be joined). So in practice the decrease in draw calls may be less
-dramatic than the perfect situation of no overlap. However performance is
-usually far higher than without this lighting optimization.
+the overlap test will sometimes fail (because the primitives overlap and
+therefore shouldn't be joined). In practice, the decrease in draw calls may be
+less dramatic than in a perfect situation with no overlapping at all. However,
+performance is usually far higher than without this lighting optimization.

-Light Scissoring
+Light scissoring
 ~~~~~~~~~~~~~~~~

 Batching can make it more difficult to cull out objects that are not affected or
 partially affected by a light. This can increase the fill rate requirements
-quite a bit, and slow rendering. Fill rate is the rate at which pixels are
-colored, it is another potential bottleneck unrelated to draw calls.
+quite a bit and slow down rendering. *Fill rate* is the rate at which pixels are
+colored. It is another potential bottleneck unrelated to draw calls.

-In order to counter this problem, (and also speedup lighting in general),
-batching introduces light scissoring. This enables the use of the OpenGL command
-``glScissor()``, which identifies an area, outside of which, the GPU will not
-render any pixels. We can thus greatly optimize fill rate by identifying the
-intersection area between a light and a primitive, and limit rendering the light
-to *that area only*.
+In order to counter this problem (and speed up lighting in general), batching
+introduces light scissoring. This enables the use of the OpenGL command
+``glScissor()``, which identifies an area outside of which the GPU won't render
+any pixels. We can greatly optimize fill rate by identifying the intersection
+area between a light and a primitive, and limit rendering the light to
+*that area only*.

 Light scissoring is controlled with the :ref:`scissor_area_threshold
 <class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
 project setting. This value is between 1.0 and 0.0, with 1.0 being off (no
 scissoring), and 0.0 being scissoring in every circumstance. The reason for the
 setting is that there may be some small cost to scissoring on some hardware.
-Generally though, when you are using lighting, it should result in some
-performance gains.
+That said, scissoring should usually result in performance gains when you're
+using 2D lighting.

 The relationship between the threshold and whether a scissor operation takes
-place is not altogether straight forward, but generally it represents the pixel
-area that is potentially 'saved' by a scissor operation (i.e. the fill rate
-saved). At 1.0, the entire screens pixels would need to be saved, which rarely
-if ever happens, so it is switched off. In practice the useful values are
-bunched towards zero, as only a small percentage of pixels need to be saved for
-the operation to be useful.
+place is not always straightforward. Generally, it represents the pixel area
+that is potentially "saved" by a scissor operation (i.e. the fill rate saved).
+At 1.0, the entire screen's pixels would need to be saved, which rarely (if
+ever) happens, so it is switched off. In practice, the useful values are close
+to 0.0, as only a small percentage of pixels need to be saved for the operation
+to be useful.

 The exact relationship is probably not necessary for users to worry about, but
-out of interest is included in the appendix.
+is included in the appendix out of interest:
+:ref:`doc_batching_light_scissoring_threshold_calculation`

-.. image:: img/scissoring.png
+.. figure:: img/scissoring.png
+   :alt: Light scissoring example diagram

-*Bottom right is a light, the red area is the pixels saved by the scissoring
-operation. Only the intersection needs to be rendered.*
+   Bottom right is a light, the red area is the pixels saved by the scissoring
+   operation. Only the intersection needs to be rendered.

 Vertex baking
 ~~~~~~~~~~~~~

 The GPU shader receives instructions on what to draw in 2 main ways:

-* Shader uniforms (e.g. modulate color, item transform)
-* Vertex attributes (vertex color, local transform)
+- Shader uniforms (e.g. modulate color, item transform).
+- Vertex attributes (vertex color, local transform).

-However, within a single draw call (batch) we cannot change uniforms. This means
-that naively, we would not be able to batch together items or commands that
-change final_modulate, or item transform. Unfortunately that is an awful lot of
-cases. Sprites for instance typically are individual nodes with their own item
-transform, and they may have their own color modulate.
+However, within a single draw call (batch), we cannot change uniforms. This
+means that naively, we would not be able to batch together items or commands
+that change ``final_modulate`` or an item's transform. Unfortunately, that
+happens in an awful lot of cases. For instance, sprites are typically
+individual nodes with their own item transform, and they may have their own
+color modulate as well.

 To get around this problem, the batching can "bake" some of the uniforms into
 the vertex attributes.

-* The item transform can be combined with the local transform and sent in a
+- The item transform can be combined with the local transform and sent in a
+  vertex attribute.
+- The final modulate color can be combined with the vertex colors, and sent in a
  vertex attribute.

-* The final modulate color can be combined with the vertex colors, and sent in a
-  vertex attribute.
-
-In most cases this works fine, but this shortcut breaks down if a shader expects
-these values to be available individually, rather than combined. This can happen
+In most cases, this works fine, but this shortcut breaks down if a shader expects
+these values to be available individually rather than combined. This can happen
 in custom shaders.

-Custom Shaders
+Custom shaders
 ^^^^^^^^^^^^^^

-As a result certain operations in custom shaders will prevent baking, and thus
-decrease the potential for batching. While we are working to decrease these
-cases, currently the following conditions apply:
+As a result of the limitation described above, certain operations in custom
+shaders will prevent vertex baking and therefore decrease the potential for
+batching. While we are working to decrease these cases, the following caveats
+currently apply:

-* Reading or writing ``COLOR`` or ``MODULATE`` - disables vertex color baking
-* Reading ``VERTEX`` - disables vertex position baking
+- Reading or writing ``COLOR`` or ``MODULATE`` disables vertex color baking.
+- Reading ``VERTEX``  disables vertex position baking.

 Project Settings
 ~~~~~~~~~~~~~~~~

-In order to fine tune batching, a number of project settings are available. You
-can usually leave these at default during development, but it is a good idea to
+To fine-tune batching, a number of project settings are available. You can
+usually leave these at default during development, but it's a good idea to
 experiment to ensure you are getting maximum performance. Spending a little time
-tweaking parameters can often give considerable performance gain, for very
-little effort. See the tooltips in the project settings for more info.
+tweaking parameters can often give considerable performance gains for very
+little effort. See the on-hover tooltips in the Project Settings for more
+information.

 rendering/batching/options
 ^^^^^^^^^^^^^^^^^^^^^^^^^^

-* :ref:`use_batching
+- :ref:`use_batching
  <class_ProjectSettings_property_rendering/batching/options/use_batching>` -
-  Turns batching on and off
+  Turns batching on or off.

-* :ref:`use_batching_in_editor
+- :ref:`use_batching_in_editor
  <class_ProjectSettings_property_rendering/batching/options/use_batching_in_editor>`
+  Turns batching on or off in the Godot editor.
+  This setting doesn't affect the running project in any way.

-* :ref:`single_rect_fallback
-  <class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
-  - This is a faster way of drawing unbatchable rectangles, however it may lead
-  to flicker on some hardware so is not recommended
+- :ref:`single_rect_fallback
+  <class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>` -
+  This is a faster way of drawing unbatchable rectangles. However, it may lead
+  to flicker on some hardware so it's not recommended.

 rendering/batching/parameters
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-* :ref:`max_join_item_commands <class_ProjectSettings_property_rendering/batching/parameters/max_join_item_commands>` -
-  One of the most important ways of achieving
-  batching is to join suitable adjacent items (nodes) together, however they can
-  only be joined if the commands they contain are compatible. The system must
-  therefore do a lookahead through the commands in an item to determine whether
-  it can be joined. This has a small cost per command, and items with a large
-  number of commands are not worth joining, so the best value may be project
-  dependent.
+- :ref:`max_join_item_commands <class_ProjectSettings_property_rendering/batching/parameters/max_join_item_commands>` -
+  One of the most important ways of achieving batching is to join suitable
+  adjacent items (nodes) together, however they can only be joined if the
+  commands they contain are compatible. The system must therefore do a lookahead
+  through the commands in an item to determine whether it can be joined. This
+  has a small cost per command, and items with a large number of commands are
+  not worth joining, so the best value may be project dependent.

-* :ref:`colored_vertex_format_threshold
-  <class_ProjectSettings_property_rendering/batching/parameters/colored_vertex_format_threshold>` - Baking colors into
-  vertices results in a
-  larger vertex format. This is not necessarily worth doing unless there are a
-  lot of color changes going on within a joined item. This parameter represents
-  the proportion of commands containing color changes / the total commands,
-  above which it switches to baked colors.
+- :ref:`colored_vertex_format_threshold
+  <class_ProjectSettings_property_rendering/batching/parameters/colored_vertex_format_threshold>` -
+  Baking colors into vertices results in a larger vertex format. This is not
+  necessarily worth doing unless there are a lot of color changes going on
+  within a joined item. This parameter represents the proportion of commands
+  containing color changes / the total commands, above which it switches to
+  baked colors.

-* :ref:`batch_buffer_size
-  <class_ProjectSettings_property_rendering/batching/parameters/batch_buffer_size>`
-  - This determines the maximum size of a batch, it doesn't have a huge effect
+- :ref:`batch_buffer_size
+  <class_ProjectSettings_property_rendering/batching/parameters/batch_buffer_size>` -
+  This determines the maximum size of a batch, it doesn't have a huge effect
  on performance but can be worth decreasing for mobile if RAM is at a premium.

-* :ref:`item_reordering_lookahead
-  <class_ProjectSettings_property_rendering/batching/parameters/item_reordering_lookahead>`
-  - Item reordering can help especially with
-  interleaved sprites using different textures. The lookahead for the overlap
-  test has a small cost, so the best value may change per project.
+- :ref:`item_reordering_lookahead
+  <class_ProjectSettings_property_rendering/batching/parameters/item_reordering_lookahead>` -
+  Item reordering can help especially with interleaved sprites using different
+  textures. The lookahead for the overlap test has a small cost, so the best
+  value may change per project.

 rendering/batching/lights
 ^^^^^^^^^^^^^^^^^^^^^^^^^

-* :ref:`scissor_area_threshold
-  <class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
-  - See light scissoring.
+- :ref:`scissor_area_threshold
+  <class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>` -
+  See light scissoring.

-* :ref:`max_join_items
-  <class_ProjectSettings_property_rendering/batching/lights/max_join_items>`  -
+- :ref:`max_join_items
+  <class_ProjectSettings_property_rendering/batching/lights/max_join_items>` -
  Joining items before lighting can significantly increase
  performance. This requires an overlap test, which has a small cost, so the
  costs and benefits may be project dependent, and hence the best value to use
@@ -368,22 +372,22 @@ rendering/batching/lights
 rendering/batching/debug
 ^^^^^^^^^^^^^^^^^^^^^^^^

-* :ref:`flash_batching
-  <class_ProjectSettings_property_rendering/batching/debug/flash_batching>`  -
+- :ref:`flash_batching
+  <class_ProjectSettings_property_rendering/batching/debug/flash_batching>` -
  This is purely a debugging feature to identify regressions between the
  batching and legacy renderer. When it is switched on, the batching and legacy
  renderer are used alternately on each frame. This will decrease performance,
  and should not be used for your final export, only for testing.

-* :ref:`diagnose_frame
-  <class_ProjectSettings_property_rendering/batching/debug/diagnose_frame>`  -
+- :ref:`diagnose_frame
+  <class_ProjectSettings_property_rendering/batching/debug/diagnose_frame>` -
  This will periodically print a diagnostic batching log to
  the Godot IDE / console.

 rendering/batching/precision
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-* :ref:`uv_contract
+- :ref:`uv_contract
  <class_ProjectSettings_property_rendering/batching/precision/uv_contract>` -
  On some hardware (notably some Android devices) there have been reports of
  tilemap tiles drawing slightly outside their UV range, leading to edge
@@ -391,10 +395,12 @@ rendering/batching/precision
  contract. This makes a small contraction in the UV coordinates to compensate
  for precision errors on devices.

-* :ref:`uv_contract_amount
-  <class_ProjectSettings_property_rendering/batching/precision/uv_contract_amount>`
-  - Hopefully the default amount should cure artifacts on most devices, but just
-  in case, this value is editable.
+- :ref:`uv_contract_amount
+  <class_ProjectSettings_property_rendering/batching/precision/uv_contract_amount>` -
+  Hopefully, the default amount should cure artifacts on most devices,
+  but this value remains adjustable just in case.
+
+.. _doc_batching_diagnostics:

 Diagnostics
 ~~~~~~~~~~~
@@ -403,120 +409,117 @@ Although you can change parameters and examine the effect on frame rate, this
 can feel like working blindly, with no idea of what is going on under the hood.
 To help with this, batching offers a diagnostic mode, which will periodically
 print out (to the IDE or console) a list of the batches that are being
-processed. This can help pin point situations where batching is not occurring as
-intended, and help you to fix them, in order to get the best possible
-performance.
+processed. This can help pinpoint situations where batching isn't occurring
+as intended, and help you fix these situations to get the best possible performance.

 Reading a diagnostic
 ^^^^^^^^^^^^^^^^^^^^

 .. code-block:: cpp

-	canvas_begin FRAME 2604
-	items
-		joined_item 1 refs
-				batch D 0-0
-				batch D 0-2 n n
-				batch R 0-1 [0 - 0] {255 255 255 255 }
-		joined_item 1 refs
-				batch D 0-0
-				batch R 0-1 [0 - 146] {255 255 255 255 }
-				batch D 0-0
-				batch R 0-1 [0 - 146] {255 255 255 255 }
-		joined_item 1 refs
-				batch D 0-0
-				batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
-				batch D 0-0
-				batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
-				batch D 0-0
-				batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
-	canvas_end
+    canvas_begin FRAME 2604
+    items
+        joined_item 1 refs
+                batch D 0-0
+                batch D 0-2 n n
+                batch R 0-1 [0 - 0] {255 255 255 255 }
+        joined_item 1 refs
+                batch D 0-0
+                batch R 0-1 [0 - 146] {255 255 255 255 }
+                batch D 0-0
+                batch R 0-1 [0 - 146] {255 255 255 255 }
+        joined_item 1 refs
+                batch D 0-0
+                batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
+                batch D 0-0
+                batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
+                batch D 0-0
+                batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
+    canvas_end


 This is a typical diagnostic.

-* **joined_item** - A joined item can contain 1 or
-  more references to items (nodes). Generally joined_items containing many
+- **joined_item:** A joined item can contain 1 or
+  more references to items (nodes). Generally, joined_items containing many
  references is preferable to many joined_items containing a single reference.
  Whether items can be joined will be determined by their contents and
  compatibility with the previous item.
-* **batch R** - a batch containing rectangles. The second number is the number of
+- **batch R:** A batch containing rectangles. The second number is the number of
  rects. The second number in square brackets is the Godot texture ID, and the
  numbers in curly braces is the color. If the batch contains more than one rect,
-  MULTI is added to the line to make it easy to identify. Seeing MULTI is good,
-  because this indicates successful batching.
-* **batch D** - a default batch, containing everything else that is not currently
+  ``MULTI`` is added to the line to make it easy to identify.
+  Seeing ``MULTI`` is good as it indicates successful batching.
+- **batch D:** A default batch, containing everything else that is not currently
  batched.

-Default Batches
+Default batches
 ^^^^^^^^^^^^^^^

 The second number following default batches is the number of commands in the
-batch, and it is followed by a brief summary of the contents:
+batch, and it is followed by a brief summary of the contents::

-::
+    l - line
+    PL - polyline
+    r - rect
+    n - ninepatch
+    PR - primitive
+    p - polygon
+    m - mesh
+    MM - multimesh
+    PA - particles
+    c - circle
+    t - transform
+    CI - clip_ignore

-	l - line
-	PL - polyline
-	r - rect
-	n - ninepatch
-	PR - primitive
-	p - polygon
-	m - mesh
-	MM - multimesh
-	PA - particles
-	c - circle
-	t - transform
-	CI - clip_ignore
+You may see "dummy" default batches containing no commands; you can ignore those.

-You may see "dummy" default batches containing no commands, you can ignore
-these.
+Frequently asked questions
+~~~~~~~~~~~~~~~~~~~~~~~~~~

-FAQ
-~~~
+I don't get a large performance increase when enabling batching.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-I don't get a large performance increase from switching on batching
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-* Try the diagnostics, see how much batching is occurring, and whether it can be
+- Try the diagnostics, see how much batching is occurring, and whether it can be
  improved
-* Try changing parameters
-* Consider that batching may not be your bottleneck (see bottlenecks)
+- Try changing batching parameters in the Project Settings.
+- Consider that batching may not be your bottleneck (see bottlenecks).

-I get a decrease in performance with batching
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+I get a decrease in performance with batching.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-* Try steps to increase batching given above
-* Try switching :ref:`single_rect_fallback
-  <class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
-  to on
-* The single rect fallback method is the default used without batching, and it
-  is approximately twice as fast, however it can result in flicker on some
-  hardware, so its use is discouraged
-* After trying the above, if your scene is still performing worse, consider
+- Try the steps described above to increase the number of batching opportunities.
+- Try enabling :ref:`single_rect_fallback
+  <class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`.
+- The single rect fallback method is the default used without batching, and it
+  is approximately twice as fast. However, it can result in flickering on some
+  hardware, so its use is discouraged.
+- After trying the above, if your scene is still performing worse, consider
  turning off batching.

-I use custom shaders and the items are not batching
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+I use custom shaders and the items are not batching.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-* Custom shaders can be problematic for batching, see the custom shaders section
+- Custom shaders can be problematic for batching, see the custom shaders section

-I am seeing line artifacts appear on certain hardware
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+I am seeing line artifacts appear on certain hardware.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-* See the :ref:`uv_contract
+- See the :ref:`uv_contract
  <class_ProjectSettings_property_rendering/batching/precision/uv_contract>`
  project setting which can be used to solve this problem.

-I use a large number of textures, so few items are being batched
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+I use a large number of textures, so few items are being batched.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-* Consider the use of texture atlases. As well as allowing batching, these
-  reduce the need for state changes associated with changing texture.
+- Consider using texture atlases. As well as allowing batching, these
+  reduce the need for state changes associated with changing textures.

 Appendix
 ~~~~~~~~

+.. _doc_batching_light_scissoring_threshold_calculation:
+
 Light scissoring threshold calculation
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@@ -525,29 +528,23 @@ The actual proportion of screen pixel area used as the threshold is the
 <class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
 value to the power of 4.

-For example, on a screen size ``1920 x 1080`` there are ``2,073,600`` pixels.
+For example, on a screen size of 1920×1080, there are 2,073,600 pixels.

-At a threshold of ``1000`` pixels, the proportion would be:
+At a threshold of 1,000 pixels, the proportion would be::

-::
-
-	1000 / 2073600 = 0.00048225
-	0.00048225 ^ 0.25 = 0.14819
-
-.. note:: The power of 0.25 is the opposite of power of 4).
+    1000 / 2073600 = 0.00048225
+    0.00048225 ^ (1/4) = 0.14819

 So a :ref:`scissor_area_threshold
 <class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
-of 0.15 would be a reasonable value to try.
+of ``0.15`` would be a reasonable value to try.

 Going the other way, for instance with a :ref:`scissor_area_threshold
 <class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
-of ``0.5``:
+of ``0.5``::

-::
+    0.5 ^ 4 = 0.0625
+    0.0625 * 2073600 = 129600 pixels

-	0.5 ^ 4 = 0.0625
-	0.0625 * 2073600 = 129600 pixels
-
-If the number of pixels saved is more than this threshold, the scissor is
+If the number of pixels saved is greater than this threshold, the scissor is
 activated.
--- a/tutorials/optimization/cpu_optimization.rst
+++ b/tutorials/optimization/cpu_optimization.rst
@@ -1,13 +1,13 @@
 .. _doc_cpu_optimization:

-CPU Optimizations
-=================
+CPU optimization
+================

 Measuring performance
 =====================

 To know how to speed up our program, we have to know where the "bottlenecks"
-are. Bottlenecks are  the slowest parts of the program that limit the rate that
+are. Bottlenecks are the slowest parts of the program that limit the rate that
 everything can progress. This allows us to concentrate our efforts on optimizing
 the areas which will give us the greatest speed improvement, instead of spending
 a lot of time optimizing functions that will lead to small performance
@@ -21,27 +21,31 @@ CPU profilers
 Profilers run alongside your program and take timing measurements to work out
 what proportion of time is spent in each function.

-The Godot IDE conveniently has a built in profiler. It does not run every time
-you start your project, and must be manually started and stopped. This is
-because, in common with most profilers, recording these timing measurements can
+The Godot IDE conveniently has a built-in profiler. It does not run every time
+you start your project: it must be manually started and stopped. This is
+because, like most profilers, recording these timing measurements can
 slow down your project significantly.

 After profiling, you can look back at the results for a frame.

-.. image:: img/godot_profiler.png
+.. figure:: img/godot_profiler.png
+.. figure:: img/godot_profiler.png
+   :alt: Screenshot of the Godot profiler

-`These are the results of a profile of one of the demo projects.`
+   Results of a profile of one of the demo projects.

 .. note:: We can see the cost of built-in processes such as physics and audio,
          as well as seeing the cost of our own scripting functions at the
          bottom.

+          Time spent waiting for various built-in servers may not be counted in
+          the profilers. This is a known bug.
+
 When a project is running slowly, you will often see an obvious function or
 process taking a lot more time than others. This is your primary bottleneck, and
 you can usually increase speed by optimizing this area.

-For more info about using the profiler within Godot see
-:ref:`doc_debugger_panel`.
+For more info about using Godot's built-in profiler, see :ref:`doc_debugger_panel`.

 External profilers
 ~~~~~~~~~~~~~~~~~~
@@ -49,70 +53,70 @@ External profilers
 Although the Godot IDE profiler is very convenient and useful, sometimes you
 need more power, and the ability to profile the Godot engine source code itself.

-You can use a number of third party profilers to do this including Valgrind,
-VerySleepy, Visual Studio and Intel VTune. 
+You can use a number of third party profilers to do this including
+`Valgrind <https://www.valgrind.org/>`__,
+`VerySleepy <http://www.codersnotes.com/sleepy/>`__,
+`HotSpot <https://github.com/KDAB/hotspot>`__,
+`Visual Studio <https://visualstudio.microsoft.com/>`__ and
+`Intel VTune <https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html>`__.

-.. note:: You may need to compile Godot from source in order to use a third
-          party profiler so that you have program database information
-          available. You can also use a debug build, however, note that the
-          results of profiling a debug build will be different to a release
-          build, because debug builds are less optimized. Bottlenecks are often
-          in a different place in debug builds, so you should profile release
-          builds wherever possible.
+.. note:: You will need to compile Godot from source to use a third-party profiler.
+          This is required to obtain debugging symbols. You can also use a debug
+          build, however, note that the results of profiling a debug build will
+          be different to a release build, because debug builds are less
+          optimized. Bottlenecks are often in a different place in debug builds,
+          so you should profile release builds whenever possible.

-.. image:: img/valgrind.png
+.. figure:: img/valgrind.png
+   :alt: Screenshot of Callgrind

-`These are example results from Callgrind, part of Valgrind, on Linux.`
+   Example results from Callgrind, which is part of Valgrind.

 From the left, Callgrind is listing the percentage of time within a function and
 its children (Inclusive), the percentage of time spent within the function
 itself, excluding child functions (Self), the number of times the function is
 called, the function name, and the file or module.

-In this example we can see nearly all time is spent under the
-`Main::iteration()` function, this is the master function in the Godot source
-code that is called repeatedly, and causes frames to be drawn, physics ticks to
+In this example, we can see nearly all time is spent under the
+`Main::iteration()` function. This is the master function in the Godot source
+code that is called repeatedly. It causes frames to be drawn, physics ticks to
 be simulated, and nodes and scripts to be updated. A large proportion of the
 time is spent in the functions to render a canvas (66%), because this example
-uses a 2d benchmark. Below this we see that almost 50% of the time is spent
-outside Godot code in `libglapi`, and `i965_dri` (the graphics driver). This
-tells us the a large proportion of CPU time is being spent in the graphics
-driver.
+uses a 2D benchmark. Below this, we see that almost 50% of the time is spent
+outside Godot code in ``libglapi`` and ``i965_dri`` (the graphics driver).
+This tells us the a large proportion of CPU time is being spent in the
+graphics driver.

-This is actually an excellent example because in an ideal world, only a very
-small proportion of time would be spent in the graphics driver, and this is an
+This is actually an excellent example because, in an ideal world, only a very
+small proportion of time would be spent in the graphics driver. This is an
 indication that there is a problem with too much communication and work being
-done in the graphics API. This profiling lead to the development of 2d batching,
-which greatly speeds up 2d by reducing bottlenecks in this area.
+done in the graphics API. This specific profiling led to the development of 2D
+batching, which greatly speeds up 2D rendering by reducing bottlenecks in this
+area.

 Manually timing functions
 =========================

 Another handy technique, especially once you have identified the bottleneck
-using a profiler, is to manually time the function or area under test. The
-specifics vary according to language, but in GDScript, you would do the
-following:
+using a profiler, is to manually time the function or area under test.
+The specifics vary depending on the language, but in GDScript, you would do
+the following:

 ::

-    var time_start = OS.get_system_time_msecs()
-    
+    var time_start = OS.get_ticks_usec()
+
    # Your function you want to time
    update_enemies()

-    var time_end = OS.get_system_time_msecs()
-    print("Function took: " + str(time_end - time_start)) 
-
-
-You may want to consider using other functions for time if another time unit is
-more suitable, for example :ref:`OS.get_system_time_secs
-<class_OS_method_get_system_time_secs>` if the function will take many seconds.
+    var time_end = OS.get_ticks_usec()
+    print("update_enemies() took %d microseconds" % time_end - time_start)

 When manually timing functions, it is usually a good idea to run the function
-many times (say ``1000`` or more times), instead of just once (unless it is a
-very slow function). A large part of the reason for this is that timers often
-have limited accuracy, and CPUs will schedule processes in a haphazard manner,
-so an average over a series of runs is more accurate than a single measurement.
+many times (1,000 or more times), instead of just once (unless it is a very slow
+function). The reason for doing this is that timers often have limited accuracy.
+Moreover, CPUs will schedule processes in a haphazard manner. Therefore, an
+average over a series of runs is more accurate than a single measurement.

 As you attempt to optimize functions, be sure to either repeatedly profile or
 time them as you go. This will give you crucial feedback as to whether the
@@ -121,21 +125,22 @@ optimization is working (or not).
 Caches
 ======

-Something else to be particularly aware of, especially when comparing timing
-results of two different versions of a function, is that the results can be
-highly dependent on whether the data is in the CPU cache or not. CPUs don't load
-data directly from main memory, because although main memory can be huge (many
-GBs), it is very slow to access. Instead CPUs load data from a smaller, higher
-speed bank of memory, called cache. Loading data from cache is super fast, but
-every time you try and load a memory address that is not stored in cache, the
-cache must make a trip to main memory and slowly load in some data. This delay
-can result in the CPU sitting around idle for a long time, and is referred to as
-a "cache miss".
+CPU caches are something else to be particularly aware of, especially when
+comparing timing results of two different versions of a function. The results
+can be highly dependent on whether the data is in the CPU cache or not. CPUs
+don't load data directly from the system RAM, even though it's huge in
+comparison to the CPU cache (several gigabytes instead of a few megabytes). This
+is because system RAM is very slow to access. Instead, CPUs load data from a
+smaller, faster bank of memory called cache. Loading data from cache is very
+fast, but every time you try and load a memory address that is not stored in
+cache, the cache must make a trip to main memory and slowly load in some data.
+This delay can result in the CPU sitting around idle for a long time, and is
+referred to as a "cache miss".

-This means that the first time you run a function, it may run slowly, because
-the data is not in cache. The second and later times, it may run much faster
-because the data is in cache. So always use averages when timing, and be aware
-of the effects of cache.
+This means that the first time you run a function, it may run slowly because the
+data is not in the CPU cache. The second and later times, it may run much faster
+because the data is in the cache. Due to this, always use averages when timing,
+and be aware of the effects of cache.

 Understanding caching is also crucial to CPU optimization. If you have an
 algorithm (routine) that loads small bits of data from randomly spread out areas
@@ -147,16 +152,15 @@ will be able to work as fast as possible.

 Godot usually takes care of such low-level details for you. For example, the
 Server APIs make sure data is optimized for caching already for things like
-rendering and physics. But you should be especially aware of caching when using
-GDNative.
+rendering and physics. Still, you should be especially aware of caching when
+using :ref:`GDNative <toc-tutorials-gdnative>`.

 Languages
 =========

 Godot supports a number of different languages, and it is worth bearing in mind
-that there are trade-offs involved - some languages are designed for ease of
-use, at the cost of speed, and others are faster but more difficult to work
-with.
+that there are trade-offs involved. Some languages are designed for ease of use
+at the cost of speed, and others are faster but more difficult to work with.

 Built-in engine functions run at the same speed regardless of the scripting
 language you choose. If your project is making a lot of calculations in its own
@@ -165,16 +169,20 @@ code, consider moving those calculations to a faster language.
 GDScript
 ~~~~~~~~

-GDScript is designed to be easy to use and iterate, and is ideal for making many
-types of games. However, ease of use is considered more important than
-performance, so if you need to make heavy calculations, consider moving some of
-your project to one of the other languages.
+:ref:`GDScript <toc-learn-scripting-gdscript>` is designed to be easy to use and iterate,
+and is ideal for making many types of games. However, in this language, ease of
+use is considered more important than performance. If you need to make heavy
+calculations, consider moving some of your project to one of the other
+languages.

 C#
 ~~

-C# is popular and has first class support in Godot. It offers a good compromise
-between speed and ease of use.
+:ref:`C# <toc-learn-scripting-C#>` is popular and has first-class support in Godot.It
+offers a good compromise between speed and ease of use. Beware of possible
+garbage collection pauses and leaks that can occur during gameplay, though. A
+common approach to workaround issues with garbage collection is to use *object
+pooling*, which is outside the scope of this guide.

 Other languages
 ~~~~~~~~~~~~~~~
@@ -186,44 +194,49 @@ Third parties provide support for several other languages, including `Rust
 C++
 ~~~

-Godot is written in C++. Using C++ will usually result in the fastest code,
-however, on a practical level, it is the most difficult to deploy to end users'
-machines on different platforms. Options for using C++ include GDNative, and
-custom modules.
+Godot is written in C++. Using C++ will usually result in the fastest code.
+However, on a practical level, it is the most difficult to deploy to end users'
+machines on different platforms. Options for using C++ include
+:ref:`GDNative <toc-tutorials-gdnative>` and
+:ref:`custom modules <doc_custom_modules_in_c++>`.

 Threads
 =======

-Consider using threads when making a lot of calculations that can run parallel
-to one another. Modern CPUs have multiple cores, each one capable of doing a
-limited amount of work. By spreading work over multiple threads you can move
-further towards peak CPU efficiency.
+Consider using threads when making a lot of calculations that can run in
+parallel to each other. Modern CPUs have multiple cores, each one capable of
+doing a limited amount of work. By spreading work over multiple threads, you can
+move further towards peak CPU efficiency.

 The disadvantage of threads is that you have to be incredibly careful. As each
 CPU core operates independently, they can end up trying to access the same
 memory at the same time. One thread can be reading to a variable while another
-is writing. Before you use threads make sure you understand the dangers and how
-to try and prevent these race conditions.
+is writing: this is called a *race condition*. Before you use threads, make sure
+you understand the dangers and how to try and prevent these race conditions.

-For more information on threads see :ref:`doc_using_multiple_threads`.
+Threads can also make debugging considerably more difficult. The GDScript
+debugger doesn't support setting up breakpoints in threads yet.
+
+For more information on threads, see :ref:`doc_using_multiple_threads`.

 SceneTree
 =========

 Although Nodes are an incredibly powerful and versatile concept, be aware that
-every node has a cost. Built in functions such as `_process()` and
+every node has a cost. Built-in functions such as `_process()` and
 `_physics_process()` propagate through the tree. This housekeeping can reduce
-performance when you have very large numbers of nodes.
+performance when you have very large numbers of nodes (usually in the thousands).

-Each node is handled individually in the Godot renderer so sometimes a smaller
+Each node is handled individually in the Godot renderer. Therefore, a smaller
 number of nodes with more in each can lead to better performance.

 One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
-get much better performance by removing nodes from the SceneTree, rather than
-by pausing or hiding them. You don't have to delete a detached node. You
-can for example, keep a reference to a node, detach it from the scene tree, then
-reattach it later. This can be very useful for adding and removing areas from a
-game for example.
+get much better performance by removing nodes from the SceneTree, rather than by
+pausing or hiding them. You don't have to delete a detached node. You can for
+example, keep a reference to a node, detach it from the scene tree using
+:ref:`Node.remove_child(node) <class_Node_method_remove_child>`, then reattach
+it later using :ref:`Node.add_child(node) <class_Node_method_add_child>`.
+This can be very useful for adding and removing areas from a game, for example.

 You can avoid the SceneTree altogether by using Server APIs. For more
 information, see :ref:`doc_using_servers`.
@@ -231,28 +244,33 @@ information, see :ref:`doc_using_servers`.
 Physics
 =======

-In some situations physics can end up becoming a bottleneck, particularly with
-complex worlds, and large numbers of physics objects.
+In some situations, physics can end up becoming a bottleneck. This is
+particularly the case with complex worlds and large numbers of physics objects.

-Some techniques to speed up physics:
+Here are some techniques to speed up physics:

-* Try using simplified versions of your rendered geometry for physics. Often
-  this won't be noticeable for end users, but can greatly increase performance.
-* Try removing objects from physics when they are out of view / outside the
+- Try using simplified versions of your rendered geometry for collision shapes.
+  Often, this won't be noticeable for end users, but can greatly increase
+  performance.
+- Try removing objects from physics when they are out of view / outside the
  current area, or reusing physics objects (maybe you allow 8 monsters per area,
  for example, and reuse these).

-Another crucial aspect to physics is the physics tick rate. In some games you
+Another crucial aspect to physics is the physics tick rate. In some games, you
 can greatly reduce the tick rate, and instead of for example, updating physics
-60 times per second, you may update it at 20, or even 10 ticks per second. This
-can greatly reduce the CPU load.
+60 times per second, you may update them only 30 or even 20 times per second.
+This can greatly reduce the CPU load.

 The downside of changing physics tick rate is you can get jerky movement or
-jitter when the physics update rate does not match the frames rendered.
+jitter when the physics update rate does not match the frames per second
+rendered. Also, decreasing the physics tick rate will increase input lag.
+It's recommended to stick to the default physics tick rate (60 Hz) in most games
+that feature real-time player movement.

-The solution to this problem is 'fixed timestep interpolation', which involves
+The solution to jitter is to use *fixed timestep interpolation*, which involves
 smoothing the rendered positions and rotations over multiple frames to match the
-physics. You can either implement this yourself or use a third-party addon.
-Interpolation is a very cheap operation, performance wise, compared to running a
-physics tick, orders of magnitude faster, so this can be a significant win, as
-well as reducing jitter.
+physics. You can either implement this yourself or use a
+`third-party addon <https://github.com/lawnjelly/smoothing-addon>`__.
+Performance-wise, interpolation is a very cheap operation compared to running a
+physics tick. It's orders of magnitude faster, so this can be a significant
+performance win while also reducing jitter.
--- a/tutorials/optimization/general_optimization.rst
+++ b/tutorials/optimization/general_optimization.rst
@@ -6,42 +6,42 @@ General optimization tips
 Introduction
 ~~~~~~~~~~~~

-In an ideal world, computers would run at infinite speed, and the only limit to
-what we could achieve would be our imagination. In the real world, however, it
-is all too easy to produce software that will bring even the fastest computer to
+In an ideal world, computers would run at infinite speed. The only limit to
+what we could achieve would be our imagination. However, in the real world, it's
+all too easy to produce software that will bring even the fastest computer to
 its knees.

-Designing games and other software is thus a compromise between what we would
+Thus, designing games and other software is a compromise between what we would
 like to be possible, and what we can realistically achieve while maintaining
 good performance.

 To achieve the best results, we have two approaches:
-* Work faster
-* Work smarter
+
+- Work faster.
+- Work smarter.

 And preferably, we will use a blend of the two.

-Smoke and Mirrors
+Smoke and mirrors
 ^^^^^^^^^^^^^^^^^

-Part of working smarter is recognizing that, especially in games, we can often
-get the player to believe they are in a world that is far more complex, 
-interactive, and graphically exciting than it really is. A good programmer is a
-magician, and should strive to learn the tricks of the trade, and try to invent
-new ones.
+Part of working smarter is recognizing that, in games, we can often get the
+player to believe they're in a world that is far more complex, interactive, and
+graphically exciting than it really is. A good programmer is a magician, and
+should strive to learn the tricks of the trade while trying to invent new ones.

 The nature of slowness
 ^^^^^^^^^^^^^^^^^^^^^^

-To the outside observer, performance problems are often lumped together. But in
-reality, there are several different kinds of performance problem:
+To the outside observer, performance problems are often lumped together.
+But in reality, there are several different kinds of performance problems:

-* A slow process that occurs every frame, leading to a continuously low frame
-  rate 
-* An intermittent process that causes 'spikes' of slowness, leading to
-  stalls 
-* A slow process that occurs outside of normal gameplay, for instance, on
-  level load
+- A slow process that occurs every frame, leading to a continuously low frame
+  rate.
+- An intermittent process that causes "spikes" of slowness, leading to
+  stalls.
+- A slow process that occurs outside of normal gameplay, for instance,
+  when loading a level.

 Each of these are annoying to the user, but in different ways.

@@ -54,30 +54,32 @@ our attempts to speed them up.

 There are several methods of measuring performance, including:

-* Putting a start / stop timer around code of interest
-* Using the Godot profiler
-* Using external third party profilers
-* Using GPU profilers / debuggers
-* Checking the frame rate (with vsync disabled)
+- Putting a start/stop timer around code of interest.
+- Using the Godot profiler.
+- Using external third-party CPU profilers.
+- Using GPU profilers/debuggers such as
+  `NVIDIA Nsight Graphics <https://developer.nvidia.com/nsight-graphics>`__
+  or `apitrace <https://apitrace.github.io/>`__.
+- Checking the frame rate (with V-Sync disabled).

 Be very aware that the relative performance of different areas can vary on
-different hardware. Often it is a good idea to make timings on more than one
-device, especially including mobile as well as desktop, if you are targeting
-mobile.
+different hardware. It's often a good idea to measure timings on more than one
+device. This is especially the case if you're targeting mobile devices.

 Limitations
 ~~~~~~~~~~~

-CPU Profilers are often the 'go to' method for measuring performance, however
+CPU profilers are often the go-to method for measuring performance. However,
 they don't always tell the whole story.

- Bottlenecks are often on the GPU, `as a result` of instructions given by the
-  CPU
- Spikes can occur in the Operating System processes (outside of Godot) `as a
-  result` of instructions used in Godot (for example dynamic memory allocation)
- You may not be able to profile e.g. a mobile phone
+- Bottlenecks are often on the GPU, "as a result" of instructions given by the
+  CPU.
+- Spikes can occur in the operating system processes (outside of Godot) "as a
+  result" of instructions used in Godot (for example, dynamic memory allocation).
+- You may not always be able to profile specific devices like a mobile phone
+  due to the initial setup required.
 - You may have to solve performance problems that occur on hardware you don't
-  have access to
+  have access to.

 As a result of these limitations, you often need to use detective work to find
 out where bottlenecks are.
@@ -92,27 +94,27 @@ binary search.
 Hypothesis testing
 ^^^^^^^^^^^^^^^^^^

-Say for example you believe that sprites are slowing down your game. You can
-test this hypothesis for example by:
+Say, for example, that you believe sprites are slowing down your game.
+You can test this hypothesis by:

-* Measuring the performance when you add more sprites, or take some away.
+- Measuring the performance when you add more sprites, or take some away.

-This may lead to a further hypothesis - does the size of the sprite determine
+This may lead to a further hypothesis: does the size of the sprite determine
 the performance drop?

-* You can test this by keeping everything the same, but changing the sprite
-  size, and measuring performance
+- You can test this by keeping everything the same, but changing the sprite
+  size, and measuring performance.

 Binary search
 ^^^^^^^^^^^^^

-Say you know that frames are taking much longer than they should, but you are
+If you know that frames are taking much longer than they should, but you're
 not sure where the bottleneck lies. You could begin by commenting out
 approximately half the routines that occur on a normal frame. Has the
 performance improved more or less than expected?

-Once you know which of the two halves contains the bottleneck, you can then
-repeat this process, until you have pinned down the problematic area.
+Once you know which of the two halves contains the bottleneck, you can
+repeat this process until you've pinned down the problematic area.

 Profilers
 =========
@@ -122,17 +124,16 @@ provide results telling you what percentage of time was spent in different
 functions and areas, and how often functions were called.

 This can be very useful both to identify bottlenecks and to measure the results
-of your improvements. Sometimes attempts to improve performance can backfire and
-lead to slower performance, so always use profiling and timing to guide your
-efforts.
+of your improvements. Sometimes, attempts to improve performance can backfire
+and lead to slower performance.
+**Always use profiling and timing to guide your efforts.**

-For more info about using the profiler within Godot see
-:ref:`doc_debugger_panel`.
+For more info about using Godot's built-in profiler, see :ref:`doc_debugger_panel`.

 Principles
 ==========

-Donald Knuth: 
+`Donald Knuth <https://en.wikipedia.org/wiki/Donald_Knuth>`__ said:

    *Programmers waste enormous amounts of time thinking about, or worrying
    about, the speed of noncritical parts of their programs, and these attempts
@@ -143,19 +144,19 @@ Donald Knuth:

 The messages are very important:

-* Programmer / Developer time is limited. Instead of blindly trying to speed up
-  all aspects of a program we should concentrate our efforts on the aspects that
+- Developer time is limited. Instead of blindly trying to speed up
+  all aspects of a program, we should concentrate our efforts on the aspects that
  really matter.
-* Efforts at optimization often end up with code that is harder to read and
+- Efforts at optimization often end up with code that is harder to read and
  debug than non-optimized code. It is in our interests to limit this to areas
  that will really benefit.

-Just because we `can` optimize a particular bit of code, it doesn't necessarily
-mean that we should. Knowing when, and when not to optimize is a great skill to
+Just because we *can* optimize a particular bit of code, it doesn't necessarily
+mean that we *should*. Knowing when and when not to optimize is a great skill to
 develop.

 One misleading aspect of the quote is that people tend to focus on the subquote
-"premature optimization is the root of all evil". While `premature` optimization
+*"premature optimization is the root of all evil"*. While *premature* optimization
 is (by definition) undesirable, performant software is the result of performant
 design.

@@ -165,30 +166,30 @@ Performant design
 The danger with encouraging people to ignore optimization until necessary, is
 that it conveniently ignores that the most important time to consider
 performance is at the design stage, before a key has even hit a keyboard. If the
-design / algorithms of a program are inefficient, then no amount of polishing the
-details later will make it run fast. It may run `faster`, but it will never run
+design or algorithms of a program are inefficient, then no amount of polishing the
+details later will make it run fast. It may run *faster*, but it will never run
 as fast as a program designed for performance.

-This tends to be far more important in game / graphics programming than in
-general programming. A performant design, even without low level optimization,
-will often run many times faster than a mediocre design with low level
+This tends to be far more important in game or graphics programming than in
+general programming. A performant design, even without low-level optimization,
+will often run many times faster than a mediocre design with low-level
 optimization.

 Incremental design
 ~~~~~~~~~~~~~~~~~~

 Of course, in practice, unless you have prior knowledge, you are unlikely to
-come up with the best design first time. So you will often make a series of
+come up with the best design the first time. Instead, you'll often make a series of
 versions of a particular area of code, each taking a different approach to the
-problem, until you come to a satisfactory solution. It is important not to spend
+problem, until you come to a satisfactory solution. It's important not to spend
 too much time on the details at this stage until you have finalized the overall
-design, otherwise much of your work will be thrown out.
+design. Otherwise, much of your work will be thrown out.

-It is difficult to give general guidelines for performant design because this is
+It's difficult to give general guidelines for performant design because this is
 so dependent on the problem. One point worth mentioning though, on the CPU
 side, is that modern CPUs are nearly always limited by memory bandwidth. This
-has led to a resurgence in data orientated design, which involves designing data
-structures and algorithms for locality of data and linear access, rather than
+has led to a resurgence in data-oriented design, which involves designing data
+structures and algorithms for *cache locality* of data and linear access, rather than
 jumping around in memory.

 The optimization process
@@ -196,17 +197,17 @@ The optimization process

 Assuming we have a reasonable design, and taking our lessons from Knuth, our
 first step in optimization should be to identify the biggest bottlenecks - the
-slowest functions, the low hanging fruit.
+slowest functions, the low-hanging fruit.

-Once we have successfully improved the speed of the slowest area, it may no
-longer be the bottleneck. So we should test / profile again, and find the next
+Once we've successfully improved the speed of the slowest area, it may no
+longer be the bottleneck. So we should test/profile again and find the next
 bottleneck on which to focus.

 The process is thus:

-1. Profile / Identify bottleneck
-2. Optimize bottleneck
-3. Return to step 1
+1. Profile / Identify bottleneck.
+2. Optimize bottleneck.
+3. Return to step 1.

 Optimizing bottlenecks
 ~~~~~~~~~~~~~~~~~~~~~~
@@ -214,18 +215,22 @@ Optimizing bottlenecks
 Some profilers will even tell you which part of a function (which data accesses,
 calculations) are slowing things down.

-As with design you should concentrate your efforts first on making sure the
+As with design, you should concentrate your efforts first on making sure the
 algorithms and data structures are the best they can be. Data access should be
 local (to make best use of CPU cache), and it can often be better to use compact
-storage of data (again, always profile to test results). Often you precalculate
-heavy computations ahead of time (e.g. at level load, or loading precalculated
-data files).
+storage of data (again, always profile to test results). Often, you precalculate
+heavy computations ahead of time. This can be done by performing the computation
+when loading a level, by loading a file containing precalculated data or simply
+by storing the results of complex calculations into a script constant and
+reading its value.

 Once algorithms and data are good, you can often make small changes in routines
-which improve performance, things like moving calculations outside of loops.
+which improve performance. For instance, you can move some calculations outside
+of loops or transform nested ``for`` loops into non-nested loops.
+(This should be feasible if you know a 2D array's width or height in advance.)

-Always retest your timing / bottlenecks after making each change. Some changes
-will increase speed, others may have a negative effect. Sometimes a small
+Always retest your timing/bottlenecks after making each change. Some changes
+will increase speed, others may have a negative effect. Sometimes, a small
 positive effect will be outweighed by the negatives of more complex code, and
 you may choose to leave out that optimization.

@@ -235,9 +240,9 @@ Appendix
 Bottleneck math
 ~~~~~~~~~~~~~~~

-The proverb "a chain is only as strong as its weakest link" applies directly to
+The proverb *"a chain is only as strong as its weakest link"* applies directly to
 performance optimization. If your project is spending 90% of the time in
-function 'A', then optimizing A can have a massive effect on performance.
+function ``A``, then optimizing ``A`` can have a massive effect on performance.

 .. code-block:: none

@@ -247,14 +252,14 @@ function 'A', then optimizing A can have a massive effect on performance.

 .. code-block:: none

-    A: 1 ms 
-    Everything else: 1ms 
+    A: 1 ms
+    Everything else: 1ms
    Total frame time: 2 ms

-So in this example improving this bottleneck A by a factor of 9x, decreases
-overall frame time by 5x, and increases frames per second by 5x.
+In this example, improving this bottleneck ``A`` by a factor of 9× decreases
+overall frame time by 5× while increasing frames per second by 5×.

-If however, something else is running slowly and also bottlenecking your
+However, if something else is running slowly and also bottlenecking your
 project, then the same improvement can lead to less dramatic gains:

 .. code-block:: none
@@ -269,8 +274,8 @@ project, then the same improvement can lead to less dramatic gains:
    Everything else: 50 ms
    Total frame time: 51 ms

-So in this example, even though we have hugely optimized functionality A, the
-actual gain in terms of frame rate is quite small.
+In this example, even though we have hugely optimized function ``A``,
+the actual gain in terms of frame rate is quite small.

 In games, things become even more complicated because the CPU and GPU run
 independently of one another. Your total frame time is determined by the slower
@@ -288,5 +293,5 @@ of the two.
    GPU: 50 ms
    Total frame time: 50 ms

-In this example, we optimized the CPU hugely again, but the frame time did not
-improve, because we are GPU-bottlenecked.
+In this example, we optimized the CPU hugely again, but the frame time didn't
+improve because we are GPU-bottlenecked.
--- a/tutorials/optimization/gpu_optimization.rst
+++ b/tutorials/optimization/gpu_optimization.rst
@@ -1,75 +1,76 @@
 .. _doc_gpu_optimization:

-GPU Optimizations
-=================
+GPU optimization
+================

 Introduction
 ~~~~~~~~~~~~

 The demand for new graphics features and progress almost guarantees that you
-will encounter graphics bottlenecks. Some of these can be CPU side, for instance
-in calculations inside the Godot engine to prepare objects for rendering.
-Bottlenecks can also occur on the CPU in the graphics driver, which sorts
-instructions to pass to the GPU, and in the transfer of these instructions. And
-finally bottlenecks also occur on the GPU itself.
+will encounter graphics bottlenecks. Some of these can be on the CPU side, for
+instance in calculations inside the Godot engine to prepare objects for
+rendering. Bottlenecks can also occur on the CPU in the graphics driver, which
+sorts instructions to pass to the GPU, and in the transfer of these
+instructions. And finally, bottlenecks also occur on the GPU itself.

-Where bottlenecks occur in rendering is highly hardware specific. Mobile GPUs in
-particular may struggle with scenes that run easily on desktop.
+Where bottlenecks occur in rendering is highly hardware-specific.
+Mobile GPUs in particular may struggle with scenes that run easily on desktop.

 Understanding and investigating GPU bottlenecks is slightly different to the
-situation on the CPU, because often you can only change performance indirectly,
-by changing the instructions you give to the GPU, and it may be more difficult
-to take measurements. Often the only way of measuring performance is by
-examining changes in frame rate.
+situation on the CPU. This is because, often, you can only change performance
+indirectly by changing the instructions you give to the GPU. Also, it may be
+more difficult to take measurements. In many cases, the only way of measuring
+performance is by examining changes in the time spent rendering each frame.

-Drawcalls, state changes, and APIs
-==================================
+Draw calls, state changes, and APIs
+===================================

 .. note:: The following section is not relevant to end-users, but is useful to
          provide background information that is relevant in later sections.

-Godot sends instructions to the GPU via a graphics API (OpenGL, GLES2, GLES3,
+Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or
 Vulkan). The communication and driver activity involved can be quite costly,
-especially in OpenGL. If we can provide these instructions in a way that is
-preferred by the driver and GPU, we can greatly increase performance.
+especially in OpenGL and OpenGL ES. If we can provide these instructions in a
+way that is preferred by the driver and GPU, we can greatly increase
+performance.

-Nearly every API command in OpenGL requires a certain amount of validation, to
+Nearly every API command in OpenGL requires a certain amount of validation to
 make sure the GPU is in the correct state. Even seemingly simple commands can
-lead to a flurry of behind the scenes housekeeping. Therefore the name of the
-game is reduce these instructions to a bare minimum, and group together similar
-objects as much as possible so they can be rendered together, or with the
-minimum number of these expensive state changes.
+lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to
+reduce these instructions to a bare minimum and group together similar objects
+as much as possible so they can be rendered together, or with the minimum number
+of these expensive state changes.

 2D batching
 ~~~~~~~~~~~

-In 2d, the costs of treating each item individually can be prohibitively high -
-there can easily be thousands on screen. This is why 2d batching is used -
-multiple similar items are grouped together and rendered in a batch, via a
-single drawcall, rather than making a separate drawcall for each item. In
-addition this means that state changes, material and texture changes can be kept
+In 2D, the costs of treating each item individually can be prohibitively high -
+there can easily be thousands of them on the screen. This is why 2D *batching*
+is used. Multiple similar items are grouped together and rendered in a batch,
+via a single draw call, rather than making a separate draw call for each item.
+In addition, this means state changes, material and texture changes can be kept
 to a minimum.

-For more information on 2D batching see :ref:`doc_batching`.
+For more information on 2D batching, see :ref:`doc_batching`.

 3D batching
 ~~~~~~~~~~~

-In 3d, we still aim to minimize draw calls and state changes, however, it can be
-more difficult to batch together several objects into a single draw call. 3d
+In 3D, we still aim to minimize draw calls and state changes. However, it can be
+more difficult to batch together several objects into a single draw call. 3D
 meshes tend to comprise hundreds or thousands of triangles, and combining large
-meshes at runtime is prohibitively expensive. The costs of joining them quickly
+meshes in real-time is prohibitively expensive. The costs of joining them quickly
 exceeds any benefits as the number of triangles grows per mesh. A much better
-alternative is to join meshes ahead of time (static meshes in relation to each
+alternative is to **join meshes ahead of time** (static meshes in relation to each
 other). This can either be done by artists, or programmatically within Godot.

-There is also a cost to batching together objects in 3d. Several objects
-rendered as one cannot be individually culled. An entire city that is off screen
+There is also a cost to batching together objects in 3D. Several objects
+rendered as one cannot be individually culled. An entire city that is off-screen
 will still be rendered if it is joined to a single blade of grass that is on
-screen. So attempting to batch together 3d objects should take account of their
-location and effect on culling. Despite this, the benefits of joining static
-objects often outweigh other considerations, especially for large numbers of low
-poly objects. 
+screen. Thus, you should always take objects' location and culling into account
+when attempting to batch 3D objects together. Despite this, the benefits of
+joining static objects often outweigh other considerations, especially for large
+numbers of distant or low-poly objects.

 For more information on 3D specific optimizations, see
 :ref:`doc_optimizing_3d_performance`.
@@ -80,14 +81,14 @@ Reuse Shaders and Materials
 The Godot renderer is a little different to what is out there. It's designed to
 minimize GPU state changes as much as possible. :ref:`SpatialMaterial
 <class_SpatialMaterial>` does a good job at reusing materials that need similar
-shaders but, if custom shaders are used, make sure to reuse them as much as
+shaders.  if custom shaders are used, make sure to reuse them as much as
 possible. Godot's priorities are:

-  **Reusing Materials**: The fewer different materials in the
+-  **Reusing Materials:** The fewer different materials in the
   scene, the faster the rendering will be. If a scene has a huge amount
-   of objects (in the hundreds or thousands) try reusing the materials
-   or in the worst case use atlases.
-  **Reusing Shaders**: If materials can't be reused, at least try to
+   of objects (in the hundreds or thousands), try reusing the materials.
+   In the worst case, use atlases to decrease the amount of texture changes.
+-  **Reusing Shaders:** If materials can't be reused, at least try to
   re-use shaders (or SpatialMaterials with different parameters but the same
   configuration).

@@ -95,54 +96,55 @@ If a scene has, for example, ``20,000`` objects with ``20,000`` different
 materials each, rendering will be slow. If the same scene has ``20,000``
 objects, but only uses ``100`` materials, rendering will be much faster.

-Pixel cost vs vertex cost
-=========================
+Pixel cost versus vertex cost
+=============================

 You may have heard that the lower the number of polygons in a model, the faster
 it will be rendered. This is *really* relative and depends on many factors.

 On a modern PC and console, vertex cost is low. GPUs originally only rendered
-triangles, so every frame all the vertices:
+triangles. This meant that every frame:

-1. Had to be transformed by the CPU (including clipping).
+1. All vertices had to be transformed by the CPU (including clipping).
+2. All vertices had to be sent to the GPU memory from the main RAM.

-2. Had to be sent to the GPU memory from the main RAM.
+Nowadays, all this is handled inside the GPU, greatly increasing performance.
+3D artists usually have the wrong feeling about polycount performance because 3D
+DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory for it to
+be edited, reducing actual performance. Game engines rely on the GPU more, so
+they can render many triangles much more efficiently.

-Now all this is handled inside the GPU, so the performance is much higher. 3D
-artists usually have the wrong feeling about polycount performance because 3D
-DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory in order
-for it to be edited, reducing actual performance. Game engines rely on the GPU
-more so they can render many triangles much more efficiently.
-
-On mobile devices, the story is different. PC and Console GPUs are
+On mobile devices, the story is different. PC and console GPUs are
 brute-force monsters that can pull as much electricity as they need from
 the power grid. Mobile GPUs are limited to a tiny battery, so they need
 to be a lot more power efficient.

-To be more efficient, mobile GPUs attempt to avoid *overdraw*. This means, the
-same pixel on the screen being rendered more than once. Imagine a town with
-several buildings, GPUs don't know what is visible and what is hidden until they
-draw it. A house might be drawn and then another house in front of it (rendering
-happened twice for the same pixel!). PC GPUs normally don't care much about this
-and just throw more pixel processors to the hardware to increase performance
-(but this also increases power consumption).
+To be more efficient, mobile GPUs attempt to avoid *overdraw*. Overdraw occurs
+when the same pixel on the screen is being rendered more than once. Imagine a
+town with several buildings. GPUs don't know what is visible and what is hidden
+until they draw it. For example, a house might be drawn and then another house
+in front of it (which means rendering happened twice for the same pixel). PC
+GPUs normally don't care much about this and just throw more pixel processors to
+the hardware to increase performance (which also increases power consumption).

 Using more power is not an option on mobile so mobile devices use a technique
-called "Tile Based Rendering" which divides the screen into a grid. Each cell
+called *tile-based rendering* which divides the screen into a grid. Each cell
 keeps the list of triangles drawn to it and sorts them by depth to minimize
 *overdraw*. This technique improves performance and reduces power consumption,
 but takes a toll on vertex performance. As a result, fewer vertices and
 triangles can be processed for drawing.

-Additionally, Tile Based Rendering struggles when there are small objects with a
+Additionally, tile-based rendering struggles when there are small objects with a
 lot of geometry within a small portion of the screen. This forces mobile GPUs to
-put a lot of strain on a single screen tile which considerably decreases
-performance as all the other cells must wait for it to complete in order to
-display the frame.
+put a lot of strain on a single screen tile, which considerably decreases
+performance as all the other cells must wait for it to complete before
+displaying the frame.

-In summary, do not worry about vertex count on mobile, but avoid concentration
-of vertices in small parts of the screen. If a character, NPC, vehicle, etc. is
-far away (so it looks tiny), use a smaller level of detail (LOD) model.
+To summarize, don't worry about vertex count on mobile, but
+**avoid concentration of vertices in small parts of the screen**.
+If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use
+a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to
+avoid having triangles smaller than the size of a pixel on screen.

 Pay attention to the additional vertex processing required when using:

@@ -150,47 +152,53 @@ Pay attention to the additional vertex processing required when using:
 -  Morphs (shape keys)
 -  Vertex-lit objects (common on mobile)

-Pixel / fragment shaders - fill rate
+Pixel/fragment shaders and fill rate
 ====================================

-In contrast to vertex processing, the costs of fragment shading has increased
-dramatically over the years. Screen resolutions have increased (the area of a 4K
-screen is ``8,294,400`` pixels, versus ``307,200`` for an old ``640x480`` VGA
+In contrast to vertex processing, the costs of fragment (per-pixel) shading have
+increased dramatically over the years. Screen resolutions have increased (the
+area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA
 screen, that is 27x the area), but also the complexity of fragment shaders has
-exploded. Physically based rendering requires complex calculations for each
+exploded. Physically-based rendering requires complex calculations for each
 fragment.

-You can test whether a project is fill rate limited quite easily. Turn off vsync
-to prevent capping the frames per second, then compare the frames per second
-when running with a large window, to running with a postage stamp sized window
-(you may also benefit from similarly reducing your shadow map size if using
-shadows). Usually you will find the fps increases quite a bit using a small
-window, which indicates you are to some extent fill rate limited. If on the
-other hand there is little to no increase in fps, then your bottleneck lies
+You can test whether a project is fill rate-limited quite easily. Turn off
+V-Sync to prevent capping the frames per second, then compare the frames per
+second when running with a large window, to running with a very small window.
+You may also benefit from similarly reducing your shadow map size if using
+shadows. Usually, you will find the FPS increases quite a bit using a small
+window, which indicates you are to some extent fill rate-limited. On the other
+hand, if there is little to no increase in FPS, then your bottleneck lies
 elsewhere.

-You can increase performance in a fill rate limited project by reducing the
+You can increase performance in a fill rate-limited project by reducing the
 amount of work the GPU has to do. You can do this by simplifying the shader
 (perhaps turn off expensive options if you are using a :ref:`SpatialMaterial
 <class_SpatialMaterial>`), or reducing the number and size of textures used.

-Consider shipping simpler shaders for mobile.
+**When targeting mobile devices, consider using the simplest possible shaders you
+can reasonably afford to use.**

 Reading textures
 ~~~~~~~~~~~~~~~~

 The other factor in fragment shaders is the cost of reading textures. Reading
-textures is an expensive operation (especially reading from several in a single
-fragment shader), and also consider the filtering may add expense to this
-(trilinear filtering between mipmaps, and averaging). Reading textures is also
-expensive in power terms, which is a big issue on mobiles.
+textures is an expensive operation, especially when reading from several
+textures in a single fragment shader. Also, consider that filtering may slow it
+down further (trilinear filtering between mipmaps, and averaging). Reading
+textures is also expensive in terms of power usage, which is a big issue on
+mobiles.
+
+**If you use third-party shaders or write your own shaders, try to use
+algorithms that require as few texture reads as possible.**

 Texture compression
 ~~~~~~~~~~~~~~~~~~~

-Godot compresses textures of 3D models when imported (VRAM compression) by
-default. Video RAM compression is not as efficient in size as PNG or JPG when
-stored, but increases performance enormously when drawing.
+By default, Godot compresses textures of 3D models when imported using video RAM
+(VRAM) compression. Video RAM compression isn't as efficient in size as PNG or
+JPG when stored, but increases performance enormously when drawing large enough
+textures.

 This is because the main goal of texture compression is bandwidth reduction
 between memory and the GPU.
@@ -203,61 +211,72 @@ more noticeable.
 As a warning, most Android devices do not support texture compression of
 textures with transparency (only opaque), so keep this in mind.

-Post processing / shadows
-~~~~~~~~~~~~~~~~~~~~~~~~~
+.. note::

-Post processing effects and shadows can also be expensive in terms of fragment
+   Even in 3D, "pixel art" textures should have VRAM compression disabled as it
+   will negatively affect their appearance, without improving performance
+   significantly due to their low resolution.
+
+
+Post-processing and shadows
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Post-processing effects and shadows can also be expensive in terms of fragment
 shading activity. Always test the impact of these on different hardware.

-Reducing the size of shadow maps can increase performance, both in terms of
-writing, and reading the maps.
+**Reducing the size of shadowmaps can increase performance**, both in terms of
+writing and reading the shadowmaps. On top of that, the best way to improve
+performance of shadows is to turn shadows off for as many lights and objects as
+possible. Smaller or distant OmniLights/SpotLights can often have their shadows
+disabled with only a small visual impact.

-Transparency / blending
-=======================
+Transparency and blending
+=========================

-Transparent items present particular problems for rendering efficiency. Opaque
-items (especially in 3d) can be essentially rendered in any order and the
+Transparent objects present particular problems for rendering efficiency. Opaque
+objects (especially in 3D) can be essentially rendered in any order and the
 Z-buffer will ensure that only the front most objects get shaded. Transparent or
-blended objects are different - in most cases they cannot rely on the Z-buffer
-and must be rendered in "painter's order" (i.e. from back to front) in order to
-look correct.
+blended objects are different. In most cases, they cannot rely on the Z-buffer
+and must be rendered in "painter's order" (i.e. from back to front) to look
+correct.

-The transparent items are also particularly bad for fill rate, because every
-item has to be drawn, even if later transparent items will be drawn on top.
+Transparent objects are also particularly bad for fill rate, because every item
+has to be drawn even if other transparent objects will be drawn on top
+later on.

-Opaque items don't have to do this. They can usually take advantage of the
+Opaque objects don't have to do this. They can usually take advantage of the
 Z-buffer by writing to the Z-buffer only first, then only performing the
-fragment shader on the 'winning' fragment, the item that is at the front at a
+fragment shader on the "winning" fragment, the object that is at the front at a
 particular pixel.

-Transparency is particularly expensive where multiple transparent items overlap.
-It is usually better to use as small a transparent area as possible in order to
+Transparency is particularly expensive where multiple transparent objects
+overlap. It is usually better to use transparent areas as small as possible to
 minimize these fill rate requirements, especially on mobile, where fill rate is
 very expensive. Indeed, in many situations, rendering more complex opaque
 geometry can end up being faster than using transparency to "cheat".

-Multi-Platform Advice
+Multi-platform advice
 =====================

-If you are aiming to release on multiple platforms, test `early` and test
-`often` on all your platforms, especially mobile. Developing a game on desktop
-but attempting to port to mobile at the last minute is a recipe for disaster.
+If you are aiming to release on multiple platforms, test *early* and test
+*often* on all your platforms, especially mobile. Developing a game on desktop
+but attempting to port it to mobile at the last minute is a recipe for disaster.

-In general you should design your game for the lowest common denominator, then
+In general, you should design your game for the lowest common denominator, then
 add optional enhancements for more powerful platforms. For example, you may want
 to use the GLES2 backend for both desktop and mobile platforms where you target
 both.

-Mobile / tile renderers
-=======================
+Mobile/tiled renderers
+======================

-GPUs on mobile devices work in dramatically different ways from GPUs on desktop.
-Most mobile devices use tile renderers. Tile renderers split up the screen into
-regular sized tiles that fit into super fast cache memory, and reduce the reads
-and writes to main memory.
+As described above, GPUs on mobile devices work in dramatically different ways
+from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers
+split up the screen into regular-sized tiles that fit into super fast cache
+memory, which reduces the number of read/write operations to the main memory.

-There are some downsides though, it can make certain techniques much more
-complicated and expensive to perform. Tiles that rely on the results of
-rendering in different tiles or on the results of earlier operations being
+There are some downsides though. Tiled rendering can make certain techniques
+much more complicated and expensive to perform. Tiles that rely on the results
+of rendering in different tiles or on the results of earlier operations being
 preserved can be very slow. Be very careful to test the performance of shaders,
 viewport textures and post processing.
--- a/tutorials/optimization/index.rst
+++ b/tutorials/optimization/index.rst
@@ -4,33 +4,33 @@ Optimization
 Introduction
 ------------

-Godot follows a balanced performance philosophy. In the performance world, there
-are always trade-offs, which consist of trading speed for usability and
-flexibility. Some practical examples of this are:
+Godot follows a balanced performance philosophy. In the performance world,
+there are always trade-offs, which consist of trading speed for usability
+and flexibility. Some practical examples of this are:

 -  Rendering objects efficiently in high amounts is easy, but when a
   large scene must be rendered, it can become inefficient. To solve this,
-   visibility computation must be added to the rendering, which makes rendering
-   less efficient, but, at the same time, fewer objects are rendered, so
-   efficiency overall improves.
+   visibility computation must be added to the rendering. This makes rendering
+   less efficient, but at the same time, fewer objects are rendered.
+   Therefore, the overall rendering efficiency is improved.

 -  Configuring the properties of every material for every object that
   needs to be rendered is also slow. To solve this, objects are sorted by
-   material to reduce the costs, but at the same time sorting has a cost.
+   material to reduce the costs. At the same time, sorting has a cost.

-  In 3D physics a similar situation happens. The best algorithms to
+-  In 3D physics, a similar situation happens. The best algorithms to
   handle large amounts of physics objects (such as SAP) are slow at
-   insertion/removal of objects and ray-casting. Algorithms that allow faster
-   insertion and removal, as well as ray-casting, will not be able to handle as
+   insertion/removal of objects and raycasting. Algorithms that allow faster
+   insertion and removal, as well as raycasting, will not be able to handle as
   many active objects.

-And there are many more examples of this! Game engines strive to be general
-purpose in nature, so balanced algorithms are always favored over algorithms
-that might be fast in some situations and slow in others or algorithms that are
-fast but make usability more difficult.
+And there are many more examples of this! Game engines strive to be general-purpose
+in nature. Balanced algorithms are always favored over algorithms
+that might be fast in some situations and slow in others, or algorithms that are
+fast but are more difficult to use.

-Godot is not an exception and, while it is designed to have backends swappable
-for different algorithms, the default ones prioritize balance and flexibility
+Godot is not an exception to this. While it is designed to have backends swappable
+for different algorithms, the default backends prioritize balance and flexibility
 over performance.

 With this clear, the aim of this tutorial section is to explain how to get the
--- a/tutorials/optimization/optimizing_3d_performance.rst
+++ b/tutorials/optimization/optimizing_3d_performance.rst
@@ -22,34 +22,42 @@ in the street you are in, as well as the sky and a few birds flying overhead. As
 far as a naive renderer is concerned however, you can still see the entire town.
 It won't just render the buildings in front of you, it will render the street
 behind that, with the people on that street, the buildings behind that. You
-quickly end up in situations where you are attempting to render 10x, or 100x
-more than what is visible.
+quickly end up in situations where you are attempting to render 10× or 100× more
+than what is visible.

 Things aren't quite as bad as they seem, because the Z-buffer usually allows the
-GPU to only fully shade the objects that are at the front. However, unneeded
-objects are still reducing performance.
+GPU to only fully shade the objects that are at the front. This is called *depth
+prepass* and is enabled by default in Godot when using the GLES3 renderer.
+However, unneeded objects are still reducing performance.

 One way we can potentially reduce the amount to be rendered is to take advantage
-of occlusion. As of version 3.2.2 there is no built in support for occlusion in
-Godot, however with careful design you can still get many of the advantages.
+of occlusion. As of Godot 3.2.2, there is no built in support for occlusion in
+Godot. However, with careful design you can still get many of the advantages.

-For instance in our city street scenario, you may be able to work out in advance
+For instance, in our city street scenario, you may be able to work out in advance
 that you can only see two other streets, ``B`` and ``C``, from street ``A``.
 Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
 you have to do is work out when your viewer is in street ``A`` (perhaps using
 Godot Areas), then you can hide the other streets.

-This is a manual version of what is known as a 'potentially visible set'. It is
+This is a manual version of what is known as a "potentially visible set". It is
 a very powerful technique for speeding up rendering. You can also use it to
 restrict physics or AI to the local area, and speed these up as well as
 rendering.

+.. note::
+
+    In some cases, you may have to adapt your level design to add more occlusion
+    opportunities. For example, you may have to add more walls to prevent the player
+    from seeing too far away, which would decrease performance due to the lost
+    opportunies for occlusion culling.
+
 Other occlusion techniques
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

-There are other occlusion techniques such as portals, automatic PVS, and raster
-based occlusion culling. Some of these may be available through addons, and may
-be available in core Godot in the future.
+There are other occlusion techniques such as portals, automatic PVS, and
+raster-based occlusion culling. Some of these may be available through add-ons
+and may be available in core Godot in the future.

 Transparent objects
 ~~~~~~~~~~~~~~~~~~~
@@ -57,9 +65,10 @@ Transparent objects
 Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
 <class_Shader>` to improve performance. This, however, can not be done with
 transparent objects. Transparent objects are rendered from back to front to make
-blending with what is behind work. As a result, try to use as few transparent
-objects as possible. If an object has a small section with transparency, try to
-make that section a separate surface with its own Material.
+blending with what is behind work. As a result,
+**try to use as few transparent objects as possible**. If an object has a
+small section with transparency, try to make that section a separate surface
+with its own material.

 For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
 doc.
@@ -67,12 +76,12 @@ doc.
 Level of detail (LOD)
 =====================

-In some situations, particularly at a distance, it can be a good idea to replace
-complex geometry with simpler versions - the end user will probably not be able
-to see much difference. Consider looking at a large number of trees in the far
-distance. There are several strategies for replacing models at varying distance.
-You could use lower poly models, or use transparency to simulate more complex
-geometry.
+In some situations, particularly at a distance, it can be a good idea to
+**replace complex geometry with simpler versions**. The end user will probably
+not be able to see much difference. Consider looking at a large number of trees
+in the far distance. There are several strategies for replacing models at
+varying distance. You could use lower poly models, or use transparency to
+simulate more complex geometry.

 Billboards and imposters
 ~~~~~~~~~~~~~~~~~~~~~~~~
@@ -112,19 +121,19 @@ Lighting objects is one of the most costly rendering operations. Realtime
 lighting, shadows (especially multiple lights), and GI are especially expensive.
 They may simply be too much for lower power mobile devices to handle.

-Consider using baked lighting, especially for mobile. This can look fantastic,
-but has the downside that it will not be dynamic. Sometimes this is a trade off
+**Consider using baked lighting**, especially for mobile. This can look fantastic,
+but has the downside that it will not be dynamic. Sometimes, this is a trade-off
 worth making.

 In general, if several lights need to affect a scene, it's best to use
 :ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
 indirect light bounces.

-Animation / Skinning
-====================
+Animation and skinning
+======================

-Animation and particularly vertex animation such as skinning and morphing can be
-very expensive on some platforms. You may need to lower poly count considerably
+Animation and vertex animation such as skinning and morphing can be very
+expensive on some platforms. You may need to lower the polycount considerably
 for animated models or limit the number of them on screen at any one time.

 Large worlds
@@ -137,7 +146,7 @@ Large worlds may need to be built in tiles that can be loaded on demand as you
 move around the world. This can prevent memory use from getting out of hand, and
 also limit the processing needed to the local area.

-There may be glitches due to floating point error in large worlds. You may be
-able to use techniques such as orienting the world around the player (rather
-than the other way around), or shifting the origin periodically to keep things
-centred around (0, 0, 0).
+There may also be rendering and physics glitches due to floating point error in
+large worlds. You may be able to use techniques such as orienting the world
+around the player (rather than the other way around), or shifting the origin
+periodically to keep things centred around ``Vector3(0, 0, 0)``.