Further Voxel World Optimisations
This article explains how we optimised chunk mesh generation down to 0.48ms per chunk.
This article is a continuation from the original article Voxel World Optimisations, which details the initial methods used to increase chunk initialisation time in our destructive first person shooter Sector's Edge.
In the previous article, the benchmark for initialising one chunk was 0.89 milliseconds. This article describes the changes made to reduce this time down to 0.48ms (54% faster).
If any terms used in this article are unfamiliar, please see the original article.
The full source code is available here.
For clarity, i, j and k
refer to chunk-relative positions (range 0-31)
and x, y and z
refer to global map positions (range 0-511)
Constants
These constants are referenced multiple times throughout the article:
const int EMPTY = 0;
const int CHUNK_SIZE = 32;
const int CHUNK_SIZE_SQUARED = 1024;
const int CHUNK_SIZE_CUBED = 32768;
const int CHUNK_SIZE_MINUS_ONE = 31;
const int CHUNK_SIZE_SHIFTED = 32 << 6;
GenerateMesh
The GenerateMesh
function iterates over each block in the chunk, combining faces across multiple blocks where possible to reduce the triangle count of the mesh. The first downfall of the original function is that all 32768 blocks in the data array are checked, when most may be empty.
To reduce the amount of blocks checked, we store two heightmaps for the bottom- and top-most non-empty blocks in each column in the chunk. These arrays are updated each time a block is added or removed from the chunk:
byte[] MinY = new byte[CHUNK_SIZE_SQUARED];
byte[] MaxY = new byte[CHUNK_SIZE_SQUARED];
To make use of this heightmap, we restructure the loop as follows:
int access;
// Z axis
for (int k = 0; k < CHUNK_SIZE; k++)
{
// X axis
for (int i = 0; i < CHUNK_SIZE; i++)
{
int j = MinY[i + k * CHUNK_SIZE_SQUARED];
int topJ = MaxY[i + k * CHUNK_SIZE_SQUARED];
// Y axis
for (; j < topJ; j++)
{
access = i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED;
ref Block b = ref data[access];
if (b.kind == EMPTY)
continue;
CreateRun(ref b, i, j, k, chunkHelper, access);
}
}
// Extend the array if it is nearly full
if (vertexBuffer.used > vertexBuffer.data.Length - 2048)
vertexBuffer.Extend(2048);
}```
As the inner-most loop now iterates over the Y axis, we should restructure the way that we store block positions in data[]
so that we can access the next highest block in the column by incrementing access
by 1:
// Old
int access = i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED;
// New
int access = j + i * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED;
The next issue is that i * CHUNK_SIZE
, k * CHUNK_SIZE_SQUARED
and other values within CreateRun()
are redundantly calculated many times. Therefore where possible, any value that is calculated multiple times should be stored in a variable in the outer loop and referenced later on. This reduces the amount of multiplications from 65536 down to 1056, and the amount of additions from 197664 down to 135232 (worst case).
The new loop structure is as follows:
// Precalculate the map-relative Y position of the chunk in the map
int chunkY = chunkPosY * CHUNK_SIZE;
// Allocate variables on the stack
int access, heightMapAccess, iCS, kCS2, i1, k1, j, topJ;
bool minXEdge, maxXEdge, minZEdge, maxZEdge;
k1 = 1;
for (int k = 0; k < CHUNK_SIZE; k++, k1++)
{
// Calculate this once, rather than multiple times in the inner loop
kCS2 = k * CHUNK_SIZE_SQUARED;
i1 = 1;
heightMapAccess = k * CHUNK_SIZE;
// Is the current run on the Z- or Z+ edge of the chunk
minZEdge = k == 0;
maxZEdge = k == CHUNK_SIZE_MINUS_ONE;
for (int i = 0; i < CHUNK_SIZE; i++; i1++)
{
// Determine where to start the innermost loop
j = MinY[heightMapAccess];
topJ = MaxY[heightMapAccess];
heightMapAccess++;
// Calculate this once, rather than multiple times in the inner loop
iCS = i * CHUNK_SIZE;
// Calculate access here and increment it each time in the innermost loop
access = kCS2 + iCS + j;
// Is the current run on the X- or X+ edge of the chunk
minX = i == 0;
maxX = i == CHUNK_SIZE_MINUS_ONE;
// X and Z runs search upwards to create runs, so start at the bottom.
for (; j < topJ; j++, access++)
{
ref Block b = ref data[access];
if (b.kind != EMPTY)
{
CreateRun(ref b, i, j,
k << 12,
i1,
k1 << 12,
j + chunkY, access,
minX, maxX,
j == 0, j == CHUNK_SIZE_MINUS_ONE,
minZ, maxZ,
iCS, kCS2);
}
}
// Extend the array if it is nearly full
if (vertexBuffer.used > vertexBuffer.data.Length - 2048)
vertexBuffer.Extend(2048);
}
}
CreateRun
As the new CreateRun()
function has many changes, we will split it into sections. The first difference is the new parameters:
// New CreateRun function
void CreateRun(ref Block b, int i, int j,
int k,
int i1,
int k1,
int y, int access,
bool minX, bool maxX,
bool minY, bool maxY,
bool minZ, bool maxZ,
int iCS, int kCS2)
{
...
}
int i, int j
- chunk-relative position of the blockk
- chunk-relative position of the block shifted left 12 bitsint i1
- precalculatedi + 1
int k1
- precalculated(k + 1) << 12
int y
- map-relative y position of the blockint access
- current position in the data[] arraybool minX, bool maxX
- if the current run is on the X edge of the chunkbool minY, bool maxY
- if the current run is on the Y edge of the chunkbool minZ, bool maxZ
- if the current run is on the Z edge of the chunkint iCS
- precalculatedi * CHUNK_SIZE
int kCS2
- precalculatedk * CHUNK_SIZE_SQUARED
Precalculating
At the start of the function, we can see the following changes:
- The
BlockVertex
helper class now has aint[] indexToTextureShifted
array, which stores the texture values shifted left 18 bits. Since the texture and health values are the same for each face of the block, they can be bit-shifted and combined here rather than in theAppendQuad()
function - The
length
variable has been replaced withrunFinish
, which stores the chunk-relative position of the end of the run
int textureHealth16 = BlockVertex.indexToTextureShifted[b.index] | ((b.health / 16) << 23);
int runFinish;
int accessIncremented = access + 1;
int chunkAccess;
int j1 = j + 1;
int jS = j << 6;
int jS1 = j1 << 6;
Combining Faces
The next part combines faces upwards along the negative X face, with the following changes:
kCS2
is passed toVisibleFaceXN()
so that the function doesn't have to calculatek * CHUNK_SIZE_SQUARED
each timechunkAccess
has the default value ofaccessIncremented
, which is equivalent toj + i * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED + 1
. This value is then incremented at the end of the loop- Rather than creating an extra variable
q
for managing the loop, we can use therunFinish
variable. By initialising this variable toj1 << 6
and incrementing it by 64 (1 << 6 == 64
), we can save an extra addition and avoid bit-shifts throughout the entireCreateRun()
function AppendQuad()
has been split into three functionsAppendQuadX()
,AppendQuadY()
andAppendQuadZ()
, which are specifically for generating X, Y and Z facesInt3
structs are no longer passed toAppendQuad()
, therefore removing unnecessary memory allocation- The
FaceType
enum has been replaced withFaceTypeShifted
, which stores the same values shifted left 27 bits
// Left (X-)
if (!chunkHelper.visitXN[access] && VisibleFaceXN(j, access, minX, kCS2))
{
chunkHelper.visitXN[access] = true;
chunkAccess = accessIncremented;
for (runFinish = jS1; length < Constants.ChunkSizeShifted; length += 64)
{
if (DifferentBlock(chunkAccess, ref b))
break;
chunkHelper.visitXN[chunkAccess++] = true;
}
// k1 and k are already shifted
BlockVertex.AppendQuadX(vertexBuffer, i, jS, length, k1, k, (int)FaceTypeShifted.xn, textureHealth16);
}
AppendQuad
The new AppendQuadX
function contains no bit-shifts and fewer or operations:
// Stores the values of the old blockToTexture[] shifted left 18 bits
public static byte[] blockToTextureShifted;
public static void AppendQuadX(BlockVertexBuffer buffer, int x,
int jBottom, int jTop,
int kLeft, int kRight,
int normal, int textureHealth)
{
// Combine data that is common for each face
var shared = x |
textureHealth |
normal;
// Combine the shared data with the Y and Z position of the vertex
buffer.data[buffer.used] = jTop | kRight | shared;
buffer.data[buffer.used + 1] = buffer.data[buffer.used + 4] = jBottom | kRight | shared;
buffer.data[buffer.used + 2] = buffer.data[buffer.used + 3] = jTop | kLeft | shared;
buffer.data[buffer.used + 5] = jBottom | kKeft | shared;
buffer.used += 6;
}
Visible Face Checking
The new VisibleFaceXN
function uses the precalculated values min
and kCS2
and prevents excess faces being produced on the edge of the map:
protected bool VisibleFaceXN(int j, int k, int access, bool min, int kCS2)
{
// Access directly from a neighbouring chunk
if (min)
{
// If on the edge of the map, don't produce faces
if (chunkPosX == 0)
return false;
if (cXN == null)
return true;
return cXN.data[31 * CHUNK_SIZE + j + kCS2].kind == EMPTY;
}
// The block to the left can be accessed by subtracting CHUNK_SIZE
return data[access - CHUNK_SIZE].kind == EMPTY;
}```
Full Source Code
The full source code including OpenGL data buffering and shaders is available here.
See the code in action in our latest video: