voxel Featured

Further Voxel World Optimisations

This article explains how we optimised chunk mesh generation down to 0.48ms per chunk.

Mitchell Robinson

Jan 11, 2020 • 5 min read

This article is a continuation from the original article Voxel World Optimisations, which details the initial methods used to increase chunk initialisation time in our destructive first person shooter Sector's Edge.

In the previous article, the benchmark for initialising one chunk was 0.89 milliseconds. This article describes the changes made to reduce this time down to 0.48ms (54% faster).

If any terms used in this article are unfamiliar, please see the original article.

The full source code is available here.

For clarity, i, j and k refer to chunk-relative positions (range 0-31)
and x, y and z refer to global map positions (range 0-511)

Constants

These constants are referenced multiple times throughout the article:

const int EMPTY = 0;
const int CHUNK_SIZE = 32;
const int CHUNK_SIZE_SQUARED = 1024;
const int CHUNK_SIZE_CUBED = 32768;
const int CHUNK_SIZE_MINUS_ONE = 31;
const int CHUNK_SIZE_SHIFTED = 32 << 6;

GenerateMesh

The GenerateMesh function iterates over each block in the chunk, combining faces across multiple blocks where possible to reduce the triangle count of the mesh. The first downfall of the original function is that all 32768 blocks in the data array are checked, when most may be empty.

To reduce the amount of blocks checked, we store two heightmaps for the bottom- and top-most non-empty blocks in each column in the chunk. These arrays are updated each time a block is added or removed from the chunk:

byte[] MinY = new byte[CHUNK_SIZE_SQUARED];
byte[] MaxY = new byte[CHUNK_SIZE_SQUARED];

To make use of this heightmap, we restructure the loop as follows:

int access;

// Z axis
for (int k = 0; k < CHUNK_SIZE; k++)
{
    // X axis
    for (int i = 0; i < CHUNK_SIZE; i++)
    {
        int j =    MinY[i + k * CHUNK_SIZE_SQUARED];
        int topJ = MaxY[i + k * CHUNK_SIZE_SQUARED];
        
        // Y axis
        for (; j < topJ; j++)
        {
            access = i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED;
            ref Block b = ref data[access];

            if (b.kind == EMPTY)
                continue;

            CreateRun(ref b, i, j, k, chunkHelper, access);
        }
    }

    // Extend the array if it is nearly full
    if (vertexBuffer.used > vertexBuffer.data.Length - 2048)
        vertexBuffer.Extend(2048);

}```

As the inner-most loop now iterates over the Y axis, we should restructure the way that we store block positions in data[] so that we can access the next highest block in the column by incrementing access by 1:

// Old
int access = i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED;

// New
int access = j + i * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED;

The next issue is that i * CHUNK_SIZE, k * CHUNK_SIZE_SQUARED and other values within CreateRun() are redundantly calculated many times. Therefore where possible, any value that is calculated multiple times should be stored in a variable in the outer loop and referenced later on. This reduces the amount of multiplications from 65536 down to 1056, and the amount of additions from 197664 down to 135232 (worst case).

The new loop structure is as follows:

// Precalculate the map-relative Y position of the chunk in the map
int chunkY = chunkPosY * CHUNK_SIZE;

// Allocate variables on the stack
int access, heightMapAccess, iCS, kCS2, i1, k1, j, topJ;
bool minXEdge, maxXEdge, minZEdge, maxZEdge;

k1 = 1;

for (int k = 0; k < CHUNK_SIZE; k++, k1++)
{
    // Calculate this once, rather than multiple times in the inner loop
    kCS2 = k * CHUNK_SIZE_SQUARED;

    i1 = 1;
    heightMapAccess = k * CHUNK_SIZE;
    
    // Is the current run on the Z- or Z+ edge of the chunk
    minZEdge = k == 0;
    maxZEdge = k == CHUNK_SIZE_MINUS_ONE;

    for (int i = 0; i < CHUNK_SIZE; i++; i1++)
    {
        // Determine where to start the innermost loop
        j = MinY[heightMapAccess];
        topJ = MaxY[heightMapAccess];
        heightMapAccess++;
        
        // Calculate this once, rather than multiple times in the inner loop
        iCS = i * CHUNK_SIZE;
        
        // Calculate access here and increment it each time in the innermost loop
        access = kCS2 + iCS + j;

        // Is the current run on the X- or X+ edge of the chunk
        minX = i == 0;
        maxX = i == CHUNK_SIZE_MINUS_ONE;

        // X and Z runs search upwards to create runs, so start at the bottom.
        for (; j < topJ; j++, access++)
        {
            ref Block b = ref data[access];

            if (b.kind != EMPTY)
            {
                CreateRun(ref b, i, j,
                          k << 12,
                          i1,
                          k1 << 12,
                          j + chunkY, access,
                          minX, maxX,
                          j == 0, j == CHUNK_SIZE_MINUS_ONE,
                          minZ, maxZ,
                          iCS, kCS2);
            }
        }

        // Extend the array if it is nearly full
        if (vertexBuffer.used > vertexBuffer.data.Length - 2048)
            vertexBuffer.Extend(2048);
    }
}

CreateRun

As the new CreateRun() function has many changes, we will split it into sections. The first difference is the new parameters:

// New CreateRun function
void CreateRun(ref Block b, int i, int j,
               int k,
               int i1,
               int k1,
               int y, int access,
               bool minX, bool maxX,
               bool minY, bool maxY,
               bool minZ, bool maxZ,
               int iCS, int kCS2)
{
    ...
}

int i, int j - chunk-relative position of the block
k - chunk-relative position of the block shifted left 12 bits
int i1 - precalculated i + 1
int k1 - precalculated (k + 1) << 12
int y - map-relative y position of the block
int access - current position in the data[] array
bool minX, bool maxX - if the current run is on the X edge of the chunk
bool minY, bool maxY - if the current run is on the Y edge of the chunk
bool minZ, bool maxZ - if the current run is on the Z edge of the chunk
int iCS - precalculated i * CHUNK_SIZE
int kCS2 - precalculated k * CHUNK_SIZE_SQUARED

Precalculating

At the start of the function, we can see the following changes:

The BlockVertex helper class now has a int[] indexToTextureShifted array, which stores the texture values shifted left 18 bits. Since the texture and health values are the same for each face of the block, they can be bit-shifted and combined here rather than in the AppendQuad() function
The length variable has been replaced with runFinish, which stores the chunk-relative position of the end of the run

int textureHealth16 = BlockVertex.indexToTextureShifted[b.index] | ((b.health / 16) << 23);
int runFinish;
int accessIncremented = access + 1;
int chunkAccess;
int j1 = j + 1;
int jS = j << 6;
int jS1 = j1 << 6;

Combining Faces

The next part combines faces upwards along the negative X face, with the following changes:

kCS2 is passed to VisibleFaceXN() so that the function doesn't have to calculate k * CHUNK_SIZE_SQUARED each time
chunkAccess has the default value of accessIncremented, which is equivalent to
j + i * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED + 1. This value is then incremented at the end of the loop
Rather than creating an extra variable q for managing the loop, we can use the runFinish variable. By initialising this variable to j1 << 6 and incrementing it by 64 (1 << 6 == 64), we can save an extra addition and avoid bit-shifts throughout the entire CreateRun() function
AppendQuad() has been split into three functions AppendQuadX(), AppendQuadY() and AppendQuadZ(), which are specifically for generating X, Y and Z faces
Int3 structs are no longer passed to AppendQuad(), therefore removing unnecessary memory allocation
The FaceType enum has been replaced with FaceTypeShifted, which stores the same values shifted left 27 bits

// Left (X-)
if (!chunkHelper.visitXN[access] && VisibleFaceXN(j, access, minX, kCS2))
{
    chunkHelper.visitXN[access] = true;
    chunkAccess = accessIncremented;

    for (runFinish = jS1; length < Constants.ChunkSizeShifted; length += 64)
    {
        if (DifferentBlock(chunkAccess, ref b))
            break;

        chunkHelper.visitXN[chunkAccess++] = true;
    }

    // k1 and k are already shifted
    BlockVertex.AppendQuadX(vertexBuffer, i, jS, length, k1, k, (int)FaceTypeShifted.xn, textureHealth16);
}

AppendQuad

The new AppendQuadX function contains no bit-shifts and fewer or operations:

// Stores the values of the old blockToTexture[] shifted left 18 bits
public static byte[] blockToTextureShifted;
        
public static void AppendQuadX(BlockVertexBuffer buffer, int x,
                               int jBottom, int jTop,
                               int kLeft, int kRight, 
                               int normal, int textureHealth)
{
    // Combine data that is common for each face
    var shared = x             | 
                 textureHealth |
                 normal;

    // Combine the shared data with the Y and Z position of the vertex
    buffer.data[buffer.used] = jTop | kRight | shared;
    buffer.data[buffer.used + 1] = buffer.data[buffer.used + 4] = jBottom | kRight | shared;
    buffer.data[buffer.used + 2] = buffer.data[buffer.used + 3] = jTop | kLeft | shared;
    buffer.data[buffer.used + 5] = jBottom | kKeft | shared;

    buffer.used += 6;
}

Visible Face Checking

The new VisibleFaceXN function uses the precalculated values min and kCS2 and prevents excess faces being produced on the edge of the map:

protected bool VisibleFaceXN(int j, int k, int access, bool min, int kCS2)
{
    // Access directly from a neighbouring chunk
    if (min)
    {
        // If on the edge of the map, don't produce faces
        if (chunkPosX == 0)
            return false;

        if (cXN == null)
            return true;

        return cXN.data[31 * CHUNK_SIZE + j + kCS2].kind == EMPTY;
    }

    // The block to the left can be accessed by subtracting CHUNK_SIZE
    return data[access - CHUNK_SIZE].kind == EMPTY;
}```

Full Source Code

The full source code including OpenGL data buffering and shaders is available here.

See the code in action in our latest video: