Voxel World Optimisations

This article details the voxel mesh generation technique used in our game Sector's Edge.

Voxel World Optimisations

Sector's Edge is a first person shooter that takes place in a voxel environment, where the world is composed of millions of destructible cubes. With 16 players per match, the mesh must be regenerated nearly every frame. This is a CPU-intensive process and this article dives into the optimisations that make it possible.

Our two goals for this engine are:

  • fast mesh regeneration balanced with low triangle count
  • fast vertex processing on the GPU

The second topic will be available in a separate article soon.

The full source code for this article is available here on GitHub.

Update: a new article has been posted that describes the further optimisations made to reduce chunk mesh initialisation time down to 0.48ms.

Map Structure

The map is divided into chunks of 32 x 32 x 32 blocks. Each chunk has its own mesh - a list of triangles - which must be regenerated each time a block is added, removed or damaged. Block data is stored in the following structure:

public class Map
{
    Chunk[,,] chunks;
}

public class Chunk
{
    const int CHUNK_SIZE = 32;
    const int CHUNK_SIZE_SQUARED = 1024;
    const int CHUNK_SIZE_CUBED = 32768;
    Block[] data = new Block[CHUNK_SIZE_CUBED];
}

public struct Block
{
    byte kind;  // Block type, e.g. empty, dirt, metal, etc.
    byte health;
}

Greedy meshing is a popular algorithm for producing a mesh with a low triangle count, however it requires many block comparisons and in our case wasn't regenerating chunks fast enough. In our approach, blocks are combined into runs along the X and Y axis:

Actual image (left), underlying mesh (right)

To reduce triangle count, a run will not be split if it is covered by a block. Runs are only split if the next block has a different texture or is damaged:

Actual image (top), underlying mesh (bottom)

This allows chunk meshes to be regenerated with less block comparisons while still combining faces to reduce the triangle count.

Mesh Generation

The full Chunk class has the following structure:

public class Chunk
{
    const int CHUNK_SIZE = 32;
    const int CHUNK_SIZE_SQUARED = 1024;
    const int CHUNK_SIZE_CUBED = 32768;
    const int EMPTY = 0;
    
    Block[,,] data = new Block[CHUNK_SIZE_CUBED];
    
    // Global map position of this chunk
    public int chunkX, chunkY, chunkZ;
    
    // Each chunk has a reference to the map it is located in, so it can access
    // blocks in other chunks
    Map map;
    
    // Stores vertex data and manages buffering to the GPU
    BlockVertexBuffer vertexBuffer;
    
    // References to neighbouring chunks
    Chunk cXN, cXP, cYN, cYP, cZN, cZP;
    
    public Chunk(int i, int j, int k, Map map)
    {
        chunkX = i * CHUNK_SIZE;
        chunkY = j * CHUNK_SIZE;
        chunkZ = k * CHUNK_SIZE;
        this.map = map;
    }
    
    public void GenerateMesh(...) ...
    void CreateRuns(...) ...
}

For clarity, i, j and k refer to chunk-relative positions (range 0-31)
and x, y and z refer to global map positions (range 0-511)

To start, GenerateMesh gets references to chunk neighbours for quick edge-case block comparisons. It then iterates over each block in the chunk and creates runs.

public void GenerateMesh(ChunkHelper chunkHelper)
{
    // X- neighbouring chunk
    cXN = chunkPosX > 0 ? m.chunks[chunkPosX - 1, chunkPosY, chunkPosZ] : null;

    // X+ neighbouring chunk
    cXP = chunkPosX < Constants.ChunkXAmount - 1 ? m.chunks[chunkPosX + 1, chunkPosY, chunkPosZ] : null;

    // Y- neighbouring chunk
    cYN = chunkPosY > 0 ? m.chunks[chunkPosX, chunkPosY - 1, chunkPosZ] : null;

    // Y+ neighbouring chunk
    cYP = chunkPosY < Constants.ChunkYAmount - 1 ? m.chunks[chunkPosX, chunkPosY + 1, chunkPosZ] : null;

    // Z- neighbouring chunk
    cZN = chunkPosZ > 0 ? m.chunks[chunkPosX, chunkPosY, chunkPosZ - 1] : null;
    
    // Z+ neighbouring chunk
    cZP = chunkPosZ < Constants.ChunkZAmount - 1 ? m.chunks[chunkPosX, chunkPosY, chunkPosZ + 1] : null;
    
    int access;
    // Y axis - start from the bottom and search up
    for (int j = 0; j < CHUNK_SIZE; j++)
    {
        // Z axis
        for (int k = 0; k < CHUNK_SIZE; k++)
        {
            // X axis
            for (int i = 0; i < CHUNK_SIZE; i++)
            {
                access = i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED;
                ref Block b = ref data[access];

                if (b.kind == EMPTY)
                    continue;

                CreateRun(ref b, i, j, k, chunkHelper, access);
            }

            // Extend the array if it is nearly full
            if (vertexBuffer.used > vertexBuffer.data.Length - 2048)
                vertexBuffer.Extend(2048);
        }
    }
}

The ChunkHelper class is used to record which faces have been merged, to prevent creating duplicate faces. Each chunk mesh is regenerated on a different thread and one ChunkHelper instance is allocated to each thread.

public class ChunkHelper
{
    public bool[] visitedXN = new bool[CHUNK_SIZE_CUBED];
    public bool[] visitedXP = new bool[CHUNK_SIZE_CUBED];
    public bool[] visitedZN = new bool[CHUNK_SIZE_CUBED];
    public bool[] visitedZP = new bool[CHUNK_SIZE_CUBED];
    public bool[] visitedYN = new bool[CHUNK_SIZE_CUBED];
    public bool[] visitedYP = new bool[CHUNK_SIZE_CUBED];
}

Separate runs are created for each block face orientation (X-, X+, Z-, Z+, Y-, Y+). For each block in the combined face, a flag is set in the respective array in ChunkHelper to indicate it has been merged.

void CreateRun(ref Block b, int i, int j, int k, ChunkHelper chunkHelper, int access)
{
    // Precalculate variables
    int i1 = i + 1;
    int j1 = j + 1;
    int k1 = k + 1;
    byte health16 = (byte)(b.health / 16);
    
    int length = 0;
    int chunkAccess = 0;

    // Left face (X-)
    if (!chunkHelper.visitedXN[access] && VisibleFaceXN(i - 1, j, k))
    {
        // Search upwards to determine run length
        for (int q = j; q < CHUNK_SIZE; q++)
        {
            // Pre-calculate the array lookup as it is used twice
            chunkAccess = i + q * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED;
            
            // If we reach a different block or an empty block, end the run
            if (DifferentBlock(chunkAccess, b))
                break;

            // Store that we have visited this block
            chunkHelper.visitedXN[chunkAccess] = true;

            length++;
        }


        if (length > 0)
        {
            // Create a quad and write it directly to the buffer
            BlockVertex.AppendQuad(buffer, new Int3(i, length + j, k1),
                                           new Int3(i, length + j, k),
                                           new Int3(i, j,          k1),
                                           new Int3(i, j,          k), 
                                           (byte)FaceType.xn, b.kind, health16);

            buffer.used += 6;
        }
    }
    
    // Same algorithm for right (X+)
    if (!chunkHelper.visitedXP[access] && VisibleFaceXP(i1, j, k))
    {
        ...
    }

    // Same algorithm for back (Z-)
    ...
    
    // Same algorithm for front (Z+)
    ...
    
    // Same algorithm for bottom (Y-)
    ...

    // Same algorithm for top (Y+)
    ...
}

The VisibleFace functions are used to check whether or not a block is empty. Most comparisons are within the working chunk, so we prioritise accessing block data directly within the chunk if possible.

bool VisibleFaceXN(int i, int j, int k)
{
    // Access directly from a neighbouring chunk
    if (i < 0)
    {
        if (cXN == null)
            return true;

        return cXN.data[31 + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED].kind == EMPTY;
    }

    return data[i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED].kind == EMPTY;
}

bool VisibleFaceXP(int i, int j, int k)
{
    if (i >= CHUNK_SIZE)
    {
        if (cXP == null)
            return true;

        return cXP.data[0 + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED].kind == EMPTY;
    }

    return data[i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED].kind == EMPTY;
}

DifferentBlock is used to determine whether a block is different to the current run. Runs do not span across multiple chunks and therefore we do not need bounds checks when accessing the block.

protected bool DifferentBlock(int access, Block current)
{
    // Using a ref variable here increased performance by ~4%
    ref var b = ref data[access];
    return b.kind != current.kind || b.health != current.health;
}

The AppendQuad function creates 6 vertices (two triangles) and writes them directly to the vertex buffer.

public static byte[] blockToTexture;
        
public static void AppendQuad(BlockVertexBuffer buffer, Int3 tl, Int3 tr, Int3 bl, 
                              Int3 br, byte normal, byte kind, byte health)
{
    // Each block kind maps to a texture unit on the GPU
    byte texture = blockToTexture[kind];

    // Each vertex is packed into one uint to reduce GPU memory overhead
    // 18 bits for position XYZ
    // 5 bits for texture unit
    // 4 bits for health
    // 3 bits for normal
    
    // Each vertex in a quad shares the same texture unit, health and normal value
    uint shared = (uint)((texture & 31) << 18 |
                         (health  & 15) << 23 |
                         (normal  & 7)  << 27);

    // Top left vertex
    uint tlv = CombinePosition(tl, shared);
    
    // Top right vertex
    uint trv = CombinePosition(tr, shared);
    
    // Bottom left vertex
    uint blv = CombinePosition(bl, shared);
    
    // Bottom right vertex
    uint brv = CombinePosition(br, shared);

    // Store each vertex directly into the buffer
    buffer.data[buffer.used]     = tlv;
    buffer.data[buffer.used + 1] = blv;
    buffer.data[buffer.used + 2] = trv;
    buffer.data[buffer.used + 3] = trv;
    buffer.data[buffer.used + 4] = blv;
    buffer.data[buffer.used + 5] = brv;
}

// Combine position data with the shared uint
static uint CombinePosition(Int3 pos, uint shared)
{
    return (uint)(shared | 
                 ((uint)pos.X & 63) |
                 ((uint)pos.Y & 63) << 6 |
                 ((uint)pos.Z & 63) << 12);
}

Benchmarking

To test our optimisations, the same map was initialised three times in a row and the time taken to run GenerateMesh for each chunk was recorded. This was done single-threaded to ensure we were only measuring our function, not task scheduling.

Our initial benchmark was:
807 chunks initialised in 4158ms
1 chunk initialised in 5.15ms (average)

After optimisations:
807 chunks initialised in 722ms
1 chunk initialised in 0.89ms (average)

How this 5.7x increase in speed was achieved is outlined below.

Update: a new article has been posted that describes the further optimisations made to reduce chunk mesh initialisation time down to 0.48ms.

Optimisations

Inlining (3% speed increase)

Apart from GenerateMesh, every method is inlined using this MethodImpl attribute:

using [MethodImpl(MethodImplOptions.AggressiveInlining)]
void CreateStrips(...) ...

Pre-Calculating and Ref Locals (5% speed increase)

Every multiplication, addition and memory lookup affects performance, so we pre-calculate the common operations.

For example at the start of the CreateRun function:

protected void CreateStrips(Block b, int i, int j, int k, ChunkHelper chunkHelper, int access)
{
    int i1 = i + 1;
    int j1 = j + 1;
    int k1 = k + 1;
    byte health16 = (byte)(b.health / 16);
    
    ...
}

In the BlockVertex class, we reduced 24 bitwise operations down to 19:

// Old - 24 bitwise operations per quad
static uint toInt(Int3 pos, byte texID, byte health, byte normal)
{
    return (uint)((pos.X  & 63) |
                  (pos.Y  & 63) << 6 |
                  (pos.Z  & 63) << 12 |
                  (texID  & 31) << 18 |
                  (health & 15) << 23 |
                  (normal & 7) << 27);
}

// New - 19 bitwise operations per quad
public static void AppendQuad(...)
{
    var shared = (uint)((t      & 31) << 18 |
                        (health & 15) << 23 |
                        (normal & 7)  << 27);
                        
    uint tlv = CombinePosition(tl, shared);
    uint trv = CombinePosition(tr, shared);
    uint blv = CombinePosition(bl, shared);
    uint brv = CombinePosition(br, shared);
    
    ...                        
}

static uint CombinePosition(Int3 pos, uint shared)
{
    return (uint)(shared | 
                  ((uint)pos.X & 63) |
                  ((uint)pos.Y & 63) << 6 |
                  ((uint)pos.Z & 63) << 12);
}

Ref locals are used where possible to avoid copying structs onto the stack and avoid multiple array lookups:

protected bool DifferentBlock(int i, int j, int k, Block compare)
{
    ref var b = ref data[i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED];
    return b.kind != compare.kind || b.health != compare.health;
}

Neighbouring Chunks (10% speed increase)

By storing references to neighbouring chunks at the start of GenerateMesh, we saved 32*32 expensive Map.IsNoBlock functions calls on each side of the chunk.

// Old
bool VisibleFace(int i, int j, int k)
{
    // Access from global map coordinates
    if (i < 0 || j < 0 || k < 0 ||
        i >= CHUNK_SIZE || j >= CHUNK_SIZE || k >= CHUNK_SIZE)
    {
        return m.IsNoBlock(i + chunkPosX, j + chunkPosY, k + chunkPos.Z);
    }

    return data[i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED].kind == EMPTY;
}

// New
bool VisibleFaceXN(int i, int j, int k)
{
    // Access directly from neighbouring chunk
    if (i < 0)
    {
        if (cXN == null)
            return true;

        return cXN.data[31 + j * Constants.ChunkSize + k * Constants.ChunkSizeSquared].kind == EMPTY;
    }

    return data[i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED].kind == EMPTY;
}

// Same for VisibleFaceXP, VisibleFaceZN, ...
...

Single-Dimensional Arrays (20% speed increase)

Changing the block data array in the Chunk class from a multi-dimensional array to a single-dimensional array provided a noticeable improvement in performance:

// Old
Block[,,] data;
Block b = data[i,j,k];

// New
Block[] data;
Block b = data[i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED];

The same was done with the ChunkHelper arrays:

// Old
bool[,,] visitedXN = new bool[CHUNK_SIZE, CHUNK_SIZE, CHUNK_SIZE];
bool b = visitedXN[i,j,k];

// New
bool[] visitedXN = new bool[CHUNK_SIZE_CUBED];
bool b = visitedXN[i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED];

This performance increase came from the fact that CHUNK_SIZE is a multiple of 2, and therefore the C# compiler could simplify the multiplications to bitshifts:

// Old
data[i, j, k];

// Underlying array access produced by the compiler
data[i + j * data.GetLength(0) + j * data.GetLength(0) * data.GetLength(1)];

// New
data[i + j * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED];

// Simplified array access produced by the compiler
data[i + j << 5 + k << 10];

At times, we are also able to pre-calculate the single-dimensional array lookup, as seen in the CreateRun function:

for (int q = j; q < CHUNK_SIZE; q++)
{
    // Pre-calculate the array lookup as it is used twice
    chunkAccess = i + q * CHUNK_SIZE + k * CHUNK_SIZE_SQUARED;

    if (DifferentBlock(chunkAccess, b))
        break;

    chunkHelper.visitedXN[chunkAccess] = true;
    
    ...
}

Garbage Collection (10% speed increase)

Each array in ChunkHelper must be reset to false before being re-used by another chunk. Initially we were re-allocating a new boolean array, however this put excessive pressure on the garbage collector. Array.Clear was a much better solution and quickly resets the array back to false:

// Old
public void Reset()
{
    visitedXN = new bool[CHUNK_SIZE_CUBED];
    
    // Same for X+, Y-, Y+, Z- and Z+
	...
}

// New
public void Reset()
{
    Array.Clear(visitedXN, 0, CHUNK_SIZE_CUBED);
    
    // Same for X+, Y-, Y+, Z- and Z+
    ...
}

Memory (20% speed increase)

Initially, AppendQuad returned an array of 6 uints, which was then copied to a temporary buffer stored in a ChunkHelper. In the GenerateMesh function we would then check if this buffer has exceeded a certain size and copy it to the vertex buffer. This involved a lot of Array.Copy calls.

Instead, AppendQuad is passed the vertex buffer as a parameter and writes to it directly. This removes a lot of expensive Array.Copy functions, however we need to ensure we do not overflow the vertex buffer:

public void GenerateMesh(...)
{
    ...    
    // Extend the array if it is nearly full
    if (vertexBuffer.used > vertexBuffer.data.Length - 2048)
        vertexBuffer.Extend(2048);        
    ...
}
    
// Function inside BlockVertexBuffer:
class BlockVertexBuffer
{
    ...    
    public void Extend(int amount)
    {
        uint[] newData = new uint[data.Length + amount];
        Array.Copy(data, newData, data.Length);
        data = newData;
    }    
    ...
}

Block Access (20% speed increase)

Initially, blocks were only accessed in the map class by their global map positions. We reduced our chunk initialisation time by accessing blocks directly within the chunk where possible.

IsNoBlock is still used and has been optimised using bit shifts and bit masks:

public class Map
{
    Chunk[,,] chunks;
    const byte SHIFT = 5;
    const int MASK = 0x1f;
    const int MAP_SIZE_X = 512;
    const int MAP_SIZE_Y = 160;
    const int MAP_SIZE_Z = 512;
    
    // Returns true if there is no block at this global map position
    public bool IsNoBlock(int x, int y, int z)
    {
        // If it is at the edge of the map, return true
        if (x < 0 || x >= MAP_SIZE_X ||
            y < 0 || y >= MAP_SIZE_Y ||
            z < 0 || z >= MAP_SIZE_Z)
            return true;

        // Chunk accessed quickly using bitwise shifts
        var c = chunks[x >> SHIFT, y >> SHIFT, z >> SHIFT];

        // To lower memory usage, a chunk is null if it has no blocks
        if (c == null)
            return true;

        // Chunk data accessed quickly using bit masks        
        return c.data[(x & MASK) + (y & MASK) * CHUNK_SIZE, (z & MASK) * CHUNK_SIZE_SQUARED].kind == EMPTY;
    }
}

In Practice

This engine was created from scratch for our first person shooter Sector's Edge. We did not use an existing game engine as we desired the ability to fine-tune every aspect of the rendering pipeline. This allows the game to run on older hardware and at rates higher than 60 frames per second.

The full source code for this article is available here on GitHub.

If you have any questions, feel free to reach out to me on our Discord server.

I have also written a breakdown of each stage in the rendering and post-processing pipeline, which is available here.