optimisations Featured

Particle Optimisations

This article explains the optimisations made to the particle system in Sector's Edge that allow tens of thousands of particles to be ray-marched, buffered and rendered each frame.

Mitchell Robinson

Feb 21, 2020 • 10 min read

This article explains the optimisations made to the particle system in Sector's Edge that allow tens of thousands of particles to be ray-marched, buffered and rendered each frame.

Particles in Sector's Edge are small cubes that are created from explosions, block destruction, projectile trails, player damage, rain and other causes. The main features of the particle system are:

Updating particles across multiple threads
Ray-marching a voxel map
Writing directly to GPU memory across multiple threads
Fast matrix calculations
Fast creation and destruction of particles

The full source code for this article is available here on GitHub.

Terminology

The following terms are referenced often throughout the article:

Active Particle - a particle that is rendered and collides with the map
Particle lifetime - how many milliseconds a particle will be visible for
Particle decay - the process of a particle's lifetime reducing to zero over time
Decayed Particle - a particle that has completely decayed and moved to temporary storage for later re-use

Particle Storage Structure

Each thread separately manages its own linked list of particles. As hundreds of particles are created and destroyed each frame, using a linked list allows this to happen quickly.

When a particle is created it is in the active state, meaning it should be rendered and collide with the map. Once a particle's lifetime reaches zero, it has decayed and is moved to a separate linked list. When creating a new active particle, we re-use a decayed particle. This puts less pressure on the C# garbage collector.

const int MAX_THREAD_COUNT = 128;

// For clarity, any variable that is used by multiple threads is prefixed with t_
int t_ActiveParticleCount = new int[MAX_THREAD_COUNT];
Particle[] t_ActiveParticles = new Particle[MAX_THREAD_COUNT];
Particle[] t_DecayedParticles = new Particle[MAX_THREAD_COUNT];

Note the first particle in each linked list will not be rendered and is only used for managing the linked list.

The particle class has the following structure:

public class Particle
{
    // Base particle values
    public Vector3 position;
    public Vector3 velocity;
    public Vector3 rotation;
    public float scale;
    public uint colour;
    
    // As an optimisation, the transform matrix can be precalculated and stored here
    // (explained in the Matrix section below)
    public Matrix4F transform;
    
    // When this value reaches 0, the particle has decayed
    public float lifeTime;

    // The linked list
    public Particle Next;

    public void AddNext(Particle p)
    {
        if (Next == null)
        {
            // Start the linked list
            Next = p;
        }
        else
        {
            // Insert the particle at the start of the linked list
            p.Next = Next;
            Next = p;
        }
    }

    // Remove the particle at the start of the linked list
    public void PopNext()
    {
        Next = Next.Next;
    }
}

We use the ThreadPool to determine how many threads are available and therefore how many linked lists we should use:

void GetThreadCount()
{
    ThreadPool.GetMinThreads(out int count, out _);
    THREAD_COUNT = count;
    THREAD_COUNT_MINUS_ONE = count - 1;
}

int THREAD_COUNT;
int THREAD_COUNT_MINUS_ONE;

Particles must be distributed evenly across all threads, so we use the CurrentThreadAddIndex variable to keep track of which linked list the next particle should be inserted into:

int _ctai;
public int CurrentThreadAddIndex
{
    get
    {
        if (_ctai == THREAD_COUNT_MINUS_ONE)
            _ctai = 0;
        else
            _ctai++;

        return _ctai;
    }
}

When creating a new particle we will attempt to re-use a decayed particle to save time allocating memory:

protected void AddParticle(in Vector3 position, in Vector3 velocity, in float scale, in uint colour, in float lifeTime)
{
    int index = CurrentThreadAddIndex;

    var p = t_DecayedParticles[index].Next;

    // If there are no decayed particles available, allocate a new one
    if (p == null)
    {
        p = new Particle()
        {
            position = position,
            velocity = velocity,
            scale = scale,
            colour = colour,
            lifeTime = lifeTime,
        };
    }
    else
    {
        // Modify an existing decayed particle
        p.position = position;
        p.velocity = velocity;
        p.scale = scale;
        p.colour = colour;
        p.lifeTime = lifeTime;

        // Remove the particle from the decayed linked list
        t_DecayedParticles[index].PopNext();

        // Disconnect the particle from the decayed linked list
        p.Next = null;
    }

    // Add the particle to the current active linked list
    t_ActiveParticles[index].AddNext(p);

    // Keep track of how many particles are in this active linked list
    // so that we can allocate the correct buffer size on the GPU
    t_ActiveParticleCount[index]++;
}

Calculating and Buffering Particle Data

Vertex Data

As all particles share the same cube shape, we can render them using instancing with OpenGL. The base particle vertex has the following structure:

byte PositionX;
byte PositionY;
byte PositionZ;
byte Normal;

Two instance buffers are also used to store the colour (uint) and transformation (mat4) of each particle. The particle's colour is determined when it is created, but its transformation must be recalculated each frame based on its position, rotation and scale.

Calculating Particle Transformations

To calculate the particle's transformation, we have to create a position, rotationX, rotationZ and scale matrix and combine them together. Multiplying four matrices together requires 192 multiplications and 144 additions, however by combining these matrices on paper we have created our own 'Particle Matrix' that requires less operations:

public static Matrix4F ParticleMatrix(in float scale, in float rotationX, in float rotationZ, in Vector3 position)
{
    // Access sin/cos values from an array rather than with Math.Sin(...)
    GetSinAndCosCheap(rotationX, out float sinX, out float cosX);
    GetSinAndCosCheap(rotationZ, out float sinZ, out float cosZ);

    Matrix4F m = Matrix4F.Identity;

    m.M11 = scale * cosZ;
    m.M12 = scale * sinZ;

    m.M21 = -scale * cosX * sinZ;
    m.M22 = scale * cosX * cosZ;
    m.M23 = scale * sinX;

    m.M31 = scale * sinX * sinZ;
    m.M32 = -scale * sinX * cosZ;
    m.M33 = scale * cosX;

    m.M41 = (float)position.X;
    m.M42 = (float)position.Y;
    m.M43 = (float)position.Z;

    return m;
}

This particle matrix requires 14 multiplications and runs 5.83x faster than the original matrix multiplications. The two methods were benchmarked by calculating the final transformation matrix for 524288 random scale, rotationX, rotationZ and position values:

	Matrix Multiplication	Particle Matrix
Average	266.9ms	45.8ms
Worst	507.3ms	112.9ms
Best	251.0ms	41.9ms

Sin and Cos Precalculation

Accessing precalculated sine wave values from an array was found to be 2.07x faster than using Math.Sin(...):

// Decreasing this value will increase the accuracy of the precalculated sine wave
private const float epsilon = 0.001f;

private static float[] sinWave;

// Precalculate the sin wave
public static void InitialiseTrigonometry()
{
    int elements = (int)(Math.PI * 2 / epsilon) + 1;
    sinWave = new float[elements];

    int i = 0;
    for (double a = 0; a <= Math.PI * 2; a += epsilon)
    {
        sinWave[i] = (float)Math.Sin(a);
        i++;
    }
}

Both sine and cosine values can be accessed from this array. Cosine values are accessed from this array by shifting the array access to the right:

private const float minusHalfPI = (float)-Math.PI / 2;
private const float halfPI = (float)Math.PI / 2;
private const float doublePI = (float)Math.PI * 2;

// 1571 = PI / 2 / 0.0001
// 6283 = PI * 2 / 0.0001 (number of elements in the array)
public static void GetSinAndCosCheap(float t, out float sin, out float cos)
{
    if (t < minusHalfPI)
    {
        int access = (int)(-t % doublePI / epsilon);
        sin = -sinWave[access];
        cos = -sinWave[(access + 1571) % 6283];
    }
    else if (t < 0)
    {
        sin = -sinWave[(int)(-t % doublePI / epsilon)];
        cos = sinWave[(int)((t + halfPI) % doublePI / epsilon)];
    }
    else
    {
        int access = (int)(t % doublePI / epsilon);
        sin = sinWave[access];
        cos = sinWave[(access + 1571) % 6283];
    }
}

The two methods were benchmarked by calculating the sine and cosine values for 524288 random numbers in the range -50 to 50:

	Math.Sin/Cos	Array
Average	59.5ms	28.6ms
Worst	116.3ms	70.0ms
Best	49.3ms	21.7ms

Buffering Particle Data

Every frame the uint and Matrix4F instance data must be buffered to the GPU for rendering. Originally, this data was stored in memory and sent to the GPU with glSubBufferData(...), however this was found to be 3.3x slower than writing directly to shared GPU memory with glMapBuffer(...). The difference in speed comes from the fact that the original method involved writing to memory twice:

writing individual values to an array - which has inherit bounds checks
copying the entire array to shared memory with glSubBufferData(...)

The new method involves getting a pointer to shared memory with glMapBuffer(...) and writing directly to that pointer. The pointer can be shared by multiple threads as they are each writing to a different section of GPU memory.

// Get pointers to both VBO's data
ParticleColourBuffer.Bind();
var cPtrBase = Gl.MapBuffer(BufferTarget.ArrayBuffer, BufferAccess.ReadWrite);
ParticleMatrixBuffer.Bind();
var mPtrBase = Gl.MapBuffer(BufferTarget.ArrayBuffer, BufferAccess.ReadWrite);

unsafe
{
    // Convert IntPtr to actual pointers
    uint* cPtr = (uint*)cPtrBase.ToPointer();
    Matrix4F* mPtr = (Matrix4F*)mPtrBase.ToPointer();

    // Example write
    cPtr[0] = 100;
    mPtr[3] = ModelHelper.ParticleMatrix(...);
}

// Finish writing to both VBO's
ParticleColourBuffer.Bind();
Gl.UnmapBuffer(BufferTarget.ArrayBuffer);
ParticleMatrixBuffer.Bind();
Gl.UnmapBuffer(BufferTarget.ArrayBuffer);

Updating Particles

Checking collision detection with the map, removing decayed particles and updating particle transformation matrices is handled across multiple threads with Tasks. As each thread is writing the colours and transformations of each particle directly to GPU memory, we need to first calculate the pointer offset for each thread.

void UpdateParticles(double elapsed)
{
    int[] pointerOffsets = new int[THREAD_COUNT];
    int particleCount = 0;

    // Calculate the VBO write offset for each thread
    for (int i = 0; i < THREAD_COUNT; i++)
    {
        pointerOffsets[i] = particleCount;
        particleCount += t_ActiveParticleCount[i];
    }

    if (particleCount == 0)
        return;

    // Check there is enough memory allocated for the VBO
    ParticleMatrixBuffer.Expand(particleCount);
    ParticleColourBuffer.Expand(particleCount);

    var tasks = new List<Task>();
    
    ParticleColourBuffer.Bind();
    var cPtrBase = Gl.MapBuffer(BufferTarget.ArrayBuffer, BufferAccess.ReadWrite);
    ParticleMatrixBuffer.Bind();
    var mPtrBase = Gl.MapBuffer(BufferTarget.ArrayBuffer, BufferAccess.ReadWrite);

    unsafe
    {
        uint* cPtr = (uint*)cPtrBase.ToPointer();
        Matrix4F* mPtr = (Matrix4F*)mPtrBase.ToPointer();

        for (int i = 0; i < THREAD_COUNT; i++)
        {
            if (t_ActiveParticles[i].Next != null)
            {
                int index = i;
                int pointerOffset = pointerOffsets[i];
                
                tasks.Add(Task.Run(() => UpdateParticleRange(cPtr, mPtr, index, pointerOffset, elapsed)));
            }
        }
    }

    ParticleColourBuffer.Bind();
    Gl.UnmapBuffer(BufferTarget.ArrayBuffer);
    ParticleMatrixBuffer.Bind();
    Gl.UnmapBuffer(BufferTarget.ArrayBuffer);

    // Wait for all particles to be updated
    Task.WaitAll(tasks.ToArray());
}

We use ray marching to handle particle collision with our voxel map. The voxel ray marching algorithm we use is based on this paper by John Amanatides and Andrew Woo and has been optimised by keeping block lookups within the current working chunk. This algorithm is quite detailed and is explained in our other blog post here.

The UpdateParticleRange(...) function has four stages:

Precalculate common variables
Handle decayed particles
Update particles + ray-marching
Write particle data to the GPU

The first stage is fairly straightforward and involves precalculating common variables and array accesses that will be used when updating each particle:

unsafe void UpdateParticleRange(uint* cPtr, Matrix4F* mPtr, int index, int bufferOffset, double elapsed)
{
    Particle p = t_ActiveParticles[index];
    Particle q = p.Next;
            
    // Precalculate variables
    double thisGravity = Constants.Gravity * elapsed;
    double scaleMultiplier = Math.Pow(0.9, elapsed / 16d);
    float elapsedF = (float)elapsed;
    float elapsed1500 = elapsedF / 1500;

    // Allocate variables
    bool hit = false;
    Axis axis = Axis.None;
            
    // Dereference array accesses
    ref int particleCount = ref t_ActiveParticleCount[index];
    var temporaryParticles = t_DecayedParticles[index];

    while (q != null)
    {
        // Update particles
        ...
    }
}

To show that a particle is decaying, it will shrink during the last 1.5s of its lifetime. Once it is virtually invisible, we will move it to the decayed linked list. As this particle was included in the total active particle count at the time we were calculating pointer offsets in UpdateParticles(...), we must write Matrix4F.Hidden to the instance buffer to prevent the particle from rendering using a transform value from the last frame.

unsafe void UpdateParticleRange(uint* cPtr, Matrix4F* mPtr, int index, int bufferOffset, double elapsed)
{
    // Part 1
    ...
    
    while (q != null)
    {
        bool shrinking = false;

        // Particles shrink during the last 1.5s of their lifetime
        if (q.lifeTime < 1500)
        {
            q.scale -= elapsed1500;

            // If the particle is virtually invisible, remove it
            if (q.scale < 0.03)
            {
                // Hide the particle
                mPtr[bufferOffset++] = Matrix4F.Hidden;

                // Remove the particle from the active linked list
                p.Next = q.Next;
                q.Next = null;

                // Add the particle to temporary storage
                temporaryParticles.AddNext(q);
                particleCount--;

                // Get the next active particle
                q = p.Next;
                continue;
            }

            shrinking = true;
        }

        // Handle active particles
        ...
    }
}

Handling active particles involves ray-marching the particle and updating its position, velocity, rotation and lifetime.

unsafe void UpdateParticleRange(uint* cPtr, Matrix4F* mPtr, int index, int bufferOffset, double elapsed)
{
    // Part 1
    ...
    
    while (q != null)
    {
        // Part 2
        ...
        
        var elapsedVelocity = q.velocity * elapsed;
        var mag = elapsedVelocity.Magnitude;

        // Only raymarch if the particle is moving
        if (mag > 0.000001)
            map.RayMarch(q.position, elapsedVelocity, mag, ref hit, ref axis);
        else
            hit = false;

        if (hit)
        {
            // Reflect and dampen the particle's velocity based on which axis it collided on
            q.velocity *= Constants.AxisToBounce[(int)axis];
        }
        else if (!map.NoBlock((int)q.position.X, (int)q.position.Y, (int)q.position.Z))
        {
            // Slide to a stop along the ground
            q.velocity.X *= Constants.ParticleSlideDampening;
            q.velocity.Z *= Constants.ParticleSlideDampening;
        }
        else
        {
            q.velocity.Y += thisGravity;
        }

        // Recalculate elapsedVelocity
        elapsedVelocity = q.velocity * elapsed;

        q.rotation -= elapsedVelocity;
        q.lifeTime -= elapsedF;
        
        // Part 4
        ...
    }
}

As an optimisation, particle transformations are only calculated if the particle is shrinking or moving. Quite often particles are at rest on the ground before shrinking, so this optimisation saves us some time.

unsafe void UpdateParticleRange(uint* cPtr, Matrix4F* mPtr, int index, int bufferOffset, double elapsed)
{
    // Part 1
    ...
    
    while (q != null)
    {
        // Part 2 + 3
        ...
        
        // Only update the transform if the particle is shrinking or moving
        if (shrinking || elapsedVelocity.Magnitude > 0.000001)
        {
            q.position += elapsedVelocity;
            q.transform = ModelHelper.ParticleMatrix(q.scale, (float)q.rotation.X, (float)q.rotation.Z, q.position);
        }

        // Write the colour and transformation matrix directly to shared memory
        cPtr[bufferOffset] = q.colour;
        mPtr[bufferOffset++] = q.transform;

        // Advance the linked list
        q = q.Next;
        p = p.Next;
    }
}

Rendering Particles

Since all data has now been buffered to the GPU, rendering particles is quite straightforward:

protected void RenderParticles()
{
    // Each particle VBO and instance VBO shares the same VAO
    // Bind the VAO
    Gl.BindVertexArray(ParticleMatrixBuffer.arrayHandle);

    // Draw the particles
    Gl.DrawArraysInstanced(PrimitiveType.Triangles, 0, 36, TotalParticleCount());

    // Unbind the VAO
    Gl.BindVertexArray(0);
}

The vertex shader is as follows:

#version 330

uniform mat4 mvp;

layout (location = 0) in vec4 aPosition;
layout (location = 1) in vec4 aColour;
layout (location = 2) in mat4 aTransform;

flat out vec4 colour;

void main()
{
    gl_Position = mvp * aTransform * vec4(aPosition.xyz, 1.0);

    colour = aColour;

    int normal = int(aPosition.w);
	
    // Slightly darken the sides of the particle
    if (normal > 3)
        colour *= 0.8;
    else if (normal > 1)
        colour *= 0.9;
}

In this case, the normal is an index that represents the Y+, Y-, X+, X-, Z+ and Z- unit vectors. For advanced lighting, the vec3 normal can be determined using either branching or an array access, however const vec3 arrays have poor performance on some GPUs.

int n = int(aPosition.w);
vec3 normal;

// Option 1 - branching
if (n == 0)
    normal = vec3(0, 1, 0);
else if (n == 1)
    normal = vec3(0, -1, 0);
else if (n == 2)
    normal = vec3(1, 0, 0);
else if (n == 3)
    normal = vec3(-1, 0, 0);
else if (n == 4)
    normal = vec3(0, 0, 1);
else
    normal = vec3(0, 0, -1);
   
// Option 2 - array access
const vec3 NORMALS = ( vec3(0,1,0), vec3(0,-1,0), vec3(1,0,0), vec3(-1,0,0), vec3(0,0,1), vec3(0,0,-1) );
normal = NORMALS[n];

// Rotate the normal
vec3 normal = normalize(mat3(transpose(inverse(aTransform))) * N);

The fragment shader then writes the particle colour to the framebuffer:

#version 330

out vec4 gColor;

flat in vec4 colour;

void main()
{
    gColor = colour;
}

In Practice

On a Ryzen 5 1600 CPU and GTX 960 GPU, the game can handle 33,000 particles before dropping below 60FPS.