top of page

Introduction

GPU-based Particle System

GPU-based particle system supporting 1M real-time particles, force field interactions, collision detection, bloom post-processing, built with DX11 instancing and Compute Shader architecture.

Platform:           PC           

Engine:              Custom Engine
Duration:          2025.12 - 2026.2  
Role:                   Sole Developer

Key Features

  • GPU-driven particle simulation via DirectX 11 Compute Shaders — emit, update, and cull entirely on the GPU with no per-particle CPU readback

  • Ping-pong AliveList architecture for lock-free alive/dead particle routing across frames

  • Single-dispatch multi-emitter emit via prefix sum and binary search, allowing all active emitters to emit in one Compute pass

  • Definition / Instance separation — shared behavior templates instanced at runtime, with dirty-flag-driven partial uploads to minimize GPU bandwidth

  • Multiple force types — Point Force with distance-based falloff, Vortex Force for tangential rotation, Linear Force for constant acceleration, Curl Noise for divergence-free turbulence via numerical curl of a 3D Perlin potential field, and Drag for velocity damping

  • Collision detection via MRT collision texture — per-pixel material encoding enables constant-time collision queries at any scene complexity, with five configurable response types (None, Bounce, Die, Stick, Slow)

  • FloatCurve / ColorCurve system driving size, rotation speed, color, and force strength over particle lifetime

  • Texture2DArray with automatic POT normalization and per-texture UV scale, supporting variable-size textures in a single draw call

  • DrawIndexedIndirect rendering — particle count never touches the CPU between simulation and draw

  • Bloom post-processing via dual render target output and multi-pass Gaussian blur

Particle Simulation Pipeline

The first operation of every frame. Counters[3] (aliveCountAfterSim), written by the previous frame's Update pass, is copied into Counters[0] (aliveCount), and Counters[3] is cleared to zero. This gives the Update shader the valid length of AliveListIn before any reads begin.

Phase 1 — Update Emitters · CPU
  • Game logic updates the transform and state of all active emitters

  • The particle manager calculates how many particles each instance should emit this frame and accumulates the result into emissionAccumulator

  • Any modified definitions are uploaded to the GPU on demand via dirty flags

  • All instance runtime states are uploaded to the GPU in full

Phase 2 — Emit Particles · Compute Shader
  • Counters[3] (aliveCountAfterSim) is copied into Counters[0] (aliveCount), and Counters[3] is cleared to zero

  • The CPU iterates over all active instances, calculates the total emit count for this frame, and builds per-instance emission start offsets (prefix sum) into EmitOffsetsBuffer, then triggers a single Dispatch

  • Each thread performs a binary search on EmitOffsetsBuffer to locate its owning instance

  • A free slot is atomically popped from the DeadList, the particle is initialized from the corresponding Definition and Instance data, and the slot index is appended to AliveListIn

Phase 3 — Update
  • Dispatched at maxParticles granularity — each thread reads a particle index from AliveListIn and returns early if its index exceeds Counters[0]

  • Valid threads run the full physics simulation and appearance update

  • Dead particles return their slot to the DeadList (Counters[1]++); surviving particles append their index to AliveListOut (Counters[3]++)

  • pingPongIndex ^= 1 swaps the roles of AliveListIn and AliveListOut for the next frame — no buffer clear required

Phase 4 — Prepare Render Data · Compute Shader
  • The WriteIndirectArgs shader reads Counters[3] and writes IndexCount = aliveCountAfterSim × 6 into IndirectArgsBuffer

  • The PopulateVBO shader iterates over AliveListOut and writes 4 vertices per surviving particle into the VBO

Phase 5 — Render · GPU Draw Call
  • DrawIndexedIndirect is issued using the parameters in IndirectArgsBuffer — the CPU never reads back or passes particle counts

Phase 6 — Post Processing · Compute Shader
  • Bloom is applied to the render output

  • The final result is blitted to screen

GPU Buffer Architecture

All simulation state in Nova2D lives entirely on the GPU. Every buffer is allocated once at startup — at runtime, the CPU never touches particle slot management directly. Its only roles are uploading constant parameters and triggering compute dispatches.

Buffers are organized into four functional groups:

1 Particle Pool
  • ParticleBuffer — RWStructuredBuffer<GPUParticle2D>, size maxParticles. Holds the full runtime data for every particle: position, velocity, remaining lifetime, color, size, rotation, behavior flags, owning instance ID, texture index, and more. Every slot is allocated upfront and persists regardless of whether the particle is alive or dead.

  • DeadListBuffer — RWStructuredBuffer<uint>, size maxParticles. An index array functioning as a concurrent stack of free slots, with Counters[1] acting as the stack pointer. Pre-filled at startup with [0, 1, 2, ..., N-1] — all slots begin as free.

  • CounterBuffer — RWStructuredBuffer<uint>, 4 elements. Maintains global bookkeeping via atomic operations:

    • [0] aliveCount — valid length of AliveListIn for the current frame, carried over from the previous frame's aliveCountAfterSim by ResetCounters

    • [1] deadCount — current stack depth of the DeadList; serves as both the free-slot count and the stack pointer for push/pop

    • [2] emitCount — a reserved field intended for tracking the number of particles emitted this frame. Currently not written or read by any shader, but available for CPU-side readback and statistics.

    • [3] aliveCountAfterSim — number of particles still alive after the Update pass; drives PopulateVBO dispatch count and IndirectArgs

2 Ping-Pong AliveList
  • AliveList A / AliveList B — each a RWStructuredBuffer<uint>, size maxParticles. Store the indices of live particles into ParticleBuffer.

    • Each frame, the two lists swap read/write roles via pingPongIndex ^= 1. At frame start, AliveListIn already holds all particles that survived the previous frame. The Emit shader then appends newly spawned particle indices into AliveListIn as well. The Update shader reads the full contents of AliveListIn — last frame's survivors plus this frame's newly emitted particles — simulates each one, and writes the indices of those that are still alive into AliveListOut. No buffer clears are ever needed.

3 Emitter Description Buffers
  • EmitterDefinitionBuffer — StructuredBuffer<Nova2DEmitterDefinitionGPU>, capacity 64. Stores the behavioral definition for each emitter type, covering the Emission, Motion, Appearance, and Collision modules along with FloatCurve data. Uploaded to the GPU only when marked dirty on the CPU side.

  • EmitterInstanceBuffer — StructuredBuffer<Nova2DEmitterInstanceGPU>, capacity 256. Stores per-instance runtime state: world position, rotation, scale, emissionAccumulator, elapsedTime, isActive, killFlag, and more. Updated by the CPU each frame and uploaded in full.

  • EmitOffsetsBuffer — StructuredBuffer<uint>, capacity 256. Holds the prefix-sum emission start offset for each active instance in the current frame's total emit sequence. Computed on the CPU before dispatch. The Emit shader uses binary search on this buffer to map each thread to its owning instance, allowing a single dispatch to serve all active emitters simultaneously.

4 Rendering Infrastructure
  • PopulatedVBO — a UAV-enabled RWByteAddressBuffer, written each frame by the PopulateVBO compute shader. Each particle contributes 4 vertices information.

  • IBO — a static index buffer, pre-generated at startup with the quad index pattern for every possible particle slot (6 indices per particle). Never modified at runtime.

  • IndirectArgsBuffer — written by the WriteIndirectArgs shader immediately after the Update pass, encoding the 5 arguments required by DrawIndexedIndirect. The draw call requires zero CPU involvement.

Emitter Architecture — Definition & Instance

Nova2D splits emitter data into two layers: Definition describes emitter behavior, Instance describes emitter runtime state. The two are stored separately and uploaded independently.

Separation of Responsibilities

A Definition is a read-only behavioral template covering four modules:

  • Emission Module: spawn shape, rate, lifetime range, initial velocity range, and related parameters

  • Motion Module: gravity, drag, point attraction/repulsion, vortex field, Curl Noise, and other force types

  • Appearance Module: texture index, Sprite Sheet configuration, and FloatCurves for color, size, and rotation over lifetime

  • Collision Module: collision texture sampling rules and response types (Bounce / Die / Stick / Slow)

An Instance is a per-emitter runtime snapshot, updated by the CPU and uploaded every frame:

  • World position, rotation, scale

  • emissionAccumulator: fractional particle count accumulated across frames

  • elapsedTime: total time the emitter has been running, used for Burst mode timing

  • isActive / killFlag: controls emitter activation and destruction

Why Separate

A single Definition can be referenced by multiple Instances. This avoids storing the same large Definition data twenty times over. On the GPU side, each particle only needs to store an instanceID. The Update shader resolves the full behavioral data through two levels of indirection: g_Instances[p.instanceID].m_definitionIndex → g_Definitions[definitionIndex].

Definitions use a dirty flag mechanism and are only uploaded when modified on the CPU side. Instances are uploaded in full every frame, since fields like position, accumulator, and elapsedTime change almost every frame, making partial updates impractical.

EmitOffsetsBuffer and Single-Dispatch Emit

At emit time, multiple instances may need to spawn particles in the same frame. The naive approach would issue one Dispatch per instance, producing many small GPU calls. Nova2D instead merges all instance emit work into a single Dispatch.

Before dispatching, the CPU iterates over all active instances, computes each instance's emit count for this frame, and writes the prefix sum of those counts into EmitOffsetsBuffer.

Each thread in the Emit shader takes its global thread index and performs a binary search on EmitOffsetsBuffer to determine which instance it belongs to, then reads the correct Definition and Instance data to initialize its particle. Regardless of how many active emitters exist in the scene, the Emit phase always requires exactly one Dispatch.

Update Shader — Physics & Appearance

Curl Noise

Mapping Perlin noise samples directly to particle velocity directions is a common approach, but it has a fundamental flaw: when the gradient of a scalar Perlin noise field is used as a force field, the divergence of the resulting vector field is not guaranteed to be zero. A non-zero divergence means the field contains sources (points where force diverges outward) and sinks (points where force converges inward). 

Mathematical Foundation

The core idea behind Curl Noise is to construct a vector field whose divergence is guaranteed to be zero. From the vector calculus identity:

div(curl(F)) = 0

The divergence of any vector field's curl is always zero. Therefore, taking the curl of any arbitrary vector field produces a new field that is mathematically guaranteed to have no sources or sinks — particles can only rotate and flow through it.

The approach constructs a potential field F from three independent scalar Perlin noise fields:

F = ( N0(x,y,z), N1(x,y,z), N2(x,y,z) )

The curl of F is then:

curl(F) = ( dN2/dy - dN1/dz, dN0/dz - dN2/dx, dN1/dx - dN0/dy )

The resulting curl(F) is used as the velocity field direction, with divergence permanently zero.

Numerical Implementation: Central Differences

Partial derivatives are approximated numerically in the shader using central differences:

dF/dx ≈ (F(x + δ) - F(x - δ)) / (2δ)

​Collision Detection

Nova2D's collision detection is driven by a real-time collision texture. During the scene render pass, MRT writes each object's physical material properties into a dedicated collision render target, with different materials encoded as distinct colors. The Update shader samples this texture every frame to determine whether a particle has collided with scene geometry and which response to apply. Because the collision texture is regenerated each frame, collision boundaries update dynamically with the scene.

 

Texture sampling on the GPU is a constant-time operation regardless of scene complexity, making the per-particle collision cost uniform and predictable at any scale.

Screenshot 2026-02-28 003523.png
Over Lifetime Graph

FloatCurve is the universal evaluation mechanism driving particle properties over lifetime. The lifetime ratio t = 1 - (lifetime / maxLifetime) runs from 0 (spawn) to 1 (death). Size, Rotation Speed, and Curl Noise Strength are all evaluated via EvaluateFloatCurve, which performs piecewise linear interpolation across keyframes at t. Curve data is stored in the DefinitionBuffer and distinguished by the m_type field. The Update shader iterates over the m_curves array to locate the curve matching the required type.

Rendering

PopulateVBO

Update completes with surviving particle indices written into AliveListOut. The PopulateVBO shader iterates over AliveListOut and writes 4 vertices per particle into the VBO. Each vertex carries the full data required by the Pixel Shader:

The VBO is declared as RWByteAddressBuffer rather than a typed buffer — this allows the Compute shader to write arbitrary byte offsets directly, giving full control over the vertex layout without being constrained by structured buffer element alignment.

Texture2DArray

DX11's Texture2DArray requires all slices to share identical dimensions. Nova2D resolves this by normalizing all registered textures to a common size at build time.

DrawIndexedIndirect

The IBO is built once at startup with 6 indices per particle slot (two triangles) and never modified. Before the draw call, the WriteIndirectArgs shader reads Counters[3] and writes IndexCount = aliveCountAfterSim × 6 into IndirectArgsBuffer.

DrawIndexedIndirect consumes IndirectArgsBuffer directly — the CPU never reads back or forwards particle counts. The Pixel Shader writes to two render targets simultaneously:

Performance

  • GitHub
  • Linkedin
bottom of page