Summary
Implement a compute shader that reads chunk block data from the GPU storage buffer (#389) and produces vertex data directly on the GPU. This eliminates the CPU meshing bottleneck entirely — no worker thread meshing, no vertex upload, no staging buffer. The GPU builds meshes and immediately draws them.
Depends on: #389 (GPU block data buffer)
This is the capstone rendering optimization. Combined with MDI (#371), GPU culling (#379), and occlusion culling (#387), the entire rendering pipeline becomes GPU-driven.
Current Meshing Pipeline
- Worker thread reads
Chunk.blocks on CPU
- Greedy mesher processes 16 subchunks, merges adjacent faces
- Produces
[]Vertex arrays (solid/cutout/fluid)
- Main thread uploads to GPU via staging buffer
- Vertex data lives in megabuffer, drawn via
drawOffset()
Bottleneck: CPU meshing at ~2-5ms per chunk. With 1000+ chunks loading, mesh queue is always behind generation queue.
Target: Compute Meshing
Overview
[GPU Block Buffer] → [Compute Mesh Shader] → [Vertex Output Buffer] → [Draw]
↑
[Indirect Draw Commands] ←─┘
- Compute shader reads blocks from
GpuBlockBuffer for one chunk
- For each block: check 6 face neighbors, generate visible faces
- Merge adjacent same-block-type faces (greedy merge)
- Write vertices to output buffer
- Write
DrawIndirectCommand for draw dispatch
Compute Shader Design
// mesh.comp
layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;
layout(binding = 0) readonly buffer BlockData { uint blocks[]; }; // chunk blocks
layout(binding = 1) writeonly buffer VertexOutput { Vertex vertices[]; };
layout(binding = 2) buffer DrawCommand { DrawIndirectCommand cmd; };
layout(binding = 3) readonly buffer NeighborData { uint neighbors[]; }; // 4 neighbor chunks
// Each workgroup processes one horizontal slice (16x16 blocks at one Y level)
// Within slice: each thread processes one block column
// For each block: check 6 faces, emit visible quad vertices
Greedy Merge on GPU
- Per-slice face mask:
uint face_mask[16][16] — 1 bit per face per block
- Workgroup shared memory for the current slice
- Reduction: merge adjacent same-type faces in shared memory
- Output: variable-length vertex stream per workgroup
Output Management
- Allocate output buffer slots atomically:
atomicAdd(vertex_counter, count)
- Each workgroup reserves space, writes vertices, updates draw command
- Pipeline barrier between compute dispatch and graphics draw
Implementation Plan
Step 1: Basic face culling compute shader
- No greedy merge initially — just emit 1 quad per visible face
- Verify correctness: same visual output as CPU mesher
- Performance measurement: compare CPU vs GPU mesh time
Step 2: Greedy merge in shared memory
- Face mask generation in shared memory
- Row-by-row greedy merge within each slice
- Column merge across slices
- This is the hard part — may need multiple passes within the workgroup
Step 3: Cutout and fluid passes
- Separate dispatch for cutout blocks (alpha-tested) and fluid blocks
- Or: single dispatch with pass tag per vertex
- Draw commands for each pass written to separate buffers
Step 4: Neighbor data
- Block data for 4 cardinal neighbors needed for boundary faces
- Already uploaded in
GpuBlockBuffer — just need the slot index mapping
- Pass neighbor slot indices as push constants or uniform
Step 5: Integration with render graph
RenderGraph: add MeshBuildPass before OpaquePass
- Dispatch compute for all chunks that need remeshing
- Pipeline barrier:
COMPUTE_SHADER → VERTEX_SHADER
- OpaquePass draws from the compute-generated vertex buffer
Step 6: CPU fallback
- Keep CPU mesher for devices without adequate compute capability
- Runtime detection: check
maxComputeWorkGroupSize and maxComputeSharedMemorySize
Files to Create
assets/shaders/vulkan/mesh.comp — compute meshing shader
assets/shaders/vulkan/mesh.comp.spv
src/world/gpu_mesher.zig — dispatch management, output buffer lifecycle
Files to Modify
src/engine/graphics/render_graph.zig — add MeshBuildPass
src/world/world_renderer.zig — integrate GPU mesher dispatch
src/world/world_streamer.zig — trigger remesh via GPU instead of CPU job
src/engine/graphics/vulkan/pipeline_manager.zig — compute mesh pipeline
build.zig — glslangValidator check
Testing
Risks
- Greedy merge on GPU is complex — shared memory management, synchronization within workgroups
- May need iterative approach: start with face culling only, add greedy merge incrementally
- Output buffer sizing is tricky — need atomic allocation with worst-case bounds
Roadmap: docs/PERFORMANCE_ROADMAP.md — Batch 6, Issue 4A-2
Summary
Implement a compute shader that reads chunk block data from the GPU storage buffer (#389) and produces vertex data directly on the GPU. This eliminates the CPU meshing bottleneck entirely — no worker thread meshing, no vertex upload, no staging buffer. The GPU builds meshes and immediately draws them.
Depends on: #389 (GPU block data buffer)
This is the capstone rendering optimization. Combined with MDI (#371), GPU culling (#379), and occlusion culling (#387), the entire rendering pipeline becomes GPU-driven.
Current Meshing Pipeline
Chunk.blockson CPU[]Vertexarrays (solid/cutout/fluid)drawOffset()Bottleneck: CPU meshing at ~2-5ms per chunk. With 1000+ chunks loading, mesh queue is always behind generation queue.
Target: Compute Meshing
Overview
GpuBlockBufferfor one chunkDrawIndirectCommandfor draw dispatchCompute Shader Design
Greedy Merge on GPU
uint face_mask[16][16]— 1 bit per face per blockOutput Management
atomicAdd(vertex_counter, count)Implementation Plan
Step 1: Basic face culling compute shader
Step 2: Greedy merge in shared memory
Step 3: Cutout and fluid passes
Step 4: Neighbor data
GpuBlockBuffer— just need the slot index mappingStep 5: Integration with render graph
RenderGraph: addMeshBuildPassbeforeOpaquePassCOMPUTE_SHADER → VERTEX_SHADERStep 6: CPU fallback
maxComputeWorkGroupSizeandmaxComputeSharedMemorySizeFiles to Create
assets/shaders/vulkan/mesh.comp— compute meshing shaderassets/shaders/vulkan/mesh.comp.spvsrc/world/gpu_mesher.zig— dispatch management, output buffer lifecycleFiles to Modify
src/engine/graphics/render_graph.zig— add MeshBuildPasssrc/world/world_renderer.zig— integrate GPU mesher dispatchsrc/world/world_streamer.zig— trigger remesh via GPU instead of CPU jobsrc/engine/graphics/vulkan/pipeline_manager.zig— compute mesh pipelinebuild.zig— glslangValidator checkTesting
Risks
Roadmap:
docs/PERFORMANCE_ROADMAP.md— Batch 6, Issue 4A-2