3D Gaussian splatting renderer and viewer in WebGPU.
Author: Lu M.
Tested System:
- Windows 11 Home
- AMD Ryzen 7 5800HS @ 3.20GHz, 16GB RAM
- NVIDIA GeForce RTX 3060 Laptop GPU 6GB (Compute Capability 8.6)
- Brave 1.83.118 (Official Build) (64-bit), Chromium: 141.0.7390.108
Gaussian splatting is a point-based rendering technique where each 3D point is rasterized as a screen-space Gaussian instead of a single pixel. Each splat carries a position, covariance, and color. When splats overlap, their contributions are composited to reconstruct a smooth, continuous appearance.
This implementation follows three main stages: preprocess, sort, and render. Each stage is WebGPU-parallelized.
- Input a 3D point-cloud loaded from a PLY.
- Transform points into view, clip and NDC coordinates.
- Compute per-point splat parameters: positions, 3D covariance, 2D covariance.
- Purpose: sorting points back-to-front is required for correct alpha blending and improved memory coherence.
- Implementation: a GPU radix sort.
Two renderers are provided:
- Point-cloud renderer: renders raw points with a simple size and per-point color. This uses a minimal vertex/fragment pipeline.
- Gaussian splat renderer: for each splat, a screen-space quad is rasterized and shaded with a precomputed Gaussian weight using the splat covariance and color. The fragment shader evaluates the color and opacity.
Improve rendering performance on large point clouds by organizing Gaussian splats into screen-space tiles and sorting per-tile, reducing memory bandwidth and improving cache locality.
Implementation: a compute-based pipeline that:
- For each Gaussian, determines which tiles it overlaps using its 2D projection and splat radius.
- Generates key-value pairs
(tile_id | depth, gaussian_id)for each Gaussian-tile intersection using atomic operations to avoid write conflicts. - Sorts these pairs globally using radix sort.
- Identifies start/end ranges for each tile via a second compute pass.
- During rendering, tiles can be rasterized in any order with per-tile depth ordering.
Trade-offs: This approach reduces per-frame fragment shader overdraw in theory but adds preprocessing overhead. For scenes with significant depth complexity and overlapping splats per tile, this can improve performance. For scenes with sparse or well-distributed splats, the overhead may outweigh the benefits.
| Point Cloud Renderer | Gaussian Splatting |
|---|---|
![]() |
![]() |
| 272,965 points — point renderer | 272,965 points — gaussian splatting |
Observation: for the bonsai dataset (272,965 points) both renderers saturate the display refresh on the test laptop and hit the monitor's 144 Hz cap in the author's measurements.
| Point Cloud Renderer | Gaussian Splatting |
|---|---|
![]() |
![]() |
| 1,063,091 points — point renderer | 1,063,091 points — gaussian splatting |
Measured performance (personal laptop):
- Bonsai (272,965 points): both renderers hit the display cap at 144 Hz.
- Bicycle (1,063,091 points): point-cloud renderer ≈ 100 Hz; Gaussian splatting renderer ≈ 50 Hz; Gaussian splatting with tile-based depth sorting ≈ 40 Hz.
Discussion: the Gaussian splatting renderer rasterizes a screen-space quad per point and evaluates a Gaussian per-fragment. For dense clouds (the bicycle set) this produces significantly more fragment shader work and overdraw compared to rendering simple points. For the bonsai dataset the total fragment workload is low enough that both appear similarly fast.
The tile-based depth sorting variant demonstrates no visual difference in rendered quality compared to the base Gaussian splatting renderer. However, it runs approximately 20% slower (40 FPS vs 50 FPS on bicycle) despite the theoretical benefits of improved memory coherence and reduced overdraw. This performance regression occurs because:
- Preprocessing overhead dominates: the tile-based approach adds significant compute cost in key generation, atomic allocations, sorting the larger tile-pair dataset, and range identification. These stages are necessary every frame and account for substantial GPU time.
- Tile-sorting benefits don't materialize at this scale: while tile-based sorting theoretically improves cache locality, the actual savings in fragment shader evaluation are minimal. The 1M point cloud is already sparse enough per tile that depth coherence doesn't yield meaningful performance gains on modern GPUs with efficient caching.
- Atomic operation contention: multiple Gaussians writing to the same tile allocator creates atomic contention, serializing work that could otherwise be parallelized.
- Small tile sizes: to fit within WebGPU buffer limits (max_tile_pairs ≈ 16M), tile sizes must be modest, spreading splats across many tiles and reducing the per-tile depth-sorting benefit.
Note: these are preliminary numbers from a single machine. More comprehensive benchmarking is required (different GPUs, tile-based profiling, memory bandwidth counters, and varying splat sizes) to draw robust conclusions.
I implemented incorrect depth calculation and improper sort buffer usage. This resulted in the output displaying Gaussian quads refreshing with black square artifacts that obscured the view.| Bonsai | Bicycle |
|---|---|
![]() |
![]() |
-
Download Node.js
-
Clone repository
git clone https://github.com/lu-m-dev/WebGPU-gaussian-splatting.git
-
Install dependencies
cd WebGPU-gaussian-splatting npm install -
Launch dev server
npm run dev
Optional build to static
dist/npm run build
- Vite
- tweakpane
- stats.js
- wgpu-matrix
- Special Thanks to: Shrek Shao (Google WebGPU team) & Differential Guassian Renderer







