Skip to content
2 changes: 2 additions & 0 deletions README.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,8 @@ The Vulkan Guide content is also viewable from https://docs.vulkan.org/guide/lat

=== xref:{chapters}dynamic_state_map.adoc[Dynamic State Map]

== xref:{chapters}compute_shaders.adoc[Compute Shaders]

== xref:{chapters}subgroups.adoc[Subgroups]

* `VK_EXT_subgroup_size_control`, `VK_KHR_shader_subgroup_extended_types`, `VK_EXT_shader_subgroup_ballot`, `VK_EXT_shader_subgroup_vote`
Expand Down
1 change: 1 addition & 0 deletions antora/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@
** xref:{chapters}robustness.adoc[]
** xref:{chapters}dynamic_state.adoc[]
*** xref:{chapters}dynamic_state_map.adoc[]
** xref:{chapters}compute_shaders.adoc[]
** xref:{chapters}subgroups.adoc[]
** xref:{chapters}shader_memory_layout.adoc[]
** xref:{chapters}atomics.adoc[]
Expand Down
169 changes: 169 additions & 0 deletions chapters/compute_shaders.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
// Copyright 2026 The Khronos Group, Inc.
// SPDX-License-Identifier: CC-BY-4.0

// Required for both single-page and combined guide xrefs to work
ifndef::chapters[:chapters:]
ifndef::images[:images: images/]

// the [] in the URL messes up asciidoc
:max-compute-workgroup-size-link: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupSize[0]&platform=all
:max-compute-workgroup-count-link: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupCount[0]&platform=all

[[compute-shaders]]
= Compute Shaders

This chapter is **not** a "how to use compute shader" article, there are plenty of resources online around GPGPU and compute.

What this chapter is for is all the "Vulkan-ism", terms, etc that are associated with compute shaders.

There is also a xref:{chapters}decoder_ring.adoc[Decoder Ring] created to help people transition from other APIs that use different terminology.

[NOTE]
====
If you want to play around with a simple compute example, suggest taking a look at the link:https://github.com/charles-lunarg/vk-bootstrap/blob/main/example/simple_compute.cpp[vk-bootstrap sample].
====

== Coming from Vulkan Graphics

For those who are more familiar with graphics in Vulkan, compute will be a simple transition. Basically everything is the same except:

- Call `vkCmdDispatch` instead of `vkCmdDraw`
- Use `vkCreateComputePipelines` instead of `vkCreateGraphicsPipelines`
- Make sure your `VkQueue` xref:{chapters}queues.adoc[supports] `VK_QUEUE_COMPUTE_BIT`
- When binding descriptors and pipelines to your command buffer, make sure to use `VK_PIPELINE_BIND_POINT_COMPUTE`

== SPIR-V Terminology

The smallest unit of work that is done is called an `invocation`. It is a "thread" or "lane" of work.

Multiple `Invocations` are organized into `subgroups`, where `invocations` within a `subgroup` can synchronize and share data with each other efficiently. (See more in the xref:{chapters}subgroups.adoc[subgroup chapter])

Next we have `workgroups` which is the smallest unit of work that an application can define. A `workgroup` is a collection of `invocations` that execute the same shader.

[NOTE]
====
While slightly annoying, Vulkan spec uses `WorkGroup` while the SPIR-V spec spells it as `Workgroup`. It has no significant meaning, other than a potential typo when going between the two.
====

=== Workgroup Size

Setting the `workgroup` size can be done in 3 ways

1. Using the `WorkgroupSize` built-in (link:https://godbolt.org/z/ees83eT7x[example])
2. Using the `LocalSize` execution mode (link:https://godbolt.org/z/3zn1Preb8[example])
3. Using the `LocalSizeId` execution mode (link:https://godbolt.org/z/dP7daqTas[example])

A few important things to note:

- The `WorkgroupSize` decoration will take precedence over any `LocalSize` or `LocalSizeId` in the same module.
- `LocalSizeId` was added in the extension `VK_KHR_maintenance4` (made core in Vulkan 1.3) to allow the ability to use specialization constants to set the size.
- There is a `maxComputeWorkGroupSize` limit how large the `X`, `Y`, and `Z` size can each be in each dimension. Most implementations {max-compute-workgroup-size-link}[support around 1024 for each dimension].
- There is a `maxComputeWorkGroupInvocations` limit how large the product of `X` * `Y` * `Z` can be. Most implementations link:https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupInvocations&platform=all[support around 1024].

=== Local and Global Workgroups

When `vkCmdDispatch` is called, it sets the number of workgroups to dispatch. This produces a `global workgroup` space that the GPU will work on. Each single workgroup is a `local workgroup`. An `invocation` within a `local workgroup` can share data with other members of the `local workgroup` through shared variables as well as issue memory and control flow barriers to synchronize with other members of the `local workgroup`.

[NOTE]
====
There is a `maxComputeWorkGroupCount` limit link:{max-compute-workgroup-count-link}[some hardware] supports only 64k, but newer hardware can basically be unlimited here.
====

=== Dispatching size from a buffer

The `vkCmdDispatchIndirect` (and newer `vkCmdDispatchIndirect2KHR`) allow the size to be controlled from a buffer. This means the GPU can set the number of workgroups to dispatch.

[source,glsl]
----
// or any other draw/dispath that will update the memory on the GPU
vkCmdDispatch();

vkCmdPipelineBarrier(
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, // src stage
VK_ACCESS_SHADER_WRITE_BIT, // src access
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, // dst stage
VK_ACCESS_INDIRECT_COMMAND_READ_BIT, // dst access
)

// Reads VkDispatchIndirectCommand in buffer to set the number of local workgroups
vkCmdDispatchIndirect(my_buffer);
----

== Shared memory

When inside a single `local workgroup` "shared memory" can be used. In SPIR-V this is referenced with the `Workgroup` storage class.

Shared memory is essentially the "L1 cache you can control" in your compute shader and an important part of any performant shader.

There is a `maxComputeSharedMemorySize` limit (link:https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeSharedMemorySize&platform=all[mainly around 32k bytes]) that needs to be accounted for.

=== Shared Memory Race Conditions

It is very easy to have race conditions when using shared memory.

The classic example is when multiple invocations initialize something to the same value.

[source,glsl]
----
shared uint my_var;
void main() {
// All the invocations in the workgroup are going to try to write to the same memory.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing to lead with this example, and would be better to start with examples of cross-thread communication and using barrier() first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide some GLSL snippets here (in the comment), I know you are more of an expert in this area and probably better at providing some better examples to provide

Copy link

@jeffbolznv jeffbolznv Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use a reduction as a simple example of cross-thread communication:

    shared float temp[BLOCK_SIZE];

    temp[tid] = x;
    [[unroll]] for (uint s = BLOCK_SIZE / 2; s > 0; s >>= 1) {
        barrier();
        float other = temp[tid ^ s];
        barrier();
        temp[tid] += other;
    }

shared float my_var[BLOCK_SIZE];

void reduction(float x) {
    uint tid = gl_GlobalInvocationID;

    my_var[tid] = x;
    for (uint s = BLOCK_SIZE / 2; s > 0; s >>= 1) {
        barrier();
        float other = my_var[tid ^ s];
        barrier();
        my_var[tid] += other;
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, we can add this in a follow up, this example makes is confusing... was is suppose to be an example of having a race condition?

// RACE CONDITION
my_var = 0;
}
----

If you are asking "why?", the "technically correct" answer is "because the link:https://docs.vulkan.org/spec/latest/appendices/memorymodel.html[memory model] says so".

When you do a weak store to a memory location, that invocation "owns" that memory location until synchronization occurs. The compiler **can** use that information and choose to reuse that location as temporary storage for another value.

Luckily the fix is simple, make sure to use atomics

[source,glsl]
----
shared uint my_var;
void main() {
atomicStore(my_var, 0u, gl_ScopeWorkgroup, 0, 0);
}
----

Another option is to use a `OpControlBarrier` with `Workgroup` scope (link:https://godbolt.org/z/WcsvjYfPx[see online]).

[source,glsl]
----
layout(local_size_x = 32) in; // 32x1x1 workgroup
shared uint my_var[32]; // one slot for each invocation

void main() {
my_var[gl_LocalInvocationIndex] = 0;
barrier(); // will generate an OpControlBarrier for you
uint x = my_var[gl_LocalInvocationIndex ^ 1];
}
----

==== Detecting shared memory data races

Luckily this problem can be caught automatically using the link:https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/docs/gpu_validation.md[GPU-AV] feature in Vulkan Validation Layers!

As of March 2026 (TODO - Add SDK version when released in May), GPU-AV will attempt to detect these races for you. There are a link:https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/layers/gpuav/shaders/instrumentation/shared_memory_data_race.comp#L47[few limitations], but highly suggest trying out if having strange issues around your shared memory accesses.

=== Explicit Layout of shared memory

The xref:{chapters}extensions/shader_features.adoc#VK_KHR_workgroup_memory_explicit_layout[VK_KHR_workgroup_memory_explicit_layout] extension was added to allow link:https://github.com/KhronosGroup/SPIRV-Guide/blob/main/chapters/explicit_layout.md[explicit layout] of shared memory.

== Finding the invocation in your shader

There are many SPIR-V built-in values that can be used to find the invocation in your shader.

The following built-ins are well defined in the link:https://docs.vulkan.org/spec/latest/chapters/interfaces.html#interfaces-builtin-variables[builtin chapter] of the Vulkan spec.

- `GlobalInvocationId`
- `LocalInvocationId`
- `LocalInvocationIndex`
- `NumSubgroups`
- `NumWorkgroups`
- `SubgroupId`
- `WorkgroupId`

For those who want a more "hands on" example, link:https://godbolt.org/z/qhPrE6o5b[the following GLSL] demonstrates using most of these built-ins.

Loading