diff --git a/README.adoc b/README.adoc index 72d6f4d..7ab42d5 100644 --- a/README.adoc +++ b/README.adoc @@ -154,6 +154,8 @@ The Vulkan Guide content is also viewable from https://docs.vulkan.org/guide/lat === xref:{chapters}dynamic_state_map.adoc[Dynamic State Map] +== xref:{chapters}compute_shaders.adoc[Compute Shaders] + == xref:{chapters}subgroups.adoc[Subgroups] * `VK_EXT_subgroup_size_control`, `VK_KHR_shader_subgroup_extended_types`, `VK_EXT_shader_subgroup_ballot`, `VK_EXT_shader_subgroup_vote` diff --git a/antora/modules/ROOT/nav.adoc b/antora/modules/ROOT/nav.adoc index 6fd940a..c5662fe 100644 --- a/antora/modules/ROOT/nav.adoc +++ b/antora/modules/ROOT/nav.adoc @@ -60,6 +60,7 @@ ** xref:{chapters}robustness.adoc[] ** xref:{chapters}dynamic_state.adoc[] *** xref:{chapters}dynamic_state_map.adoc[] +** xref:{chapters}compute_shaders.adoc[] ** xref:{chapters}subgroups.adoc[] ** xref:{chapters}shader_memory_layout.adoc[] ** xref:{chapters}atomics.adoc[] diff --git a/chapters/compute_shaders.adoc b/chapters/compute_shaders.adoc new file mode 100644 index 0000000..718b397 --- /dev/null +++ b/chapters/compute_shaders.adoc @@ -0,0 +1,169 @@ +// Copyright 2026 The Khronos Group, Inc. +// SPDX-License-Identifier: CC-BY-4.0 + +// Required for both single-page and combined guide xrefs to work +ifndef::chapters[:chapters:] +ifndef::images[:images: images/] + +// the [] in the URL messes up asciidoc +:max-compute-workgroup-size-link: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupSize[0]&platform=all +:max-compute-workgroup-count-link: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupCount[0]&platform=all + +[[compute-shaders]] += Compute Shaders + +This chapter is **not** a "how to use compute shader" article, there are plenty of resources online around GPGPU and compute. + +What this chapter is for is all the "Vulkan-ism", terms, etc that are associated with compute shaders. + +There is also a xref:{chapters}decoder_ring.adoc[Decoder Ring] created to help people transition from other APIs that use different terminology. + +[NOTE] +==== +If you want to play around with a simple compute example, suggest taking a look at the link:https://github.com/charles-lunarg/vk-bootstrap/blob/main/example/simple_compute.cpp[vk-bootstrap sample]. +==== + +== Coming from Vulkan Graphics + +For those who are more familiar with graphics in Vulkan, compute will be a simple transition. Basically everything is the same except: + +- Call `vkCmdDispatch` instead of `vkCmdDraw` +- Use `vkCreateComputePipelines` instead of `vkCreateGraphicsPipelines` +- Make sure your `VkQueue` xref:{chapters}queues.adoc[supports] `VK_QUEUE_COMPUTE_BIT` +- When binding descriptors and pipelines to your command buffer, make sure to use `VK_PIPELINE_BIND_POINT_COMPUTE` + +== SPIR-V Terminology + +The smallest unit of work that is done is called an `invocation`. It is a "thread" or "lane" of work. + +Multiple `Invocations` are organized into `subgroups`, where `invocations` within a `subgroup` can synchronize and share data with each other efficiently. (See more in the xref:{chapters}subgroups.adoc[subgroup chapter]) + +Next we have `workgroups` which is the smallest unit of work that an application can define. A `workgroup` is a collection of `invocations` that execute the same shader. + +[NOTE] +==== +While slightly annoying, Vulkan spec uses `WorkGroup` while the SPIR-V spec spells it as `Workgroup`. It has no significant meaning, other than a potential typo when going between the two. +==== + +=== Workgroup Size + +Setting the `workgroup` size can be done in 3 ways + +1. Using the `WorkgroupSize` built-in (link:https://godbolt.org/z/ees83eT7x[example]) +2. Using the `LocalSize` execution mode (link:https://godbolt.org/z/3zn1Preb8[example]) +3. Using the `LocalSizeId` execution mode (link:https://godbolt.org/z/dP7daqTas[example]) + +A few important things to note: + +- The `WorkgroupSize` decoration will take precedence over any `LocalSize` or `LocalSizeId` in the same module. +- `LocalSizeId` was added in the extension `VK_KHR_maintenance4` (made core in Vulkan 1.3) to allow the ability to use specialization constants to set the size. +- There is a `maxComputeWorkGroupSize` limit how large the `X`, `Y`, and `Z` size can each be in each dimension. Most implementations {max-compute-workgroup-size-link}[support around 1024 for each dimension]. +- There is a `maxComputeWorkGroupInvocations` limit how large the product of `X` * `Y` * `Z` can be. Most implementations link:https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeWorkGroupInvocations&platform=all[support around 1024]. + +=== Local and Global Workgroups + +When `vkCmdDispatch` is called, it sets the number of workgroups to dispatch. This produces a `global workgroup` space that the GPU will work on. Each single workgroup is a `local workgroup`. An `invocation` within a `local workgroup` can share data with other members of the `local workgroup` through shared variables as well as issue memory and control flow barriers to synchronize with other members of the `local workgroup`. + +[NOTE] +==== +There is a `maxComputeWorkGroupCount` limit link:{max-compute-workgroup-count-link}[some hardware] supports only 64k, but newer hardware can basically be unlimited here. +==== + +=== Dispatching size from a buffer + +The `vkCmdDispatchIndirect` (and newer `vkCmdDispatchIndirect2KHR`) allow the size to be controlled from a buffer. This means the GPU can set the number of workgroups to dispatch. + +[source,glsl] +---- +// or any other draw/dispath that will update the memory on the GPU +vkCmdDispatch(); + +vkCmdPipelineBarrier( + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, // src stage + VK_ACCESS_SHADER_WRITE_BIT, // src access + VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, // dst stage + VK_ACCESS_INDIRECT_COMMAND_READ_BIT, // dst access +) + +// Reads VkDispatchIndirectCommand in buffer to set the number of local workgroups +vkCmdDispatchIndirect(my_buffer); +---- + +== Shared memory + +When inside a single `local workgroup` "shared memory" can be used. In SPIR-V this is referenced with the `Workgroup` storage class. + +Shared memory is essentially the "L1 cache you can control" in your compute shader and an important part of any performant shader. + +There is a `maxComputeSharedMemorySize` limit (link:https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeSharedMemorySize&platform=all[mainly around 32k bytes]) that needs to be accounted for. + +=== Shared Memory Race Conditions + +It is very easy to have race conditions when using shared memory. + +The classic example is when multiple invocations initialize something to the same value. + +[source,glsl] +---- +shared uint my_var; +void main() { + // All the invocations in the workgroup are going to try to write to the same memory. + // RACE CONDITION + my_var = 0; +} +---- + +If you are asking "why?", the "technically correct" answer is "because the link:https://docs.vulkan.org/spec/latest/appendices/memorymodel.html[memory model] says so". + +When you do a weak store to a memory location, that invocation "owns" that memory location until synchronization occurs. The compiler **can** use that information and choose to reuse that location as temporary storage for another value. + +Luckily the fix is simple, make sure to use atomics + +[source,glsl] +---- +shared uint my_var; +void main() { + atomicStore(my_var, 0u, gl_ScopeWorkgroup, 0, 0); +} +---- + +Another option is to use a `OpControlBarrier` with `Workgroup` scope (link:https://godbolt.org/z/WcsvjYfPx[see online]). + +[source,glsl] +---- +layout(local_size_x = 32) in; // 32x1x1 workgroup +shared uint my_var[32]; // one slot for each invocation + +void main() { + my_var[gl_LocalInvocationIndex] = 0; + barrier(); // will generate an OpControlBarrier for you + uint x = my_var[gl_LocalInvocationIndex ^ 1]; +} +---- + +==== Detecting shared memory data races + +Luckily this problem can be caught automatically using the link:https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/docs/gpu_validation.md[GPU-AV] feature in Vulkan Validation Layers! + +As of March 2026 (TODO - Add SDK version when released in May), GPU-AV will attempt to detect these races for you. There are a link:https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/layers/gpuav/shaders/instrumentation/shared_memory_data_race.comp#L47[few limitations], but highly suggest trying out if having strange issues around your shared memory accesses. + +=== Explicit Layout of shared memory + +The xref:{chapters}extensions/shader_features.adoc#VK_KHR_workgroup_memory_explicit_layout[VK_KHR_workgroup_memory_explicit_layout] extension was added to allow link:https://github.com/KhronosGroup/SPIRV-Guide/blob/main/chapters/explicit_layout.md[explicit layout] of shared memory. + +== Finding the invocation in your shader + +There are many SPIR-V built-in values that can be used to find the invocation in your shader. + +The following built-ins are well defined in the link:https://docs.vulkan.org/spec/latest/chapters/interfaces.html#interfaces-builtin-variables[builtin chapter] of the Vulkan spec. + +- `GlobalInvocationId` +- `LocalInvocationId` +- `LocalInvocationIndex` +- `NumSubgroups` +- `NumWorkgroups` +- `SubgroupId` +- `WorkgroupId` + +For those who want a more "hands on" example, link:https://godbolt.org/z/qhPrE6o5b[the following GLSL] demonstrates using most of these built-ins. +