diff --git a/antora/modules/ROOT/nav.adoc b/antora/modules/ROOT/nav.adoc index 89c2dc6f..e660ca81 100644 --- a/antora/modules/ROOT/nav.adoc +++ b/antora/modules/ROOT/nav.adoc @@ -61,6 +61,28 @@ ** xref:courses/18_Ray_tracing/05_Shadow_transparency.adoc[Shadow transparency] ** xref:courses/18_Ray_tracing/06_Reflections.adoc[Reflections] ** xref:courses/18_Ray_tracing/07_Conclusion.adoc[Conclusion] +* Synchronization 2 +** xref:Synchronization/introduction.adoc[Introduction] +** Anatomy of a Dependency +*** xref:Synchronization/Anatomy_of_a_Dependency/01_introduction.adoc[Introduction] +** Pipeline Barriers and Transitions +*** xref:Synchronization/Pipeline_Barriers_Transitions/01_introduction.adoc[Introduction] +** Timeline Semaphores: The Master Clock +*** xref:Synchronization/Timeline_Semaphores/01_introduction.adoc[Introduction] +** Frame-in-Flight Architecture +*** xref:Synchronization/Frame_in_Flight/01_introduction.adoc[Introduction] +** Asynchronous Compute & Execution Overlap +*** xref:Synchronization/Async_Compute_Overlap/01_introduction.adoc[Introduction] +** Transfer Queues & Asset Streaming Sync +*** xref:Synchronization/Transfer_Queues_Streaming/01_introduction.adoc[Introduction] +** Synchronization in Dynamic Rendering +*** xref:Synchronization/Dynamic_Rendering_Sync/01_introduction.adoc[Introduction] +** Host Image Copies & Memory Mapped Sync +*** xref:Synchronization/Host_Image_Copies_Memory_Sync/01_introduction.adoc[Introduction] +** Debugging with Synchronization Validation +*** xref:Synchronization/Synchronization_Validation/01_introduction.adoc[Introduction] +** Profiling, Batching, and Optimization +*** xref:Synchronization/Profiling_Optimization/01_introduction.adoc[Introduction] * xref:90_FAQ.adoc[FAQ] * link:https://github.com/KhronosGroup/Vulkan-Tutorial[GitHub Repository, window=_blank] diff --git a/en/Synchronization/Anatomy_of_a_Dependency/01_introduction.adoc b/en/Synchronization/Anatomy_of_a_Dependency/01_introduction.adoc new file mode 100644 index 00000000..dac2b081 --- /dev/null +++ b/en/Synchronization/Anatomy_of_a_Dependency/01_introduction.adoc @@ -0,0 +1,25 @@ +:pp: {plus}{plus} += Anatomy of a Dependency: Introduction + +== Overview + +Every Vulkan operation, from a simple color clear to a complex ray-traced reflections pass, lives and breathes by the dependencies we define. In this chapter, we take a deep dive into the core mechanics of how data actually moves through the Vulkan pipeline and why synchronization is about much more than just "setting a bitmask." + +image::/images/rendering_pipeline_flowchart.png[Rendering Pipeline Flowchart, width=600, alt="Flowchart showing the stages of a modern Vulkan rendering pipeline"] + +To truly master synchronization, we first need to break down what happens when the GPU processes your commands. We often talk about the GPU as a "massive parallel processor," but what does that mean for data integrity? We'll start by deconstructing the fundamental differences between **Execution Dependencies** (the "when" of GPU work) and **Memory Dependencies** (the "where" and "visibility" of data). + +=== What You'll Learn in This Chapter + +This chapter is designed to move you from "making it work" to "knowing why it works." We'll explore: + +* **The Hardware Perspective**: Understanding why execution barriers alone are not enough to prevent data corruption on modern, multi-cache GPUs. +* **Execution vs. Memory Dependencies**: Learning how to distinguish between stopping a stage and ensuring its data is actually readable by the next one. +* **The Synchronization 2 Advantage**: Why the new `vk::DependencyInfo` and `vkCmdPipelineBarrier2` are more than just a syntax cleanup—they are a fundamental shift in how we express intent to the driver. +* **Surgical Precision with Pipeline Stages**: Mastering `vk::PipelineStageFlagBits2` and `vk::AccessFlagBits2` to target specific hardware units, ensuring maximum GPU occupancy by avoiding unnecessary pipeline bubbles. + +By the end of this chapter, you’ll have a clear understanding of the "handshake" that must occur between any two pieces of GPU work. This foundation is crucial for everything that follows, from simple image layout transitions to complex asynchronous compute architectures. + +== Navigation + +Previous: xref:Synchronization/introduction.adoc[Introduction] | Next: xref:Synchronization/Anatomy_of_a_Dependency/02_execution_vs_memory.adoc[Execution vs. Memory Dependencies] diff --git a/en/Synchronization/Anatomy_of_a_Dependency/02_execution_vs_memory.adoc b/en/Synchronization/Anatomy_of_a_Dependency/02_execution_vs_memory.adoc new file mode 100644 index 00000000..c4821714 --- /dev/null +++ b/en/Synchronization/Anatomy_of_a_Dependency/02_execution_vs_memory.adoc @@ -0,0 +1,61 @@ +:pp: {plus}{plus} += Execution vs. Memory Dependencies + +== Introduction + +To understand why synchronization is so critical, we first need to look at what's happening under the hood when a GPU processes your work. Unlike a CPU, which generally executes instructions in a linear, predictable fashion, the GPU is a massive, highly-parallel array of specialized hardware units. When you submit a command buffer, the GPU doesn't just start at the top and finish at the bottom; it distributes tasks across various stages of its pipeline—geometry, rasterization, fragment shading, and more—often all at once. + +This parallelism is what makes Vulkan powerful, but it's also where the danger lies. If you want a fragment shader to read data that was just written by a compute shader, you must define exactly how that dependency works. In Vulkan, this is split into two distinct concepts: **Execution Dependencies** and **Memory Dependencies**. + +=== The "When": Execution Dependencies + +An **Execution Dependency** is the simplest form of synchronization. It answers the question: "When can this work start?" + +Imagine you have two commands: Command A and Command B. An execution dependency from A to B simply tells the GPU: "Don't start the specified pipeline stages of Command B until the specified pipeline stages of Command A have finished." + +This sounds straightforward, but here's the catch: on modern hardware, Command A finishing its work is *not* the same thing as its data being ready for Command B. Execution is just the trigger; memory is the substance. + +=== Architectural Realities: Caches and Memory Types + +Vulkan memory isn't just one big bucket where you store textures and buffers. Depending on your hardware, it's a complex landscape of different physical locations and access speeds. To sync effectively, you need to know what you're syncing against. + +On a **Discrete GPU**, you have dedicated Video RAM (VRAM) that is physically separate from your system's RAM. Moving data between these two is the job of the **DMA (Direct Memory Access)** engine—a specialized unit that can copy data across the PCI Express bus without bothering the main shader cores. When you upload a texture, you're often syncing the DMA engine with the Graphics pipeline. + +On the other hand, many laptops and mobile devices use **Unified Memory Architecture (UMA)**, where the CPU and GPU share the same physical RAM sticks. While this sounds like it should make things easier, it actually adds a hidden layer of complexity: **Caches**. Even if they share the RAM, the CPU has its own L1/L2/L3 caches, and the GPU has its own L1/L2 caches. If the GPU writes data to a shared buffer, that data might stay in the GPU's L2 cache and never actually reach the physical RAM. When the CPU tries to read it, it will see the old, stale value from the RAM or its own cache. + +In Vulkan, we categorize these behaviors into three primary memory types: + +* **Device Local**: This is memory that is "fastest" for the GPU to access. On a discrete card, this is the VRAM. On UMA, it's just a portion of the shared RAM. +* **Host Visible**: This memory can be "mapped" into your c{pp} application's address space, allowing the CPU to read and write to it directly. +* **Host Coherent**: A special type of Host Visible memory where the hardware automatically ensures that CPU and GPU see the same data without you needing to manually flush caches (though you still need an execution dependency to ensure the write has *finished*!). + +=== The "Where": Memory Dependencies + +This is where many Vulkan developers get caught. Even if Command A has finished, its output might still be sitting in a local L1 cache on a specific shader core, or it might be in a shared L2 cache that hasn't been written back to the main pool. If Command B—perhaps running on a completely different part of the GPU or even the CPU—tries to read that data from main memory before it has been "made available," it will read stale data. + +This is why we say execution is not enough. You can tell the hardware "Wait for the Compute Shader to finish before starting the Fragment Shader," and the hardware will happily oblige. But the Fragment Shader will then go to read the texture and find the old data because the Compute Shader's writes are still trapped in a local cache somewhere. + +A **Memory Dependency** ensures that data is properly moved between caches and main memory so it can be safely read. This involves two critical steps: + +1. **Availability**: This operation "flushes" the data from the source's local caches so that it is visible to a shared memory pool (like L2 cache or main memory). +2. **Visibility**: This operation "invalidates" the local caches of the destination stage, forcing it to read the fresh data from the shared memory pool rather than using whatever stale bits it might already have. + +Without both an execution dependency AND a memory dependency, you are living in a world of **hazards**. The most common is the "Read-After-Write" (RAW) hazard, where your fragment shader reads a texture before the compute shader has finished writing to it, resulting in the flickering artifacts or "shadow acne" that are so common in early Vulkan implementations. + +=== The Practical Handshake + +Think of it as a professional handshake. An execution dependency is the two people agreeing to meet. A memory dependency is one person actually handing the document to the other and the other person making sure they are looking at the new document, not their old notes. + +In Synchronization 2, we define this handshake using `vk::PipelineStageFlagBits2` and `vk::AccessFlagBits2`. The stage flags define the *when* (the execution dependency), and the access flags define the *how* (the memory dependency). By pairing these correctly, you ensure that your data is not only processed in the right order but is also actually there when you go to look for it. + +== Simple Engine Implementation: Caches and Safety + +In `Simple Engine`, we handle these architectural realities through our `MemoryPool` class (`memory_pool.cpp`). When we allocate memory for a buffer or image, we specify the `vk::MemoryPropertyFlags` to decide its role. For example, our `UniformBuffer` objects are typically allocated as `HostVisible | HostCoherent`. This means the CPU can write to them and they are automatically visible to the GPU without a manual `flushMappedMemoryRanges` call. + +However, just because they are **coherent** doesn't mean we can ignore execution dependencies! Even in `Simple Engine`, if the CPU updates a `HostCoherent` uniform buffer while the GPU is in the middle of a fragment shader reading from it, we will encounter a **data race**. This is why we still use `inFlightFences` and semaphores to ensure the GPU has finished using a frame's resources before the CPU starts modifying them for the next frame. + +For our textures and vertex buffers, we use `DeviceLocal` memory for maximum performance. Because these are not host-coherent, we must use `vk::DependencyInfo` and `vk::ImageMemoryBarrier2` to explicitly manage the "Availability" and "Visibility" handshakes. This ensures that after a `vkCmdCopyBufferToImage` command, the data is properly flushed from the transfer unit's caches and invalidated for the fragment shader's caches. + +== Navigation + +Previous: xref:Synchronization/Anatomy_of_a_Dependency/01_introduction.adoc[Introduction] | Next: xref:Synchronization/Anatomy_of_a_Dependency/03_sync2_advantage.adoc[The Synchronization 2 Advantage] diff --git a/en/Synchronization/Anatomy_of_a_Dependency/03_sync2_advantage.adoc b/en/Synchronization/Anatomy_of_a_Dependency/03_sync2_advantage.adoc new file mode 100644 index 00000000..7562d613 --- /dev/null +++ b/en/Synchronization/Anatomy_of_a_Dependency/03_sync2_advantage.adoc @@ -0,0 +1,195 @@ +:pp: {plus}{plus} += The Synchronization 2 Advantage + +== Introduction + +In the early days of Vulkan 1.0, defining dependencies was a fragmented and often frustrating process. You had to juggle multiple structures like `VkMemoryBarrier`, `VkBufferMemoryBarrier`, and `VkImageMemoryBarrier`. These structures weren't just numerous; they were also functionally separate, which meant the logic of your synchronization was spread across several different parts of your code. + +**Synchronization 2** (VK_KHR_synchronization2), which is now core in Vulkan 1.3, fundamentally changes this. It unifies these disparate barriers into a single, cohesive structure: `vk::DependencyInfo`. + +== The Fragmented Past: Why Legacy Was Hard + +To appreciate the advantage of Synchronization 2, we have to look at what we're leaving behind. In the legacy Vulkan 1.0 API, a pipeline barrier was a single function call that took three separate arrays of barriers: global, buffer, and image. + +[,c++] +---- +// Legacy Vulkan 1.0 (Still works, but we don't like it) +vkCmdPipelineBarrier( + commandBuffer, + srcStageMask, // Global stage mask for ALL barriers + dstStageMask, // Global stage mask for ALL barriers + dependencyFlags, + memoryBarrierCount, pMemoryBarriers, + bufferMemoryBarrierCount, pBufferMemoryBarriers, + imageMemoryBarrierCount, pImageMemoryBarriers +); +---- + +Notice the problem? The `srcStageMask` and `dstStageMask` were passed as arguments to the *function*, not as part of the individual barrier structures. This meant that if you had two different image transitions in the same call—say, one from `Transfer` to `Fragment Shader` and another from `Compute` to `Vertex Shader`—you had to combine all those stages into a single, broad mask. + +This led to "over-synchronization." By merging the stages at the function level, you were inadvertently telling the GPU to wait for *all* the source stages to finish before *any* of the destination stages could start. You were creating a bottleneck where one didn't need to exist. + +image::/images/sync2_problem_over_sync.svg[Legacy Synchronization Log Jam, width=600, align="center"] + +== The "Chain of Intent": Unification with vk::DependencyInfo + +Synchronization 2 solves this by moving the stage masks into the barrier structures themselves. In our engine, we use `vk::DependencyInfo`, which acts as a container for all our synchronization needs. + +When we unify synchronization, we aren't just cleaning up the syntax. We are grouping the entire "intent" of the dependency in one place. With `vk::DependencyInfo`, each individual `vk::ImageMemoryBarrier2` or `vk::BufferMemoryBarrier2` contains its own `srcStageMask` and `dstStageMask`. + +[,c++] +---- +// Synchronization 2 (The Modern Way) +vk::ImageMemoryBarrier2 imageBarrier{ + .srcStageMask = vk::PipelineStageFlagBits2::eColorAttachmentOutput, + .srcAccessMask = vk::AccessFlagBits2::eColorAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eFragmentShader, + .dstAccessMask = vk::AccessFlagBits2::eShaderRead, + // ... layout transitions, image handles ... +}; + +vk::DependencyInfo dependencyInfo{ + .imageMemoryBarrierCount = 1, + .pImageMemoryBarriers = &imageBarrier +}; + +commandBuffer.pipelineBarrier2(dependencyInfo); +---- + +image::/images/sync2_solution_granular.svg[Synchronization 2 Granular Control, width=600, align="center"] + +This is a massive win for human readability. When you look at that `imageBarrier` block, you see a complete "handshake." You know exactly what work is finishing (`src`) and exactly what work is waiting (`dst`). There's no need to hunt through function arguments or global variables to find the other half of the dependency. + +== Granular Control with 64-bit Masks + +Another technical reason for the switch was simple math. The original Vulkan 1.0 flags were 32-bit bitmasks. As Vulkan evolved, we added Ray Tracing, Mesh Shading, Video Encoding, and more. We were literally running out of bits. + +With `vk::PipelineStageFlagBits2`, we've moved to 64-bit masks. This gives us the headroom to target specific hardware units with surgical precision. + +=== The Power of "None" +In legacy Vulkan, if you didn't need a memory dependency (just an execution one), you often had to pass `0` or use confusing flags like `BOTTOM_OF_PIPE`. In Sync 2, we have an explicit `vk::PipelineStageFlagBits2::eNone` and `vk::AccessFlagBits2::eNone`. + +If you're doing a layout transition that doesn't require a memory flush (very rare, but possible), or if you just want to be absolutely clear that a certain barrier has no effect on a specific stage, `eNone` is your best friend. It makes the code self-documenting. + +== Sync 2 as a Mental Model + +Think of Synchronization 2 not as a new API, but as a better way to talk to the GPU. In the old system, you were shouting broad commands at the hardware: "EVERYONE STOP UNTIL THE RENDERING IS DONE!" + +In Synchronization 2, you're having a more nuanced conversation: "Hey, Color Attachment Output, once you're done writing this specific image, let the Fragment Shader know it's safe to start reading it." + +This "human-to-human" level of clarity is why we've built our entire engine around these structures. It reduces the cognitive load on you, the developer, and it gives the driver the exact information it needs to keep the GPU's "pipeline" full of work. + +== Putting it Together in the Engine + +In a real-world engine, synchronization isn't just about single transitions; it's about orchestrating the entire flow of data between passes. A perfect example of the Synchronization 2 "win" can be found in our `Renderer` class, specifically during a complex operation like a reflection pass. + +When rendering reflections, we often need to transition multiple resources at once—for example, a color buffer and a depth buffer. In the legacy API, these would be forced into a "log jam" where both would have to wait for the union of all stages. With Synchronization 2, we can batch them while maintaining their unique requirements. + +[,c++] +---- +void Renderer::renderReflectionPass(vk::raii::CommandBuffer& cmd) { + // Transition reflection color to COLOR_ATTACHMENT_OPTIMAL + vk::ImageMemoryBarrier2 toColor{ + .srcStageMask = vk::PipelineStageFlagBits2::eColorAttachmentOutput, + .srcAccessMask = vk::AccessFlagBits2::eColorAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eColorAttachmentOutput, + .dstAccessMask = vk::AccessFlagBits2::eColorAttachmentWrite, + .oldLayout = vk::ImageLayout::eShaderReadOnlyOptimal, + .newLayout = vk::ImageLayout::eColorAttachmentOptimal, + .image = *reflectionColor, + .subresourceRange = { vk::ImageAspectFlagBits::eColor, 0, 1, 0, 1 } + }; + + // Transition reflection depth to DEPTH_ATTACHMENT_OPTIMAL + vk::ImageMemoryBarrier2 toDepth{ + .srcStageMask = vk::PipelineStageFlagBits2::eEarlyFragmentTests, + .srcAccessMask = vk::AccessFlagBits2::eDepthStencilAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eEarlyFragmentTests, + .dstAccessMask = vk::AccessFlagBits2::eDepthStencilAttachmentWrite, + .oldLayout = vk::ImageLayout::eUndefined, + .newLayout = vk::ImageLayout::eDepthAttachmentOptimal, + .image = *reflectionDepth, + .subresourceRange = { vk::ImageAspectFlagBits::eDepth, 0, 1, 0, 1 } + }; + + // The Win: Batching unique intents into a single call + std::array barriers{ toColor, toDepth }; + vk::DependencyInfo dependencyInfo{ + .imageMemoryBarrierCount = static_cast(barriers.size()), + .pImageMemoryBarriers = barriers.data() + }; + + // One call, but the driver knows 'toColor' only cares about + // ColorAttachmentOutput, while 'toDepth' only cares about EarlyFragmentTests. + cmd.pipelineBarrier2(dependencyInfo); + + // Now we can safely begin our reflection rendering + // ... +} +---- + +To see the "win" here, let's look at the legacy alternative for that same `renderReflectionPass` operation. You would have been forced to combine your stage masks at the function level: + +[,c++] +---- +// Legacy Vulkan 1.0 equivalent - The "Log Jam" +vk::ImageMemoryBarrier legacyToColor{ + .srcAccessMask = vk::AccessFlagBits::eColorAttachmentWrite, + .dstAccessMask = vk::AccessFlagBits::eColorAttachmentWrite, + .oldLayout = vk::ImageLayout::eShaderReadOnlyOptimal, + .newLayout = vk::ImageLayout::eColorAttachmentOptimal, + .image = *reflectionColor, + .subresourceRange = { vk::ImageAspectFlagBits::eColor, 0, 1, 0, 1 } +}; + +vk::ImageMemoryBarrier legacyToDepth{ + .srcAccessMask = vk::AccessFlagBits::eDepthStencilAttachmentWrite, + .dstAccessMask = vk::AccessFlagBits::eDepthStencilAttachmentWrite, + .oldLayout = vk::ImageLayout::eUndefined, + .newLayout = vk::ImageLayout::eDepthAttachmentOptimal, + .image = *reflectionDepth, + .subresourceRange = { vk::ImageAspectFlagBits::eDepth, 0, 1, 0, 1 } +}; + +std::array legacyBarriers{ legacyToColor, legacyToDepth }; + +// NOTICE: The stage masks are passed to the function, not the barriers! +cmd.pipelineBarrier( + vk::PipelineStageFlagBits::eColorAttachmentOutput | vk::PipelineStageFlagBits::eEarlyFragmentTests, // srcStageMask (Union) + vk::PipelineStageFlagBits::eColorAttachmentOutput | vk::PipelineStageFlagBits::eEarlyFragmentTests, // dstStageMask (Union) + {}, // dependencyFlags + nullptr, nullptr, legacyBarriers +); +---- + +In this legacy version, you are forced to pass a single `srcStageMask` that is the union of `eColorAttachmentOutput` and `eEarlyFragmentTests`. This means the GPU would have to wait for *both* the color writes and the depth tests of all previous work to finish before it could even *start* transitioning either image. + +With Synchronization 2, if the depth tests finish early, the driver can begin the `toDepth` transition immediately, even if the color hardware is still busy. This keeps the GPU's "log jam" from forming, allowing different parts of the hardware to work at their own pace. + +This code typically lives inside your frame recording logic, often in a dedicated pass function like `renderReflectionPass`, just before calling `beginRendering`. By placing the synchronization logic right where the resource is needed, and grouping related transitions into a single `vk::DependencyInfo`, you create a "Chain of Intent" that is both easy for you to read and optimal for the hardware to execute. + +== Implementation: Modernizing Simple Engine + +While `Simple Engine`'s renderer has been largely modernized to use the `pipelineBarrier2` calls we've discussed, the codebase still contains "legacy islands" that we are in the process of refactoring. A prime example is the `PhysicsSystem` (`physics_system.cpp`), which still uses the old-style `pipelineBarrier` for synchronizing its compute dispatches. + +If you look into `PhysicsSystem::SimulatePhysicsOnGPU`, you'll see transitions that look like this: + +[,c++] +---- +// Legacy synchronization in PhysicsSystem +vulkanResources.commandBuffer.pipelineBarrier( + vk::PipelineStageFlagBits::eComputeShader, + vk::PipelineStageFlagBits::eComputeShader, + {}, + vk::MemoryBarrier(vk::AccessFlagBits::eShaderWrite, vk::AccessFlagBits::eShaderRead), + nullptr, nullptr +); +---- + +In the upcoming "Synchronization Upgrade" branch of `Simple Engine`, we will replace these with the cleaner `vk::DependencyInfo` and `vk::BufferMemoryBarrier2`. This will allow us to move away from global memory barriers and target the specific physics buffers, reducing the performance penalty on architectures with complex cache hierarchies. + +In the next section, we'll dive deeper into how to pick the right stages and flags to squeeze every last drop of performance out of the hardware. + +== Navigation + +Previous: xref:Synchronization/Anatomy_of_a_Dependency/02_execution_vs_memory.adoc[Execution vs. Memory Dependencies] | Next: xref:Synchronization/Anatomy_of_a_Dependency/04_refined_pipeline_stages.adoc[Refined Pipeline Stages] diff --git a/en/Synchronization/Anatomy_of_a_Dependency/04_refined_pipeline_stages.adoc b/en/Synchronization/Anatomy_of_a_Dependency/04_refined_pipeline_stages.adoc new file mode 100644 index 00000000..96cde598 --- /dev/null +++ b/en/Synchronization/Anatomy_of_a_Dependency/04_refined_pipeline_stages.adoc @@ -0,0 +1,59 @@ +:pp: {plus}{plus} += Refined Pipeline Stages: Precision is Performance + +== Introduction + +In the previous sections, we saw how Synchronization 2 unifies the API. But the real performance gains come from how you use it. Mastering `vk::PipelineStageFlagBits2` and `vk::AccessFlagBits2` is about precision. In legacy Vulkan, many developers fell into the trap of using `eAllCommands` (or worse, `eTopOfPipe` and `eBottomOfPipe`) as a catch-all solution. While this "works" in the sense that it prevents data corruption, it’s the digital equivalent of stopping every car in the city just so one pedestrian can cross the street. + +=== The Pipeline Bubble + +When you use an overly broad stage mask, you create what’s known as a **Pipeline Bubble**. Modern GPUs are designed to keep as many specialized hardware units—the rasterizers, the compute cores, the fixed-function blit engines—busy as possible. If you tell the GPU to wait at `eAllCommands`, you are essentially draining the entire pipeline. The GPU must wait until every previous operation is completely finished before it can start even the smallest part of the next operation. + +image::/images/vulkan_pipeline_block_diagram.png[Vulkan Pipeline Block Diagram, width=800, alt="A block diagram of the Vulkan graphics pipeline showing its various stages"] + +With Synchronization 2, we can be far more surgical. If you're only interested in ensuring that a compute shader has finished writing to a storage buffer before a fragment shader reads it, you can target `eComputeShader` and `eFragmentShader` specifically. This allows other parts of the GPU, like the geometry engine or the rasterizer, to keep working on independent tasks. + +=== Choosing the Right Stage + +Picking the right stage mask requires a solid understanding of where your data is coming from and where it's going. Here are a few common patterns we use in our engine: + +* **Render to Texture**: If you're transitioning a color attachment so it can be sampled in a later pass, your source stage should be `eColorAttachmentOutput`. +* **Compute Post-Processing**: When a compute shader finishes a pass that will be used by the fragment shader, use `eComputeShader` as the source and `eFragmentShader` as the destination. +* **Transfer to Graphics**: When you've finished uploading a buffer or image using a transfer queue, the source stage is `eTransfer`. + +=== The Power of Access Flags + +Stage flags tell the GPU *when* to wait, but **Access Flags** tell it *why*. They control the cache flushes and invalidations we discussed in the "Execution vs. Memory" section. + +Pairing a stage with the correct access flag is vital. For example, if you're reading a storage buffer in a compute shader, you need `eShaderRead` or `eShaderStorageRead`. If you're writing to it, you need `eShaderWrite` or `eShaderStorageWrite`. Being specific here allows the hardware to perform only the necessary cache operations, which can significantly reduce the overhead of the barrier itself. + +=== Conclusion + +As we move forward into the more complex parts of this series—like asynchronous compute and asset streaming—keep this "precision-first" mindset. Every bit you set in a barrier is a hint to the hardware. The more accurate your hints, the smoother your frame rates will be. + +== Simple Engine: Targeting the Right Units + +In `Simple Engine`, we apply this precision-first approach in our `Renderer::Render` loop. For example, when transitioning our depth buffer from a "Depth-Only" pass (like our shadow map generation or depth pre-pass) to a "Depth-Test" pass (like our main opaque pass), we use: + +[,c++] +---- +// Depth transition in Renderer::Render +vk::ImageMemoryBarrier2 depthToRead2{ + .srcStageMask = vk::PipelineStageFlagBits2::eLateFragmentTests, + .srcAccessMask = vk::AccessFlagBits2::eDepthStencilAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eEarlyFragmentTests, + .dstAccessMask = vk::AccessFlagBits2::eDepthStencilAttachmentRead, + .oldLayout = vk::ImageLayout::eDepthAttachmentOptimal, + .newLayout = vk::ImageLayout::eDepthAttachmentOptimal, + // ... + .image = *depthImage, +}; +---- + +By specifying `eLateFragmentTests` and `eEarlyFragmentTests`, we tell the GPU that it only needs to wait for the fixed-function depth units to finish writing before it can start reading for the next pass. The vertex shaders for the next pass can actually start running and even begin processing their geometry while the previous pass's depth writes are still being finalized. This overlap is what prevents the "Pipeline Bubble" and keeps our frame rates high even in complex scenes. + +Next, we'll take these foundational concepts and apply them to the most common synchronization task in Vulkan: the image layout transition. + +== Navigation + +Previous: xref:Synchronization/Anatomy_of_a_Dependency/03_sync2_advantage.adoc[The Synchronization 2 Advantage] | Next: xref:Synchronization/Pipeline_Barriers_Transitions/01_introduction.adoc[Pipeline Barriers and Transitions - Introduction] diff --git a/en/Synchronization/Async_Compute_Overlap/01_introduction.adoc b/en/Synchronization/Async_Compute_Overlap/01_introduction.adoc new file mode 100644 index 00000000..6808d50f --- /dev/null +++ b/en/Synchronization/Async_Compute_Overlap/01_introduction.adoc @@ -0,0 +1,28 @@ +:pp: {plus}{plus} += Asynchronous Compute & Execution Overlap: Parallelizing the GPU + +== Introduction + +In many rendering architectures, work is submitted as a linear sequence of events. We draw the shadows, then the geometry, then we run a compute-based post-processing pass. This "serial" approach is easy to understand, but it often leaves significant portions of the GPU hardware idle. Modern GPUs are composed of multiple independent units—graphics pipelines, compute units, and transfer engines—that can, and should, work simultaneously. + +**Asynchronous Compute** is the practice of running compute workloads (like physics, occlusion culling, or post-processing) on a dedicated compute queue while the main graphics queue is busy with its own work. When done correctly, this can lead to massive performance gains by effectively filling the "holes" in the GPU's execution timeline. + +== The "Bubble" Problem + +The primary enemy of high performance is the **Pipeline Stall**, often called a "bubble." This happens when one part of the GPU has finished its work but cannot start the next task because it's waiting for a dependency that hasn't been satisfied. If your barriers are too conservative—for example, if you tell the GPU to wait for "All Commands" to finish before starting a compute pass—you are essentially forcing the hardware into a serial mode, even if the compute work could have started much earlier. + +== Architecting for Overlap + +To achieve true execution overlap, we need to move beyond simple "top-of-pipe" to "bottom-of-pipe" dependencies. We need to architect our `vk::DependencyInfo` and our **Timeline Semaphores** to express the exact moment data is ready. + +In this chapter, we will explore: + +1. **Maximizing Throughput**: How to identify workloads that are good candidates for overlap and how to structure your submissions to keep the GPU occupancy as high as possible. +2. **Async Post-Processing**: We'll implement a common real-world pattern: running compute-based bloom or tonemapping concurrent with the subsequent frame's shadow or geometry pass. +3. **Eliminating the Stalls**: We'll learn how to use hardware profilers and synchronization validation to find those elusive "bubbles" and refine our stage masks to eliminate them. + +By the end of this chapter, you'll be able to move your engine from a serial sequence to a parallel execution model, ensuring that no hardware unit is left sitting idle. + +== Navigation + +Previous: xref:Synchronization/Frame_in_Flight/03_resource_lifetimes.adoc[Resource Lifetimes] | Next: xref:Synchronization/Async_Compute_Overlap/02_maximizing_throughput.adoc[Maximizing Throughput] diff --git a/en/Synchronization/Async_Compute_Overlap/02_maximizing_throughput.adoc b/en/Synchronization/Async_Compute_Overlap/02_maximizing_throughput.adoc new file mode 100644 index 00000000..ca5254a2 --- /dev/null +++ b/en/Synchronization/Async_Compute_Overlap/02_maximizing_throughput.adoc @@ -0,0 +1,68 @@ +:pp: {plus}{plus} += Maximizing Throughput: Identifying Overlap Candidates + +== Finding the "Holes" in the GPU + +To maximize GPU throughput, we need to think beyond the simple linear execution of our command buffers. We want to find workloads that are **latency-bound** (spending a lot of time waiting for memory or fixed-function units) and pair them with workloads that are **compute-bound** (using the GPU's arithmetic units heavily). + +A classic example of this is the **Shadow Pass**. While the GPU is busy doing vertex processing and rasterizing depth-only geometry for shadows, many of the compute and shading units are sitting idle. This is a perfect "hole" that can be filled with an asynchronous compute task, such as a physics simulation or an occlusion culling pass. + +== The Simple Engine Case Study: Physics and Audio Compute + +In our `Simple Engine`, we have two major systems that are prime candidates for asynchronous compute: the **Physics System** (`physics_system.cpp`) and the **Audio HRTF System** (`audio_system.cpp`). + +The `PhysicsSystem` performs complex simulation tasks like integration and collision detection using GPU-accelerated compute shaders (`shaders/physics.slang`). Similarly, the `AudioSystem` uses a compute shader (`shaders/hrtf.slang`) to process audio spatialization (Head-Related Transfer Function) on the GPU. + +Currently, both systems follow a **sequential, blocking** pattern. For example, the physics simulation is submitted to the GPU, and the CPU immediately stalls at a fence: + +[,c++] +---- +// Sequential Physics Dispatch (Current Engine) +physicsSystem->Update(deltaTime); // Internally calls SimulatePhysicsOnGPU + +// Inside PhysicsSystem::SimulatePhysicsOnGPU: +// 1. Submit compute commands to computeQueue +// 2. ReadbackGPUPhysicsData: blocks on a fence (CPU STALL!) +---- + +This CPU-side stall is a missed opportunity for overlap. To maximize throughput, we can re-architect this flow to be asynchronous by utilizing the engine's dedicated **Compute Queue** (obtained via `renderer->GetComputeQueue()`). By submitting these tasks early in the frame and only synchronizing when the data is strictly necessary, we can keep both the graphics and compute hardware units fully occupied. + +Beyond physics and audio, the engine's **Forward+ Rendering** path (see `ForwardPlus_Rendering.adoc`) is another prime candidate for overlap. The Forward+ compute pass (`forward_plus_cull.slang`) builds light lists for each tile on the screen. While this compute pass *does* require the depth buffer from the current frame to perform effective Z-culling, it doesn't need to wait for the entire geometry pass to finish. + +If we use **Timeline Semaphores**, we can tell the compute queue to wait only until the **Depth Pre-pass** is complete. While the graphics queue continues with the main **Opaque Geometry** rendering, the compute queue can simultaneously be culling lights for those same pixels, perfectly overlapping the compute-heavy light assignment with the raster-heavy geometry processing. + +== The Dependency Architecture + +The key to allowing these workloads to overlap is the way we architect our dependencies. If we use a single, global timeline for everything, we might inadvertently create a bottleneck. Instead, we should use multiple timeline semaphores—one for each major "engine" of the GPU—and have them coordinate only when strictly necessary. + +For example, your graphics queue could signal a "Geometry Complete" value on its own timeline. Your compute queue could wait for that value before starting its work, while simultaneously continuing with other tasks that don't depend on the geometry. + +[,c++] +---- +// Compute queue waiting for graphics geometry completion +auto computeWaitInfo = vk::SemaphoreSubmitInfo{ + .semaphore = *graphicsTimeline, + .value = geometryFrameValue, + .stageMask = vk::PipelineStageFlagBits2::eComputeShader +}; + +auto computeSubmit = vk::SubmitInfo2{ + .waitSemaphoreInfoCount = 1, + .pWaitSemaphoreInfos = &computeWaitInfo, + // ... +}; + +computeQueue.submit2(computeSubmit); +---- + +== Submitting for Overlap + +Simply having multiple queues isn't enough. You also need to submit your work in a way that the hardware can actually parallelize. On most modern hardware, this means submitting your "background" compute work to a dedicated asynchronous compute queue. This queue has its own command processor and can feed tasks to the compute units independently of the main graphics queue. + +By decoupling the submission of your compute work from your main graphics loop, you allow the driver to schedule them concurrently. If the graphics queue is momentarily stalled (e.g., waiting for the display or a cache flush), the compute queue can step in and keep the hardware busy. + +In the next section, we'll see a concrete implementation of this pattern: async post-processing. + +== Navigation + +Previous: xref:Synchronization/Async_Compute_Overlap/01_introduction.adoc[Introduction] | Next: xref:Synchronization/Async_Compute_Overlap/03_async_post_processing.adoc[Async Post-Processing] diff --git a/en/Synchronization/Async_Compute_Overlap/03_async_post_processing.adoc b/en/Synchronization/Async_Compute_Overlap/03_async_post_processing.adoc new file mode 100644 index 00000000..b6d1b872 --- /dev/null +++ b/en/Synchronization/Async_Compute_Overlap/03_async_post_processing.adoc @@ -0,0 +1,70 @@ +:pp: {plus}{plus} += Async Post-Processing: Parallelizing Frame End and Start + +== A Real-World Use Case + +One of the most effective ways to use asynchronous compute is to run your post-processing pass (which is usually compute-bound) while the graphics unit is busy with the shadow or geometry pass of the *next* frame. This is a powerful pattern because post-processing typically happens at the very end of the frame, when the graphics units have finished their work. Instead of making the next frame wait for post-processing to complete, we move it to a dedicated compute queue. + +== Implementing the Overlap + +The implementation involves two different queues: a **Graphics Queue** for your geometry and shadow work, and an **Asynchronous Compute Queue** for your post-processing work (e.g., bloom, tonemapping, or temporal anti-aliasing). + +1. **Main Render Pass (Graphics Queue)**: Once your main rendering is complete, signal a "Graphics Complete" value on your graphics timeline. +2. **Post-Processing Pass (Compute Queue)**: The compute queue waits for the "Graphics Complete" value. It then performs the post-processing work and signals a "Post-Processing Complete" value on its own compute timeline. +3. **Frame Submission (CPU)**: The CPU can start recording and submitting the *next* frame to the graphics queue as soon as the previous frame's geometry is submitted. It doesn't need to wait for the post-processing to finish. + +== Synchronization 2 Example + +Using `vk::DependencyInfo` and `vk::SubmitInfo2`, this coordination is clear and precise. + +[,c++] +---- +// Compute Submit: wait for frame N graphics to finish, then run post-processing +auto computeWaitInfo = vk::SemaphoreSubmitInfo{ + .semaphore = *graphicsTimeline, + .value = frameN_graphics_finished, + .stageMask = vk::PipelineStageFlagBits2::eComputeShader +}; + +auto computeSignalInfo = vk::SemaphoreSubmitInfo{ + .semaphore = *computeTimeline, + .value = frameN_postprocessing_finished, + .stageMask = vk::PipelineStageFlagBits2::eComputeShader +}; + +auto computeSubmit = vk::SubmitInfo2{ + .waitSemaphoreInfoCount = 1, + .pWaitSemaphoreInfos = &computeWaitInfo, + .signalSemaphoreInfoCount = 1, + .pSignalSemaphoreInfos = &computeSignalInfo, + .commandBufferInfoCount = 1, + .pCommandBufferInfos = &postProcessCmdInfo +}; + +computeQueue.submit2(computeSubmit); +---- + +== Handling the Present + +The final step is the **Present** operation. On the CPU side, you must ensure that you don't present the final image until both the graphics and compute work for that frame are complete. This is usually handled by having the present operation wait for the "Post-Processing Complete" value. + +This pattern ensures that the graphics units are always fed with new work, while the compute units handle the final look of each frame. It's a key strategy for maximizing your engine's frame rate and keeping your GPU occupancy as high as possible. + +== Implementing in Simple Engine + +In `Simple Engine`, we will apply this async post-processing pattern to our **PBR Tonemapping** pass. Currently, the tonemapping is done at the end of `Renderer::Render` on the graphics queue. We will move this logic to a dedicated `postProcessComputePipeline` that runs on the `computeQueue`. + +To implement this: + +1. **Add Compute Pass**: We'll update our `Renderer` to record the tonemapping compute shader (`shaders/tonemap.slang`) into a separate compute command buffer. +2. **Wait for Graphics**: This compute command buffer will wait for the main rendering timeline to reach the `GeometryFinished` value. +3. **Signal for Present**: Once the tonemapping is complete, it will signal a `PostProcessFinished` value. +4. **Update Submit**: We'll update our final `vk::SubmitInfo2` for the frame so that the present operation waits for this `PostProcessFinished` value on the compute timeline. + +By moving tonemapping to the compute queue, we can start the **next frame's shadow pass** on the graphics queue while the current frame is still being tonemapped. This overlaps the raster-heavy shadow pass with the compute-heavy tonemapping pass, significantly improving our overall frame throughput. + +In the final section of this chapter, we'll look at how to identify and eliminate the "bubbles" that can occur if your synchronization is too conservative. + +== Navigation + +Previous: xref:Synchronization/Async_Compute_Overlap/02_maximizing_throughput.adoc[Maximizing Throughput] | Next: xref:Synchronization/Async_Compute_Overlap/04_bubble_problem.adoc[The Bubble Problem] diff --git a/en/Synchronization/Async_Compute_Overlap/04_bubble_problem.adoc b/en/Synchronization/Async_Compute_Overlap/04_bubble_problem.adoc new file mode 100644 index 00000000..85e52097 --- /dev/null +++ b/en/Synchronization/Async_Compute_Overlap/04_bubble_problem.adoc @@ -0,0 +1,30 @@ +:pp: {plus}{plus} += The Bubble Problem: Finding and Fixing Stalls + +== Identifying the Bubble + +A "bubble" in the GPU timeline is a period where some units are idle because they are waiting for a dependency to be satisfied. These can be hard to find just by looking at your code. You might *think* you've enabled overlap, but if your stage masks are too broad, the GPU might still be stalling. + +To find these, we use hardware profilers like **NVIDIA Nsight Graphics**, **AMD Radeon GPU Profiler**, or even the **LunarG Synchronization Validation** layer. In a profiler, a bubble looks like a gap in the timeline where the Graphics or Compute rows are empty while the other is busy. + +image::/images/vulkan_simplified_pipeline.svg[Vulkan Simplified Pipeline, width=400, alt="Simplified diagram of the Vulkan pipeline used to illustrate where bubbles can occur"] + +== Common Causes of Bubbles + +1. **Overly Conservative Stage Masks**: If you use `vk::PipelineStageFlagBits2::eAllCommands` for every barrier, the GPU will flush everything and wait for it to be idle before starting the next task. This is the most common cause of bubbles. Always use the most specific stage mask possible. +2. **Sequential Submission**: Even if you have two queues, if your CPU code waits for one to finish before submitting to the other, you've created a bubble on the CPU side. Use the **Wait-Before-Signal** pattern and multiple submission threads where appropriate. +3. **Dependency Chains**: A chain of small dependencies can sometimes be more expensive than one slightly broader barrier. If you have five compute passes that all wait for each other, each one introduces a small stall. Sometimes batching these into a single compute submission is better. + +== Fixing the Stall + +Once you've found a bubble, the fix is usually to refine your `vk::DependencyInfo`. + +- **Refine Stage Masks**: Check if you can move your `srcStageMask` later in the pipeline or your `dstStageMask` earlier. For example, can your compute work start as soon as `eVertexShader` is done, instead of waiting for `eFragmentShader`? +- **Use Memory Barriers Wisely**: Sometimes a global memory barrier is better than several image barriers if it allows more work to start sooner. +- **Increase Concurrency**: If your profiler shows that the compute units are under-utilized, can you move more work (like occlusion culling) from graphics to compute? + +By systematically finding and eliminating these bubbles, you move from a renderer that "just works" to one that is truly professional-grade. In the next chapter, we'll see how these same principles apply to one of the most common background tasks in modern games: asset streaming. + +== Navigation + +Previous: xref:Synchronization/Async_Compute_Overlap/03_async_post_processing.adoc[Async Post-Processing] | Next: xref:Synchronization/Transfer_Queues_Streaming/01_introduction.adoc[Transfer Queues & Asset Streaming Sync] diff --git a/en/Synchronization/Dynamic_Rendering_Sync/01_introduction.adoc b/en/Synchronization/Dynamic_Rendering_Sync/01_introduction.adoc new file mode 100644 index 00000000..97df7e1a --- /dev/null +++ b/en/Synchronization/Dynamic_Rendering_Sync/01_introduction.adoc @@ -0,0 +1,27 @@ +:pp: {plus}{plus} += Synchronization in Dynamic Rendering: A Pass-less World + +== Introduction + +For much of its early history, Vulkan synchronization was tied heavily to the concept of **Render Passes** and **Subpasses**. While this was designed to help mobile GPUs optimize on-tile memory usage, it was often confusing for developers and led to overly complex code. The "Subpass Dependency" was the primary way to sync data between different stages of a render pass, but it felt like a legacy structure that didn't always match the way modern engines work. + +With the introduction of **Dynamic Rendering** (introduced in Vulkan 1.3), the API has moved away from these rigid structures. There are no more `VkRenderPass` or `VkFramebuffer` objects to manage. Instead, you simply call `beginRendering` and `endRendering`. This change has made Vulkan much easier to use, but it has also shifted the responsibility for synchronization entirely to us. + +== The Explicit Era + +In a world without subpass dependencies, every synchronization point must be explicit. If you want to use the output of one draw call as the input for another within the same rendering block, you can no longer rely on the render pass to handle the transition for you. You must use the **Synchronization 2** barriers we learned about in Chapter 3. + +This shift is actually a major advantage. It provides far more clarity and control. You know exactly where your transitions are happening because you recorded them yourself. It also makes it much easier to integrate with modern engine architectures where rendering passes are fluid and often determined at runtime. + +== What We'll Explore + +In this chapter, we'll dive into how synchronization works in this modern, pass-less landscape. We'll explore: + +1. **Subpass Replacement**: How to use explicit barriers to coordinate synchronization between rendering attachments, replacing the legacy `VkSubpassDependency` structures. +2. **Local Read Sync**: We'll look at one of the most exciting features of **Vulkan 1.4**: `VK_KHR_dynamic_rendering_local_read`. This allows you to perform on-tile operations (like reading from a depth buffer in a fragment shader) with the same performance as legacy subpasses but with the simplicity of dynamic rendering. + +By the end of this chapter, you'll be able to confidently architect a high-performance renderer using the latest Vulkan features, ensuring that your synchronization is as streamlined and efficient as your rendering code. + +== Navigation + +Previous: xref:Synchronization/Transfer_Queues_Streaming/03_staging_sync.adoc[Staging Synchronization] | Next: xref:Synchronization/Dynamic_Rendering_Sync/02_subpass_replacement.adoc[Subpass Replacement] diff --git a/en/Synchronization/Dynamic_Rendering_Sync/02_subpass_replacement.adoc b/en/Synchronization/Dynamic_Rendering_Sync/02_subpass_replacement.adoc new file mode 100644 index 00000000..985c566e --- /dev/null +++ b/en/Synchronization/Dynamic_Rendering_Sync/02_subpass_replacement.adoc @@ -0,0 +1,63 @@ +:pp: {plus}{plus} += Subpass Replacement: Syncing Without the Pass + +== The End of the Subpass Dependency + +In the legacy Vulkan "Render Pass" system, you defined your dependencies upfront. If you wanted to use a G-Buffer pass and then a lighting pass, you'd create a subpass dependency that specified how data was transitioned and synchronized. This was often confusing because it separated the synchronization from the actual commands that were using it. + +With **Dynamic Rendering**, we replace these dependencies with **Synchronization 2** barriers that we record directly between our draw calls. This approach is far more intuitive. If your second draw call needs to read from the output of the first, you record a barrier in between. + +== A Concrete Example + +Imagine you're building a G-Buffer. You have a "Depth-Only" pass to pre-populate the depth buffer, followed by a "Main Pass" that reads from that depth buffer for early-Z testing. + +[,c++] +---- +// 1. Depth Pre-Pass +commandBuffer.beginRendering(depthPrePassInfo); +// ... record depth draw calls ... +commandBuffer.endRendering(); + +// 2. Synchronization Barrier +auto depthBarrier = vk::ImageMemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eLateFragmentTests, + .srcAccessMask = vk::AccessFlagBits2::eDepthStencilAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eEarlyFragmentTests, + .dstAccessMask = vk::AccessFlagBits2::eDepthStencilAttachmentRead, + .oldLayout = vk::ImageLayout::eDepthAttachmentOptimal, + .newLayout = vk::ImageLayout::eDepthAttachmentOptimal, // No layout change needed + .image = depthBuffer.image(), + .subresourceRange = subresourceRange +}; + +commandBuffer.pipelineBarrier2(vk::DependencyInfo{.imageMemoryBarrierCount = 1, .pImageMemoryBarriers = &depthBarrier}); + +// 3. Main Pass +commandBuffer.beginRendering(mainPassInfo); +// ... record main draw calls ... +commandBuffer.endRendering(); +---- + +== Why This is Better + +- **Clarity**: You can see exactly what is being synchronized and why, right there in your command buffer. +- **Flexibility**: You can decide on the synchronization at runtime, making it much easier to build a flexible rendering graph. +- **Modernity**: It matches the way other modern APIs, like DirectX 12, handle synchronization, making your engine code more portable. + +By using explicit barriers, you move away from the "black box" of the legacy render pass system and toward a clear, surgical synchronization architecture. In the next section, we'll see how **Vulkan 1.4** takes this even further by allowing for efficient on-tile read operations. + +== Simple Engine: Dynamic Rendering Sync + +In `Simple Engine`, we use this explicit synchronization between our **Opaque Pre-Pass** and our **Main Pass**. Because we don't have a traditional render pass to handle these transitions, we record our own `vk::ImageMemoryBarrier2` to ensure the depth buffer is properly flushed and invalidated. + +Specifically, in `Renderer::Render`, you'll find the following sequence: + +1. **Depth Pre-Pass**: We call `commandBuffer.beginRendering` for the depth pre-pass. +2. **Barrier**: After `endRendering`, we record a `depthToRead2` barrier. This barrier synchronizes the `eLateFragmentTests` (the depth writes) with the `eEarlyFragmentTests` (the depth reads) of the next pass. +3. **Main Opaque Pass**: We then call `beginRendering` again for our main opaque color pass, which now has safe access to the pre-filled depth buffer. + +This explicit approach is what allowed us to easily add **Forward+ Lighting** to `Simple Engine`. Since we already had the depth buffer synchronized, adding the light culling compute pass between the pre-pass and the main pass was a straightforward matter of adding one more barrier, without having to re-architect a complex legacy render pass. + +== Navigation + +Previous: xref:Synchronization/Dynamic_Rendering_Sync/01_introduction.adoc[Introduction] | Next: xref:Synchronization/Dynamic_Rendering_Sync/03_local_read_sync.adoc[Local Read Sync] diff --git a/en/Synchronization/Dynamic_Rendering_Sync/03_local_read_sync.adoc b/en/Synchronization/Dynamic_Rendering_Sync/03_local_read_sync.adoc new file mode 100644 index 00000000..5a7bbf28 --- /dev/null +++ b/en/Synchronization/Dynamic_Rendering_Sync/03_local_read_sync.adoc @@ -0,0 +1,67 @@ +:pp: {plus}{plus} += Local Read Sync: On-Tile Efficiency with Dynamic Rendering + +== The Best of Both Worlds + +In the legacy render pass system, we used subpasses to perform efficient on-tile read operations. This allowed the GPU to read from a color or depth attachment directly from its on-chip memory (the tile cache), avoiding expensive trips to main memory. This was a critical optimization for mobile and tiled-rendering GPUs. + +With the introduction of **Vulkan 1.4**, this same efficiency is now available in **Dynamic Rendering** through the `VK_KHR_dynamic_rendering_local_read` feature. This gives us the simplicity of a "pass-less" world with the performance of a subpass-based world. + +== Implementing the Local Read + +The implementation involves two parts: a specialized barrier and a specific rendering setup. When you use a local read, you tell the GPU: "I want to read from an attachment, but I promise the read will only occur at the same pixel (x, y) location as the current write." This allows the hardware to keep the data on-tile. + +[,c++] +---- +// 1. Define the Dependency +auto localReadBarrier = vk::ImageMemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eColorAttachmentOutput, + .srcAccessMask = vk::AccessFlagBits2::eColorAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eFragmentShader, + .dstAccessMask = vk::AccessFlagBits2::eInputAttachmentRead, + .oldLayout = vk::ImageLayout::eRenderingLocalRead, + .newLayout = vk::ImageLayout::eRenderingLocalRead, + .image = gBufferAttachment.image(), + .subresourceRange = subresourceRange +}; + +commandBuffer.pipelineBarrier2(vk::DependencyInfo{.imageMemoryBarrierCount = 1, .pImageMemoryBarriers = &localReadBarrier}); + +// 2. Perform the Rendering +// You must include the local read information in your RenderingInfo +auto localReadInfo = vk::RenderingInputAttachmentIndexInfo{ + .colorAttachmentCount = 1, + .pColorAttachmentInputIndices = &colorIndex +}; + +auto renderingInfo = vk::RenderingInfo{ + .pNext = &localReadInfo, + // ... +}; + +commandBuffer.beginRendering(renderingInfo); +// ... record your on-tile reads in your Slang shader ... +commandBuffer.endRendering(); +---- + +== Slang Integration + +In your Slang shader, you use the standard input attachment syntax. The Slang compiler will correctly target the SPIR-V instructions required for local read access. This ensures that your shader code remains clean and portable across different hardware. + +[,c++] +---- +// Slang snippet +[[vk::input_attachment_index(0)]] +InputAttachment gBufferInput; + +float4 main(float2 uv : TEXCOORD0) : SV_Target { + float4 data = gBufferInput.SubpassLoad(); + // ... +} +---- + +By mastering local read synchronization, you can build a modern deferred renderer that is every bit as efficient as a legacy subpass-based renderer, but with the flexibility and clarity of modern Vulkan. In the next chapter, we'll see how these principles apply to the direct CPU-to-GPU data movements in **Host Image Copies**. + +== Navigation + +Previous: xref:Synchronization/Dynamic_Rendering_Sync/02_subpass_replacement.adoc[Subpass Replacement] | Next: xref:Synchronization/Host_Image_Copies_Memory_Sync/01_introduction.adoc[Host Image Copies & Memory Mapped Sync] diff --git a/en/Synchronization/Frame_in_Flight/01_introduction.adoc b/en/Synchronization/Frame_in_Flight/01_introduction.adoc new file mode 100644 index 00000000..9f49d999 --- /dev/null +++ b/en/Synchronization/Frame_in_Flight/01_introduction.adoc @@ -0,0 +1,30 @@ +:pp: {plus}{plus} += Frame-in-Flight Architecture: The Heartbeat of Your Engine + +== Introduction + +In the early days of graphics programming, we often thought of rendering as a linear sequence: the CPU records some commands, the GPU executes them, and then the CPU waits for the GPU to finish before starting the next frame. This is simple, but it’s also incredibly slow. While the GPU is rendering, the CPU is sitting idle, and while the CPU is recording the next frame, the GPU is waiting for work. + +To achieve high performance, we need to overlap these two processes. This is what we call **Frame-in-Flight Architecture**. We want to have multiple frames being processed simultaneously—for example, the CPU might be recording frame 3, while the GPU is still rendering frame 2, and the display is currently showing frame 1. + +== The Synchronization Challenge + +Managing multiple concurrent frames is arguably the most complex synchronization challenge in a Vulkan engine. You have to ensure that: + +1. **Data Integrity**: You don't overwrite a uniform buffer that the GPU is currently reading for a previous frame. +2. **Resource Lifetimes**: You don't destroy a texture or a command buffer until you are absolutely certain the GPU has finished using it. +3. **Forward Progress**: You don't submit so many frames that you introduce massive input lag or run out of memory. + +In the legacy Vulkan 1.0 world, this was handled using a complex array of fences and binary semaphores for each frame in flight. This led to "sync-heavy" code that was difficult to scale and easy to break. + +== The Timeline Advantage + +By using **Timeline Semaphores** as our foundation, we can drastically simplify this architecture. Instead of managing a separate fence for every frame, we use a single monotonic counter that represents the "completed frame index." + +In this chapter, we are going to rebuild the main engine loop to handle an arbitrary number of frames in flight. We'll explore how to use the timeline to coordinate between the CPU and GPU, and how to implement a robust resource management system that uses the timeline to determine exactly when it's safe to destroy or reuse our Vulkan objects. + +Let's begin by looking at how to rebuild the heartbeat of our engine: the main render loop. + +== Navigation + +Previous: xref:Synchronization/Timeline_Semaphores/04_wait_before_signal.adoc[Wait-Before-Signal Submission] | Next: xref:Synchronization/Frame_in_Flight/02_managing_concurrent_frames.adoc[Managing Concurrent Frames] diff --git a/en/Synchronization/Frame_in_Flight/02_managing_concurrent_frames.adoc b/en/Synchronization/Frame_in_Flight/02_managing_concurrent_frames.adoc new file mode 100644 index 00000000..44904fba --- /dev/null +++ b/en/Synchronization/Frame_in_Flight/02_managing_concurrent_frames.adoc @@ -0,0 +1,106 @@ +:pp: {plus}{plus} += Managing Concurrent Frames: Rebuilding the Main Loop + +== The Goal: Overlap Without Chaos + +The purpose of a frame-in-flight system is simple: keep the GPU busy while the CPU prepares future frames. The trick is doing that without corrupting data or introducing unbounded latency. With timeline semaphores, we can express this cleanly using a single, monotonic value that represents "frame N is complete." + +== A Practical Structure + +We'll use a ring of per-frame data (command buffers, descriptor sets, transient buffers). Each frame has an associated timeline value that marks when it's safe to reuse those resources. + +[,c++] +---- +struct FrameContext { + vk::raii::CommandPool pool{nullptr}; + vk::raii::CommandBuffer cmd{nullptr}; + vk::raii::Fence fence{nullptr}; // optional if you only use timeline waits on CPU + uint64_t retireValue = 0; // timeline value when this frame finishes +}; + +std::array frames; + +vk::raii::Semaphore timeline = createTimelineSemaphore(device, /*initial=*/0); +uint64_t nextSubmitValue = 1; // monotonically increasing +---- + +== The Main Loop With Timeline Gating + +On each frame, choose the next `FrameContext` in the ring. Before you touch any of its resources, make sure the global timeline has advanced beyond the value at which those resources were last used. + +[,c++] +---- +FrameContext& fc = frames[currentFrameIndex]; + +// Wait until GPU has reached the value when this frame's resources were last retired +if (fc.retireValue != 0) { + auto waitInfo = vk::SemaphoreWaitInfo{ + .semaphoreCount = 1, + .pSemaphores = &(*timeline), + .pValues = &fc.retireValue + }; + device.waitSemaphores(waitInfo, /*timeoutNs=*/UINT64_C(1'000'000'000)); // 1s timeout +} + +// Record & submit this frame +recordCommands(fc.cmd /*, ... */); + +// Define the value that represents "this frame complete" +const uint64_t frameComplete = nextSubmitValue++; + +vk::SemaphoreSubmitInfo signalInfo{ + .semaphore = *timeline, + .value = frameComplete, + .stageMask = vk::PipelineStageFlagBits2::eAllCommands +}; + +vk::CommandBufferSubmitInfo cmdInfo{ .commandBuffer = *fc.cmd }; + +vk::SubmitInfo2 submit{ + .commandBufferInfoCount = 1, + .pCommandBufferInfos = &cmdInfo, + .signalSemaphoreInfoCount= 1, + .pSignalSemaphoreInfos = &signalInfo +}; + +graphicsQueue.submit2(submit); + +// Tag this frame's resources with the value at which they're safe to reuse +fc.retireValue = frameComplete; +---- + +== CPU Throttle Without Fences + +To limit latency (e.g., only 2–3 frames in flight), wait for the value that corresponds to "the oldest in-flight frame has finished" before starting a new one. No per-frame fences necessary. + +[,c++] +---- +const uint64_t minAllowedValue = frameCompleteValueFor(currentFrameIndex - (MaxFramesInFlight - 1)); +if (minAllowedValue) { + auto waitInfo = vk::SemaphoreWaitInfo{ + .semaphoreCount = 1, + .pSemaphores = &(*timeline), + .pValues = &minAllowedValue + }; + device.waitSemaphores(waitInfo, UINT64_MAX); +} +---- + +This approach centralizes flow control around a single, debuggable counter. In the next section, we'll use the same counter to make precise, low-overhead decisions about resource destruction and reuse. + +== How to implement this in Simple Engine + +To implement this in `Simple Engine`, we will refactor the `Renderer::Render` method. Currently, it relies on `inFlightFences[currentFrame]` to stall the CPU. We will replace this with a single `Renderer::frameTimeline` semaphore. + +The new logic will look like this: + +1. **Calculate Retire Value**: Instead of `waitForFences`, we will calculate the `retireValue` for the current frame slot. This is simply the timeline value assigned to this slot the last time it was submitted (e.g., `frameTimelineValue[currentFrame]`). +2. **Wait on Timeline**: We'll call `device.waitSemaphores` to wait for that `retireValue`. This ensures the GPU is finished with the resources (command buffers, descriptor sets) associated with this frame slot. +3. **Submit with Signal**: When we call `queue.submit2`, we'll include a `vk::SemaphoreSubmitInfo` that signals our `frameTimeline` with a new, incremented value. +4. **Update Frame Slot**: We'll store this new signal value in `frameTimelineValue[currentFrame]` so we can wait for it the next time this slot comes around in the ring. + +This refactor will allow us to remove the `inFlightFences` array entirely, simplifying our resource management and making it easier to integrate other asynchronous systems into the same "Master Clock." + +== Navigation + +Previous: xref:Synchronization/Frame_in_Flight/01_introduction.adoc[Introduction] | Next: xref:Synchronization/Frame_in_Flight/03_resource_lifetimes.adoc[Resource Lifetimes] diff --git a/en/Synchronization/Frame_in_Flight/03_resource_lifetimes.adoc b/en/Synchronization/Frame_in_Flight/03_resource_lifetimes.adoc new file mode 100644 index 00000000..3993accc --- /dev/null +++ b/en/Synchronization/Frame_in_Flight/03_resource_lifetimes.adoc @@ -0,0 +1,61 @@ +:pp: {plus}{plus} += Resource Lifetimes: Safe Reuse Without deviceWaitIdle() + +== Tagging and Reclamation + +One of the biggest challenges in Vulkan is knowing when it's safe to reuse or destroy a resource. With timeline semaphores, we treat destruction and reuse as a function of the global counter: a resource becomes eligible for reclamation when the counter exceeds the value at which it was last used. + +We maintain a small allocator or freelist for transient resources (command buffers, staging buffers, descriptor sets). Each allocation is tagged with a `retireValue`. + +[,c++] +---- +struct TrackedResource { + ResourceHandle handle{}; // your wrapper around vk objects + uint64_t retireValue = 0; // timeline value when last submitted use completes +}; + +void destroyWhenSafe(TrackedResource res) { + deferredDeletes.push_back(res); +} + +void gc(vk::raii::Device const& device, vk::raii::Semaphore const& timeline) { + const uint64_t now = device.getSemaphoreCounterValue(*timeline); + auto it = std::remove_if(deferredDeletes.begin(), deferredDeletes.end(), [&](TrackedResource const& r){ + if (now >= r.retireValue) { destroy(r.handle); return true; } + return false; + }); + deferredDeletes.erase(it, deferredDeletes.end()); +} +---- + +== Integrating With Submissions + +Whenever you submit work that references a resource, tag it with the same value you signal on the timeline for that submission. + +[,c++] +---- +const uint64_t submissionValue = nextSubmitValue++; +submitCommands(cmd, /*signals*/ submissionValue); + +TrackedResource tex = createTexture(/*...*/); +tex.retireValue = submissionValue; // safe to reuse/destroy once reached +---- + +This pattern scales to complex graphs. You can attach `retireValue`s to entire resource sets created for a frame, or to individual allocations in sub-systems like upload managers. + +== Simple Engine: Garbage Collection + +In `Simple Engine`, we currently handle deferred resource destruction using a simple "frames since destroy" counter in our `pendingASDeletions` queue (found in `renderer_rendering.cpp`). This system waits for a fixed number of frames (`MAX_FRAMES_IN_FLIGHT + 1`) before deleting an acceleration structure. While safe, it is imprecise and can lead to resources staying in memory longer than necessary if the GPU is running fast. + +By moving to a timeline-based **Garbage Collection (GC)** system, we can be much more efficient. We will tag each `pendingASDeletion` (and any other transient resource, like our staging buffers) with the exact `frameTimelineValue` at which it was last used. Our `Renderer::ProcessDeferredDeletions` function will then query the current `frameTimeline` value. If the GPU has already reached or passed the tagged value, we can delete the resource immediately. This ensures that memory is reclaimed as soon as the GPU is done with it, regardless of the current frame rate or CPU/GPU load. + +== Pitfalls and Best Practices + +- Don't leak values: keep `nextSubmitValue` monotonic but bounded in meaning (e.g., encode frame and pass indices) to aid debugging. +- Batch deletions in `gc()` to avoid per-frame spikes. +- Avoid mixing fences and timeline for the same lifetime decision to prevent contradictory states. +- For external queues/devices (e.g., interop), convert their completion signals into your timeline domain where possible. + +== Navigation + +Previous: xref:Synchronization/Frame_in_Flight/02_managing_concurrent_frames.adoc[Managing Concurrent Frames] | Next: xref:Synchronization/Async_Compute_Overlap/01_introduction.adoc[Asynchronous Compute & Execution Overlap - Introduction] diff --git a/en/Synchronization/Host_Image_Copies_Memory_Sync/01_introduction.adoc b/en/Synchronization/Host_Image_Copies_Memory_Sync/01_introduction.adoc new file mode 100644 index 00000000..50e7a8b2 --- /dev/null +++ b/en/Synchronization/Host_Image_Copies_Memory_Sync/01_introduction.adoc @@ -0,0 +1,28 @@ +:pp: {plus}{plus} += Host Image Copies & Memory Mapped Sync: Direct Access + +== Introduction + +For most of Vulkan's history, if you wanted to move data into an image, you had to follow a very specific ritual: create a staging buffer, map it, write your data, record a `copyBufferToImage` command, and then submit that command buffer to a queue. While this is efficient for large, asynchronous uploads, it's a lot of overhead for simple, direct updates—like updating a small UI texture or a single mip level. + +With the arrival of **Vulkan 1.4**, we have a powerful new tool: **Host Image Copies** (`VK_EXT_host_image_copy`). This feature allows the CPU to copy data directly into a GPU-optimal image without recording or submitting a single command buffer. It's the most direct way to move data between CPU and GPU memory. + +== The Synchronization Challenge + +While Host Image Copies simplify the "how" of moving data, they don't exempt us from the "when." Because we are moving data directly on the host (CPU), we must be extremely careful to ensure that the GPU isn't trying to use that same image while we are writing to it. + +This introduces a different kind of synchronization. We aren't just syncing two GPU queues; we are syncing the **Host** with the **Device**. We need to ensure that our host writes are **visible** to the GPU, and that any previous GPU work is **available** before we start our host copy. + +== What We'll Explore + +In this chapter, we'll dive into the world of host-side synchronization. We'll explore: + +1. **Direct CPU-to-Image Access**: How to utilize the new Vulkan 1.4 Host Image Copy features to move data efficiently without command buffer overhead. +2. **Visibility and Flushes**: Mastering `vk::MemoryBarrier2` specifically for host-mapped memory. We'll learn how to ensure data coherency across the bus, ensuring that the bytes we write on the CPU are exactly what the GPU sees. +3. **Host-Device Handshakes**: Coordinating with fences and timeline semaphores to ensure that our host-side copies never collide with active GPU rendering. + +By the end of this chapter, you'll have a complete understanding of how to manage direct memory access in modern Vulkan, providing you with a faster, more flexible way to keep your assets updated. + +== Navigation + +Previous: xref:Synchronization/Dynamic_Rendering_Sync/03_local_read_sync.adoc[Local Read Sync] | Next: xref:Synchronization/Host_Image_Copies_Memory_Sync/02_cpu_to_image_access.adoc[Direct CPU-to-Image Access] diff --git a/en/Synchronization/Host_Image_Copies_Memory_Sync/02_cpu_to_image_access.adoc b/en/Synchronization/Host_Image_Copies_Memory_Sync/02_cpu_to_image_access.adoc new file mode 100644 index 00000000..e994714a --- /dev/null +++ b/en/Synchronization/Host_Image_Copies_Memory_Sync/02_cpu_to_image_access.adoc @@ -0,0 +1,56 @@ +:pp: {plus}{plus} += Direct CPU-to-Image Access: Utilizing Host Image Copies + +== Moving Data Directly on the Host + +One of the most powerful features in **Vulkan 1.4** is the ability to move data directly on the host (CPU). This is handled by the `vk::Device::copyMemoryToImageEXT` function (or its equivalent in your RAII wrapper). This function takes raw CPU memory and copies it directly into a GPU-optimal image. + +This is a major productivity boost. You no longer have to manage staging buffers, command pools, or submission queues for simple, direct image updates. It's the most direct way to move data from a CPU-side resource, like a dynamic texture or a screenshot buffer, into a GPU-side image. + +== Implementing the Host Copy + +To use this feature, you first need to check for support for the `VK_EXT_host_image_copy` extension (now part of Vulkan 1.4). Once confirmed, you can perform a copy like this: + +[,c++] +---- +// 1. Prepare the copy info +auto copyInfo = vk::MemoryToImageCopyEXT{ + .pHostPointer = cpuData, + .memoryRowLength = width, + .memoryImageHeight = height, + .imageSubresource = { .aspectMask = vk::ImageAspectFlagBits::eColor, .mipLevel = 0, .baseArrayLayer = 0, .layerCount = 1 }, + .imageExtent = { .width = width, .height = height, .depth = 1 } +}; + +auto hostCopyInfo = vk::CopyMemoryToImageInfoEXT{ + .dstImage = *gpuImage, + .dstImageLayout = vk::ImageLayout::eGeneral, + .regionCount = 1, + .pRegions = ©Info +}; + +// 2. Perform the copy directly on the CPU +device.copyMemoryToImageEXT(hostCopyInfo); +---- + +== Use Cases and Advantages + +Host Image Copies are ideal for scenarios where you need to update a small amount of data quickly and don't want to wait for the GPU's command processor. + +- **Dynamic Textures**: Updating UI elements, font atlases, or small dynamic textures. +- **Screenshots**: Copying a GPU-optimal image back to the CPU for saving to disk (using the inverse `copyImageToMemoryEXT`). +- **Debugging**: Quickly inspecting the contents of a GPU-side resource from your CPU code. + +The primary advantage is **lower latency**. Because you aren't recording and submitting a command buffer, you eliminate all the driver and hardware overhead associated with submission. The CPU simply moves the bytes, and they are available on the GPU immediately. + +== Potential in Simple Engine + +In `Simple Engine`, we can use Host Image Copies to optimize our **Screenshot** system. Currently, taking a screenshot involves a multi-step process of recording a command buffer to copy the swapchain image to a staging buffer, submitting that command buffer, and then waiting for a fence on the CPU. This is slow and can cause a noticeable hitch in the frame rate. + +By moving to `copyImageToMemoryEXT` (available in Vulkan 1.4), we can perform the screenshot copy directly on the CPU. Once the frame is finished and the swapchain image is in the `ePresentSrcKHR` layout, we can call `copyImageToMemoryEXT` from our main thread. This moves the pixels directly from the GPU's memory into our CPU-side screenshot buffer, completely bypassing the command submission and fence-wait cycle. This results in a much smoother user experience and a cleaner, more direct implementation of the screenshot feature in our engine. + +In the next section, we'll see how to handle the synchronization required to ensure that these bytes are visible to the GPU and that we don't create any host-device hazards. + +== Navigation + +Previous: xref:Synchronization/Host_Image_Copies_Memory_Sync/01_introduction.adoc[Introduction] | Next: xref:Synchronization/Host_Image_Copies_Memory_Sync/03_visibility_flushes.adoc[Visibility & Flushes] diff --git a/en/Synchronization/Host_Image_Copies_Memory_Sync/03_visibility_flushes.adoc b/en/Synchronization/Host_Image_Copies_Memory_Sync/03_visibility_flushes.adoc new file mode 100644 index 00000000..b28626f1 --- /dev/null +++ b/en/Synchronization/Host_Image_Copies_Memory_Sync/03_visibility_flushes.adoc @@ -0,0 +1,58 @@ +:pp: {plus}{plus} += Visibility & Flushes: Mastering Coherency + +== Understanding Host-Device Synchronization + +When you use **Host Image Copies**, you are essentially performing a direct memory copy between the CPU and GPU. This is highly efficient, but it introduces a new kind of synchronization challenge. We must ensure that the data we write on the host (CPU) is **visible** to the GPU before it starts using it, and that any previous GPU work is **available** before we start our host copy. + +In the world of **Synchronization 2**, we use `vk::MemoryBarrier2` to express this. We are no longer syncing two different GPU stages; we are syncing the host and the device. + +== The Host-to-Device Dependency + +The most common case is a **Host-to-Device** dependency. You write some data on the CPU and then want the GPU to read it in a shader. To do this, you use a barrier with `srcStageMask = vk::PipelineStageFlagBits2::eHost` and `dstStageMask` set to the shader stage where the image will be read. + +[,c++] +---- +auto hostToDeviceBarrier = vk::ImageMemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eHost, + .srcAccessMask = vk::AccessFlagBits2::eHostWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eFragmentShader, + .dstAccessMask = vk::AccessFlagBits2::eShaderRead, + .oldLayout = vk::ImageLayout::eGeneral, + .newLayout = vk::ImageLayout::eShaderReadOnlyOptimal, + .image = gpuImage.image(), + .subresourceRange = subresourceRange +}; + +commandBuffer.pipelineBarrier2(vk::DependencyInfo{.imageMemoryBarrierCount = 1, .pImageMemoryBarriers = &hostToDeviceBarrier}); +---- + +The `eHost` stage mask is a special flag that tells the GPU: "This data was updated on the CPU. Please ensure that all CPU writes are visible before the fragment shader starts its read." + +== The Device-to-Host Dependency + +The inverse case is a **Device-to-Host** dependency—for example, when you take a screenshot. You must ensure that the GPU has finished its rendering before the CPU starts the host copy. To do this, you record a barrier with the appropriate GPU stages as the source and `eHost` as the destination. + +[,c++] +---- +auto deviceToHostBarrier = vk::ImageMemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eColorAttachmentOutput, + .srcAccessMask = vk::AccessFlagBits2::eColorAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eHost, + .dstAccessMask = vk::AccessFlagBits2::eHostRead, + .oldLayout = vk::ImageLayout::eColorAttachmentOptimal, + .newLayout = vk::ImageLayout::eGeneral, + .image = gpuImage.image(), + .subresourceRange = subresourceRange +}; + +commandBuffer.pipelineBarrier2(vk::DependencyInfo{.imageMemoryBarrierCount = 1, .pImageMemoryBarriers = &deviceToHostBarrier}); +---- + +In addition to the barrier, you must also use a **Fence** or a **Timeline Semaphore** on the CPU side to ensure that the command buffer containing the barrier has actually finished executing on the GPU before you attempt to call `device.copyImageToMemoryEXT`. + +By mastering these host-device handshakes, you can build a renderer that is both extremely fast and perfectly robust, giving you a powerful new tool for managing your engine's memory. In the final chapters of this series, we'll see how to debug and optimize these complex synchronization patterns using the latest Vulkan tools. + +== Navigation + +Previous: xref:Synchronization/Host_Image_Copies_Memory_Sync/02_cpu_to_image_access.adoc[Direct CPU-to-Image Access] | Next: xref:Synchronization/Synchronization_Validation/01_introduction.adoc[Debugging with Synchronization Validation] diff --git a/en/Synchronization/Pipeline_Barriers_Transitions/01_introduction.adoc b/en/Synchronization/Pipeline_Barriers_Transitions/01_introduction.adoc new file mode 100644 index 00000000..9830a0cf --- /dev/null +++ b/en/Synchronization/Pipeline_Barriers_Transitions/01_introduction.adoc @@ -0,0 +1,18 @@ +:pp: {plus}{plus} += Pipeline Barriers and Layout Transitions: The Core Loop + +== Introduction + +If the previous chapter was about understanding the theoretical "handshake" between GPU stages, this chapter is where we get our hands dirty with the actual implementation. In the modern Vulkan 1.3+ landscape, the `vk::ImageMemoryBarrier2` is the most common tool in our synchronization toolbox. It's how we transition images between layouts, ensure data is visible across different hardware caches, and manage the complex state changes required for high-performance rendering. + +We often think of an image as just a grid of pixels, but to the GPU, it's a sophisticated resource that can be optimized for different types of access. A layout that's great for writing as a color attachment might be terrible for sampling in a fragment shader. Managing these transitions efficiently—and only when strictly necessary—is what separates a stuttering renderer from a smooth, 60 FPS experience. + +In this chapter, we're going to dive deep into the mechanics of these barriers. We'll start with the anatomy of the image barrier itself, specifically within the context of **Dynamic Rendering**, which has largely replaced the legacy "Render Pass" system. We'll then tackle one of the most misunderstood topics in Vulkan: **Queue Family Ownership**. This is the explicit "hand-off" required when you want to move a resource, like a texture or a buffer, between the Graphics, Compute, and Transfer queues of your engine. + +Finally, we'll look at the performance implications of our choices. Vulkan gives us the option of using **Global Memory Barriers** or more specific, resource-bound barriers. We'll learn how to determine which one to use and when, so we can give the driver exactly the right amount of information to keep the hardware running at full tilt without introducing unnecessary stalls. + +Let's begin by looking at the workhorse of modern Vulkan synchronization: the Image Memory Barrier 2. + +== Navigation + +Previous: xref:Synchronization/Anatomy_of_a_Dependency/04_refined_pipeline_stages.adoc[Refined Pipeline Stages] | Next: xref:Synchronization/Pipeline_Barriers_Transitions/02_image_barrier.adoc[The Image Barrier] diff --git a/en/Synchronization/Pipeline_Barriers_Transitions/02_image_barrier.adoc b/en/Synchronization/Pipeline_Barriers_Transitions/02_image_barrier.adoc new file mode 100644 index 00000000..ef4fa249 --- /dev/null +++ b/en/Synchronization/Pipeline_Barriers_Transitions/02_image_barrier.adoc @@ -0,0 +1,120 @@ +:pp: {plus}{plus} += The Image Barrier: Implementing vk::ImageMemoryBarrier2 + +== The Core Mechanism + +In the world of modern Vulkan, the image memory barrier is the definitive tool for managing how resources flow through the pipeline. While the theory of synchronization is about "execution" and "visibility," the image barrier adds a third, equally critical component: **Layout Transitions**. Unlike a buffer, which is just a linear strip of memory, an image has a layout that determines how its texels are organized. + +If we want to write to an image as a color attachment, the GPU hardware expects it to be in `VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL`. If we later want to sample that same image in a shader, it must be transitioned to `VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL`. This is not just a driver-side "flag"—on some hardware, this transition might trigger a physical reorganization of the data or a cache flush. + +image::/images/image_barrier_anatomy.svg[Anatomy of an Image Barrier] + +When we talk about "physical reorganization," we're referring to how different hardware units see the same bits. For instance, a Rasterizer might use a specialized tiled compression format (like Delta Color Compression) to save bandwidth. However, a Compute shader sampling that same image might not understand that compression. The layout transition ensures the data is "decompressed" or moved into a format that the next stage can consume. + +== Anatomizing the Image Barrier + +When we record a pipeline barrier, we are essentially defining a "gate" that the GPU must pass through. Let’s look at how we construct this using the RAII-style Vulkan-Hpp wrappers we use in our engine: + +[,c++] +---- +auto imageBarrier = vk::ImageMemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eColorAttachmentOutput, + .srcAccessMask = vk::AccessFlagBits2::eColorAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eFragmentShader, + .dstAccessMask = vk::AccessFlagBits2::eShaderRead, + .oldLayout = vk::ImageLayout::eColorAttachmentOptimal, + .newLayout = vk::ImageLayout::eShaderReadOnlyOptimal, + .image = renderTarget.image(), + .subresourceRange = { + .aspectMask = vk::ImageAspectFlagBits::eColor, + .baseMipLevel = 0, + .levelCount = 1, + .baseArrayLayer = 0, + .layerCount = 1 + } +}; + +auto dependencyInfo = vk::DependencyInfo{ + .imageMemoryBarrierCount = 1, + .pImageMemoryBarriers = &imageBarrier +}; + +commandBuffer.pipelineBarrier2(dependencyInfo); +---- + +In this example, we're transitioning a color attachment so it can be sampled by a subsequent fragment shader. The `srcStageMask` tells the GPU "wait for the color attachment output stage of previous commands to finish," while the `srcAccessMask` specifies that we are specifically waiting for the memory *writes* from that stage to be complete. On the other side of the gate, the `dstStageMask` and `dstAccessMask` ensure that the fragment shader stage will wait to start its read operations until the layout transition and cache flushes are finished. + +== The Power of Layout Discard + +One of the most common performance optimizations in Vulkan is the use of `vk::ImageLayout::eUndefined` as the `oldLayout`. When we set the old layout to undefined, we are telling the driver: "I don't care about what was in this image before." + +This is incredibly powerful. If the driver knows the previous content is garbage, it can skip the expensive work of preserving data during a layout transition. For example, if you're about to clear an image and use it as a fresh color attachment, transitioning from `eUndefined` to `eColorAttachmentOptimal` is significantly faster than transitioning from `eShaderReadOnlyOptimal` (which might require a "resolve" or "decompression" of the previous frame's data). + +== Subresource Ranges and Aspect Masks + +Vulkan doesn't just let us synchronize an entire image; it gives us surgical control over specific parts of it via the `subresourceRange`. This is vital for complex effects: + +* **Mipmap Generation**: We can transition mip level 0 to `eTransferSrcOptimal` and level 1 to `eTransferDstOptimal` to perform a blit, then transition them back. +* **Aspect Masks**: For depth-stencil formats, we might only want to transition the `eDepth` aspect while leaving `eStencil` alone (or vice versa). +* **Layered Rendering**: In VR or cubemap rendering, we can transition individual array layers independently to allow different parts of the GPU to work on different views simultaneously. + +== Synchronization in Dynamic Rendering + +One of the major shifts in modern Vulkan is the move toward **Dynamic Rendering** (introduced in Vulkan 1.3). In the old "Render Pass" system, transitions were often hidden within subpass dependencies or the render pass definition itself. This was often confusing and led to over-synchronization. + +With dynamic rendering, the responsibility for transitions falls squarely on us. We typically perform our transitions *between* calls to `beginRendering` and `endRendering`. This might feel like more work, but it provides far more clarity. We know exactly where the transition is happening because we recorded it explicitly. It also makes it much easier to integrate with modern engine architectures where rendering passes are more fluid and less rigid than the legacy system. + +== Putting it Together in the Engine + +In a real-world engine, you rarely emit just one barrier. You batch them. Here is how our `Renderer` might handle a common "Post-Process" sequence where we transition both the scene color and the depth buffer (for depth-of-field) before the final UI pass: + +[,c++] +---- +std::array barriers; + +// Transition Scene Color from Attachment to Shader Read +barriers[0] = vk::ImageMemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eColorAttachmentOutput, + .srcAccessMask = vk::AccessFlagBits2::eColorAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eFragmentShader, + .dstAccessMask = vk::AccessFlagBits2::eShaderRead, + .oldLayout = vk::ImageLayout::eColorAttachmentOptimal, + .newLayout = vk::ImageLayout::eShaderReadOnlyOptimal, + .image = sceneColor.image() +}; + +// Transition Depth from Attachment to Shader Read (Depth Aspect only!) +barriers[1] = vk::ImageMemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eLateFragmentTests, + .srcAccessMask = vk::AccessFlagBits2::eDepthStencilAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eFragmentShader, + .dstAccessMask = vk::AccessFlagBits2::eShaderRead, + .oldLayout = vk::ImageLayout::eDepthStencilAttachmentOptimal, + .newLayout = vk::ImageLayout::eDepthReadOnlyOptimal, + .image = depthBuffer.image(), + .subresourceRange = { + .aspectMask = vk::ImageAspectFlagBits::eDepth, + .baseMipLevel = 0, + .levelCount = 1, + .baseArrayLayer = 0, + .layerCount = 1 + } +}; + +commandBuffer.pipelineBarrier2(vk::DependencyInfo{ + .imageMemoryBarrierCount = static_cast(barriers.size()), + .pImageMemoryBarriers = barriers.data() +}); +---- + +By batching these into a single `DependencyInfo`, the driver can optimize the state changes and cache flushes, ensuring the GPU spends more time drawing and less time waiting for barriers. + +== Simple Engine: The Unified Barrier + +In `Simple Engine`, we consolidate our image transitions to minimize driver overhead. If you look at `Renderer::Render` in `renderer_rendering.cpp`, you'll see how we handle the transition from the **Opaque Pass** to the **Post-Processing Pass**. We don't just transition the color buffer; we often transition the depth buffer and any auxiliary buffers (like our G-Buffer for Forward+ lighting) in a single `vk::DependencyInfo`. + +One specific trick we use in `Simple Engine` is the **Layout Tracking** system. Because our `Renderer` can switch between different rendering paths (like Rasterization vs. Ray Query), we keep track of the current layout of our main images (like `opaqueSceneColorImageLayouts`). When we begin a pass, we check the current layout and only emit a barrier if a transition is actually necessary. If the image is already in the correct layout, we skip the barrier entirely, saving precious GPU cycles. + +== Navigation + +Previous: xref:Synchronization/Pipeline_Barriers_Transitions/01_introduction.adoc[Introduction] | Next: xref:Synchronization/Pipeline_Barriers_Transitions/03_queue_family_ownership.adoc[Queue Family Ownership] diff --git a/en/Synchronization/Pipeline_Barriers_Transitions/03_queue_family_ownership.adoc b/en/Synchronization/Pipeline_Barriers_Transitions/03_queue_family_ownership.adoc new file mode 100644 index 00000000..67f739f3 --- /dev/null +++ b/en/Synchronization/Pipeline_Barriers_Transitions/03_queue_family_ownership.adoc @@ -0,0 +1,74 @@ +:pp: {plus}{plus} += Queue Family Ownership: The Handshake + +== Why We Transfer Ownership + +In many high-performance Vulkan engines, we don't just use a single "Graphics" queue for everything. We might use a dedicated **Transfer Queue** for background asset streaming or a **Compute Queue** for asynchronous post-processing. However, Vulkan resources (buffers and images) are generally "owned" by a specific queue family if they were created with `vk::SharingMode::eExclusive`. + +If you want to move an image from your Transfer queue (where you just uploaded it) to your Graphics queue (where you want to draw it), you must perform an explicit **Queue Family Ownership Transfer**. This is a two-step "handshake" that involves a release operation on the source queue and an acquire operation on the destination queue. + +== The Release and Acquire Handshake + +The transfer happens by recording a pipeline barrier on both queues. Crucially, both barriers must specify the source and destination queue family indices. + +=== 1. The Release Operation (Source Queue) + +On the queue that currently owns the resource, you record a barrier that "releases" it. The `srcQueueFamilyIndex` is your current queue, and the `dstQueueFamilyIndex` is the queue you are sending it to. + +[,c++] +---- +auto releaseBarrier = vk::ImageMemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eAllTransfer, + .srcAccessMask = vk::AccessFlagBits2::eTransferWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eNone, // No stage on this queue + .dstAccessMask = vk::AccessFlagBits2::eNone, // No access on this queue + .oldLayout = vk::ImageLayout::eTransferDstOptimal, + .newLayout = vk::ImageLayout::eShaderReadOnlyOptimal, + .srcQueueFamilyIndex = transferQueueIndex, + .dstQueueFamilyIndex = graphicsQueueIndex, + .image = texture.image(), + .subresourceRange = subresourceRange +}; + +// Record on Transfer Command Buffer +transferCommandBuffer.pipelineBarrier2(vk::DependencyInfo{.imageMemoryBarrierCount = 1, .pImageMemoryBarriers = &releaseBarrier}); +---- + +=== 2. The Acquire Operation (Destination Queue) + +On the target queue, you record a barrier that "acquires" the resource. The indices remain the same, but now the `srcStageMask` and `srcAccessMask` are set to `eNone` because those stages happened on a different queue. + +[,c++] +---- +auto acquireBarrier = vk::ImageMemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eNone, + .srcAccessMask = vk::AccessFlagBits2::eNone, + .dstStageMask = vk::PipelineStageFlagBits2::eFragmentShader, + .dstAccessMask = vk::AccessFlagBits2::eShaderRead, + .oldLayout = vk::ImageLayout::eTransferDstOptimal, + .newLayout = vk::ImageLayout::eShaderReadOnlyOptimal, + .srcQueueFamilyIndex = transferQueueIndex, + .dstQueueFamilyIndex = graphicsQueueIndex, + .image = texture.image(), + .subresourceRange = subresourceRange +}; + +// Record on Graphics Command Buffer +graphicsCommandBuffer.pipelineBarrier2(vk::DependencyInfo{.imageMemoryBarrierCount = 1, .pImageMemoryBarriers = &acquireBarrier}); +---- + +== Orchestration with Semaphores + +Recording the barriers is only half the battle. You also need to ensure that the Graphics queue doesn't try to acquire the resource before the Transfer queue has released it. This is typically handled with a **Semaphore**. The Transfer queue signals a semaphore upon completion of its command buffer, and the Graphics queue waits on that same semaphore before executing its own acquire barrier. + +This handshake is one of the more complex parts of Vulkan synchronization, but it's essential for building a multi-threaded, non-blocking engine architecture. In modern Vulkan, we prefer **Timeline Semaphores** for this orchestration, as they allow us to track this progress with a simple monotonic counter, which we'll cover in detail in the next chapter. + +== Simple Engine: Resource Handoff + +In `Simple Engine`, we avoid the complexity of ownership transfers where possible by using `vk::SharingMode::eConcurrent` when creating our major buffers and images. If the hardware supports it, this allows multiple queue families (like our `transferQueue` and `graphicsQueue`) to access the same memory concurrently without an explicit "Release/Acquire" barrier. + +However, even with `eConcurrent`, you still need to synchronize the *execution* of those queues! In `Simple Engine`, we use a dedicated **Transfer Semaphore** to ensure that our graphics queue doesn't start sampling a texture until the transfer queue has finished its work. This is handled during the `Renderer::ProcessPendingMeshUploads` call, ensuring that all background uploads are correctly "visible" to the graphics hardware before the next frame begins. + +== Navigation + +Previous: xref:Synchronization/Pipeline_Barriers_Transitions/02_image_barrier.adoc[The Image Barrier] | Next: xref:Synchronization/Pipeline_Barriers_Transitions/04_global_vs_local_barriers.adoc[Global vs. Local Barriers] diff --git a/en/Synchronization/Pipeline_Barriers_Transitions/04_global_vs_local_barriers.adoc b/en/Synchronization/Pipeline_Barriers_Transitions/04_global_vs_local_barriers.adoc new file mode 100644 index 00000000..a3b58599 --- /dev/null +++ b/en/Synchronization/Pipeline_Barriers_Transitions/04_global_vs_local_barriers.adoc @@ -0,0 +1,65 @@ +:pp: {plus}{plus} += Global vs. Local Barriers: Precision and Performance + +== The Dilemma of Choice + +Vulkan gives us two ways to synchronize memory: **Global Memory Barriers** and **Specific Resource Barriers** (Image and Buffer barriers). It's often tempting to just use a global barrier for everything—it's simpler to write, requires less bookkeeping, and covers all your bases. However, this convenience comes at a cost. + +A global barrier affects *all* memory accesses of the specified type across the entire GPU. If you only need to transition a single texture, but you use a global memory barrier, the GPU might end up flushing its entire L1 and L2 cache, potentially stalling other unrelated work that was running perfectly fine. + +== When to Use Global Barriers + +Global barriers are not "evil"; they are simply a broad tool. They are excellent for scenarios where you are about to perform a major state change that affects many resources simultaneously. + +For example, if you are moving from a G-Buffer pass to a complex lighting pass that will read from multiple textures and buffers, a single global barrier might be more efficient than recording ten individual image and buffer barriers. Consolidating into a single global barrier reduces the driver overhead of processing the `vk::DependencyInfo` and can sometimes lead to better hardware utilization if many resources are transitioning between similar stages. + +[,c++] +---- +auto globalBarrier = vk::MemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eColorAttachmentOutput, + .srcAccessMask = vk::AccessFlagBits2::eColorAttachmentWrite, + .dstStageMask = vk::PipelineStageFlagBits2::eComputeShader, + .dstAccessMask = vk::AccessFlagBits2::eShaderRead +}; + +commandBuffer.pipelineBarrier2(vk::DependencyInfo{.memoryBarrierCount = 1, .pMemoryBarriers = &globalBarrier}); +---- + +== When to Use Resource Barriers + +Resource-specific barriers (`vk::ImageMemoryBarrier2` and `vk::BufferMemoryBarrier2`) are your "surgical" tools. You should use them whenever the dependency is limited to a specific resource, especially if that resource is being transitioned between layouts. + +The primary advantage of an image barrier is that it allows the driver to perform layout-specific optimizations. A global memory barrier *cannot* transition an image layout. If you need to change an image from `eColorAttachmentOptimal` to `eShaderReadOnlyOptimal`, you *must* use an image memory barrier. + +== The Golden Rule: Batching + +Whether you choose global or local barriers, the most important rule for Vulkan synchronization performance is **Batching**. + +Avoid calling `pipelineBarrier2` multiple times in a row. Every call to `pipelineBarrier2` has a non-trivial overhead. Instead, collect all your barriers (global, image, and buffer) into a single `vk::DependencyInfo` and submit them in one go. + +[,c++] +---- +std::vector imageBarriers = { /* ... */ }; +vk::MemoryBarrier2 globalBarrier = { /* ... */ }; + +auto dependencyInfo = vk::DependencyInfo{ + .memoryBarrierCount = 1, + .pMemoryBarriers = &globalBarrier, + .imageMemoryBarrierCount = static_cast(imageBarriers.size()), + .pImageMemoryBarriers = imageBarriers.data() +}; + +commandBuffer.pipelineBarrier2(dependencyInfo); +---- + +By batching your barriers, you give the driver the opportunity to consolidate the cache flushes and stage stalls, ensuring that the GPU spends as little time as possible waiting and as much time as possible rendering. + +== Simple Engine: Optimization + +In `Simple Engine`, we primarily use **Image Memory Barriers** because most of our synchronization involves layout transitions (e.g., from `eColorAttachmentOptimal` to `eShaderReadOnlyOptimal`). However, we do use **Global Memory Barriers** in our `ComputeSystem` (e.g., in `physics_system.cpp`) when we need to ensure that all previous compute writes to any and all storage buffers are visible to subsequent shader stages. + +One area where `Simple Engine` could be further optimized is in the consolidation of these barriers. Currently, some of our systems emit their own barriers independently. In a future update, we plan to move toward a **Render Graph** architecture. This would allow the engine to collect all necessary barriers across all systems for an entire frame and batch them into a single, highly-optimized `vkCmdPipelineBarrier2` call, further reducing driver overhead and improving GPU occupancy. + +== Navigation + +Previous: xref:Synchronization/Pipeline_Barriers_Transitions/03_queue_family_ownership.adoc[Queue Family Ownership] | Next: xref:Synchronization/Timeline_Semaphores/01_introduction.adoc[Timeline Semaphores: The Master Clock] diff --git a/en/Synchronization/Profiling_Optimization/01_introduction.adoc b/en/Synchronization/Profiling_Optimization/01_introduction.adoc new file mode 100644 index 00000000..760b8abf --- /dev/null +++ b/en/Synchronization/Profiling_Optimization/01_introduction.adoc @@ -0,0 +1,30 @@ +:pp: {plus}{plus} += Profiling, Batching, and Optimization: Squeezing the GPU + +== Introduction + +Congratulations! You've mastered the core mechanics of **Synchronization 2**, the monotonic world of **Timeline Semaphores**, and the complexities of **Asynchronous Compute** and **Asset Streaming**. You've built a renderer that is robust, modern, and validated. + +But in the world of high-performance graphics, "correct" is only the beginning. The final challenge is to make your synchronization as efficient as possible. Every barrier you record and every semaphore you signal has a cost—both in terms of driver overhead and potential hardware stalls. + +In this final chapter, we're going to move beyond the "how" and "why" of synchronization and focus on the "how fast." We'll explore the advanced techniques that professional engine developers use to squeeze every last drop of performance out of the GPU. + +== The Optimization Mindset + +Optimization in synchronization is a balancing act. On one hand, you want to be as specific as possible to avoid unnecessary stalls. On the other hand, you want to minimize the number of times you call into the driver. + +The key is to think in terms of **Batching** and **Visibility**. Instead of thinking about each resource in isolation, you should think about your frame as a whole. Where can you group dependencies? Where can you move barriers to allow more work to overlap? Where can you use hardware profiling tools to find the "invisible" bottlenecks that are holding your frame rate back? + +== What We'll Explore + +In this final chapter, we'll dive into the advanced world of Vulkan optimization. We'll explore: + +1. **Barrier Batching**: How to consolidate multiple global, image, and buffer barriers into a single `vkCmdPipelineBarrier2` call to reduce driver overhead. +2. **Visualizing Stalls**: We'll revisit the "bubble" problem from Chapter 6, but this time with a focus on using hardware profilers to identify and eliminate them at scale. +3. **Final Refinements**: We'll wrap up the series with a checklist of best practices and common pitfalls to ensure your engine remains high-performance as it grows. + +By the end of this chapter, you'll have the knowledge and the tools to take your renderer from "validated" to "optimized," ensuring that your synchronization code is as fast as it is correct. + +== Navigation + +Previous: xref:Synchronization/Synchronization_Validation/03_interpreting_vuids.adoc[Interpreting VUIDs] | Next: xref:Synchronization/Profiling_Optimization/02_barrier_batching.adoc[Barrier Batching] diff --git a/en/Synchronization/Profiling_Optimization/02_barrier_batching.adoc b/en/Synchronization/Profiling_Optimization/02_barrier_batching.adoc new file mode 100644 index 00000000..8243fb6b --- /dev/null +++ b/en/Synchronization/Profiling_Optimization/02_barrier_batching.adoc @@ -0,0 +1,52 @@ +:pp: {plus}{plus} += Barrier Batching: Consolidating Your Synchronization + +== The Cost of a Call + +Every time you call `commandBuffer.pipelineBarrier2`, you are making a trip from your CPU code into the Vulkan driver. The driver then has to parse your `vk::DependencyInfo`, validate your stage and access masks, and then record the actual hardware instructions into the command buffer. + +If you have ten different images to transition, and you record ten individual barriers, you are performing ten driver trips. This overhead can add up, especially in a complex frame with many passes. + +== The Solution: Batching + +**Barrier Batching** is the practice of collecting all your global, image, and buffer barriers and submitting them in a single `pipelineBarrier2` call. This is one of the easiest ways to reduce the CPU overhead of your synchronization code. + +The `vk::DependencyInfo` structure is specifically designed for this. It allows you to provide an array of barriers of each type. + +[,c++] +---- +std::vector imageBarriers = { /* ... multiple image transitions ... */ }; +vk::MemoryBarrier2 globalBarrier = { /* ... a broad memory dependency ... */ }; + +auto dependencyInfo = vk::DependencyInfo{ + .memoryBarrierCount = 1, + .pMemoryBarriers = &globalBarrier, + .imageMemoryBarrierCount = static_cast(imageBarriers.size()), + .pImageMemoryBarriers = imageBarriers.data() +}; + +// One call into the driver instead of many +commandBuffer.pipelineBarrier2(dependencyInfo); +---- + +== Hardware Benefits + +Batching is not just about reducing CPU overhead; it also provides significant benefits on the GPU. When you provide multiple barriers in a single call, the driver can consolidate the cache flushes and the pipeline stalls. + +Instead of stalling the pipeline and flushing caches five different times, the hardware can potentially do it all at once. This reduces the total time the GPU spends waiting and increases the time it spends rendering. + +== Implementation Strategy + +A good strategy for an engine is to have a "Barrier Manager" that collects barriers throughout a pass. When you reach a synchronization point—for example, at the end of a G-Buffer pass—the manager flushes all the collected barriers in a single batch. + +By thinking in terms of batches rather than individual barriers, you move toward a more "holistic" approach to synchronization, ensuring that your engine remains high-performance as you add more complexity to your renderer. In the next section, we'll see how to use profiling tools to visualize the impact of these optimizations. + +== Simple Engine: Consolidation + +In `Simple Engine`, we apply this principle of barrier batching in our `Renderer::Render` loop. For example, during the **Opaque Pass** to **Post-Processing** transition, we collect all necessary image barriers—including those for the scene color and the depth buffer—into a single `vk::DependencyInfo`. + +One optimization we plan for a future version of `Simple Engine` is to centralize this further. By implementing a "Barrier Manager" that collects barriers across all systems (Renderer, Physics, Audio), we can reduce our total number of `pipelineBarrier2` calls per frame. This is a critical part of our roadmap toward a full **Render Graph** system, where all synchronization is calculated globally for each frame, ensuring that we never emit redundant barriers and that all transitions are batched for maximum hardware performance. + +== Navigation + +Previous: xref:Synchronization/Profiling_Optimization/01_introduction.adoc[Introduction] | Next: xref:Synchronization/Profiling_Optimization/03_visualizing_stalls.adoc[Visualizing Stalls] diff --git a/en/Synchronization/Profiling_Optimization/03_visualizing_stalls.adoc b/en/Synchronization/Profiling_Optimization/03_visualizing_stalls.adoc new file mode 100644 index 00000000..475e80b1 --- /dev/null +++ b/en/Synchronization/Profiling_Optimization/03_visualizing_stalls.adoc @@ -0,0 +1,35 @@ +:pp: {plus}{plus} += Visualizing Stalls: Finding Your Pipeline Bubbles + +== Hardware Profilers + +The most effective way to optimize your synchronization code is to see what the GPU is actually doing. We use hardware profilers like **NVIDIA Nsight Graphics**, **AMD Radeon GPU Profiler**, or **Intel Graphics Performance Analyzers** to visualize the pipeline. + +In these tools, you can see a "Timeline" view that shows exactly when each part of the GPU (graphics cores, compute cores, transfer engine) is busy. A **Pipeline Bubble** is a gap in this timeline—a period where the hardware is idle because it's waiting for a dependency that hasn't been reached. + +== Identifying the Cause + +When you find a bubble, you must determine its cause. Is it a real dependency (e.g., the lighting pass waiting for the G-Buffer to finish)? Or is it an artificial stall caused by a too-conservative barrier? + +A common mistake is using `vk::PipelineStageFlagBits2::eAllCommands` for every barrier. This tells the GPU: "Stop everything until all previous commands have finished." This is a massive "sledgehammer" that can create huge bubbles. Instead, you should always use the most specific stage mask possible (e.g., `eColorAttachmentOutput`). + +== Practical Refinement + +To refine your masks, follow this process: + +1. **Spot the Bubble**: Find a gap in the timeline in your profiler. +2. **Identify the Dependency**: Look at the barrier that precedes the gap. +3. **Refine the Stage Mask**: Check if the dependency can be satisfied by an earlier stage. For example, can your shadow pass start as soon as the vertex work of the previous frame is done? +4. **Verify the Fix**: Re-run the profiler and check if the bubble has shrunk or disappeared. + +== Closing the Series + +Congratulations! You've successfully navigated the complex and powerful world of **Synchronization 2**, **Timeline Semaphores**, and **Asynchronous Overlap**. You've built a renderer that is modern, validated, and optimized. + +Synchronization is one of the most challenging parts of Vulkan, but it's also where you have the most power to differentiate your engine's performance. By applying the principles we've learned in this series—using the most specific stage masks, batching your barriers, and visualizing your stalls—you can build a professional-grade renderer that squeezes every last drop of performance out of the hardware. + +Keep profiling, keep refining, and keep building! + +== Navigation + +Previous: xref:Synchronization/Profiling_Optimization/02_barrier_batching.adoc[Barrier Batching] | Next: xref:Synchronization/introduction.adoc[Back to Introduction] diff --git a/en/Synchronization/Synchronization_Validation/01_introduction.adoc b/en/Synchronization/Synchronization_Validation/01_introduction.adoc new file mode 100644 index 00000000..6960783d --- /dev/null +++ b/en/Synchronization/Synchronization_Validation/01_introduction.adoc @@ -0,0 +1,30 @@ +:pp: {plus}{plus} += Debugging with Synchronization Validation: Finding Your Hazards + +== Introduction + +Vulkan synchronization is a "trust but verify" system. You can write what you believe is perfectly correct `vk::DependencyInfo` and `vk::SubmitInfo2` code, but the only way to be absolutely certain is to test it against the actual hardware behavior. However, synchronization bugs are notoriously difficult to find. They often manifest as subtle flickering, occasional crashes, or—worst of all—perfect behavior on your development machine and complete failure on a customer's GPU. + +This is where the **LunarG Synchronization Validation** layer comes in. It is, without a doubt, the most important tool in your Vulkan debugging arsenal. Unlike the standard validation layers that check for API usage errors, the sync validation layer tracks the state of every resource in your engine and identifies the "Read-After-Write" (RAW), "Write-After-Read" (WAR), and "Write-After-Write" (WAW) hazards that lead to data corruption. + +== The Hazards We Face + +Synchronization is essentially about managing these three types of hazards: + +1. **Read-After-Write (RAW)**: A stage tries to read a resource before a previous stage has finished writing to it. This is the most common cause of "garbage" data. +2. **Write-After-Read (WAR)**: A stage tries to write to a resource while a previous stage is still reading from it. This can lead to the previous stage reading "half-updated" data. +3. **Write-After-Write (WAW)**: Two stages try to write to the same resource simultaneously. The result is unpredictable and almost always leads to corruption. + +== What We'll Explore + +In this chapter, we'll learn how to leverage the validation layers to make our engine perfectly robust. We'll explore: + +1. **The Validation Layer**: How to configure and enable the LunarG Synchronization Validation layer within your engine's debug build. +2. **Interpreting VUIDs**: Vulkan Validation Unique Identifiers (VUIDs) can be daunting. We'll learn how to decipher these complex error messages and turn them into actionable code fixes. +3. **Identifying Hazards**: We'll see real-world examples of how the validation layer catches hazards that are nearly impossible to find through manual inspection. + +By the end of this chapter, you'll have the tools and the knowledge to ensure that your synchronization code is not just "mostly correct," but "Vulkan-validated" correct. + +== Navigation + +Previous: xref:Synchronization/Host_Image_Copies_Memory_Sync/03_visibility_flushes.adoc[Visibility & Flushes] | Next: xref:Synchronization/Synchronization_Validation/02_validation_layer.adoc[The Validation Layer] diff --git a/en/Synchronization/Synchronization_Validation/02_validation_layer.adoc b/en/Synchronization/Synchronization_Validation/02_validation_layer.adoc new file mode 100644 index 00000000..8656b0b3 --- /dev/null +++ b/en/Synchronization/Synchronization_Validation/02_validation_layer.adoc @@ -0,0 +1,53 @@ +:pp: {plus}{plus} += The Validation Layer: Configuring Your Environment + +== Enabling Sync Validation + +The **LunarG Synchronization Validation** layer is not part of the standard `VK_LAYER_KHRONOS_validation` by default on all platforms. In many environments, you must explicitly enable it through the `vk_layer_settings.txt` file or through your engine's `vk::InstanceCreateInfo`. + +To enable it via the instance, you add the `VK_VALIDATION_FEATURE_ENABLE_SYNCHRONIZATION_VALIDATION_EXT` flag to the `vk::ValidationFeaturesEXT` structure and pass it as the `pNext` of your `vk::InstanceCreateInfo`. + +[,c++] +---- +auto syncValidationFeature = vk::ValidationFeaturesEXT{ + .enabledValidationFeatureCount = 1, + .pEnabledValidationFeatures = &vk::ValidationFeatureEnableEXT::eSynchronizationValidation +}; + +auto instanceCreateInfo = vk::InstanceCreateInfo{ + .pNext = &syncValidationFeature, + // ... +}; + +auto instance = vk::raii::Instance(context, instanceCreateInfo); +---- + +== Working with vk_layer_settings.txt + +For a more flexible approach, you can create a `vk_layer_settings.txt` file in your application's working directory. This file allows you to configure many aspects of the validation layers without recompiling your code. + +---- +# Example vk_layer_settings.txt +khronos_validation.enables = VK_VALIDATION_FEATURE_ENABLE_SYNCHRONIZATION_VALIDATION_EXT +khronos_validation.report_flags = error;warning;perf;info +---- + +== What the Layer Actually Does + +Once enabled, the sync validation layer begins tracking every resource in your engine. It keeps a record of the last stage that wrote to a resource, the stage that is currently reading from it, and the stage that is next in line. + +If it detects a situation where two stages could be accessing the same memory without a proper barrier—for example, if a fragment shader starts reading a texture before its transfer copy has finished—the layer will emit a validation error. + +It’s important to note that the sync validation layer has a **non-trivial performance overhead**. It is not meant to be left on in your production or release builds. It should be used exclusively during development and testing to catch hazards before they become bugs. + +== Simple Engine: Development Workflow + +In `Simple Engine`, we integrate Synchronization Validation directly into our debug builds. When you run the engine with the `--debug-sync` command-line flag (or enable it in `renderer_core.cpp`), we automatically add `vk::ValidationFeatureEnableEXT::eSynchronizationValidation` to our instance creation. + +This is a critical part of our development workflow. Whenever we add a new rendering pass—like our recent **Forward+ Lighting** or **Ray Query Shadows**—we run the engine with synchronization validation enabled. This allows us to catch any "Write-After-Read" (WAR) or "Read-After-Write" (RAW) hazards early, before they manifest as flickering pixels or intermittent GPU hangs. By letting the tools find these hazards for us, we can spend more time optimizing our engine and less time chasing down elusive synchronization bugs. + +In the next section, we'll see how to decipher the error messages this layer produces. + +== Navigation + +Previous: xref:Synchronization/Synchronization_Validation/01_introduction.adoc[Introduction] | Next: xref:Synchronization/Synchronization_Validation/03_interpreting_vuids.adoc[Interpreting VUIDs] diff --git a/en/Synchronization/Synchronization_Validation/03_interpreting_vuids.adoc b/en/Synchronization/Synchronization_Validation/03_interpreting_vuids.adoc new file mode 100644 index 00000000..3275eaf7 --- /dev/null +++ b/en/Synchronization/Synchronization_Validation/03_interpreting_vuids.adoc @@ -0,0 +1,40 @@ +:pp: {plus}{plus} += Interpreting VUIDs: Deciphering Your Hazard Errors + +== The Anatomy of a VUID + +Vulkan **Validation Unique Identifiers (VUIDs)** are the specific error codes that the validation layers emit when they find a problem. These IDs, like `VUID-VkImageMemoryBarrier2-image-01199`, are not just random numbers. They correspond to specific rules in the Vulkan specification. + +When the sync validation layer finds a hazard, it will emit an error message that looks something like this: + +---- +VALIDATION [SYNC-HAZARD-READ-AFTER-WRITE] (0x01234567) +VUID: VUID-vkCmdDraw-None-07888 +Message: Write-After-Read (WAR) hazard on Image (0x89abcdef) in VkCommandBuffer (0x12345678). + - Current Stage: VK_PIPELINE_STAGE_2_FRAGMENT_SHADER_BIT + - Current Access: VK_ACCESS_2_SHADER_READ_BIT + - Previous Stage: VK_PIPELINE_STAGE_2_COPY_BIT + - Previous Access: VK_ACCESS_2_TRANSFER_WRITE_BIT +---- + +== Deciphering the Message + +To a new developer, this message can be overwhelming. But if you break it down, it’s actually telling you exactly what’s wrong: + +1. **Hazard Type**: The `[SYNC-HAZARD-READ-AFTER-WRITE]` tag tells you the nature of the problem. In this case, a read is happening before a previous write has finished. +2. **Resource**: The message identifies the specific resource (`Image (0x89abcdef)`) and the command buffer where the hazard occurred. +3. **The Culprits**: The message lists the "Current" and "Previous" stages and access masks. In this example, the fragment shader is trying to read an image that was just being updated by a copy operation. + +== Actionable Fixes + +Once you understand what the message is telling you, the fix is usually straightforward: + +- **Add a Barrier**: If a previous stage is still writing when the current stage starts reading, you need to add a `vk::ImageMemoryBarrier2` (or a `vk::MemoryBarrier2`) between the two stages to ensure that the write is finished and visible. +- **Refine Your Stages**: If you already have a barrier, check that your `srcStageMask` and `dstStageMask` are correct. Did you wait for the correct stage? Did you use the correct access mask? +- **Check Your Submission**: If the hazard occurs between two different submissions, are you using a semaphore or a fence to coordinate them? + +By treating every VUID as a learning opportunity, you can systematically improve the quality and the performance of your synchronization code. In the final chapter, we'll see how to optimize these patterns for maximum GPU throughput. + +== Navigation + +Previous: xref:Synchronization/Synchronization_Validation/02_validation_layer.adoc[The Validation Layer] | Next: xref:Synchronization/Profiling_Optimization/01_introduction.adoc[Profiling, Batching, and Optimization] diff --git a/en/Synchronization/Timeline_Semaphores/01_introduction.adoc b/en/Synchronization/Timeline_Semaphores/01_introduction.adoc new file mode 100644 index 00000000..639bb515 --- /dev/null +++ b/en/Synchronization/Timeline_Semaphores/01_introduction.adoc @@ -0,0 +1,28 @@ +:pp: {plus}{plus} += Timeline Semaphores: The Master Clock + +== Introduction + +For years, Vulkan developers have had to juggle two very different synchronization primitives: **Binary Semaphores** and **Fences**. Binary semaphores were used exclusively for GPU-to-GPU synchronization (e.g., waiting for an image to be ready before sampling it), while Fences were used for GPU-to-CPU synchronization (e.g., waiting for a command buffer to finish before reusing its resources). + +This split forced us to write two different sets of logic for what is essentially the same problem: "Is the work done yet?" It also led to complex "semaphore chains" that were notoriously difficult to debug. + +**Timeline Semaphores**, introduced as an extension and now a core part of Vulkan 1.2+, change everything. They provide a single, unified primitive that can handle both GPU-to-GPU and GPU-to-CPU synchronization using a simple, monotonic `uint64_t` counter. + +== The Monotonic World + +In a timeline-based system, progress is measured by a value that only ever increases. When you submit a piece of work to the GPU, you tell it: "When you finish this, set the semaphore value to 10." If another piece of work needs that data, you tell it: "Don't start until the semaphore value is at least 10." + +This simple change has profound implications for how we architect our engines: + +1. **Unified Logic**: We no longer care if the "waiter" is the CPU or the GPU. The interface is the same: we wait for a specific value. +2. **Wait-Before-Signal**: One of the most powerful features of Timeline Semaphores is that you can submit work to the GPU that waits for a value that hasn't even been reached yet. This allows us to decouple our submission logic from our execution logic. +3. **Better Debugging**: Because the value is a simple integer, we can easily log it, inspect it in a debugger, or even use it to build a visual profiler of our engine's progress. + +In this chapter, we are going to explore how to implement Timeline Semaphores as the "master clock" of our renderer. We'll start by looking at how to replace our legacy fences and binary semaphores, then we'll dive into the implementation of the monotonic counter and the highly efficient wait-before-signal submission pattern. + +Let's begin by unifying our synchronization primitives. + +== Navigation + +Previous: xref:Synchronization/Pipeline_Barriers_Transitions/04_global_vs_local_barriers.adoc[Global vs. Local Barriers] | Next: xref:Synchronization/Timeline_Semaphores/02_unifying_sync.adoc[Unifying Synchronization] diff --git a/en/Synchronization/Timeline_Semaphores/02_unifying_sync.adoc b/en/Synchronization/Timeline_Semaphores/02_unifying_sync.adoc new file mode 100644 index 00000000..88a53b7a --- /dev/null +++ b/en/Synchronization/Timeline_Semaphores/02_unifying_sync.adoc @@ -0,0 +1,73 @@ +:pp: {plus}{plus} += Unifying Synchronization: Replacing Fences and Binary Semaphores + +== The Simplification + +The most immediate benefit of moving to Timeline Semaphores is that you can effectively delete your code for handling fences and binary semaphores. Instead of maintaining separate sets of primitives, you create a single `vk::raii::Semaphore` and configure it to be a **Timeline** type. + +In the RAII context, this configuration happens through the `vk::SemaphoreTypeCreateInfo` which is passed as the `pNext` of the standard `vk::SemaphoreCreateInfo`. + +[,c++] +---- +auto typeCreateInfo = vk::SemaphoreTypeCreateInfo{ + .semaphoreType = vk::SemaphoreType::eTimeline, + .initialValue = 0 +}; + +auto createInfo = vk::SemaphoreCreateInfo{ + .pNext = &typeCreateInfo +}; + +auto timelineSemaphore = vk::raii::Semaphore(device, createInfo); +---- + +== Handling CPU Waits + +Wait operations on the CPU, which used to require a `vk::Fence`, now use the `vk::Device::waitSemaphores` function. This function can wait for multiple semaphores simultaneously and will return as soon as all specified values have been reached. + +[,c++] +---- +auto waitInfo = vk::SemaphoreWaitInfo{ + .semaphoreCount = 1, + .pSemaphores = &(*timelineSemaphore), + .pValues = &targetValue +}; + +// Wait for the GPU to reach targetValue (equivalent to vkWaitForFences) +auto result = device.waitSemaphores(waitInfo, timeoutInNanoseconds); +---- + +The beauty here is that we can now query the current value of the semaphore at any time using `device.getSemaphoreCounterValue`. This allows for much more flexible engine logic than the binary "is it done yet?" state of a fence. + +== Handling GPU Waits + +GPU-to-GPU synchronization, which used to require binary semaphores, now happens within the `vk::SubmitInfo2` (part of Synchronization 2). You specify the timeline semaphore and the specific value that the queue must wait for before beginning execution. + +[,c++] +---- +auto waitSemaphoreInfo = vk::SemaphoreSubmitInfo{ + .semaphore = *timelineSemaphore, + .value = requiredValue, + .stageMask = vk::PipelineStageFlagBits2::eAllCommands +}; + +auto submitInfo = vk::SubmitInfo2{ + .waitSemaphoreInfoCount = 1, + .pWaitSemaphoreInfos = &waitSemaphoreInfo, + // ... +}; + +queue.submit2(submitInfo); +---- + +By using the same primitive for both, we eliminate the need to synchronize between fences and semaphores. The GPU signals the timeline, and both the CPU and other GPU queues can respond to that same signal by waiting for the appropriate value. + +== Simple Engine: The Roadmap to Timeline + +Currently, `Simple Engine` uses the legacy combination of `inFlightFences` (for CPU-to-GPU sync) and `imageAvailableSemaphores` / `renderFinishedSemaphores` (for GPU-to-GPU sync). This requires us to carefully manage `MAX_FRAMES_IN_FLIGHT` sets of each primitive, leading to the "ping-pong" logic you've likely seen in `Renderer::Render`. + +Our next major architectural update will replace these with a single `Renderer::frameTimeline` semaphore. This will allow us to unify our wait logic. Instead of `device.waitForFences`, we will use `device.waitSemaphores` to wait for the specific frame index value. This significantly simplifies our `Renderer::Render` function and makes the frame loop much easier to reason about, especially as we introduce more complex asynchronous tasks. + +== Navigation + +Previous: xref:Synchronization/Timeline_Semaphores/01_introduction.adoc[Introduction] | Next: xref:Synchronization/Timeline_Semaphores/03_monotonic_counter.adoc[The Monotonic Counter] diff --git a/en/Synchronization/Timeline_Semaphores/03_monotonic_counter.adoc b/en/Synchronization/Timeline_Semaphores/03_monotonic_counter.adoc new file mode 100644 index 00000000..161b5bf5 --- /dev/null +++ b/en/Synchronization/Timeline_Semaphores/03_monotonic_counter.adoc @@ -0,0 +1,50 @@ +:pp: {plus}{plus} += The Monotonic Counter: Tracking Global Progress + +== Understanding the Counter + +At the heart of every timeline semaphore is a single `uint64_t` value. This value is monotonic, meaning it can only ever increase. This simple property allows us to treat the entire execution of our GPU/CPU engine as a single, unified timeline. + +When you submit a command buffer to a queue, you associate it with a signal operation on a timeline semaphore. You assign a specific value to that signal—say, `frame_index * 10 + pass_index`. As the GPU completes each pass, the semaphore value increments. + +== Tracking Progress + +Because we can query this value from the CPU at any time using `device.getSemaphoreCounterValue`, we can build much more intelligent engine logic. For example, instead of waiting for a "Render Complete" fence, we can query the timeline and see exactly which stage the GPU is currently working on. + +[,c++] +---- +uint64_t currentValue = device.getSemaphoreCounterValue(*timelineSemaphore); +if (currentValue >= PassValues::eShadowPassComplete) { + // We can start preparing the next pass that depends on shadows +} +---- + +This is particularly useful for asynchronous resource management. You can tag resources with the timeline value at which they were last used. When you need to reuse or destroy a resource, you simply check if the current semaphore value has exceeded that tag. This eliminates the need for conservative `deviceWaitIdle()` calls, which are often the primary cause of GPU bubbles and CPU stalls. + +== Strategic Value Selection + +Choosing how to increment your timeline values is an architectural decision. A common pattern is to use a large increment for each frame (e.g., 1000) and then use small sub-increments for each major pass within that frame. + +* Frame 1: + * Start: 1000 + * Shadow Pass: 1010 + * G-Buffer Pass: 1020 + * Lighting Pass: 1030 +* Frame 2: + * Start: 2000 + * Shadow Pass: 2010 + * ... + +This numbering scheme provides plenty of "headroom" for adding new passes or sub-steps without having to re-calculate every single synchronization value in your engine. It also makes your logs much easier to read, as the frame number is clearly encoded in the timeline value. + +By treating the timeline as a "master clock," you move away from micro-managing individual dependencies and toward managing the overall state and progress of your renderer. In the next section, we'll see how this enables one of the most powerful submission patterns in Vulkan: the wait-before-signal. + +== Simple Engine: Tracking Frame Progress + +In `Simple Engine`, we will use the monotonic counter to track the progress of each system. We'll define a set of `TimelineValues` that represent major milestones in our frame. For example, our `Renderer` could signal a value like `currentFrameIndex * 10 + passOffset` to indicate that a specific rendering stage has finished. + +This becomes incredibly powerful when paired with our `MemoryPool`. Instead of using a simple "frames since destroy" counter (like we currently do in `pendingASDeletions`), we can tag each resource with the exact `TimelineValue` at which it was last used by the GPU. When the `MemoryPool` needs to reclaim memory, it can simply query the current semaphore value. If `currentValue >= resourceTagValue`, the resource is guaranteed to be safe for destruction or reuse, with no extra stalls or conservative waits required. + +== Navigation + +Previous: xref:Synchronization/Timeline_Semaphores/02_unifying_sync.adoc[Unifying Synchronization] | Next: xref:Synchronization/Timeline_Semaphores/04_wait_before_signal.adoc[Wait-Before-Signal Submission] diff --git a/en/Synchronization/Timeline_Semaphores/04_wait_before_signal.adoc b/en/Synchronization/Timeline_Semaphores/04_wait_before_signal.adoc new file mode 100644 index 00000000..8fef30d4 --- /dev/null +++ b/en/Synchronization/Timeline_Semaphores/04_wait_before_signal.adoc @@ -0,0 +1,69 @@ +:pp: {plus}{plus} += Wait-Before-Signal Submission: Decoupling Execution + +== The Paradigm Shift + +In legacy Vulkan, you generally had to submit work in the order it was intended to execute. If Command Buffer B depended on Command Buffer A, you either had to submit A first, or submit them both in the same `vkQueueSubmit` call with a binary semaphore connecting them. + +Timeline Semaphores introduce the **Wait-Before-Signal** submission pattern. This allows you to submit Command Buffer B to the GPU *before* Command Buffer A has even been recorded, let alone submitted. You simply tell Command Buffer B to wait for a specific value on a timeline semaphore. As long as Command Buffer A (or some other process) eventually signals that value, the GPU will correctly manage the dependency. + +== How It Works + +This pattern works because Vulkan separates the **submission** of work from the **execution** of work. When you call `queue.submit2`, the driver simply adds your commands to the queue's internal buffer. The hardware's command processor then monitors the specified timeline semaphores. It will not begin executing the commands until all the "wait" values have been reached. + +[,c++] +---- +// Submit the "Waiter" first! +auto waitInfo = vk::SemaphoreSubmitInfo{ + .semaphore = *timelineSemaphore, + .value = 10, + .stageMask = vk::PipelineStageFlagBits2::eAllCommands +}; + +auto submitWaiter = vk::SubmitInfo2{ + .waitSemaphoreInfoCount = 1, + .pWaitSemaphoreInfos = &waitInfo, + .commandBufferInfoCount = 1, + .pCommandBufferInfos = &waitCommandBufferInfo +}; + +graphicsQueue.submit2(submitWaiter); + +// ... Later, perhaps in a different thread or even a different frame ... + +// Submit the "Signaler" +auto signalInfo = vk::SemaphoreSubmitInfo{ + .semaphore = *timelineSemaphore, + .value = 10, + .stageMask = vk::PipelineStageFlagBits2::eAllCommands +}; + +auto submitSignaler = vk::SubmitInfo2{ + .signalSemaphoreInfoCount = 1, + .pSignalSemaphoreInfos = &signalInfo, + .commandBufferInfoCount = 1, + .pCommandBufferInfos = &signalCommandBufferInfo +}; + +transferQueue.submit2(submitSignaler); +---- + +== Why This Matters + +This decoupling is a game-changer for modern, multi-threaded engine architectures. + +1. **Reduced CPU Latency**: Your main thread can submit all its work to the GPU as soon as the command buffers are recorded, without waiting for background threads (like an asset loader or a physics engine) to finish their work. +2. **Asynchronous Overlap**: It makes it much easier to implement overlapping passes. For example, your GPU can start its geometry pass while the CPU is still finishing the recording of the post-processing pass, as long as the post-processing pass waits for the geometry timeline value. +3. **Simplified Architecture**: You can build your submission logic around the "needs" of each pass, rather than worrying about the strict ordering of API calls. + +Wait-before-signal is the final piece of the puzzle for a truly modern Vulkan renderer. By combining the precision of Synchronization 2 with the flexibility of Timeline Semaphores, you can build an engine that is both easier to reason about and capable of squeezing every last drop of performance out of the hardware. + +== Simple Engine: Non-Blocking Submission + +In `Simple Engine`, we will use this pattern to decouple our `PhysicsSystem` from our `Renderer`. Currently, the renderer must wait for the physics simulation to finish on the CPU before it can even *record* its command buffers. This creates a massive CPU stall every frame. + +With wait-before-signal, our `Renderer` will simply record its commands to wait for the `physicsTimeline` to reach a specific value (e.g., `currentFrameIndex`). It can then submit those commands immediately to the `graphicsQueue`. Even if the `PhysicsSystem` hasn't finished its simulation on the `computeQueue` yet, the GPU will correctly wait at the beginning of the frame's rendering. This allows the CPU to move on to other tasks (like audio processing or input handling) while the GPU is efficiently managing the dependency itself. + +== Navigation + +Previous: xref:Synchronization/Timeline_Semaphores/03_monotonic_counter.adoc[The Monotonic Counter] | Next: xref:Synchronization/Frame_in_Flight/01_introduction.adoc[Frame-in-Flight Architecture] diff --git a/en/Synchronization/Transfer_Queues_Streaming/01_introduction.adoc b/en/Synchronization/Transfer_Queues_Streaming/01_introduction.adoc new file mode 100644 index 00000000..86a6fa6b --- /dev/null +++ b/en/Synchronization/Transfer_Queues_Streaming/01_introduction.adoc @@ -0,0 +1,32 @@ +:pp: {plus}{plus} += Transfer Queues & Asset Streaming Sync: Non-Blocking Uploads + +== Introduction + +In a modern, open-world game or a complex architectural visualization, we can't afford to load all our assets upfront. We need to stream textures, meshes, and animation data in the background as the player moves through the world. If we do this on the main graphics queue, we risk introducing "stutters" (dropped frames) every time we submit a large upload. + +The solution is to use a **Dedicated Transfer Queue**. Most modern GPUs have a specialized engine designed specifically for moving data from CPU-visible staging buffers to GPU-optimal memory. This engine can run completely independently of the graphics and compute units, allowing us to stream gigabytes of data without affecting the frame rate. + +== The Staging Pipeline + +Asset streaming is a multi-step process. First, the CPU maps a **Staging Buffer** and writes the raw data (like a PNG or a mesh file). Then, the transfer queue is used to copy that data into a **GPU-Optimal Buffer or Image**. Finally, the graphics queue is notified that the data is ready so it can begin using it in a shader. + +The challenge, as always, is synchronization. We must ensure that: + +1. **CPU to Transfer**: The transfer queue doesn't start copying until the CPU has finished writing to the staging buffer. +2. **Transfer to GPU**: The transfer operation is complete and the data is visible in GPU memory. +3. **Transfer to Graphics**: The graphics queue doesn't try to sample the texture until the transfer queue has finished its work and, if necessary, released ownership of the resource. + +== What We'll Build + +In this chapter, we will implement a robust, non-blocking asset streaming system. We'll explore: + +1. **Non-Blocking Data Uploads**: How to utilize a dedicated transfer queue for background texture and buffer streaming. +2. **Staging Synchronization**: Coordinating **Timeline Semaphores** to ensure the graphics queue waits for the transfer to complete before sampling new data. +3. **Ownership Handshakes**: Implementing the queue family ownership transfers we learned about in Chapter 3, but in the context of a background streaming system. + +By the end of this chapter, you'll have a streaming architecture that allows your engine to load massive amounts of data in the background while maintaining a perfectly smooth, stutter-free frame rate. + +== Navigation + +Previous: xref:Synchronization/Async_Compute_Overlap/04_bubble_problem.adoc[The Bubble Problem] | Next: xref:Synchronization/Transfer_Queues_Streaming/02_non_blocking_uploads.adoc[Non-Blocking Data Uploads] diff --git a/en/Synchronization/Transfer_Queues_Streaming/02_non_blocking_uploads.adoc b/en/Synchronization/Transfer_Queues_Streaming/02_non_blocking_uploads.adoc new file mode 100644 index 00000000..3e63987d --- /dev/null +++ b/en/Synchronization/Transfer_Queues_Streaming/02_non_blocking_uploads.adoc @@ -0,0 +1,66 @@ +:pp: {plus}{plus} += Non-Blocking Data Uploads: Utilizing the Dedicated Transfer Queue + +== Why Use a Dedicated Queue? + +In a simple Vulkan application, we might use the same queue for graphics, compute, and transfer work. This is easy to implement, but it's not efficient. Every time we submit a large transfer, the graphics queue has to stop what it's doing and wait for the transfer engine to finish. This creates a "stutter" in our frame rate. + +By using a **Dedicated Transfer Queue**, we can perform these uploads in the background. The transfer engine is a specialized piece of hardware that can move data between memory locations without using the GPU's compute or graphics cores. By offloading these tasks, we can keep our main rendering pipeline running at full speed. + +== Implementing the Transfer + +When we use a dedicated transfer queue, we must be careful with how we record and submit our command buffers. We typically use a specialized **Transfer Command Pool** that is tied to our transfer queue family. + +[,c++] +---- +// Record a transfer command buffer +auto cmd = vk::raii::CommandBuffer(device, { .commandPool = *transferPool, .level = vk::CommandBufferLevel::ePrimary }); +cmd.begin({ .flags = vk::CommandBufferUsageFlagBits::eOneTimeSubmit }); + +// Copy from staging buffer to GPU-optimal image +auto region = vk::BufferImageCopy{ + .bufferOffset = 0, + .imageSubresource = { .aspectMask = vk::ImageAspectFlagBits::eColor, .mipLevel = 0, .baseArrayLayer = 0, .layerCount = 1 }, + .imageExtent = extent +}; +cmd.copyBufferToImage(*stagingBuffer, *gpuImage, vk::ImageLayout::eTransferDstOptimal, region); + +cmd.end(); +---- + +== Submitting for Parallel Execution + +The key to non-blocking uploads is submitting our transfer work to the transfer queue *independently* of our main graphics loop. We don't want our CPU to wait for the transfer to finish. Instead, we use a **Timeline Semaphore** to signal when the transfer is complete. + +[,c++] +---- +// On the background thread +auto signalInfo = vk::SemaphoreSubmitInfo{ + .semaphore = *transferTimeline, + .value = nextTransferValue++, + .stageMask = vk::PipelineStageFlagBits2::eAllTransfer +}; + +auto submit = vk::SubmitInfo2{ + .commandBufferInfoCount = 1, + .pCommandBufferInfos = &cmdInfo, + .signalSemaphoreInfoCount = 1, + .pSignalSemaphoreInfos = &signalInfo +}; + +transferQueue.submit2(submit); +---- + +Because we are using a dedicated queue, the GPU can process this transfer while it is simultaneously rendering frame N or frame N+1 on its graphics queue. There is no contention for the command processor or the shader units. + +== Simple Engine: The Streaming Thread + +In `Simple Engine`, we have a dedicated `LoadingThread` that handles the background loading and uploading of textures. This thread uses a separate `vk::raii::CommandPool` and a dedicated `transferQueue` (if available on the hardware). When a new texture needs to be uploaded, the loading thread records its own transfer commands and submits them to the `transferQueue` independently of the main rendering loop. + +This architecture ensures that our frame rates remain smooth even when loading large new areas of the Bistro scene. The main `Renderer::Render` function is never blocked by the transfer engine. Instead, the renderer only needs to check the status of the `transferTimeline` before it can start using the new texture. This is a much more scalable and responsive approach than the traditional "stop-the-world" loading screen, and it's a key part of how `Simple Engine` achieves high performance on a wide range of hardware. + +In the next section, we'll see how to coordinate the synchronization to ensure that the graphics queue waits for the transfer to finish before trying to sample the newly uploaded data. + +== Navigation + +Previous: xref:Synchronization/Transfer_Queues_Streaming/01_introduction.adoc[Introduction] | Next: xref:Synchronization/Transfer_Queues_Streaming/03_staging_sync.adoc[Staging Synchronization] diff --git a/en/Synchronization/Transfer_Queues_Streaming/03_staging_sync.adoc b/en/Synchronization/Transfer_Queues_Streaming/03_staging_sync.adoc new file mode 100644 index 00000000..a0326ec7 --- /dev/null +++ b/en/Synchronization/Transfer_Queues_Streaming/03_staging_sync.adoc @@ -0,0 +1,72 @@ +:pp: {plus}{plus} += Staging Synchronization: Coordinating Graphics and Transfer + +== The Handshake + +Once your background transfer queue has finished copying a new asset to GPU-optimal memory, you need a way to tell the graphics queue that the data is ready to be sampled. This is where the **Timeline Semaphore** and **Queue Family Ownership Transfer** come together. + +The coordination follows a simple three-step process: + +1. **Transfer Release**: The transfer queue performs its copy and then records a pipeline barrier to "release" ownership of the resource (as we learned in Chapter 3). +2. **Semaphore Signal**: The transfer queue signals a specific value on a timeline semaphore when its work is finished. +3. **Graphics Acquire**: The graphics queue waits for that same semaphore value and then records its own pipeline barrier to "acquire" ownership of the resource. + +== Coordinating the Handshake + +The beauty of the **Wait-Before-Signal** pattern we learned in Chapter 4 is that the graphics queue can be submitted *before* the transfer has even finished. As long as the graphics queue waits for the correct timeline value, the hardware will ensure the transfer completes first. + +[,c++] +---- +// Graphics Queue Submission (Wait for Transfer) +auto waitInfo = vk::SemaphoreSubmitInfo{ + .semaphore = *transferTimeline, + .value = upload_complete_value, + .stageMask = vk::PipelineStageFlagBits2::eFragmentShader +}; + +auto acquireBarrier = vk::ImageMemoryBarrier2{ + .srcStageMask = vk::PipelineStageFlagBits2::eNone, + .srcAccessMask = vk::AccessFlagBits2::eNone, + .dstStageMask = vk::PipelineStageFlagBits2::eFragmentShader, + .dstAccessMask = vk::AccessFlagBits2::eShaderRead, + .oldLayout = vk::ImageLayout::eTransferDstOptimal, + .newLayout = vk::ImageLayout::eShaderReadOnlyOptimal, + .srcQueueFamilyIndex = transferQueueIndex, + .dstQueueFamilyIndex = graphicsQueueIndex, + .image = newTexture.image(), + .subresourceRange = subresourceRange +}; + +// Record and submit the acquire barrier on the graphics queue +graphicsCommandBuffer.pipelineBarrier2(vk::DependencyInfo{.imageMemoryBarrierCount = 1, .pImageMemoryBarriers = &acquireBarrier}); + +auto submit = vk::SubmitInfo2{ + .waitSemaphoreInfoCount = 1, + .pWaitSemaphoreInfos = &waitInfo, + // ... +}; + +graphicsQueue.submit2(submit); +---- + +== Handling Resource Lifetimes + +When you stream assets in the background, you must also be careful with the lifetime of your **Staging Buffer**. You cannot reuse or destroy the staging buffer until the transfer queue has finished copying from it. This is another area where the timeline semaphore is invaluable. By tagging each staging allocation with its completion value, you can build a simple "garbage collector" that reclaims memory only when it's safe to do so. + +By coordinating your staging synchronization with your timeline semaphores, you can build an engine that is both high-performance and extremely robust. In the next chapter, we'll see how these same principles apply to the modern world of **Dynamic Rendering**, where traditional subpass dependencies have been replaced by these explicit synchronization patterns. + +== Coordinating in Simple Engine + +In `Simple Engine`, this coordination happens in the `Renderer::ProcessPendingMeshUploads` method. When the `LoadingThread` has finished a transfer, it adds the new mesh/texture to a thread-safe queue. The main renderer then checks this queue once per frame. + +For each new asset, the renderer: + +1. **Gets the Timeline Value**: It retrieves the `uploadTimelineValue` that the loading thread signaled for this specific upload. +2. **Records the Acquire Barrier**: It records a `vk::ImageMemoryBarrier2` on the main graphics command buffer. Since we use `eConcurrent` for our images, this barrier primarily handles the layout transition from `eTransferDstOptimal` to `eShaderReadOnlyOptimal` and ensures the data is invalidated in the graphics caches. +3. **Waits for Completion**: It adds a `vk::SemaphoreSubmitInfo` to the main frame submission. This tells the `graphicsQueue` to wait until the `uploadTimeline` has reached the `uploadTimelineValue` before it can start executing the fragment shaders that sample the new texture. + +This robust handshake ensures that `Simple Engine` never tries to draw a texture that is still being copied, even if the background thread is running on a different CPU core and submitting to a different GPU queue. + +== Navigation + +Previous: xref:Synchronization/Transfer_Queues_Streaming/02_non_blocking_uploads.adoc[Non-Blocking Data Uploads] | Next: xref:Synchronization/Dynamic_Rendering_Sync/01_introduction.adoc[Synchronization in Dynamic Rendering] diff --git a/en/Synchronization/introduction.adoc b/en/Synchronization/introduction.adoc new file mode 100644 index 00000000..5626f86c --- /dev/null +++ b/en/Synchronization/introduction.adoc @@ -0,0 +1,59 @@ +:pp: {plus}{plus} += Synchronization 2: Mastering the GPU/CPU Handshake + +== Introduction + +Welcome to the **Synchronization 2** tutorial series! If you've spent any significant time with Vulkan, you've likely encountered the "Sync Wall." It’s that moment when your code runs perfectly on your development machine but flickers on another, or when you realize that your high-performance GPU is spending half its time waiting for a single, overly conservative barrier. + +Synchronization is arguably the most challenging part of Vulkan, but it’s also the most powerful. It is the language we use to tell the hardware exactly how data flows through the pipeline. In this series, we are going to move beyond the legacy Vulkan 1.0 synchronization systems—those fragmented bitmasks and binary semaphores—and embrace the modern standard: **Synchronization 2** and **Timeline Semaphores**. + +=== Why a New System? + +Vulkan 1.0 synchronization was a breakthrough in control, but it was/is notoriously difficult to work with or understand. The original pipeline barriers were split across multiple structures, and the stage masks often felt like they were designed for the hardware of a decade ago; because well, it was a decade ago when it was designed. Vulkan is 10 years old at the time of writing, and modern techniques along with modern hardware have enabled some better ways while maintaining the same level of control. + +Synchronization 2, which arrived as an extension and is now a core part of Vulkan 1.3, simplifies this landscape by unifying everything into the `vk::DependencyInfo` structure. It provides a clearer, more intuitive way to define dependencies, using 64-bit masks that can target modern hardware units—like task and mesh shaders—with surgical precision. + +When we combine this with **Timeline Semaphores**, we move from a world of "binary" signals (on or off) to a world of monotonic counters. This allows us to treat the entire GPU/CPU execution as a single, unified timeline, drastically simplifying how we manage multiple frames in flight and asynchronous work. + +=== What You'll Learn + +This isn't just a list of API calls. We are going to build an engine-grade synchronization architecture. Throughout this series, we will: + +1. **Deconstruct the Dependency**: We'll look under the hood at how GPUs actually handle memory and why an "execution dependency" alone isn't enough to prevent data corruption. +2. **Master the New Barrier**: You'll learn how to use `vk::DependencyInfo` to replace legacy barriers, making your code cleaner and more performant. +3. **Harness the Timeline**: We'll implement Timeline Semaphores as the "master clock" of our engine, replacing fences and binary semaphores with a more robust monotonic counter. +4. **Architect for Concurrency**: We'll rebuild the main engine loop to handle multiple frames in flight and implement asynchronous compute and transfer operations that overlap with your main graphics work. +5. **Leverage Modern Vulkan**: We'll dive into Vulkan 1.4 features, including **Host Image Copies** and tile-local reads in **Dynamic Rendering**, to stay on the cutting edge of the API. + +=== Prerequisites + +This series is designed as an "Advanced Topic" that builds directly on the foundations established in our main Vulkan tutorial. We assume you are comfortable with: + +* The basic Vulkan rendering loop (Command Buffers, Pipelines, and Descriptor Sets). +* Modern c{pp} (RAII, smart pointers, and basic templates). +* The fundamental concepts of graphics pipelines (Vertex/Fragment stages). + +If you’re new to Vulkan, we strongly recommend completing the xref:00_Introduction.adoc[main tutorial] first. For those following along with our engine-building journey, this series perfectly complements the xref:Building_a_Simple_Engine/introduction.adoc[Building a Simple Engine] tutorial by providing the deep-dive synchronization knowledge required for a truly professional-grade renderer. + +=== A Note on Tooling + +In this series, we will be using **Slang** for all our shader examples. Slang’s productivity features and its ability to target Vulkan spir-v naturally make it the perfect companion for modern synchronization. We’ll also lean heavily on the **LunarG Synchronization Validation** layer—your best friend when it comes to identifying the "Write-After-Read" (WAR) and "Read-After-Write" (RAW) hazards that can be so hard to track down manually. + +Let's begin by tearing down a dependency to see what it's really made of. + +== Chapters in this series + +1. xref:Synchronization/Anatomy_of_a_Dependency/01_introduction.adoc[The Anatomy of a Dependency] - Understanding the core mechanics of how data moves through the pipeline. +2. xref:Synchronization/Pipeline_Barriers_Transitions/01_introduction.adoc[Pipeline Barriers and Image Layout Transitions] - Mastering the new barrier system. +3. xref:Synchronization/Timeline_Semaphores/01_introduction.adoc[Timeline Semaphores: The Master Clock] - Moving to a monotonic world. +4. xref:Synchronization/Frame_in_Flight/01_introduction.adoc[Frame-in-Flight Architecture] - Building the heartbeat of your engine. +5. xref:Synchronization/Async_Compute_Overlap/01_introduction.adoc[Asynchronous Compute & Execution Overlap] - Parallelizing your GPU work. +6. xref:Synchronization/Transfer_Queues_Streaming/01_introduction.adoc[Transfer Queues & Asset Streaming Sync] - Streaming assets without the stutter. +7. xref:Synchronization/Dynamic_Rendering_Sync/01_introduction.adoc[Synchronization in Dynamic Rendering] - Modern sync in a pass-less world. +8. xref:Synchronization/Host_Image_Copies_Memory_Sync/01_introduction.adoc[Host Image Copies & Memory Mapped Sync] - Direct CPU-to-GPU memory management. +9. xref:Synchronization/Synchronization_Validation/01_introduction.adoc[Debugging with Synchronization Validation] - Letting the tools find your hazards. +10. xref:Synchronization/Profiling_Optimization/01_introduction.adoc[Profiling, Batching, and Optimization] - Squeezing out every last millisecond. + +== Navigation + +Next: xref:Synchronization/Anatomy_of_a_Dependency/01_introduction.adoc[The Anatomy of a Dependency] diff --git a/images/image_barrier_anatomy.svg b/images/image_barrier_anatomy.svg new file mode 100644 index 00000000..9a8abc23 --- /dev/null +++ b/images/image_barrier_anatomy.svg @@ -0,0 +1,38 @@ + + + + + + + Source State + Stage: Color Attachment + Access: Write + Layout: Color Attachment + + + + vk::ImageMemoryBarrier2 + + + + Transition + + + + Destination State + Stage: Fragment Shader + Access: Shader Read + Layout: Shader Read Only + + + + + + + + + + Execution Wait + + Memory Visibility + + Layout Transition + diff --git a/images/sync2_problem_over_sync.svg b/images/sync2_problem_over_sync.svg new file mode 100644 index 00000000..75c05e0b --- /dev/null +++ b/images/sync2_problem_over_sync.svg @@ -0,0 +1,44 @@ + + + + + + Legacy Vulkan 1.0: The "Log Jam" Problem + + + + TRANSFER + + + COMPUTE + + + + BARRIER + (Global Masks) + + + + VERTEX + + + FRAGMENT + + + + + + + + + + Wait for Compute too! + Wait for Transfer too! + + + + + + + + diff --git a/images/sync2_solution_granular.svg b/images/sync2_solution_granular.svg new file mode 100644 index 00000000..4a880495 --- /dev/null +++ b/images/sync2_solution_granular.svg @@ -0,0 +1,55 @@ + + + + + + Synchronization 2: Granular and Parallel + + + + TRANSFER + + + COMPUTE + + + + BARRIER 1 + + + BARRIER 2 + + + + VERTEX + + + FRAGMENT + + + + + + + + + + Wait ONLY for Transfer + Wait ONLY for Compute + + + + + + + + + + + + + + + + +