Fix: reset signal slots after Phase 2 barrier for reentrant collectiv…#1
Open
georgebisbas wants to merge 2 commits into
Open
Conversation
…e kernels The TNOTIFY(AtomicAdd, 1) + TWAIT(GE, 1) barrier pattern leaves each signal slot permanently at 1 after the first kernel invocation. A second call on the same HCCL window (e.g. inside a training loop) would find signal_base[peer] >= 1 already, causing every TWAIT to return immediately without actually synchronising — a silent correctness bug. Fix: after all TWAIT calls complete and pipe_barrier(PIPE_ALL) drains in-flight MTE2 reads, zero signal_base[0..nranks-1] in the local window and flush with a second pipe_barrier(PIPE_ALL). Safety of the reset: signal_base[i] in OUR window is where peer i writes its arrival notification. TWAIT(signal_base[i], GE 1) returning guarantees that peer i's AtomicAdd DMA has already committed to our local memory; no further write to signal_base[i] can arrive from the current invocation. We are the exclusive writer of our own signal_base, so zeroing it here races with nothing. Applied uniformly to all three collective kernels: - allreduce_distributed - allgather_distributed (new, added in parent commit) - reduce_scatter_distributed (new, added in parent commit) Addresses the HIGH-priority barrier comment from the Gemini review on hw-native-sys#842
Keep kernel source concise while preserving full safety rationale in prior commit message on this branch (60db4df). Behavior is unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…e kernels
The TNOTIFY(AtomicAdd, 1) + TWAIT(GE, 1) barrier pattern leaves each signal slot permanently at 1 after the first kernel invocation. A second call on the same HCCL window (e.g. inside a training loop) would find signal_base[peer] >= 1 already, causing every TWAIT to return immediately without actually synchronising — a silent correctness bug.
Fix: after all TWAIT calls complete and pipe_barrier(PIPE_ALL) drains in-flight MTE2 reads, zero signal_base[0..nranks-1] in the local window and flush with a second pipe_barrier(PIPE_ALL).
Safety of the reset:
signal_base[i] in OUR window is where peer i writes its arrival
notification. TWAIT(signal_base[i], GE 1) returning guarantees that
peer i's AtomicAdd DMA has already committed to our local memory; no
further write to signal_base[i] can arrive from the current invocation.
We are the exclusive writer of our own signal_base, so zeroing it here
races with nothing.
Applied uniformly to all three collective kernels:
Addresses the HIGH-priority barrier comment from the Gemini review on hw-native-sys#842