Fix webgpu native SDPA test: select attention output by position, not numel#20282
Fix webgpu native SDPA test: select attention output by position, not numel#20282shoumikhin wants to merge 4 commits into
Conversation
… numel The sdpa_with_kv_cache sweep picked the attention output among the mutating op's outputs by matching element count. For llama1b_prefill (Hq=32,Hkv=8,D=64,S=128,Cmax=512) the attention output (S*Hq*D=262144) shares its numel with both the k and v caches (Cmax*Hkv*D=262144, since S*Hq==Cmax*Hkv), so three tensors matched and the test failed with 'ambiguous attention output'. The op returns [k_cache, v_cache, attn_output] in a fixed order, so select the last output instead and keep a numel sanity check. Test-only; the kernel is correct.
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Fixes the WebGPU native SDPA test harness by selecting the attention output by output position (fixed output order) rather than by element count, avoiding ambiguity when caches and attention output share the same numel (e.g. llama1b_prefill).
Changes:
- Select SDPA attention output as the last output (
outputs.back()) instead of searching bynumel. - Add a
numelsanity check on the selected output and update related failure messages.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…numel asserts Use the documented ExecuTorch output order [*mutated_inputs, *user_outputs] = [k_cache, v_cache, attn_output] and select the attention output by position (index 2) instead of by element count. Assert outputs.size()==3, that each output is a tensor, and that the two cache slots have numel Cmax*Hkv*D, so a future change in output arity/order fails loudly. Print numel as %zu/size_t to match the rest of the file and avoid truncation. Addresses Copilot review comments.
The replay and dynamic-input_pos decode tests had the same latent bug as the sweep: they classified outputs by element count (ne==qn attn, ne==cn cache), which is ambiguous when the attention output and caches share numel. Switch both to the documented positional order [k_cache, v_cache, attn_output] and assert 3 tensor outputs; the existing content-based k/v cache disambiguation and cross-step threading are unchanged.
- Remove the leftover replay comment claiming the attn output has a unique numel (the whole point of this PR is that it can equal the cache numel). - In both replay and dynamic-decode, after positional selection, assert the two cache outputs have numel cn before step-0 identification/threading read cn/kvn elements from them, turning a short/changed tensor into a clean failure instead of an out-of-bounds read.
|
Superseded by #20283, which fixes the same One thing #20283 does not cover: it only updates |
Summary
test-webgpu-native / linux-job(Test WebGPU Native (Dawn)) has been red on every commit since the WebGPU SDPA test suite landed on June 13. This is a test-harness bug, not a kernel bug.The
sdpa_with_kv_cachesweep selects the attention output among the mutating op's outputs by matching the element count (numel). For thellama1b_prefillconfig (Hq=32, Hkv=8, D=64, S=128, Cmax=512) that selection is ambiguous:Because
S*Hq == Cmax*Hkv(both 4096), all three tensors share numel 262144, so three tensors match and the test fails withambiguous attention output: 3 tensors match numel 262144. Every other shape passes (max error ~1e-8), so the kernel is correct.Fix
The mutating op returns
[k_cache, v_cache, attn_output]in a fixed order (already documented in the original comment), so the attention output is always the last output. Select the last output directly and keep a numel sanity check. Test-only change.Test plan
test-webgpu-native / linux-jobshould go green; thellama1b_prefillconfig now passes its numeric check instead of failing output selection. All other SDPA configs are unaffected.Note: the SDPA replay tests in the same file use a separate cache-detection scheme and are not affected by this change.