Skip to content

Fix webgpu native SDPA test: select attention output by position, not numel#20282

Closed
shoumikhin wants to merge 4 commits into
mainfrom
oncall-fix-webgpu-sdpa-output-selection
Closed

Fix webgpu native SDPA test: select attention output by position, not numel#20282
shoumikhin wants to merge 4 commits into
mainfrom
oncall-fix-webgpu-sdpa-output-selection

Conversation

@shoumikhin

Copy link
Copy Markdown
Contributor

Summary

test-webgpu-native / linux-job (Test WebGPU Native (Dawn)) has been red on every commit since the WebGPU SDPA test suite landed on June 13. This is a test-harness bug, not a kernel bug.

The sdpa_with_kv_cache sweep selects the attention output among the mutating op's outputs by matching the element count (numel). For the llama1b_prefill config (Hq=32, Hkv=8, D=64, S=128, Cmax=512) that selection is ambiguous:

  • attention output numel = SHqD = 3212864 = 262144
  • k cache numel = v cache numel = CmaxHkvD = 851264 = 262144

Because S*Hq == Cmax*Hkv (both 4096), all three tensors share numel 262144, so three tensors match and the test fails with ambiguous attention output: 3 tensors match numel 262144. Every other shape passes (max error ~1e-8), so the kernel is correct.

Fix

The mutating op returns [k_cache, v_cache, attn_output] in a fixed order (already documented in the original comment), so the attention output is always the last output. Select the last output directly and keep a numel sanity check. Test-only change.

Test plan

test-webgpu-native / linux-job should go green; the llama1b_prefill config now passes its numeric check instead of failing output selection. All other SDPA configs are unaffected.

Note: the SDPA replay tests in the same file use a separate cache-detection scheme and are not affected by this change.

… numel

The sdpa_with_kv_cache sweep picked the attention output among the
mutating op's outputs by matching element count. For llama1b_prefill
(Hq=32,Hkv=8,D=64,S=128,Cmax=512) the attention output (S*Hq*D=262144)
shares its numel with both the k and v caches (Cmax*Hkv*D=262144, since
S*Hq==Cmax*Hkv), so three tensors matched and the test failed with
'ambiguous attention output'. The op returns [k_cache, v_cache,
attn_output] in a fixed order, so select the last output instead and
keep a numel sanity check. Test-only; the kernel is correct.
Copilot AI review requested due to automatic review settings June 15, 2026 14:54
@pytorch-bot

pytorch-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20282

Note: Links to docs will display an error until the docs builds have been completed.

❌ 79 Cancelled Jobs, 74 Pending

As of commit 87ab706 with merge base 06143cb (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 15, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the WebGPU native SDPA test harness by selecting the attention output by output position (fixed output order) rather than by element count, avoiding ambiguity when caches and attention output share the same numel (e.g. llama1b_prefill).

Changes:

  • Select SDPA attention output as the last output (outputs.back()) instead of searching by numel.
  • Add a numel sanity check on the selected output and update related failure messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backends/webgpu/test/test_webgpu_native.cpp Outdated
Comment thread backends/webgpu/test/test_webgpu_native.cpp
…numel asserts

Use the documented ExecuTorch output order [*mutated_inputs, *user_outputs] = [k_cache, v_cache, attn_output] and select the attention output by position (index 2) instead of by element count. Assert outputs.size()==3, that each output is a tensor, and that the two cache slots have numel Cmax*Hkv*D, so a future change in output arity/order fails loudly. Print numel as %zu/size_t to match the rest of the file and avoid truncation. Addresses Copilot review comments.
The replay and dynamic-input_pos decode tests had the same latent bug as the
sweep: they classified outputs by element count (ne==qn attn, ne==cn cache),
which is ambiguous when the attention output and caches share numel. Switch
both to the documented positional order [k_cache, v_cache, attn_output] and
assert 3 tensor outputs; the existing content-based k/v cache disambiguation
and cross-step threading are unchanged.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment thread backends/webgpu/test/test_webgpu_native.cpp Outdated
Comment thread backends/webgpu/test/test_webgpu_native.cpp
Comment thread backends/webgpu/test/test_webgpu_native.cpp
- Remove the leftover replay comment claiming the attn output has a unique
  numel (the whole point of this PR is that it can equal the cache numel).
- In both replay and dynamic-decode, after positional selection, assert the
  two cache outputs have numel cn before step-0 identification/threading read
  cn/kvn elements from them, turning a short/changed tensor into a clean
  failure instead of an out-of-bounds read.
@shoumikhin

Copy link
Copy Markdown
Contributor Author

Superseded by #20283, which fixes the same test-webgpu-native (Dawn) red by selecting the SDPA attention output by shape instead of element count. That landed on main first, so I'm closing this PR to avoid a redundant, conflicting change.

One thing #20283 does not cover: it only updates test_sdpa_config. The test_sdpa_replay and test_sdpa_dynamic_decode functions in the same file still select the attention output by numel and carry the same latent ambiguity (not triggered by any currently configured shape, since none has SHq == CmaxHkv). Happy to send a small follow-up that hardens those two to match the shape-based selection if that's useful.

@shoumikhin shoumikhin closed this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants