Skip to content

increasing precision tolerance#3060

Open
francesco-bertolotti wants to merge 1 commit into
NVIDIA:mainfrom
francesco-bertolotti:f14-increased-tolerance
Open

increasing precision tolerance#3060
francesco-bertolotti wants to merge 1 commit into
NVIDIA:mainfrom
francesco-bertolotti:f14-increased-tolerance

Conversation

@francesco-bertolotti
Copy link
Copy Markdown
Contributor

Here’s a cleaner Markdown rewrite of your PR description:


Description

test_kv_cache compares full-sequence attention against incremental KV-cache decoding. In the TransformerLayer configuration where head_dim > 128 in fp16, these two execution paths use different kernels and masking strategies (e.g., causal vs. padding_causal_bottom_right, and full-matrix vs. single-query-row kernels). As a result, their outputs diverge slightly due to accumulated fp16 rounding differences.

On Ampere, this divergence can reach the current tolerance threshold in rare cases, producing a spurious failure. In one observed instance, a single element out of 4096 showed an absolute difference of ~0.0107, which narrowly exceeds the existing 1e-2 tolerance.

This change slightly relaxes the fp16 tolerance for the affected configuration to make the test robust across architectures. No runtime or library code is modified.

Fixes: N/A (spurious test_kv_cache failure on sm80 / fp16 / head_dim=256)


Type of change

  • Documentation change
  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change
  • Infra/Build change
  • Code refactoring

Checklist

  • I have read and followed the contributing guidelines (https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (N/A — test-only change)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works (N/A — test-only tolerance adjustment)
  • New and existing unit tests pass locally with my changes

Notes

  • Contributing guidelines and full test runs still need to be verified before final submission.
  • bfloat16 tolerance in this branch is unchanged, as it was not failing.
  • The Pyright import warnings are unrelated noise from missing stubs in the local environment.
  • Potential reviewer feedback: this could be further refined to apply the tolerance only conditionally by compute capability (e.g., sm < 90), rather than broadly for the branch.

Diff summary

# tests/pytorch/attention/test_kv_cache.py (get_tols)
- torch.half: (1e-2, 1e-2),
+ torch.half: (1.5e-2, 1.5e-2),

# plus explanatory comment added

@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 29, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Greptile Summary

This PR relaxes the torch.half absolute/relative tolerance in get_tols from 1e-2 to 1.5e-2 for the TransformerLayer path where head_dim > 128 and the backend is not UnfusedAttention. The change targets the infer_1 model config (head_dim=256) to prevent a spurious test failure on sm80 where full-sequence and incremental KV-cache paths produce slightly larger per-element differences in fp16.

  • Applies only to the two fused-attention backends (FusedAttention, FlashAttention) with head_dim=256 in fp16; all other dtype/backend/head_dim combinations are untouched.
  • A detailed inline comment was added explaining the sm80 / cuDNN version sensitivity that motivates the looser bound.

Confidence Score: 5/5

Safe to merge — only a test tolerance constant and its explanatory comment are changed; no production or library code is touched.

The change is limited to widening a single fp16 tolerance pair from 1e-2 to 1.5e-2 in a test helper. The new value (0.015) gives ~40% headroom above the observed worst-case difference (~0.0107) on sm80, the comment clearly traces the reasoning, and all other dtype/backend/head-dim branches are unaffected.

No files require special attention.

Important Files Changed

Filename Overview
tests/pytorch/attention/test_kv_cache.py Widens fp16 tolerance from 1e-2 to 1.5e-2 in get_tols for the head_dim > 128, non-UnfusedAttention branch, with an explanatory comment; no runtime code changed

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_tols called] --> B{module == TransformerLayer?}
    B -- No --> C[DotProductAttention tols\nfp16: 1e-3 / bf16: 1e-2]
    B -- Yes --> D{head_dim_qk <= 128?}
    D -- Yes --> E[fp16: 5e-3\nbf16: 3.5e-2]
    D -- No --> F{backend == UnfusedAttention?}
    F -- Yes --> G[fp16: 1.6e-2\nbf16: 1.2e-1]
    F -- No --> H["fp16: 1.5e-2 ← widened from 1e-2\nbf16: 8e-2\n(FlashAttention / FusedAttention, head_dim=256 on sm80)"]
Loading

Reviews (1): Last reviewed commit: "increasing precision tolerance" | Re-trigger Greptile

@cyanguwa cyanguwa self-requested a review May 29, 2026 19:58
@cyanguwa
Copy link
Copy Markdown
Collaborator

/te-ci pytorch L0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants