increasing precision tolerance by francesco-bertolotti · Pull Request #3060 · NVIDIA/TransformerEngine

francesco-bertolotti · 2026-05-29T17:22:11Z

Here’s a cleaner Markdown rewrite of your PR description:

Description

test_kv_cache compares full-sequence attention against incremental KV-cache decoding. In the TransformerLayer configuration where head_dim > 128 in fp16, these two execution paths use different kernels and masking strategies (e.g., causal vs. padding_causal_bottom_right, and full-matrix vs. single-query-row kernels). As a result, their outputs diverge slightly due to accumulated fp16 rounding differences.

On Ampere, this divergence can reach the current tolerance threshold in rare cases, producing a spurious failure. In one observed instance, a single element out of 4096 showed an absolute difference of ~0.0107, which narrowly exceeds the existing 1e-2 tolerance.

This change slightly relaxes the fp16 tolerance for the affected configuration to make the test robust across architectures. No runtime or library code is modified.

Fixes: N/A (spurious test_kv_cache failure on sm80 / fp16 / head_dim=256)

Type of change

Checklist

I have read and followed the contributing guidelines (https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (N/A — test-only change)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works (N/A — test-only tolerance adjustment)
New and existing unit tests pass locally with my changes

Notes

Contributing guidelines and full test runs still need to be verified before final submission.
bfloat16 tolerance in this branch is unchanged, as it was not failing.
The Pyright import warnings are unrelated noise from missing stubs in the local environment.
Potential reviewer feedback: this could be further refined to apply the tolerance only conditionally by compute capability (e.g., sm < 90), rather than broadly for the branch.

Diff summary

# tests/pytorch/attention/test_kv_cache.py (get_tols)
- torch.half: (1e-2, 1e-2),
+ torch.half: (1.5e-2, 1.5e-2),

# plus explanatory comment added

greptile-apps · 2026-05-29T17:23:39Z

Greptile Summary

This PR relaxes the torch.half absolute/relative tolerance in get_tols from 1e-2 to 1.5e-2 for the TransformerLayer path where head_dim > 128 and the backend is not UnfusedAttention. The change targets the infer_1 model config (head_dim=256) to prevent a spurious test failure on sm80 where full-sequence and incremental KV-cache paths produce slightly larger per-element differences in fp16.

Applies only to the two fused-attention backends (FusedAttention, FlashAttention) with head_dim=256 in fp16; all other dtype/backend/head_dim combinations are untouched.
A detailed inline comment was added explaining the sm80 / cuDNN version sensitivity that motivates the looser bound.

Confidence Score: 5/5

Safe to merge — only a test tolerance constant and its explanatory comment are changed; no production or library code is touched.

The change is limited to widening a single fp16 tolerance pair from 1e-2 to 1.5e-2 in a test helper. The new value (0.015) gives ~40% headroom above the observed worst-case difference (~0.0107) on sm80, the comment clearly traces the reasoning, and all other dtype/backend/head-dim branches are unaffected.

No files require special attention.

Important Files Changed

Filename	Overview
tests/pytorch/attention/test_kv_cache.py	Widens fp16 tolerance from 1e-2 to 1.5e-2 in get_tols for the head_dim > 128, non-UnfusedAttention branch, with an explanatory comment; no runtime code changed

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_tols called] --> B{module == TransformerLayer?}
    B -- No --> C[DotProductAttention tols\nfp16: 1e-3 / bf16: 1e-2]
    B -- Yes --> D{head_dim_qk <= 128?}
    D -- Yes --> E[fp16: 5e-3\nbf16: 3.5e-2]
    D -- No --> F{backend == UnfusedAttention?}
    F -- Yes --> G[fp16: 1.6e-2\nbf16: 1.2e-1]
    F -- No --> H["fp16: 1.5e-2 ← widened from 1e-2\nbf16: 8e-2\n(FlashAttention / FusedAttention, head_dim=256 on sm80)"]

_{Reviews (1): Last reviewed commit: "increasing precision tolerance" | Re-trigger Greptile}

cyanguwa · 2026-05-29T19:59:15Z

/te-ci pytorch L0

increasing precision tolerance

3695fe4

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 29, 2026

cyanguwa self-requested a review May 29, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increasing precision tolerance#3060

increasing precision tolerance#3060
francesco-bertolotti wants to merge 1 commit into
NVIDIA:mainfrom
francesco-bertolotti:f14-increased-tolerance

francesco-bertolotti commented May 29, 2026

Uh oh!

greptile-apps Bot commented May 29, 2026

Uh oh!

cyanguwa commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

francesco-bertolotti commented May 29, 2026

Description

Type of change

Checklist

Notes

Diff summary

Uh oh!

greptile-apps Bot commented May 29, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

cyanguwa commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants