Add LongRoPE support and fp64 RoPE precompute for Phi-3 / Phi-4 family by SS-JIA · Pull Request #19235 · pytorch/executorch

SS-JIA · 2026-04-30T19:00:00Z

Summary:
Adds LongRoPE plumbing and an fp64 cos/sin precompute pass to
hf_precompute_freqs_cis. Together these eliminate Phi-4 Mini decode-time
n-gram repetition under both XNNPACK and Vulkan delegates.

Phi-3 and Phi-4 family models use HF's "longrope" RoPE scaling, which
multiplies cos/sin by an attention_factor (~1.19 for Phi-4 Mini) and
divides inv_freq element-wise by a per-dimension short_factor (when
seq_len <= original_max_position_embeddings) or long_factor. ET's
hf_precompute_freqs_cis was vanilla RoPE -- missing both terms.

At typical export configurations the dominant effect is the missing
attention_factor, which leaves attention scores ~1.42x softer than
the model was trained for. Compounded across 32 layers this pushes
Phi-4 Mini's narrow top-2 logit margins past their tipping point and
triggers greedy-decode n-gram repetition; the same error explains
prior on-device looping observed under both XNNPACK and Vulkan.

Adds LongRoPE plumbing through ModelArgs (short_factor, long_factor,
original_max_position_embeddings, max_position_embeddings,
rope_scaling_attention_factor) and into hf_precompute_freqs_cis,
with attention_factor derived as
sqrt(1 + log(scaling)/log(original_max)) when not explicitly set.
The non-HF precompute_freqs_cis path is left vanilla; longrope models
must set use_hf_rope=True (noted in Rope.init).

Also moves the cos/sin precompute to fp64, casting to fp32 once at
the end. After LongRoPE corrects the 19% scale error, fp32 ULP-level
rounding in the cos/sin tables becomes the next-largest contributor
to logit drift -- load-bearing on Vulkan under sampling: with fp32
precompute, 1/2 T=0.5 trajectories collapsed into a 4-gram loop
("avoiding data and data biases") even with LongRoPE applied. fp64
precompute is one-time at construction (microseconds on a few-KB
table); runtime tables remain fp32, so no inference-time cost.

Wires the LongRoPE fields into examples/models/phi_4_mini/config/config.json
sourced from HF's Phi-4 Mini config.

Test Plan:
Validated end-to-end on Samsung Galaxy S25:

Eager bf16 (host, 12 threads): 3/3 loop-free at T=0 greedy and
T=0.5 sampling x 2 seeds.
XNNPACK 8da4w-g32 on device: 3/3 loop-free, ~20.7 tok/s decode.
Vulkan 8da4w-g32 on device: 3/3 loop-free, ~17.1 tok/s decode.
Reproduced across two distinct S25 units to confirm result is not
device-specific. Verified that omitting either fix regresses Vulkan
sampling: LongRoPE alone leaves residual sampling loops; fp64 alone
was previously known insufficient.

Co-Authored-By: Claude noreply@anthropic.com

Summary: Adds LongRoPE plumbing and an fp64 cos/sin precompute pass to hf_precompute_freqs_cis. Together these eliminate Phi-4 Mini decode-time n-gram repetition under both XNNPACK and Vulkan delegates. Phi-3 and Phi-4 family models use HF's "longrope" RoPE scaling, which multiplies cos/sin by an attention_factor (~1.19 for Phi-4 Mini) and divides inv_freq element-wise by a per-dimension short_factor (when seq_len <= original_max_position_embeddings) or long_factor. ET's hf_precompute_freqs_cis was vanilla RoPE -- missing both terms. At typical export configurations the dominant effect is the missing attention_factor, which leaves attention scores ~1.42x softer than the model was trained for. Compounded across 32 layers this pushes Phi-4 Mini's narrow top-2 logit margins past their tipping point and triggers greedy-decode n-gram repetition; the same error explains prior on-device looping observed under both XNNPACK and Vulkan. Adds LongRoPE plumbing through ModelArgs (short_factor, long_factor, original_max_position_embeddings, max_position_embeddings, rope_scaling_attention_factor) and into hf_precompute_freqs_cis, with attention_factor derived as sqrt(1 + log(scaling)/log(original_max)) when not explicitly set. The non-HF precompute_freqs_cis path is left vanilla; longrope models must set use_hf_rope=True (noted in Rope.__init__). Also moves the cos/sin precompute to fp64, casting to fp32 once at the end. After LongRoPE corrects the 19% scale error, fp32 ULP-level rounding in the cos/sin tables becomes the next-largest contributor to logit drift -- load-bearing on Vulkan under sampling: with fp32 precompute, 1/2 T=0.5 trajectories collapsed into a 4-gram loop ("avoiding data and data biases") even with LongRoPE applied. fp64 precompute is one-time at construction (microseconds on a few-KB table); runtime tables remain fp32, so no inference-time cost. Wires the LongRoPE fields into examples/models/phi_4_mini/config/config.json sourced from HF's Phi-4 Mini config. Test Plan: Validated end-to-end on Samsung Galaxy S25: - Eager bf16 (host, 12 threads): 3/3 loop-free at T=0 greedy and T=0.5 sampling x 2 seeds. - XNNPACK 8da4w-g32 on device: 3/3 loop-free, ~20.7 tok/s decode. - Vulkan 8da4w-g32 on device: 3/3 loop-free, ~17.1 tok/s decode. Reproduced across two distinct S25 units to confirm result is not device-specific. Verified that omitting either fix regresses Vulkan sampling: LongRoPE alone leaves residual sampling loops; fp64 alone was previously known insufficient. Co-Authored-By: Claude <noreply@anthropic.com>

pytorch-bot · 2026-04-30T19:00:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19235

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 2 Unrelated Failures

As of commit 6420809 with merge base e84a418 ():

NEW FAILURES - The following jobs have failed:

Cadence Build & Test / cpu-test / test-aot / test-aot (gh)
backends/cadence/aot/tests/test_replace_ops_passes.py::TestReplaceOpsPasses::test_replace_conv2d_with_linear
pull / test-arm-cortex-m-size-test (zephyr-preset) / linux-job (gh)
RuntimeError: Command docker exec -t 7cdb21b8f66507fb2d0092528e28b0a1b25297574d1ebffb2c7d3f15f7d25337 /exec failed with exit code 2
pull / test-lora-linux / linux-job (gh)
RuntimeError: Command docker exec -t 0131f9490413538a3dd677e1aa47d9479fd903676d2ad1cd4e0bdf72321f4b35 /exec failed with exit code 1
pull / test-lora-multimethod-linux / linux-job (gh)
RuntimeError: Command docker exec -t 2637076cb9c0c34a024805681c6da4750fdece2c443ebcf95aaf4289057b2545 /exec failed with exit code 1
pull / unittest / linux / linux-job (gh)
examples/models/llama/tests/test_static_attention.py::StaticAttentionTest::test_within_transformer
pull / unittest-editable / linux / linux-job (gh)
examples/models/llama/tests/test_static_attention.py::StaticAttentionTest::test_within_transformer

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-30T19:00:45Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

SS-JIA requested a review from lucylq as a code owner April 30, 2026 19:00

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LongRoPE support and fp64 RoPE precompute for Phi-3 / Phi-4 family#19235

Add LongRoPE support and fp64 RoPE precompute for Phi-3 / Phi-4 family#19235
SS-JIA wants to merge 1 commit intomainfrom
longrope_impl

SS-JIA commented Apr 30, 2026

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SS-JIA commented Apr 30, 2026

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19235

❌ 6 New Failures, 2 Unrelated Failures

Uh oh!

github-actions Bot commented Apr 30, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading

This PR needs a `release notes:` label