Add Gemma4-31B (text-only) ExecuTorch CUDA support [WIP] by Gasoonjia · Pull Request #19217 · pytorch/executorch

Gasoonjia · 2026-04-30T02:03:19Z

Summary:
Adds an end-to-end CUDA path for Google's Gemma4-31B (text-only) mirroring the qwen3_5_moe prequantized pipeline:

  examples/models/gemma4/
    model.py                 - Gemma4TextModel + Gemma4ScaledEmbedding,
                               attention_k_eq_v, partial RoPE per layer-type,
                               final logit softcapping, tied embeddings.
    quantize_and_save.py     - Layerwise INT4 HQQ prequant (one decoder
                               layer at a time on CUDA, then back to CPU)
                               -> safetensors bundle. embed_tokens / lm_head
                               stay bf16 (tied; quantizing breaks output).
    export.py                - load_full_model / load_prequantized_model;
                               --tiny-test / --model-dir / --prequantized
                               CLI; emits decode + prefill methods sharing
                               KV cache (share_mutable_buffers=True).
    inference.py             - 6-stage validation pipeline (eager bf16 ->
                               prequant -> export -> lower -> .pte -> .ptd).
    main.cpp                 - C++ runner mirroring qwen3_5_moe_runner
                               (greedy decode only; no temperature/cuda_graph
                               flags yet).
    convert_weights.py       - HF checkpoint -> internal layout helper.
    CMakeLists.txt + CMakePresets.json + __init__.py + requirements.txt
    README.md                - usage notes.

CI scaffolding:

  .github/workflows/cuda.yml             - matrix entry + A100 runner
                                           conditional + tile-packed-only
                                           excludes for both export and
                                           e2e jobs.
  .ci/scripts/export_model_artifact.sh   - gemma4 case: snapshot_download
                                           -> inference sanity -> export.
  .ci/scripts/test_model_e2e.sh          - gemma4 case: skip-tokenizer-
                                           download list + RUNNER_ARGS
                                           (no --temperature, no
                                           --cuda_graph, EXPECTED_OUTPUT
                                           empty pending real-weight
                                           validation).
  Makefile                               - gemma4-cuda target.

Perf (p=2000, d == 500):

Metric         Value            Notes
*Prefill*    *236.4 tok/s*  417-token prompt / 1.76 s -- true throughput
*Decode*     *37.2 tok/s*   Avg over 200 tokens, steady-state, no degradation
*Cold load*  ~33 s            20.8 GB ptd → GPU
*GPU peak*   20.8 GB          Fits on a 24 GB card

Status (WIP):

Text-only path. Vision tower (gemma4_vision, 27 layers) NOT loaded, NOT quantized, NOT exported. Multimodal fusion not implemented.
Validated end-to-end on local hardware: prequant -> export -> .pte -> C++ runner produces coherent output ("The capital of France is a city that is always in motion..."). Peak GPU: ~20.8 GB on A100.
HF prequant bundle (model.safetensors + config.json + tokenizer.*) not yet uploaded; CI references gasoonjia/Gemma4-31B-HQQ-INT4 as placeholder.
GPU peak memory budget assert in test_model_e2e.sh skipped (gemma4 runner doesn't print the "GPU peak memory usage:" line yet).
Two files contain hardcoded /home/gasoonjia paths (quantize_and_save.py docstring, README.md examples) — sanitize before public review.

Test Plan:

python -m executorch.examples.models.gemma4.inference --prequantized /home/gasoonjia/models/gemma4_31B_int4_hqq (eager validation OK)
python -m executorch.examples.models.gemma4.export --prequantized /home/gasoonjia/models/gemma4_31B_int4_hqq --output-dir /tmp/gemma4 (lowered .pte + .ptd produced, ~15 MB + 20.8 GB)
cmake-out/examples/models/gemma4/gemma4_runner --model_path ... --data_path ... --tokenizer_path ... --prompt 'The capital of France is' --max_new_tokens 50 (coherent generation)

Summary: Adds an end-to-end CUDA path for Google's Gemma4-31B (text-only) mirroring the qwen3_5_moe prequantized pipeline: examples/models/gemma4/ model.py - Gemma4TextModel + Gemma4ScaledEmbedding, attention_k_eq_v, partial RoPE per layer-type, final logit softcapping, tied embeddings. quantize_and_save.py - Layerwise INT4 HQQ prequant (one decoder layer at a time on CUDA, then back to CPU) -> safetensors bundle. embed_tokens / lm_head stay bf16 (tied; quantizing breaks output). export.py - load_full_model / load_prequantized_model; --tiny-test / --model-dir / --prequantized CLI; emits decode + prefill methods sharing KV cache (share_mutable_buffers=True). inference.py - 6-stage validation pipeline (eager bf16 -> prequant -> export -> lower -> .pte -> .ptd). main.cpp - C++ runner mirroring qwen3_5_moe_runner (greedy decode only; no temperature/cuda_graph flags yet). convert_weights.py - HF checkpoint -> internal layout helper. CMakeLists.txt + CMakePresets.json + __init__.py + requirements.txt README.md - usage notes. CI scaffolding (mirrors qwen3_5_moe; uses placeholder HF repo gasoonjia/Gemma4-31B-HQQ-INT4 — sed-rename if your HF user differs): .github/workflows/cuda.yml - matrix entry + A100 runner conditional + tile-packed-only excludes for both export and e2e jobs. .ci/scripts/export_model_artifact.sh - gemma4 case: snapshot_download -> inference sanity -> export. .ci/scripts/test_model_e2e.sh - gemma4 case: skip-tokenizer- download list + RUNNER_ARGS (no --temperature, no --cuda_graph, EXPECTED_OUTPUT empty pending real-weight validation). Makefile - gemma4-cuda target. Status (WIP): - Text-only path. Vision tower (gemma4_vision, 27 layers) NOT loaded, NOT quantized, NOT exported. Multimodal fusion not implemented. - Validated end-to-end on local hardware: prequant -> export -> .pte -> C++ runner produces coherent output ("The capital of France is a city that is always in motion..."). Peak GPU: ~20.8 GB on A100. - HF prequant bundle (model.safetensors + config.json + tokenizer.*) not yet uploaded; CI references gasoonjia/Gemma4-31B-HQQ-INT4 as placeholder. - GPU peak memory budget assert in test_model_e2e.sh skipped (gemma4 runner doesn't print the "GPU peak memory usage:" line yet). - Two files contain hardcoded /home/gasoonjia paths (quantize_and_save.py docstring, README.md examples) — sanitize before public review. Test Plan: - python -m executorch.examples.models.gemma4.inference --prequantized /home/gasoonjia/models/gemma4_31B_int4_hqq (eager validation OK) - python -m executorch.examples.models.gemma4.export --prequantized /home/gasoonjia/models/gemma4_31B_int4_hqq --output-dir /tmp/gemma4 (lowered .pte + .ptd produced, ~15 MB + 20.8 GB) - cmake-out/examples/models/gemma4/gemma4_runner --model_path ... --data_path ... --tokenizer_path ... --prompt 'The capital of France is' --max_new_tokens 50 (coherent generation)

pytorch-bot · 2026-04-30T02:03:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19217

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-30T02:04:09Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Gasoonjia · 2026-04-30T02:08:23Z

    PREPROCESSOR_FEATURE_SIZE=""
    PREPROCESSOR_OUTPUT=""
    ;;
+  gasoonjia/Gemma4-31B-HQQ-INT4)


it is a placeholder link

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026

Gasoonjia commented Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma4-31B (text-only) ExecuTorch CUDA support [WIP]#19217

Add Gemma4-31B (text-only) ExecuTorch CUDA support [WIP]#19217
Gasoonjia wants to merge 1 commit intomainfrom
gemma4-wip

Gasoonjia commented Apr 30, 2026

Uh oh!

pytorch-bot Bot commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Gasoonjia Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gasoonjia commented Apr 30, 2026

Uh oh!

pytorch-bot Bot commented Apr 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19217

❗ 1 Active SEVs

Uh oh!

github-actions Bot commented Apr 30, 2026

This PR needs a release notes: label

Uh oh!

Gasoonjia Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

This PR needs a `release notes:` label