Skip to content

Add Gemma4-31B (text-only) ExecuTorch CUDA support [WIP]#19217

Draft
Gasoonjia wants to merge 1 commit intomainfrom
gemma4-wip
Draft

Add Gemma4-31B (text-only) ExecuTorch CUDA support [WIP]#19217
Gasoonjia wants to merge 1 commit intomainfrom
gemma4-wip

Conversation

@Gasoonjia
Copy link
Copy Markdown
Contributor

Summary:
Adds an end-to-end CUDA path for Google's Gemma4-31B (text-only) mirroring the qwen3_5_moe prequantized pipeline:

  examples/models/gemma4/
    model.py                 - Gemma4TextModel + Gemma4ScaledEmbedding,
                               attention_k_eq_v, partial RoPE per layer-type,
                               final logit softcapping, tied embeddings.
    quantize_and_save.py     - Layerwise INT4 HQQ prequant (one decoder
                               layer at a time on CUDA, then back to CPU)
                               -> safetensors bundle. embed_tokens / lm_head
                               stay bf16 (tied; quantizing breaks output).
    export.py                - load_full_model / load_prequantized_model;
                               --tiny-test / --model-dir / --prequantized
                               CLI; emits decode + prefill methods sharing
                               KV cache (share_mutable_buffers=True).
    inference.py             - 6-stage validation pipeline (eager bf16 ->
                               prequant -> export -> lower -> .pte -> .ptd).
    main.cpp                 - C++ runner mirroring qwen3_5_moe_runner
                               (greedy decode only; no temperature/cuda_graph
                               flags yet).
    convert_weights.py       - HF checkpoint -> internal layout helper.
    CMakeLists.txt + CMakePresets.json + __init__.py + requirements.txt
    README.md                - usage notes.

CI scaffolding:

  .github/workflows/cuda.yml             - matrix entry + A100 runner
                                           conditional + tile-packed-only
                                           excludes for both export and
                                           e2e jobs.
  .ci/scripts/export_model_artifact.sh   - gemma4 case: snapshot_download
                                           -> inference sanity -> export.
  .ci/scripts/test_model_e2e.sh          - gemma4 case: skip-tokenizer-
                                           download list + RUNNER_ARGS
                                           (no --temperature, no
                                           --cuda_graph, EXPECTED_OUTPUT
                                           empty pending real-weight
                                           validation).
  Makefile                               - gemma4-cuda target.

Perf (p=2000, d == 500):

Metric         Value            Notes
*Prefill*    *236.4 tok/s*  417-token prompt / 1.76 s -- true throughput
*Decode*     *37.2 tok/s*   Avg over 200 tokens, steady-state, no degradation
*Cold load*  ~33 s            20.8 GB ptd → GPU
*GPU peak*   20.8 GB          Fits on a 24 GB card

Status (WIP):

  • Text-only path. Vision tower (gemma4_vision, 27 layers) NOT loaded, NOT quantized, NOT exported. Multimodal fusion not implemented.
  • Validated end-to-end on local hardware: prequant -> export -> .pte -> C++ runner produces coherent output ("The capital of France is a city that is always in motion..."). Peak GPU: ~20.8 GB on A100.
  • HF prequant bundle (model.safetensors + config.json + tokenizer.*) not yet uploaded; CI references gasoonjia/Gemma4-31B-HQQ-INT4 as placeholder.
  • GPU peak memory budget assert in test_model_e2e.sh skipped (gemma4 runner doesn't print the "GPU peak memory usage:" line yet).
  • Two files contain hardcoded /home/gasoonjia paths (quantize_and_save.py docstring, README.md examples) — sanitize before public review.

Test Plan:

  • python -m executorch.examples.models.gemma4.inference --prequantized /home/gasoonjia/models/gemma4_31B_int4_hqq (eager validation OK)
  • python -m executorch.examples.models.gemma4.export --prequantized /home/gasoonjia/models/gemma4_31B_int4_hqq --output-dir /tmp/gemma4 (lowered .pte + .ptd produced, ~15 MB + 20.8 GB)
  • cmake-out/examples/models/gemma4/gemma4_runner --model_path ... --data_path ... --tokenizer_path ... --prompt 'The capital of France is' --max_new_tokens 50 (coherent generation)

Summary:
Adds an end-to-end CUDA path for Google's Gemma4-31B (text-only) mirroring
the qwen3_5_moe prequantized pipeline:

  examples/models/gemma4/
    model.py                 - Gemma4TextModel + Gemma4ScaledEmbedding,
                               attention_k_eq_v, partial RoPE per layer-type,
                               final logit softcapping, tied embeddings.
    quantize_and_save.py     - Layerwise INT4 HQQ prequant (one decoder
                               layer at a time on CUDA, then back to CPU)
                               -> safetensors bundle. embed_tokens / lm_head
                               stay bf16 (tied; quantizing breaks output).
    export.py                - load_full_model / load_prequantized_model;
                               --tiny-test / --model-dir / --prequantized
                               CLI; emits decode + prefill methods sharing
                               KV cache (share_mutable_buffers=True).
    inference.py             - 6-stage validation pipeline (eager bf16 ->
                               prequant -> export -> lower -> .pte -> .ptd).
    main.cpp                 - C++ runner mirroring qwen3_5_moe_runner
                               (greedy decode only; no temperature/cuda_graph
                               flags yet).
    convert_weights.py       - HF checkpoint -> internal layout helper.
    CMakeLists.txt + CMakePresets.json + __init__.py + requirements.txt
    README.md                - usage notes.

  CI scaffolding (mirrors qwen3_5_moe; uses placeholder HF repo
  gasoonjia/Gemma4-31B-HQQ-INT4 — sed-rename if your HF user differs):
    .github/workflows/cuda.yml             - matrix entry + A100 runner
                                             conditional + tile-packed-only
                                             excludes for both export and
                                             e2e jobs.
    .ci/scripts/export_model_artifact.sh   - gemma4 case: snapshot_download
                                             -> inference sanity -> export.
    .ci/scripts/test_model_e2e.sh          - gemma4 case: skip-tokenizer-
                                             download list + RUNNER_ARGS
                                             (no --temperature, no
                                             --cuda_graph, EXPECTED_OUTPUT
                                             empty pending real-weight
                                             validation).
    Makefile                               - gemma4-cuda target.

Status (WIP):
  - Text-only path. Vision tower (gemma4_vision, 27 layers) NOT loaded,
    NOT quantized, NOT exported. Multimodal fusion not implemented.
  - Validated end-to-end on local hardware: prequant -> export -> .pte
    -> C++ runner produces coherent output ("The capital of France is
    a city that is always in motion..."). Peak GPU: ~20.8 GB on A100.
  - HF prequant bundle (model.safetensors + config.json + tokenizer.*)
    not yet uploaded; CI references gasoonjia/Gemma4-31B-HQQ-INT4 as
    placeholder.
  - GPU peak memory budget assert in test_model_e2e.sh skipped (gemma4
    runner doesn't print the "GPU peak memory usage:" line yet).
  - Two files contain hardcoded /home/gasoonjia paths
    (quantize_and_save.py docstring, README.md examples) — sanitize
    before public review.

Test Plan:
  - python -m executorch.examples.models.gemma4.inference --prequantized
    /home/gasoonjia/models/gemma4_31B_int4_hqq  (eager validation OK)
  - python -m executorch.examples.models.gemma4.export --prequantized
    /home/gasoonjia/models/gemma4_31B_int4_hqq --output-dir /tmp/gemma4
    (lowered .pte + .ptd produced, ~15 MB + 20.8 GB)
  - cmake-out/examples/models/gemma4/gemma4_runner --model_path ...
    --data_path ... --tokenizer_path ... --prompt 'The capital of
    France is' --max_new_tokens 50  (coherent generation)
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19217

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

PREPROCESSOR_FEATURE_SIZE=""
PREPROCESSOR_OUTPUT=""
;;
gasoonjia/Gemma4-31B-HQQ-INT4)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a placeholder link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant