Add Gemma4-31B (text-only) ExecuTorch CUDA support [WIP]#19217
Add Gemma4-31B (text-only) ExecuTorch CUDA support [WIP]#19217
Conversation
Summary:
Adds an end-to-end CUDA path for Google's Gemma4-31B (text-only) mirroring
the qwen3_5_moe prequantized pipeline:
examples/models/gemma4/
model.py - Gemma4TextModel + Gemma4ScaledEmbedding,
attention_k_eq_v, partial RoPE per layer-type,
final logit softcapping, tied embeddings.
quantize_and_save.py - Layerwise INT4 HQQ prequant (one decoder
layer at a time on CUDA, then back to CPU)
-> safetensors bundle. embed_tokens / lm_head
stay bf16 (tied; quantizing breaks output).
export.py - load_full_model / load_prequantized_model;
--tiny-test / --model-dir / --prequantized
CLI; emits decode + prefill methods sharing
KV cache (share_mutable_buffers=True).
inference.py - 6-stage validation pipeline (eager bf16 ->
prequant -> export -> lower -> .pte -> .ptd).
main.cpp - C++ runner mirroring qwen3_5_moe_runner
(greedy decode only; no temperature/cuda_graph
flags yet).
convert_weights.py - HF checkpoint -> internal layout helper.
CMakeLists.txt + CMakePresets.json + __init__.py + requirements.txt
README.md - usage notes.
CI scaffolding (mirrors qwen3_5_moe; uses placeholder HF repo
gasoonjia/Gemma4-31B-HQQ-INT4 — sed-rename if your HF user differs):
.github/workflows/cuda.yml - matrix entry + A100 runner
conditional + tile-packed-only
excludes for both export and
e2e jobs.
.ci/scripts/export_model_artifact.sh - gemma4 case: snapshot_download
-> inference sanity -> export.
.ci/scripts/test_model_e2e.sh - gemma4 case: skip-tokenizer-
download list + RUNNER_ARGS
(no --temperature, no
--cuda_graph, EXPECTED_OUTPUT
empty pending real-weight
validation).
Makefile - gemma4-cuda target.
Status (WIP):
- Text-only path. Vision tower (gemma4_vision, 27 layers) NOT loaded,
NOT quantized, NOT exported. Multimodal fusion not implemented.
- Validated end-to-end on local hardware: prequant -> export -> .pte
-> C++ runner produces coherent output ("The capital of France is
a city that is always in motion..."). Peak GPU: ~20.8 GB on A100.
- HF prequant bundle (model.safetensors + config.json + tokenizer.*)
not yet uploaded; CI references gasoonjia/Gemma4-31B-HQQ-INT4 as
placeholder.
- GPU peak memory budget assert in test_model_e2e.sh skipped (gemma4
runner doesn't print the "GPU peak memory usage:" line yet).
- Two files contain hardcoded /home/gasoonjia paths
(quantize_and_save.py docstring, README.md examples) — sanitize
before public review.
Test Plan:
- python -m executorch.examples.models.gemma4.inference --prequantized
/home/gasoonjia/models/gemma4_31B_int4_hqq (eager validation OK)
- python -m executorch.examples.models.gemma4.export --prequantized
/home/gasoonjia/models/gemma4_31B_int4_hqq --output-dir /tmp/gemma4
(lowered .pte + .ptd produced, ~15 MB + 20.8 GB)
- cmake-out/examples/models/gemma4/gemma4_runner --model_path ...
--data_path ... --tokenizer_path ... --prompt 'The capital of
France is' --max_new_tokens 50 (coherent generation)
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19217
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
| PREPROCESSOR_FEATURE_SIZE="" | ||
| PREPROCESSOR_OUTPUT="" | ||
| ;; | ||
| gasoonjia/Gemma4-31B-HQQ-INT4) |
There was a problem hiding this comment.
it is a placeholder link
Summary:
Adds an end-to-end CUDA path for Google's Gemma4-31B (text-only) mirroring the qwen3_5_moe prequantized pipeline:
CI scaffolding:
Perf (p=2000, d == 500):
Status (WIP):
Test Plan: