enhance hstu export by eric-gecheng · Pull Request #443 · alibaba/TorchEasyRec

eric-gecheng · 2026-03-26T09:31:46Z

When exporting HSTU model, user needs to add a command line argument to specify the target item_id column name, tzrec will save the column name into model_acc.json. Online processor will read it.
Tzrec will detect the HSTU model kernel, pytorch or triton, and write a field into model_acc.json.
Since PAI-EAS container will detect and extract any zipped file in the model directory, and .pt2 file is actually a zipped format, we have to put it into a sub-folder "aoti" so as to avoid being unzipped.
For triton kernel, on-line inference processor needs to know what are the output key names, so we need to save them during export. A file called output_field_names.json is created under "aoti" folder.

docs/source/models/dlrm_hstu.md

tzrec/acc/aot_utils.py

tzrec/utils/export_util.py

tzrec/acc/utils.py

github-actions · 2026-03-27T03:34:56Z

Code Review Summary

The changes are well-structured and the intent is clear — adding HSTU-specific export config, reorganizing AOTI output into a subfolder, and persisting output field names. A few items to address:

Issues

Bug: typo in docs — experiments/dltm_hstu/export should be experiments/dlrm_hstu/export in the export example (dlrm_hstu.md:322).
Missing torch.no_grad() on extra forward pass — aot_utils.py:90 runs dense_model(sparse_output) just to capture output keys, building autograd graphs needlessly. Consider wrapping in torch.no_grad() or extracting keys from the ExportedProgram to avoid the redundant pass entirely.
hstu_kernel silently swallowed in RTP path — export_model passes both hstu_item_id and hstu_kernel to impl(), but export_rtp_model absorbs hstu_kernel via **kwargs without using it. If the RTP path doesn't need it, add a comment; if it does, this is a bug.

Test Coverage Gaps

The core new feature has no test coverage — no test passes --hstu_item_id, so the code path writing hstu_item_id/hstu_kernel into model_acc.json is never exercised. Similarly, output_field_names.json creation is never verified. Consider:
- Adding hstu_item_id to the existing HSTU export test
- Adding a unit test for export_acc_config() with the new parameters
- Asserting output_field_names.json exists in AOTI export tests

Minor

Kernel.Name() called for all model types — returns misleading default (TRITON) for non-HSTU models. Harmless but confusing; guard with if hstu_item_id.
Docstring/implementation casing mismatch — docstring says default is "PYTORCH" but the code lowercases to "pytorch". Clarify the expected contract for downstream consumers.

🤖 Generated with Claude Code

tzrec/utils/export_util.py

tzrec/acc/aot_utils.py

tzrec/tests/rank_integration_test.py

github-actions · 2026-03-31T11:54:57Z

Code Review Summary

Overall this PR is well-structured — the parameter threading through the normal export path is clean, the validation guard for hstu_item_id is a good UX improvement, and the tests verify the new model_acc.json fields properly. A few items worth addressing:

Bug

export_rtp_model silently discards hstu_item_id and hstu_kernel: The function accepts these parameters but never uses them — no model_acc.json is written in the RTP path. test_dlrm_hstu_rtp_train_export asserts these fields exist in model_acc.json, which will likely fail. See inline comment for details.

Test Coverage Gaps

ValueError for missing hstu_item_id is untested: The validation guard at export_util.py:93-101 is a user-facing error that deserves a unit test.
test_hstu_with_fg_train_eval_export is unconditionally skipped (@unittest.skip): The new model_acc.json assertions for match model towers (lines 461-474) are dead code. Consider @unittest.skipIf with a condition (e.g., GPU availability) instead.
output_field_names.json only checked for existence, not content. See inline comment.

Minor

hstu_kernel is computed for all model types, not just HSTU. See inline comment with suggestion.
The extra forward pass in aot_utils.py to capture output keys could be avoided by passing keys from the caller which already has them. See inline comment.

🤖 Generated with Claude Code

tzrec/utils/export_util.py

tzrec/acc/utils.py

tzrec/acc/aot_utils.py

github-actions · 2026-04-07T09:45:13Z

tzrec/tests/utils.py

-        cmd_str += f"--asset_files {asset_files}"
+        cmd_str += f"--asset_files {asset_files} "
+    if hstu_item_id:
+        cmd_str += f"--hstu_item_id {hstu_item_id}"


Shell injection surface: hstu_item_id is interpolated into a command string that is later executed with shell=True (via misc_util.run_cmd). While current callers use hardcoded literals, this extends a pre-existing injection-prone pattern. Consider using shlex.quote():

Suggested change

cmd_str += f"--hstu_item_id {hstu_item_id}"

cmd_str += f"--hstu_item_id {shlex.quote(hstu_item_id)}"

(Same concern applies to asset_files and other interpolated values in this function.)

github-actions · 2026-04-07T09:45:18Z

tzrec/tests/match_integration_test.py

+        # Verify model_acc.json contains HSTU kernel for user tower
+        user_acc_path = os.path.join(self.test_dir, "export/user/model_acc.json")
+        self.assertTrue(os.path.exists(user_acc_path))
+        with open(user_acc_path) as f:
+            acc_cfg = json.load(f)
+            self.assertNotIn("hstu_item_id", acc_cfg)
+            self.assertEqual(acc_cfg["hstu_kernel"], "pytorch")
+        # Verify model_acc.json contains HSTU kernel for item tower
+        item_acc_path = os.path.join(self.test_dir, "export/item/model_acc.json")
+        self.assertTrue(os.path.exists(item_acc_path))
+        with open(item_acc_path) as f:
+            acc_cfg = json.load(f)
+            self.assertNotIn("hstu_item_id", acc_cfg)
+            self.assertEqual(acc_cfg["hstu_kernel"], "pytorch")


Dead test code: The parent test test_hstu_with_fg_train_eval_export is decorated with @unittest.skip("skip hstu match test"), so these new assertions verifying hstu_kernel and hstu_item_id absence in model_acc.json will never execute. The PR appears to add match-side validation but it's illusory coverage.

Is there a plan to re-enable this test? If so, a tracking comment or issue would help.

github-actions · 2026-04-07T09:45:38Z

Code Review Summary

PR looks good overall — the HSTU export enhancements are well-structured and the rank integration tests provide solid verification of the new model_acc.json fields and output_field_names.json.

Key Findings

Performance

Redundant forward pass in aot_utils.py:90-91: An extra dense_model(sparse_output) call runs the full dense model on GPU just to capture output key names. The caller already has these keys from its own forward pass — pass them in as a parameter instead.

Code Quality

Unreachable fallback in utils.py:226: The else "pytorch" branch in the hstu_kernel ternary can never trigger given the caller's usage of Kernel.Name(...). Simplify to hstu_kernel.lower().

Test Coverage

Dead test assertions in match_integration_test.py:460-473: The parent test is @unittest.skip'd, so the new model_acc.json assertions for the match model path never run.
Missing negative test: No test verifies the ValueError raised when hstu_item_id is omitted for dlrm_hstu models. This is a simple unit test worth adding.

Security

Shell injection surface in tests/utils.py:1109: hstu_item_id is f-string interpolated into a shell=True command. Current callers use hardcoded strings, but shlex.quote() would be a low-effort hardening.

See inline comments for details and suggestions.

🤖 Generated with Claude Code

Two bugs caused test_rank_dlrm_hstu_cutlass_train_eval_export to fail: 1. tzrec/tests/rank_integration_test.py: the test called utils.test_export without hstu_item_id, but the DlrmHSTU export path now requires it (ValueError raised in export_util.py). Pass hstu_item_id ="cand_seq__video_id" mirroring the sibling triton test. 2. tzrec/acc/aot_utils.py: the new dry-run that captures output field names called dense_model(sparse_output) in raw fp32, before the DenseAutocastWrapper was applied. The CUTLASS HSTU attention kernel rejects fp32 inputs, which made the dry-run crash for any mixed_precision=BF16/FP16 + CUTLASS configuration. Reorder so dense_to_export (autocast-wrapped when mixed_precision is set) is constructed first, then dry-run through it to collect output keys. Verified locally: both test_rank_dlrm_hstu_cutlass_train_eval_export and test_rank_dlrm_hstu_train_eval_export pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eric-gecheng added 3 commits March 26, 2026 17:23

enhance hstu export

2d30498

code style fix

32ec6ca

update doc and code format check

6b23573

tiankongdeguiji added the claude-review Let Claude Review label Mar 27, 2026

github-actions bot removed the claude-review Let Claude Review label Mar 27, 2026

Merge branch 'master' into features/hstu_export_enhance

9083fff

eric-gecheng added the claude-review Let Claude Review label Mar 27, 2026

github-actions bot removed the claude-review Let Claude Review label Mar 27, 2026

github-actions bot reviewed Mar 27, 2026

View reviewed changes

docs/source/models/dlrm_hstu.md Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 27, 2026

View reviewed changes

tzrec/acc/aot_utils.py Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 27, 2026

View reviewed changes

tzrec/utils/export_util.py Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 27, 2026

View reviewed changes

tzrec/acc/utils.py Outdated Show resolved Hide resolved

eric-gecheng added 5 commits March 30, 2026 09:58

typo fix

1f375b5

forward with no_grad

02018ca

hstu_item_id param has to be specified when exporting hstu model

addcded

update doc string

f83dde1

enhance hstu test case

a15fa47

tiankongdeguiji added the claude-review Let Claude Review label Mar 31, 2026

github-actions bot removed the claude-review Let Claude Review label Mar 31, 2026

github-actions bot reviewed Mar 31, 2026

View reviewed changes

tzrec/utils/export_util.py Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 31, 2026

View reviewed changes

tzrec/utils/export_util.py Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 31, 2026

View reviewed changes

tzrec/acc/aot_utils.py Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 31, 2026

View reviewed changes

tzrec/tests/rank_integration_test.py Show resolved Hide resolved

tiankongdeguiji requested changes Mar 31, 2026

View reviewed changes

tzrec/utils/export_util.py Outdated Show resolved Hide resolved

eric-gecheng added 2 commits April 7, 2026 17:01

remove enhancements for rtp export

b0cce9f

remove hstu_item_id for HstuMatch model

50f19b9

eric-gecheng added the claude-review Let Claude Review label Apr 7, 2026

github-actions bot removed the claude-review Let Claude Review label Apr 7, 2026

github-actions bot reviewed Apr 7, 2026

View reviewed changes

tzrec/acc/utils.py Outdated Show resolved Hide resolved

github-actions bot reviewed Apr 7, 2026

View reviewed changes

tzrec/acc/aot_utils.py Outdated Show resolved Hide resolved

github-actions bot reviewed Apr 7, 2026

View reviewed changes

eric-gecheng and others added 4 commits April 7, 2026 19:56

code review fix

7f3a81e

add test case when hstu_item_id is missing

c62ad7c

fix _HSTU_MODEL_TYPES

d6a1530

Merge branch 'master' into features/hstu_export_enhance

95467a6

tiankongdeguiji previously approved these changes Apr 8, 2026

View reviewed changes

tiankongdeguiji dismissed their stale review via cebd2cf April 8, 2026 11:23

tiankongdeguiji approved these changes Apr 8, 2026

View reviewed changes

tiankongdeguiji merged commit 9d4a142 into alibaba:master Apr 9, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance hstu export#443

enhance hstu export#443
tiankongdeguiji merged 16 commits intoalibaba:masterfrom
eric-gecheng:features/hstu_export_enhance

eric-gecheng commented Mar 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot Apr 7, 2026

Uh oh!

github-actions bot Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	cmd_str += f"--hstu_item_id {hstu_item_id}"
	cmd_str += f"--hstu_item_id {shlex.quote(hstu_item_id)}"

Conversation

eric-gecheng commented Mar 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 27, 2026

Code Review Summary

Issues

Test Coverage Gaps

Minor

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2026

Code Review Summary

Bug

Test Coverage Gaps

Minor

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 7, 2026

Code Review Summary

Key Findings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants