[Distillation] Layer-wise LTI by vlad-karp · Pull Request #3769 · AI-Hypercomputer/maxtext

vlad-karp · 2026-04-29T00:43:19Z

Layer-wise LTI

Generalization of Learn-To-Init (LTI) Mechanism

The rest of the description includes relevant details and context, examples:

current LTI implementation is too model specific and imposes multiple model structure inconveniences when dealing with the extra intermediate LTI wrapper. Also, it works only in the layer-scanned mode.
This PR refactors the Learn-To-Init (LTI) approach to make it model-agnostic and more generalized for other models. Enables non layer-scan mode to allow layer-wise LTI training.
This allows injecting generalized LTI modifications (apply_lti_modification) dynamically into any instantiated base NNX layer. One can use layer-wise logic for LTI (i.e. augment only specific layers)
Removed of Llama2-Specific LTI decoder.
Dynamic Module Augmentation : nnx modules are LTI-augmented as they are created in the linen flow
The distillation utilities (lti_utils.py and train_distill.py) were upgraded to use regex patterns for weight sharing, copying, and freezing instead of exact path matching

Shortcomings:

One still have to add apply_lti_modification function (or another model specific LTI method) when instantiating model's linen version class
Currently is not of a much use until the layer-wise student model configuration is available.

Tests

learn_to_init_test.py and train_distill_test.py were refactored to validate the new generic LTI augmentation functionality and the regex-based weight preparation logic.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

into vladk/lti2

github-actions · 2026-05-01T17:42:33Z

🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This PR introduces a significant refactor to the Learn-To-Init (LTI) mechanism, making it model-agnostic and more flexible by using dynamic NNX module augmentation. While the overall design improvement is positive and aligns with the goal of generalizing LTI, there are several critical and high-severity issues that need to be addressed, including syntax errors in assertions, missing f-string prefixes, and logical inconsistencies in layer collection.

🔍 General Feedback

Regex Support: The move to regex-based weight sharing and copying is a great addition that improves the flexibility of the distillation pipeline.
Model Agnostic LTI: Decoupling LTI from specific model architectures (like Llama2) is a good architectural move.
Testing: New tests were added for the generic augmentation, but they currently have structural issues (missing indentation) and incorrect mock patch paths that will prevent them from running correctly.
Inconsistencies: There is some inconsistency in how different layer prefixes (e.g., dense_layers_, moe_layers_) are handled between initialization and the final weight update.

JamesDeng42

Generally LGTM

vlad-karp added 3 commits April 29, 2026 00:42

generalized LTI

3dbb7bb

regexp based matching for distillation freezing and LTI

86ba412

fixed tests

dd3fe3b

vlad-karp marked this pull request as ready for review May 1, 2026 17:20

vlad-karp changed the title ~~generalized LTI~~ Layer-wise LTI May 1, 2026

vlad-karp changed the title ~~Layer-wise LTI~~ [Distillation] Layer-wise LTI May 1, 2026

vlad-karp added 2 commits May 1, 2026 17:36

fixed tests

3605ce0

Merge branch 'vladk/lti2' of https://github.com/AI-Hypercomputer/maxtext

8cfe9e5

into vladk/lti2

gagika added the gemini-review label May 1, 2026

github-actions Bot reviewed May 1, 2026

View reviewed changes

gagika reviewed May 1, 2026

View reviewed changes

Comment thread src/maxtext/layers/learn_to_init_layer.py Outdated

Comment thread tests/post_training/unit/learn_to_init_test.py Outdated

Comment thread tests/post_training/unit/learn_to_init_test.py Outdated

Comment thread src/maxtext/layers/nnx_wrappers.py Outdated

JamesDeng42 reviewed May 1, 2026

View reviewed changes

Comment thread src/maxtext/trainers/post_train/distillation/lti_utils.py Outdated

JamesDeng42 reviewed May 1, 2026

View reviewed changes

Comment thread src/maxtext/layers/learn_to_init_layer.py Outdated

addressed comments

82611ad

JamesDeng42 approved these changes May 1, 2026

View reviewed changes

gagika approved these changes May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Distillation] Layer-wise LTI#3769

[Distillation] Layer-wise LTI#3769
vlad-karp wants to merge 6 commits intomainfrom
vladk/lti2

vlad-karp commented Apr 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JamesDeng42 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vlad-karp commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Layer-wise LTI

Tests

Checklist

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JamesDeng42 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vlad-karp commented Apr 29, 2026 •

edited

Loading