Reorganize pre-training doc by melissawm · Pull Request #3778 · AI-Hypercomputer/maxtext

melissawm · 2026-04-29T20:27:23Z

Description

This PR implements the following changes to the documentation:

Moves pre-training documentation to be a sub-section of the Run MaxText section;
Adds more information beyond just dataset configuration for the pre-training guide;
Adds some extra content to data pipeline individual guides.
Fixes some links and formatting in the hlo_diff_testing.md, gepa_optimization.md and knowledge_distillation.md files.

Tests

Documentation builds correctly locally.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
[N/A] I have necessary comments in my code, particularly in hard-to-understand areas.
[N/A] I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

melissawm

Left a few comments below.

melissawm · 2026-04-29T20:28:35Z

-> [!IMPORTANT]\
-> While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.**
-> The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines.
+```{important}


Added syntax to show this note correctly on readthedocs.

melissawm · 2026-04-29T20:28:41Z

-> 2. Squash the commits in your branch once again to keep your PR history clean.
-> 3. Push the squashed commit to remote.
-> 4. Retry the `tpu-integration` workflow to verify tests pass on your PR.
+```{note}


Added syntax to show this note correctly on readthedocs.

melissawm · 2026-04-29T20:31:23Z

+  base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
+  dataset_type=synthetic \
+  steps=100
+```


Unfortunately I could not verify any of the code snippets in this file, as I don't have access to this kind of machine. It would be very hepful if the team could sign off on the correctness and completeness of the code blocks in the run_maxtext_pretraining.md file.

melissawm · 2026-04-29T20:32:34Z

+Grain is the **recommended input pipeline** for production training due to its determinism and resilience to preemption. It supports ArrayRecord (random access) and Parquet (sequential access) formats.
+
+To get started, you need to:
+
+1. **Download data** to a Cloud Storage bucket
+2. **Mount the bucket** using [Cloud Storage FUSE (GCSFuse)](https://cloud.google.com/storage/docs/gcs-fuse)


Since we are using the c4 dataset as the example for all three data pipelines in this page, could we leave instructions for folks wanting to use the c4 dataset with Grain?

melissawm · 2026-04-29T20:33:54Z

 ## Overview

-This document explains how to use **GEPA** (Generic Evaluation and Prompt Adaptation) to optimize system prompts for MaxText models. GEPA is an evolutionary framework ([GitHub Repository](https://github.com/gepa-ai/gepa), [Paper](https://arxiv.org/abs/2507.19457)) that iteratively refines prompts based on evaluation feedback, helping models perform better on specific tasks. A complete, runnable example notebook is provided in the repository at [maxtext_with_gepa.ipynb](../../../src/maxtext/examples/maxtext_with_gepa.ipynb).
+This document explains how to use **GEPA** (Generic Evaluation and Prompt Adaptation) to optimize system prompts for MaxText models. GEPA is an evolutionary framework ([GitHub Repository](https://github.com/gepa-ai/gepa), [Paper](https://arxiv.org/abs/2507.19457)) that iteratively refines prompts based on evaluation feedback, helping models perform better on specific tasks. A complete, runnable example notebook is provided in the repository at [maxtext_with_gepa.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/examples/maxtext_with_gepa.ipynb).


Documentation builds on Sphinx can only see the files contained in the root "docs" folder, and so the original link here was broken. Pointing to the github version of the notebook is the best approach here.

melissawm · 2026-04-29T20:33:58Z


 A complete, runnable tutorial is available in the repository as a Jupyter Notebook:
-[maxtext_with_gepa.ipynb](../../../src/maxtext/examples/maxtext_with_gepa.ipynb) (provided as an example)
+[maxtext_with_gepa.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/examples/maxtext_with_gepa.ipynb) (provided as an example)


Documentation builds on Sphinx can only see the files contained in the root "docs" folder, and so the original link here was broken. Pointing to the github version of the notebook is the best approach here.

melissawm · 2026-04-29T20:34:35Z

 ---
 development/update_dependencies.md
 development/contribute_docs.md
+development/hlo_diff_testing.md


This file had not been added to any ToCs, and so was not reachable.

melissawm · 2026-04-29T20:35:11Z

-:gutter: 2
-
-:::{grid-item-card} ⚡ Optimization
+````{grid} 1 2 2 2


Both the colon and backticks syntax can be used here, but the markdown linter we are using prefers backticks, so I converted the syntax here to be compatible with the linter.

* Moves pre-training documentation to be a sub-section of the Run MaxText section; * Adds more information beyond just dataset configuration for the pre-training guide; * Adds some extra content to data pipeline individual guides.

melissawm requested review from A9isha, RissyRan, SurbhiJainUSC, bvandermoon, gagika, gobbleturk, jacoguzo, jiangjy1982, richjames0, shralex and vipannalla as code owners April 29, 2026 20:27

melissawm commented Apr 29, 2026

View reviewed changes

melissawm force-pushed the pre-training-reorg branch 2 times, most recently from 83f52ce to 2f89016 Compare April 30, 2026 17:26

Reorganize pre-training doc

93c0cfa

* Moves pre-training documentation to be a sub-section of the Run MaxText section; * Adds more information beyond just dataset configuration for the pre-training guide; * Adds some extra content to data pipeline individual guides.

melissawm force-pushed the pre-training-reorg branch from 2f89016 to 93c0cfa Compare April 30, 2026 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorganize pre-training doc#3778

Reorganize pre-training doc#3778
melissawm wants to merge 1 commit intoAI-Hypercomputer:mainfrom
melissawm:pre-training-reorg

melissawm commented Apr 29, 2026

Uh oh!

melissawm left a comment

Uh oh!

melissawm Apr 29, 2026

Uh oh!

melissawm Apr 29, 2026

Uh oh!

melissawm Apr 29, 2026

Uh oh!

melissawm Apr 29, 2026

Uh oh!

melissawm Apr 29, 2026

Uh oh!

melissawm Apr 29, 2026

Uh oh!

melissawm Apr 29, 2026

Uh oh!

melissawm Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

melissawm commented Apr 29, 2026

Description

Tests

Checklist

Uh oh!

melissawm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant