Skip to content

Reorganize pre-training doc#3778

Open
melissawm wants to merge 1 commit intoAI-Hypercomputer:mainfrom
melissawm:pre-training-reorg
Open

Reorganize pre-training doc#3778
melissawm wants to merge 1 commit intoAI-Hypercomputer:mainfrom
melissawm:pre-training-reorg

Conversation

@melissawm
Copy link
Copy Markdown
Collaborator

Description

This PR implements the following changes to the documentation:

  • Moves pre-training documentation to be a sub-section of the Run MaxText section;
  • Adds more information beyond just dataset configuration for the pre-training guide;
  • Adds some extra content to data pipeline individual guides.
  • Fixes some links and formatting in the hlo_diff_testing.md, gepa_optimization.md and knowledge_distillation.md files.

Tests

Documentation builds correctly locally.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • [N/A] I have necessary comments in my code, particularly in hard-to-understand areas.
  • [N/A] I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

Copy link
Copy Markdown
Collaborator Author

@melissawm melissawm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments below.

> [!IMPORTANT]\
> While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.**
> The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines.
```{important}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added syntax to show this note correctly on readthedocs.

> 2. Squash the commits in your branch once again to keep your PR history clean.
> 3. Push the squashed commit to remote.
> 4. Retry the `tpu-integration` workflow to verify tests pass on your PR.
```{note}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added syntax to show this note correctly on readthedocs.

base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
dataset_type=synthetic \
steps=100
```
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I could not verify any of the code snippets in this file, as I don't have access to this kind of machine. It would be very hepful if the team could sign off on the correctness and completeness of the code blocks in the run_maxtext_pretraining.md file.

Comment on lines +132 to +137
Grain is the **recommended input pipeline** for production training due to its determinism and resilience to preemption. It supports ArrayRecord (random access) and Parquet (sequential access) formats.

To get started, you need to:

1. **Download data** to a Cloud Storage bucket
2. **Mount the bucket** using [Cloud Storage FUSE (GCSFuse)](https://cloud.google.com/storage/docs/gcs-fuse)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using the c4 dataset as the example for all three data pipelines in this page, could we leave instructions for folks wanting to use the c4 dataset with Grain?

## Overview

This document explains how to use **GEPA** (Generic Evaluation and Prompt Adaptation) to optimize system prompts for MaxText models. GEPA is an evolutionary framework ([GitHub Repository](https://github.com/gepa-ai/gepa), [Paper](https://arxiv.org/abs/2507.19457)) that iteratively refines prompts based on evaluation feedback, helping models perform better on specific tasks. A complete, runnable example notebook is provided in the repository at [maxtext_with_gepa.ipynb](../../../src/maxtext/examples/maxtext_with_gepa.ipynb).
This document explains how to use **GEPA** (Generic Evaluation and Prompt Adaptation) to optimize system prompts for MaxText models. GEPA is an evolutionary framework ([GitHub Repository](https://github.com/gepa-ai/gepa), [Paper](https://arxiv.org/abs/2507.19457)) that iteratively refines prompts based on evaluation feedback, helping models perform better on specific tasks. A complete, runnable example notebook is provided in the repository at [maxtext_with_gepa.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/examples/maxtext_with_gepa.ipynb).
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation builds on Sphinx can only see the files contained in the root "docs" folder, and so the original link here was broken. Pointing to the github version of the notebook is the best approach here.


A complete, runnable tutorial is available in the repository as a Jupyter Notebook:
[maxtext_with_gepa.ipynb](../../../src/maxtext/examples/maxtext_with_gepa.ipynb) (provided as an example)
[maxtext_with_gepa.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/examples/maxtext_with_gepa.ipynb) (provided as an example)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation builds on Sphinx can only see the files contained in the root "docs" folder, and so the original link here was broken. Pointing to the github version of the notebook is the best approach here.

Comment thread docs/development.md
---
development/update_dependencies.md
development/contribute_docs.md
development/hlo_diff_testing.md
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file had not been added to any ToCs, and so was not reachable.

Comment thread docs/guides.md
:gutter: 2

:::{grid-item-card} ⚡ Optimization
````{grid} 1 2 2 2
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the colon and backticks syntax can be used here, but the markdown linter we are using prefers backticks, so I converted the syntax here to be compatible with the linter.

@melissawm melissawm force-pushed the pre-training-reorg branch 2 times, most recently from 83f52ce to 2f89016 Compare April 30, 2026 17:26
* Moves pre-training documentation to be a sub-section of the Run MaxText section;
* Adds more information beyond just dataset configuration for the pre-training guide;
* Adds some extra content to data pipeline individual guides.
@melissawm melissawm force-pushed the pre-training-reorg branch from 2f89016 to 93c0cfa Compare April 30, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant