Reorganize pre-training doc#3778
Conversation
melissawm
left a comment
There was a problem hiding this comment.
Left a few comments below.
| > [!IMPORTANT]\ | ||
| > While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.** | ||
| > The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines. | ||
| ```{important} |
There was a problem hiding this comment.
Added syntax to show this note correctly on readthedocs.
| > 2. Squash the commits in your branch once again to keep your PR history clean. | ||
| > 3. Push the squashed commit to remote. | ||
| > 4. Retry the `tpu-integration` workflow to verify tests pass on your PR. | ||
| ```{note} |
There was a problem hiding this comment.
Added syntax to show this note correctly on readthedocs.
| base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ | ||
| dataset_type=synthetic \ | ||
| steps=100 | ||
| ``` |
There was a problem hiding this comment.
Unfortunately I could not verify any of the code snippets in this file, as I don't have access to this kind of machine. It would be very hepful if the team could sign off on the correctness and completeness of the code blocks in the run_maxtext_pretraining.md file.
| Grain is the **recommended input pipeline** for production training due to its determinism and resilience to preemption. It supports ArrayRecord (random access) and Parquet (sequential access) formats. | ||
|
|
||
| To get started, you need to: | ||
|
|
||
| 1. **Download data** to a Cloud Storage bucket | ||
| 2. **Mount the bucket** using [Cloud Storage FUSE (GCSFuse)](https://cloud.google.com/storage/docs/gcs-fuse) |
There was a problem hiding this comment.
Since we are using the c4 dataset as the example for all three data pipelines in this page, could we leave instructions for folks wanting to use the c4 dataset with Grain?
| ## Overview | ||
|
|
||
| This document explains how to use **GEPA** (Generic Evaluation and Prompt Adaptation) to optimize system prompts for MaxText models. GEPA is an evolutionary framework ([GitHub Repository](https://github.com/gepa-ai/gepa), [Paper](https://arxiv.org/abs/2507.19457)) that iteratively refines prompts based on evaluation feedback, helping models perform better on specific tasks. A complete, runnable example notebook is provided in the repository at [maxtext_with_gepa.ipynb](../../../src/maxtext/examples/maxtext_with_gepa.ipynb). | ||
| This document explains how to use **GEPA** (Generic Evaluation and Prompt Adaptation) to optimize system prompts for MaxText models. GEPA is an evolutionary framework ([GitHub Repository](https://github.com/gepa-ai/gepa), [Paper](https://arxiv.org/abs/2507.19457)) that iteratively refines prompts based on evaluation feedback, helping models perform better on specific tasks. A complete, runnable example notebook is provided in the repository at [maxtext_with_gepa.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/examples/maxtext_with_gepa.ipynb). |
There was a problem hiding this comment.
Documentation builds on Sphinx can only see the files contained in the root "docs" folder, and so the original link here was broken. Pointing to the github version of the notebook is the best approach here.
|
|
||
| A complete, runnable tutorial is available in the repository as a Jupyter Notebook: | ||
| [maxtext_with_gepa.ipynb](../../../src/maxtext/examples/maxtext_with_gepa.ipynb) (provided as an example) | ||
| [maxtext_with_gepa.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/examples/maxtext_with_gepa.ipynb) (provided as an example) |
There was a problem hiding this comment.
Documentation builds on Sphinx can only see the files contained in the root "docs" folder, and so the original link here was broken. Pointing to the github version of the notebook is the best approach here.
| --- | ||
| development/update_dependencies.md | ||
| development/contribute_docs.md | ||
| development/hlo_diff_testing.md |
There was a problem hiding this comment.
This file had not been added to any ToCs, and so was not reachable.
| :gutter: 2 | ||
|
|
||
| :::{grid-item-card} ⚡ Optimization | ||
| ````{grid} 1 2 2 2 |
There was a problem hiding this comment.
Both the colon and backticks syntax can be used here, but the markdown linter we are using prefers backticks, so I converted the syntax here to be compatible with the linter.
83f52ce to
2f89016
Compare
* Moves pre-training documentation to be a sub-section of the Run MaxText section; * Adds more information beyond just dataset configuration for the pre-training guide; * Adds some extra content to data pipeline individual guides.
2f89016 to
93c0cfa
Compare
Description
This PR implements the following changes to the documentation:
Tests
Documentation builds correctly locally.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.