Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@ hidden:
---
development/update_dependencies.md
development/contribute_docs.md
development/hlo_diff_testing.md
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file had not been added to any ToCs, and so was not reachable.

```
24 changes: 14 additions & 10 deletions docs/development/hlo_diff_testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,12 @@ ______________________________________________________________________

When intended architectures transformations alter graph lowering, reference file baselines require updates.

> [!IMPORTANT]\
> While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.**
> The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines.
```{important}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added syntax to show this note correctly on readthedocs.


While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.**

The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines.
```

### Method 1: Run the manual GitHub Action Workflow (Highly Recommended)

Expand All @@ -66,13 +69,14 @@ Alternatively, you can trigger the remote workflow via terminal CLI execution:
gh workflow run update_reference_hlo.yml --ref <branch>
```

> [!NOTE]
> A successful run of the manual update workflow will add a new commit to your Pull Request branch. Once complete, you must:
>
> 1. Pull the new commit from remote.
> 2. Squash the commits in your branch once again to keep your PR history clean.
> 3. Push the squashed commit to remote.
> 4. Retry the `tpu-integration` workflow to verify tests pass on your PR.
```{note}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added syntax to show this note correctly on readthedocs.

A successful run of the manual update workflow will add a new commit to your Pull Request branch. Once complete, you must:

1. Pull the new commit from remote.
2. Squash the commits in your branch once again to keep your PR history clean.
3. Push the squashed commit to remote.
4. Retry the `tpu-integration` workflow to verify tests pass on your PR.
```

### Method 2: Local Execution

Expand Down
37 changes: 19 additions & 18 deletions docs/guides.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,58 +18,59 @@

Explore our how-to guides for optimizing, debugging, and managing your MaxText workloads.

::::{grid} 1 2 2 2
:gutter: 2

:::{grid-item-card} ⚡ Optimization
````{grid} 1 2 2 2
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the colon and backticks syntax can be used here, but the markdown linter we are using prefers backticks, so I converted the syntax here to be compatible with the linter.

---
gutter: 2
---
```{grid-item-card} ⚡ Optimization
:link: guides/optimization
:link-type: doc

Techniques for maximizing performance, including sharding strategies, Pallas kernels, and benchmarking.
:::
```

:::{grid-item-card} 💾 Data Pipelines
```{grid-item-card} 💾 Data Pipelines
:link: guides/data_input_pipeline
:link-type: doc

Configure input pipelines using **Grain** (recommended for determinism), **HuggingFace**, or **TFDS**.
:::
```

:::{grid-item-card} 🔄 Checkpointing
```{grid-item-card} 🔄 Checkpointing
:link: guides/checkpointing_solutions
:link-type: doc

Manage GCS checkpoints, handle preemption with emergency checkpointing, and configure multi-tier storage.
:::
```

:::{grid-item-card} 🔍 Monitoring & Debugging
```{grid-item-card} 🔍 Monitoring & Debugging
:link: guides/monitoring_and_debugging
:link-type: doc

Tools for observability: goodput monitoring, hung job debugging, and Vertex AI TensorBoard integration.
:::
```

:::{grid-item-card} 🐍 Python Notebooks
```{grid-item-card} 🐍 Python Notebooks
:link: guides/run_python_notebook
:link-type: doc

Interactive development guides for running MaxText on Google Colab or local JupyterLab environments.
:::
```

:::{grid-item-card} 🌱 Model Bringup
```{grid-item-card} 🌱 Model Bringup
:link: guides/model_bringup
:link-type: doc

A step-by-step guide for the community to help expand MaxText's model library.
:::
```

:::{grid-item-card} 🎓 Distillation
```{grid-item-card} 🎓 Distillation
:link: guides/distillation
:link-type: doc

How online distillation works in MaxText: loss anatomy, α / β / temperature schedule tuning, layer indices, monitoring metrics, and troubleshooting.
:::
::::
```
````

```{toctree}
---
Expand Down
3 changes: 2 additions & 1 deletion docs/guides/data_input_pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ Training in a multi-host environment presents unique challenges for data input p

### Random access dataset (Recommended)

Random-access formats are highly recommended for multi-host training because they allow any part of the file to be read directly by its index.<br>
Random-access formats are highly recommended for multi-host training because they allow any part of the file to be read directly by its index.

In MaxText, this is best supported by the ArrayRecord format using the Grain input pipeline. This approach gracefully handles the key challenges:

- **Concurrent access and uniqueness**: Grain assigns a unique set of indices to each host. ArrayRecord allows different hosts to read from different indices in the same file.
Expand Down
36 changes: 29 additions & 7 deletions docs/guides/data_input_pipeline/data_input_grain.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,14 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state

## Using Grain

1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.
- **Community Resource**: The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet.
2. If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). The script configures some parameters for the mount.
Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.

```{admonition} Community Resource

The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet.
```

If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). The script configures some parameters for the mount.

```sh
bash src/dependencies/scripts/setup_gcsfuse.sh \
Expand All @@ -45,11 +50,13 @@ MOUNT_PATH=${MOUNT_PATH?} \

Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pre-filling the metadata cache (see ["Performance tuning best practices" on the Google Cloud documentation](https://docs.cloud.google.com/storage/docs/cloud-storage-fuse/performance)).

### Configuration

1. Set `dataset_type=grain`, `grain_file_type={arrayrecord|parquet|tfrecord}`, `grain_train_files` in `src/maxtext/configs/base.yml` or through command line arguments to match the file pattern on the mounted local path.

2. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh) to avoid gcsfuse throttling.

3. ArrayRecord Only: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example:
3. *ArrayRecord Only*: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example:

```
# Blend two data sources with 30% from first source and 70% from second source
Expand Down Expand Up @@ -120,17 +127,32 @@ grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* \
grain_worker_count=2
```

1. Using validation set for evaluation
### Using validation set for evaluation

When setting eval_interval > 0, evaluation will be run with a specified eval dataset. Example config (set in [`src/maxtext/configs/base.yml`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/base.yml) or through command line):
When setting `eval_interval > 0`, evaluation will be run with a specified eval dataset. Example config (set in [`src/maxtext/configs/base.yml`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/base.yml) or through command line):

```yaml
eval_interval: 10000
eval_steps: 50
grain_eval_files: '/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record*'
```

1. Experimental: resuming training with a different chip count
### Tokenizer support

Grain pipeline supports three tokenizer types:

- `sentencepiece`: For SentencePiece tokenizers;
- `huggingface`: For HuggingFace tokenizers (requires `hf_access_token` for gated models);
- `tiktoken`: For OpenAI's tiktoken tokenizers.

Example with SentencePiece:

```bash
tokenizer_type=sentencepiece \
tokenizer_path=gs://<your-bucket>/tokenizers/c4_en_301_5Mexp2_spm.model
```

### Experimental: resuming training with a different chip count

In Grain checkpoints, each data-loading host has a corresponding JSON file. For cases where a user wants to resume training with a different number of data-loading hosts, MaxText provides an experimental feature:

Expand Down
34 changes: 34 additions & 0 deletions docs/guides/data_input_pipeline/data_input_hf.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,40 @@ hf_eval_files: 'gs://<bucket>/<folder>/*-validation-*.parquet' # match the val
tokenizer_path: 'google-t5/t5-large' # for using https://huggingface.co/google-t5/t5-large
```

## Tokenizer configuration

The Hugging Face pipeline only supports Hugging Face tokenizers and will ignore the `tokenizer_type` flag.

## Using gated datasets

For [gated datasets](https://huggingface.co/docs/hub/en/datasets-gated) or tokenizers from [gated models](https://huggingface.co/docs/hub/en/models-gated), you need to:

1. Request access on HuggingFace
2. Generate an access token from your [HuggingFace settings](https://huggingface.co/settings/tokens)
3. Provide the token in your command:

```bash
hf_access_token=<YOUR_TOKEN>
```

Example with gated model:

```bash
python3 -m maxtext.trainers.pre_train.train \
base_output_directory=gs://<your-bucket> \
run_name=llama2_demo \
model_name=llama2-7b \
dataset_type=hf \
hf_path=allenai/c4 \
hf_data_dir=en \
train_split=train \
tokenizer_type=huggingface \
tokenizer_path=meta-llama/Llama-2-7b \
hf_access_token=hf_xxxxxxxxxxxxx \
steps=1000 \
per_device_batch_size=8
```

## Limitations and Recommendations

1. Streaming data directly from Hugging Face Hub may be impacted by the traffic of the server. During peak hours you may encounter "504 Server Error: Gateway Time-out". It's recommended to download the Hugging Face dataset to a Cloud Storage bucket or disk for the most stable experience.
Expand Down
12 changes: 12 additions & 0 deletions docs/guides/data_input_pipeline/data_input_tfds.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# TFDS pipeline

The TensorFlow Datasets (TFDS) pipeline uses datasets in TFRecord format, which is performant and widely supported in the TensorFlow ecosystem.

## Example config for streaming from TFDS dataset in a Cloud Storage bucket

1. Download the Allenai C4 dataset in TFRecord format to a Cloud Storage bucket. For information about cost, see [this discussion](https://github.com/allenai/allennlp/discussions/5056)

```shell
Expand All @@ -18,3 +22,11 @@ eval_split: 'validation'
# TFDS input pipeline only supports tokenizer in spm format
tokenizer_path: 'src/maxtext/assets/tokenizers/tokenizer.llama2'
```

### Tokenizer support

TFDS pipeline supports three tokenizer types:

- `sentencepiece`: For SentencePiece tokenizers
- `huggingface`: For HuggingFace tokenizers (requires `hf_access_token` for gated models)
- `tiktoken`: For OpenAI's tiktoken tokenizers
35 changes: 22 additions & 13 deletions docs/run_maxtext.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,50 +2,59 @@

Choose your environment and orchestration method to run MaxText.

::::{grid} 1 2 2 2
:gutter: 2
````{grid} 1 2 2 2
---
gutter: 2
---
```{grid-item-card} 🚀 Pre-training
:link: run_maxtext/run_maxtext_pretraining
:link-type: doc

Complete guide to pre-training language models from scratch. Covers model selection, hyperparameters, dataset configuration, deployment options, and monitoring.
```

:::{grid-item-card} 💻 Localhost / Single VM
```{grid-item-card} 💻 Localhost / Single VM
:link: run_maxtext/run_maxtext_localhost
:link-type: doc

Get started quickly on a single machine. Clone the repo, install dependencies, and run your first training job on a single TPU or GPU VM.
:::
```

:::{grid-item-card} 🎮 Single-host GPU
```{grid-item-card} 🎮 Single-host GPU
:link: run_maxtext/run_maxtext_single_host_gpu
:link-type: doc

Run MaxText on single-host NVIDIA GPUs (e.g., A3 High/Mega). Includes Docker setup, NVIDIA Container Toolkit installation, and 1B/7B model training examples.
:::
```

:::{grid-item-card} 🏗️ At scale with XPK (GKE)
```{grid-item-card} 🏗️ At scale with XPK (GKE)
:link: run_maxtext/run_maxtext_via_xpk
:link-type: doc

Deploy to Google Kubernetes Engine (GKE) using XPK. Orchestrate large-scale training jobs on TPU or GPU clusters with simple CLI commands.
:::
```

:::{grid-item-card} 🌐 Multi-host via Pathways
```{grid-item-card} 🌐 Multi-host via Pathways
:link: run_maxtext/run_maxtext_via_pathways
:link-type: doc

Run large-scale JAX jobs on TPUs using Pathways. Supports batch and headless (interactive) workloads on GKE.
:::
```

:::{grid-item-card} 🔌 Decoupled Mode
```{grid-item-card} 🔌 Decoupled Mode
:link: run_maxtext/decoupled_mode
:link-type: doc

Run tests and local development without Google Cloud dependencies (no `gcloud`, GCS, or Vertex AI required).
:::
::::
```
````

```{toctree}
---
hidden:
maxdepth: 1
---
run_maxtext/run_maxtext_pretraining.md
run_maxtext/run_maxtext_localhost.md
run_maxtext/run_maxtext_single_host_gpu.md
run_maxtext/run_maxtext_via_xpk.md
Expand Down
Loading
Loading