diff --git a/docs/development.md b/docs/development.md index c4e671f44b..7b65580846 100644 --- a/docs/development.md +++ b/docs/development.md @@ -7,4 +7,5 @@ hidden: --- development/update_dependencies.md development/contribute_docs.md +development/hlo_diff_testing.md ``` diff --git a/docs/development/hlo_diff_testing.md b/docs/development/hlo_diff_testing.md index f486f3e6de..d79acb31ec 100644 --- a/docs/development/hlo_diff_testing.md +++ b/docs/development/hlo_diff_testing.md @@ -44,9 +44,12 @@ ______________________________________________________________________ When intended architectures transformations alter graph lowering, reference file baselines require updates. -> [!IMPORTANT]\ -> While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.** -> The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines. +```{important} + +While running the update script locally is not the end of the world, **relying on local execution can cause remote CI tests to fail.** + +The PR verification pipelines run the tests in a strictly locked GitHub Actions environment. The smallest discrepancies in local library installations will introduce slight backend lowering graph deviations. If your local execution leads to a remote CI check failure, rely on the GitHub Action trigger described below to generate environment-matching baselines. +``` ### Method 1: Run the manual GitHub Action Workflow (Highly Recommended) @@ -66,13 +69,14 @@ Alternatively, you can trigger the remote workflow via terminal CLI execution: gh workflow run update_reference_hlo.yml --ref ``` -> [!NOTE] -> A successful run of the manual update workflow will add a new commit to your Pull Request branch. Once complete, you must: -> -> 1. Pull the new commit from remote. -> 2. Squash the commits in your branch once again to keep your PR history clean. -> 3. Push the squashed commit to remote. -> 4. Retry the `tpu-integration` workflow to verify tests pass on your PR. +```{note} +A successful run of the manual update workflow will add a new commit to your Pull Request branch. Once complete, you must: + +1. Pull the new commit from remote. +2. Squash the commits in your branch once again to keep your PR history clean. +3. Push the squashed commit to remote. +4. Retry the `tpu-integration` workflow to verify tests pass on your PR. +``` ### Method 2: Local Execution diff --git a/docs/guides.md b/docs/guides.md index bff2cc0042..bfd5a0eaf1 100644 --- a/docs/guides.md +++ b/docs/guides.md @@ -18,58 +18,59 @@ Explore our how-to guides for optimizing, debugging, and managing your MaxText workloads. -::::{grid} 1 2 2 2 -:gutter: 2 - -:::{grid-item-card} ⚡ Optimization +````{grid} 1 2 2 2 +--- +gutter: 2 +--- +```{grid-item-card} ⚡ Optimization :link: guides/optimization :link-type: doc Techniques for maximizing performance, including sharding strategies, Pallas kernels, and benchmarking. -::: +``` -:::{grid-item-card} 💾 Data Pipelines +```{grid-item-card} 💾 Data Pipelines :link: guides/data_input_pipeline :link-type: doc Configure input pipelines using **Grain** (recommended for determinism), **HuggingFace**, or **TFDS**. -::: +``` -:::{grid-item-card} 🔄 Checkpointing +```{grid-item-card} 🔄 Checkpointing :link: guides/checkpointing_solutions :link-type: doc Manage GCS checkpoints, handle preemption with emergency checkpointing, and configure multi-tier storage. -::: +``` -:::{grid-item-card} 🔍 Monitoring & Debugging +```{grid-item-card} 🔍 Monitoring & Debugging :link: guides/monitoring_and_debugging :link-type: doc Tools for observability: goodput monitoring, hung job debugging, and Vertex AI TensorBoard integration. -::: +``` -:::{grid-item-card} 🐍 Python Notebooks +```{grid-item-card} 🐍 Python Notebooks :link: guides/run_python_notebook :link-type: doc Interactive development guides for running MaxText on Google Colab or local JupyterLab environments. -::: +``` -:::{grid-item-card} 🌱 Model Bringup +```{grid-item-card} 🌱 Model Bringup :link: guides/model_bringup :link-type: doc A step-by-step guide for the community to help expand MaxText's model library. -::: +``` -:::{grid-item-card} 🎓 Distillation +```{grid-item-card} 🎓 Distillation :link: guides/distillation :link-type: doc How online distillation works in MaxText: loss anatomy, α / β / temperature schedule tuning, layer indices, monitoring metrics, and troubleshooting. -::: -:::: +``` +```` ```{toctree} --- diff --git a/docs/guides/data_input_pipeline.md b/docs/guides/data_input_pipeline.md index 0b65bdfa6c..019a8dd07f 100644 --- a/docs/guides/data_input_pipeline.md +++ b/docs/guides/data_input_pipeline.md @@ -37,7 +37,8 @@ Training in a multi-host environment presents unique challenges for data input p ### Random access dataset (Recommended) -Random-access formats are highly recommended for multi-host training because they allow any part of the file to be read directly by its index.
+Random-access formats are highly recommended for multi-host training because they allow any part of the file to be read directly by its index. + In MaxText, this is best supported by the ArrayRecord format using the Grain input pipeline. This approach gracefully handles the key challenges: - **Concurrent access and uniqueness**: Grain assigns a unique set of indices to each host. ArrayRecord allows different hosts to read from different indices in the same file. diff --git a/docs/guides/data_input_pipeline/data_input_grain.md b/docs/guides/data_input_pipeline/data_input_grain.md index 5a7d66981d..1bac071639 100644 --- a/docs/guides/data_input_pipeline/data_input_grain.md +++ b/docs/guides/data_input_pipeline/data_input_grain.md @@ -32,9 +32,14 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state ## Using Grain -1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class. - - **Community Resource**: The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet. -2. If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). The script configures some parameters for the mount. +Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class. + +```{admonition} Community Resource + +The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet. +``` + +If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). The script configures some parameters for the mount. ```sh bash src/dependencies/scripts/setup_gcsfuse.sh \ @@ -45,11 +50,13 @@ MOUNT_PATH=${MOUNT_PATH?} \ Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pre-filling the metadata cache (see ["Performance tuning best practices" on the Google Cloud documentation](https://docs.cloud.google.com/storage/docs/cloud-storage-fuse/performance)). +### Configuration + 1. Set `dataset_type=grain`, `grain_file_type={arrayrecord|parquet|tfrecord}`, `grain_train_files` in `src/maxtext/configs/base.yml` or through command line arguments to match the file pattern on the mounted local path. 2. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh) to avoid gcsfuse throttling. -3. ArrayRecord Only: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example: +3. *ArrayRecord Only*: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example: ``` # Blend two data sources with 30% from first source and 70% from second source @@ -120,9 +127,9 @@ grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* \ grain_worker_count=2 ``` -1. Using validation set for evaluation +### Using validation set for evaluation -When setting eval_interval > 0, evaluation will be run with a specified eval dataset. Example config (set in [`src/maxtext/configs/base.yml`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/base.yml) or through command line): +When setting `eval_interval > 0`, evaluation will be run with a specified eval dataset. Example config (set in [`src/maxtext/configs/base.yml`](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/base.yml) or through command line): ```yaml eval_interval: 10000 @@ -130,7 +137,22 @@ eval_steps: 50 grain_eval_files: '/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record*' ``` -1. Experimental: resuming training with a different chip count +### Tokenizer support + +Grain pipeline supports three tokenizer types: + +- `sentencepiece`: For SentencePiece tokenizers; +- `huggingface`: For HuggingFace tokenizers (requires `hf_access_token` for gated models); +- `tiktoken`: For OpenAI's tiktoken tokenizers. + +Example with SentencePiece: + +```bash +tokenizer_type=sentencepiece \ +tokenizer_path=gs:///tokenizers/c4_en_301_5Mexp2_spm.model +``` + +### Experimental: resuming training with a different chip count In Grain checkpoints, each data-loading host has a corresponding JSON file. For cases where a user wants to resume training with a different number of data-loading hosts, MaxText provides an experimental feature: diff --git a/docs/guides/data_input_pipeline/data_input_hf.md b/docs/guides/data_input_pipeline/data_input_hf.md index ee8d0c67a6..be00844d4f 100644 --- a/docs/guides/data_input_pipeline/data_input_hf.md +++ b/docs/guides/data_input_pipeline/data_input_hf.md @@ -39,6 +39,40 @@ hf_eval_files: 'gs:////*-validation-*.parquet' # match the val tokenizer_path: 'google-t5/t5-large' # for using https://huggingface.co/google-t5/t5-large ``` +## Tokenizer configuration + +The Hugging Face pipeline only supports Hugging Face tokenizers and will ignore the `tokenizer_type` flag. + +## Using gated datasets + +For [gated datasets](https://huggingface.co/docs/hub/en/datasets-gated) or tokenizers from [gated models](https://huggingface.co/docs/hub/en/models-gated), you need to: + +1. Request access on HuggingFace +2. Generate an access token from your [HuggingFace settings](https://huggingface.co/settings/tokens) +3. Provide the token in your command: + +```bash +hf_access_token= +``` + +Example with gated model: + +```bash +python3 -m maxtext.trainers.pre_train.train \ + base_output_directory=gs:// \ + run_name=llama2_demo \ + model_name=llama2-7b \ + dataset_type=hf \ + hf_path=allenai/c4 \ + hf_data_dir=en \ + train_split=train \ + tokenizer_type=huggingface \ + tokenizer_path=meta-llama/Llama-2-7b \ + hf_access_token=hf_xxxxxxxxxxxxx \ + steps=1000 \ + per_device_batch_size=8 +``` + ## Limitations and Recommendations 1. Streaming data directly from Hugging Face Hub may be impacted by the traffic of the server. During peak hours you may encounter "504 Server Error: Gateway Time-out". It's recommended to download the Hugging Face dataset to a Cloud Storage bucket or disk for the most stable experience. diff --git a/docs/guides/data_input_pipeline/data_input_tfds.md b/docs/guides/data_input_pipeline/data_input_tfds.md index acbf064055..714653a4a5 100644 --- a/docs/guides/data_input_pipeline/data_input_tfds.md +++ b/docs/guides/data_input_pipeline/data_input_tfds.md @@ -1,5 +1,9 @@ # TFDS pipeline +The TensorFlow Datasets (TFDS) pipeline uses datasets in TFRecord format, which is performant and widely supported in the TensorFlow ecosystem. + +## Example config for streaming from TFDS dataset in a Cloud Storage bucket + 1. Download the Allenai C4 dataset in TFRecord format to a Cloud Storage bucket. For information about cost, see [this discussion](https://github.com/allenai/allennlp/discussions/5056) ```shell @@ -18,3 +22,11 @@ eval_split: 'validation' # TFDS input pipeline only supports tokenizer in spm format tokenizer_path: 'src/maxtext/assets/tokenizers/tokenizer.llama2' ``` + +### Tokenizer support + +TFDS pipeline supports three tokenizer types: + +- `sentencepiece`: For SentencePiece tokenizers +- `huggingface`: For HuggingFace tokenizers (requires `hf_access_token` for gated models) +- `tiktoken`: For OpenAI's tiktoken tokenizers diff --git a/docs/run_maxtext.md b/docs/run_maxtext.md index accd7ea3b5..47bcb20400 100644 --- a/docs/run_maxtext.md +++ b/docs/run_maxtext.md @@ -2,50 +2,59 @@ Choose your environment and orchestration method to run MaxText. -::::{grid} 1 2 2 2 -:gutter: 2 +````{grid} 1 2 2 2 +--- +gutter: 2 +--- +```{grid-item-card} 🚀 Pre-training +:link: run_maxtext/run_maxtext_pretraining +:link-type: doc + +Complete guide to pre-training language models from scratch. Covers model selection, hyperparameters, dataset configuration, deployment options, and monitoring. +``` -:::{grid-item-card} 💻 Localhost / Single VM +```{grid-item-card} 💻 Localhost / Single VM :link: run_maxtext/run_maxtext_localhost :link-type: doc Get started quickly on a single machine. Clone the repo, install dependencies, and run your first training job on a single TPU or GPU VM. -::: +``` -:::{grid-item-card} 🎮 Single-host GPU +```{grid-item-card} 🎮 Single-host GPU :link: run_maxtext/run_maxtext_single_host_gpu :link-type: doc Run MaxText on single-host NVIDIA GPUs (e.g., A3 High/Mega). Includes Docker setup, NVIDIA Container Toolkit installation, and 1B/7B model training examples. -::: +``` -:::{grid-item-card} 🏗️ At scale with XPK (GKE) +```{grid-item-card} 🏗️ At scale with XPK (GKE) :link: run_maxtext/run_maxtext_via_xpk :link-type: doc Deploy to Google Kubernetes Engine (GKE) using XPK. Orchestrate large-scale training jobs on TPU or GPU clusters with simple CLI commands. -::: +``` -:::{grid-item-card} 🌐 Multi-host via Pathways +```{grid-item-card} 🌐 Multi-host via Pathways :link: run_maxtext/run_maxtext_via_pathways :link-type: doc Run large-scale JAX jobs on TPUs using Pathways. Supports batch and headless (interactive) workloads on GKE. -::: +``` -:::{grid-item-card} 🔌 Decoupled Mode +```{grid-item-card} 🔌 Decoupled Mode :link: run_maxtext/decoupled_mode :link-type: doc Run tests and local development without Google Cloud dependencies (no `gcloud`, GCS, or Vertex AI required). -::: -:::: +``` +```` ```{toctree} --- hidden: maxdepth: 1 --- +run_maxtext/run_maxtext_pretraining.md run_maxtext/run_maxtext_localhost.md run_maxtext/run_maxtext_single_host_gpu.md run_maxtext/run_maxtext_via_xpk.md diff --git a/docs/run_maxtext/run_maxtext_pretraining.md b/docs/run_maxtext/run_maxtext_pretraining.md new file mode 100644 index 0000000000..8015f769f5 --- /dev/null +++ b/docs/run_maxtext/run_maxtext_pretraining.md @@ -0,0 +1,356 @@ + + +(run-pretraining)= + +# Pre-training + +Pre-training is the process of training a language model from scratch (or from randomly initialized weights) on large-scale datasets to learn general language understanding. This guide covers running pre-training workloads with MaxText, including model selection, hyperparameter configuration, dataset setup, deployment options, and monitoring. + +## Prerequisites + +Before starting, ensure you have: + +1. **Completed the installation** - See [Install MaxText](../install_maxtext.md) +2. **Set up a Cloud Storage bucket** - For storing logs and checkpoints. See [First Run](../tutorials/first_run.md#prerequisites-set-up-storage-and-configure-maxtext) for detailed instructions. +3. **Configured your environment** - Set `BASE_OUTPUT_DIRECTORY` environment variable: + ```bash + export BASE_OUTPUT_DIRECTORY=gs:/// + ``` + +## 1. Model selection + +MaxText provides pre-configured models for popular architectures. To use one, +specify the `model_name` parameter when using `maxtext.trainers.pre_train.train`. +MaxText will load the corresponding configuration from `src/maxtext/configs/models/` +(for TPU defaults) or `src/maxtext/configs/models/gpu/` (for GPU defaults). + +MaxText supports many open-source models. For a complete list, see the [configs/models](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/configs/models) directory and [our README](https://github.com/AI-Hypercomputer/maxtext/blob/main/README.md). + +### Example: Training with a specific model + +```bash +python3 -m maxtext.trainers.pre_train.train \ + model_name=llama3-8b \ + run_name=my_pretraining_run \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ + dataset_type=synthetic \ + steps=100 +``` + +### Custom model configurations + +You can also create custom model configurations or override specific architecture parameters: + +```bash +# Override specific parameters +python3 -m maxtext.trainers.pre_train.train \ + model_name=llama3-8b \ + base_emb_dim=4096 \ + base_num_decoder_layers=32 \ + run_name=custom_model \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ + dataset_type=synthetic \ + steps=100 +``` + +**Note:** You cannot override parameters that are already defined in the model config file from the command line. To fully customize a model, create a new YAML config file similar to the ones in the [MaxText repository](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/configs) and pass it as a parameter: + +```bash +# Override specific parameters +python3 -m maxtext.trainers.pre_train.train \ + /path/to/model_config.yml \ + run_name=custom_model \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ + dataset_type=synthetic \ + steps=100 +``` + +## 2. Hyperparameter configuration + +Key hyperparameters control the training process. Here are the most important ones. + +- **Training duration:** + - `steps`: Total number of training steps (required) + - `max_target_length`: Maximum sequence length in tokens (default: 1024) + - `per_device_batch_size`: Batch size per device/chip (default: 12.0) +- **Learning rate and optimizer:** + - `learning_rate`: Peak learning rate (default: 3e-5) + - `opt_type`: Optimizer type - `adamw`, `adam_pax`, `sgd` or `muon` (default: `adamw`) +- **Checkpointing:** + - `enable_checkpointing`: Save checkpoints during training (default: `True`) + - `checkpoint_period`: (default: 10000) +- **Logging and monitoring:** + - `log_period`: The frequency of Tensorboard flush, gcs metrics writing, and managed profiler metrics updating (default: 100) + - `enable_tensorboard`: Enable TensorBoard logging (default: `True`) + +### Example with common hyperparameters + +```bash +python3 -m maxtext.trainers.pre_train.train \ + model_name=qwen3-4b \ + run_name=qwen_pretrain \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ + dataset_type=hf \ + hf_path=allenai/c4 \ + hf_data_dir=en \ + train_split=train \ + tokenizer_type=huggingface \ + tokenizer_path=Qwen/Qwen2.5-4B \ + steps=10000 \ + per_device_batch_size=16 \ + max_target_length=4096 \ + learning_rate=1e-4 \ + checkpoint_period=1000 \ + log_period=50 +``` + +For a complete list of configurable parameters, see +[src/maxtext/configs/base.yml](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/base.yml). + +## 3. Dataset configuration + +MaxText supports three dataset input pipelines: Grain, HuggingFace and TFDS +(TensorFlow Datasets). We will briefly describe how to get started with each of +them in this document. See [Data Input Pipeline](../guides/data_input_pipeline.md) and the individual guides listed in that document for more details. + +### Grain pipeline + +Grain is the **recommended input pipeline** for production training due to its determinism and resilience to preemption. It supports ArrayRecord (random access) and Parquet (sequential access) formats. + +To get started, you need to: + +1. **Download data** to a Cloud Storage bucket +2. **Mount the bucket** using [Cloud Storage FUSE (GCSFuse)](https://cloud.google.com/storage/docs/gcs-fuse) + +```bash +bash src/dependencies/scripts/setup_gcsfuse.sh \ + DATASET_GCS_BUCKET=gs:// \ + MOUNT_PATH=/tmp/gcsfuse +``` + +After training, unmount the bucket: + +```bash +fusermount -u /tmp/gcsfuse +``` + +#### Example: using GCSFuse and ArrayRecord + +```bash +# Replace DATASET_GCS_BUCKET and base_output_directory with your buckets; replace run-name with your run name +python3 -m maxtext.trainers.pre_train.train \ + base_output_directory=gs:// \ + run_name= \ + model_name=deepseek2-16b \ + per_device_batch_size=1 \ + steps=10 \ + max_target_length=2048 \ + enable_checkpointing=false \ + dataset_type=grain \ + grain_file_type=arrayrecord \ + grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* \ + grain_worker_count=2 \ +``` + +#### Dataset configuration parameters + +- `dataset_type`: Set to `grain` +- `grain_file_type`: Format type - `arrayrecord` (recommended) or `parquet` +- `grain_train_files`: Path pattern to training files (supports regex patterns like `*`) +- `grain_worker_count`: Number of child processes for data loading (tune for performance) + +#### Evaluation during training (optional) + +To add periodic evaluation during training, specify an evaluation interval: + +```bash +# Add to your command +eval_interval=5 \ +eval_steps=10 \ +grain_eval_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record* +``` + +For comprehensive Grain configuration and best practices, see [Grain Pipeline Guide](../guides/data_input_pipeline/data_input_grain.md). + +### HuggingFace pipeline + +The HuggingFace pipeline provides the easiest way to get started with real datasets, streaming data directly from the HuggingFace Hub without requiring downloads. Alternatively, you can stream from a Cloud Storage bucket; see [HuggingFace Pipeline](../guides/data_input_pipeline/data_input_hf.md) for details. + +#### Example: [allenai/c4](https://huggingface.co/datasets/allenai/c4) dataset + +We'll use the [allenai/c4](https://huggingface.co/datasets/allenai/c4) dataset from HuggingFace, a processed version of Google's C4 (Colossal Clean Crawled Corpus). This dataset is organized into subsets (e.g., `en`, `es`), each containing data splits (e.g., `train`, `validation`). + +```bash +# Replace base_output_directory with your bucket and run-name with your run name +python3 -m maxtext.trainers.pre_train.train \ + base_output_directory=gs:// \ + run_name= \ + model_name=deepseek2-16b \ + per_device_batch_size=1 \ + steps=10 \ + max_target_length=2048 \ + enable_checkpointing=false \ + dataset_type=hf \ + hf_path=allenai/c4 \ + hf_data_dir=en \ + train_split=train \ + tokenizer_type=huggingface \ + tokenizer_path=deepseek-ai/DeepSeek-V2-Lite +``` + +#### Dataset configuration parameters + +- `dataset_type`: set to `hf` +- `hf_path`: set to the HF dataset you want to use +- `hf_data_dir`: path to the data on the HuggingFace repository + +#### Evaluation during training (optional) + +To add periodic evaluation during training, specify an evaluation interval and split: + +```bash +eval_interval=5 \ +eval_steps=10 \ +hf_eval_split=validation +``` + +This runs 10 evaluation steps every 5 training steps using the `validation` split. + +For a more comprehensive description of the available configuration parameters, see [HuggingFace Pipeline](../guides/data_input_pipeline/data_input_hf.md). + +### TFDS pipeline + +The TensorFlow Datasets (TFDS) pipeline uses datasets in TFRecord format, which is performant and widely supported in the TensorFlow ecosystem. + +To get started, you need to: + +1. **Create a Cloud Storage bucket** for your dataset +2. **Download dataset** to your bucket + +You can use [the `download_dataset.sh` script provided by MaxText](https://github.com/AI-Hypercomputer/maxtext/blob/main/tools/data_generation/download_dataset.sh) to download the AllenAI C4 dataset: + +```bash +bash download_dataset.sh +``` + +#### Example: training with TFDS + +```bash +# Replace base_output_directory and dataset_path with your buckets +python3 -m maxtext.trainers.pre_train.train \ + base_output_directory=gs:// \ + run_name=demo \ + model_name=deepseek2-16b \ + per_device_batch_size=1 \ + steps=10 \ + max_target_length=2048 \ + enable_checkpointing=false \ + dataset_type=tfds \ + dataset_path=gs:// \ + dataset_name='c4/en:3.0.1' \ + train_split=train \ + tokenizer_type=huggingface \ + tokenizer_path=deepseek-ai/DeepSeek-V2-Lite +``` + +#### Dataset configuration parameters + +- `dataset_type`: Set to `tfds` +- `dataset_path`: Cloud Storage bucket containing the dataset (e.g., `gs://`) +- `dataset_name`: Subdirectory path within the bucket (e.g., `c4/en:3.0.1`) +- `train_split`: Split name for training (typically `train`, corresponds to `*-train.tfrecord-*` files) + +The pipeline reads from files matching the pattern: + +``` +gs:///c4/en/3.0.1/c4-train.tfrecord-0000-of-01024 +``` + +#### Evaluation during training (optional) + +To add periodic evaluation during training, specify an evaluation interval: + +```bash +# Add to your command +eval_interval=5 \ +eval_steps=10 \ +eval_dataset_name='c4/en:3.0.1' \ +eval_split=validation +``` + +- `eval_dataset_name`: Can be different from `dataset_name` if evaluating on a different dataset +- `eval_split`: Split name for evaluation (corresponds to `*-validation.tfrecord-*` files) + +For comprehensive TFDS configuration, see [TFDS Pipeline Guide](../guides/data_input_pipeline/data_input_tfds.md). + +## 4. Deployment options + +Choose your deployment method based on your scale and infrastructure. + +- **Localhost / Single VM:** Best for getting started and testing configurations on a single machine with a single TPU or GPU. See how to [run MaxText via localhost](./run_maxtext_localhost) or get specific instructions for [running on a single-host GPU](./run_maxtext_single_host_gpu). +- **XPK (Google Kubernetes Engine):** Best for large-scale training on TPU or GPU clusters. See how to [run MaxText via XPK](./run_maxtext_via_xpk). +- **Pathways:** Best for large-scale multi-host JAX jobs on TPUs. See how to [run MaxText via Pathways](./run_maxtext_via_pathways). +- **Decoupled Mode:** Best for local testing and development without Google Cloud dependencies. See how to [run MaxText in Decoupled Mode](./decoupled_mode). + +## 5. Monitoring training progress + +### Understanding logs + +MaxText produces detailed logs during training. Here's what to look for: + +``` +completed step: 100, seconds: 1.234, TFLOP/s/device: 156.789, Tokens/s/device: 10234.567, total_weights: 8192, loss: 3.456 +``` + +- `step`: Current training step +- `seconds`: Time taken for this step +- `TFLOP/s/device`: Compute throughput per device +- `Tokens/s/device`: Token processing rate per device +- `total_weights`: Number of actual tokens processed (excluding padding) +- `loss`: Training loss value + +For detailed log interpretation, see [Understand Logs and Metrics](../guides/monitoring_and_debugging/understand_logs_and_metrics). + +## 6. Complete pre-training example + +Here's a complete example that combines all the concepts, using the HuggingFace pipeline to pre-train a Llama3-8B model on the C4 dataset: + +```bash +# Pre-training Llama3-8B on C4 dataset using HuggingFace pipeline +python3 -m maxtext.trainers.pre_train.train \ + model_name=llama3-8b \ + run_name=llama3_c4_pretrain \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ + dataset_type=hf \ + hf_path=allenai/c4 \ + hf_data_dir=en \ + train_split=train \ + tokenizer_type=huggingface \ + tokenizer_path=meta-llama/Meta-Llama-3-8B \ + steps=100000 \ + per_device_batch_size=16 \ + max_target_length=4096 \ + learning_rate=3e-4 \ + warmup_steps=2000 \ + checkpoint_period=5000 \ + log_period=100 \ + enable_tensorboard=true \ + eval_interval=1000 \ + eval_steps=100 \ + hf_eval_split=validation \ + enable_goodput_recording=true +``` diff --git a/docs/tutorials.md b/docs/tutorials.md index 0528b1ab39..ee0b66252d 100644 --- a/docs/tutorials.md +++ b/docs/tutorials.md @@ -18,37 +18,31 @@ Explore our tutorials to learn how to use MaxText, from your first run to advanced post-training techniques. -::::{grid} 1 2 2 2 -:gutter: 2 - -:::{grid-item-card} 🚀 Getting Started +````{grid} 1 2 2 2 +--- +gutter: 2 +--- +```{grid-item-card} 🚀 Getting Started :link: tutorials/first_run :link-type: doc Installation, prerequisites, verification, and your first training run. -::: - -:::{grid-item-card} 📚 Pre-training -:link: tutorials/pretraining -:link-type: doc - -Step-by-step guides for pre-training with real datasets like C4 using HuggingFace, Grain, or TFDS. -::: +``` -:::{grid-item-card} 🧩 Post-training +```{grid-item-card} 🧩 Post-training :link: tutorials/post_training_index :link-type: doc Techniques for SFT, RL, and other post-training workflows on TPU. -::: +``` -:::{grid-item-card} 📊 Inference +```{grid-item-card} 📊 Inference :link: tutorials/inference :link-type: doc Step-by-step guides for running inference of MaxText models on vLLM. -::: -:::: +``` +```` ```{toctree} --- @@ -56,7 +50,6 @@ hidden: maxdepth: 1 --- tutorials/first_run.md -tutorials/pretraining.md tutorials/post_training_index.md tutorials/inference.md ``` diff --git a/docs/tutorials/posttraining/gepa_optimization.md b/docs/tutorials/posttraining/gepa_optimization.md index 2b705aad20..5f4f28aa36 100644 --- a/docs/tutorials/posttraining/gepa_optimization.md +++ b/docs/tutorials/posttraining/gepa_optimization.md @@ -2,7 +2,7 @@ ## Overview -This document explains how to use **GEPA** (Generic Evaluation and Prompt Adaptation) to optimize system prompts for MaxText models. GEPA is an evolutionary framework ([GitHub Repository](https://github.com/gepa-ai/gepa), [Paper](https://arxiv.org/abs/2507.19457)) that iteratively refines prompts based on evaluation feedback, helping models perform better on specific tasks. A complete, runnable example notebook is provided in the repository at [maxtext_with_gepa.ipynb](../../../src/maxtext/examples/maxtext_with_gepa.ipynb). +This document explains how to use **GEPA** (Generic Evaluation and Prompt Adaptation) to optimize system prompts for MaxText models. GEPA is an evolutionary framework ([GitHub Repository](https://github.com/gepa-ai/gepa), [Paper](https://arxiv.org/abs/2507.19457)) that iteratively refines prompts based on evaluation feedback, helping models perform better on specific tasks. A complete, runnable example notebook is provided in the repository at [maxtext_with_gepa.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/examples/maxtext_with_gepa.ipynb). ## How GEPA Optimization Works @@ -44,7 +44,7 @@ Prompt optimization frameworks like GEPA are highly sensitive to the reward sign ## Tutorial Notebook A complete, runnable tutorial is available in the repository as a Jupyter Notebook: -[maxtext_with_gepa.ipynb](../../../src/maxtext/examples/maxtext_with_gepa.ipynb) (provided as an example) +[maxtext_with_gepa.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/examples/maxtext_with_gepa.ipynb) (provided as an example) This notebook walks through: diff --git a/docs/tutorials/posttraining/knowledge_distillation.md b/docs/tutorials/posttraining/knowledge_distillation.md index ea6222a09c..2280990e66 100644 --- a/docs/tutorials/posttraining/knowledge_distillation.md +++ b/docs/tutorials/posttraining/knowledge_distillation.md @@ -348,7 +348,7 @@ python3 -m maxtext.trainers.post_train.distillation.train_distill \ profiler=xplane ``` -The schedule values above are a strong default for same-size pruning recovery. See [α and β schedule guide](../../guides/distillation.md#alpha-schedule-guide) for other scenarios (large teacher → small student, logit-only, aggressive recovery, etc.). +The schedule values above are a strong default for same-size pruning recovery. See [α and β schedule guide](../../guides/distillation.md#%CE%B1-alpha-schedule-guide) for other scenarios (large teacher → small student, logit-only, aggressive recovery, etc.). > **Note:** `distill_layer_indices` is applied to **both** student and teacher activations identically. When the two have different depths (Pattern A or a depth-pruned Pattern B), every index must be valid on the *smaller* side, and same-numbered layers are aligned across the two models. The trainer cannot map student layer *i* to teacher layer *f(i)* for arbitrary *f*. If the depths differ significantly, prefer logit-only distillation (`distill_beta=0`). diff --git a/docs/tutorials/pretraining.md b/docs/tutorials/pretraining.md deleted file mode 100644 index a1ae985db0..0000000000 --- a/docs/tutorials/pretraining.md +++ /dev/null @@ -1,172 +0,0 @@ - - -(pretraining)= - -# Pre-training - -In this tutorial, we introduce how to run pretraining with real datasets. While synthetic data is commonly used for benchmarking, we rely on real datasets to obtain meaningful weights. Currently, MaxText supports three dataset input pipelines: HuggingFace, Grain, and TensorFlow Datasets (TFDS). We will walk you through: setting up dataset, modifying the [dataset configs](https://github.com/AI-Hypercomputer/maxtext/blob/f11f5507c987fdb57272c090ebd2cbdbbadbd36c/src/maxtext/configs/base.yml#L631-L675) and [tokenizer configs](https://github.com/AI-Hypercomputer/maxtext/blob/f11f5507c987fdb57272c090ebd2cbdbbadbd36c/src/maxtext/configs/base.yml#L566) for training, and optionally enabling evaluation. - -```{note} -Before starting this tutorial, ensure you have installed MaxText following the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). For pre-training, install `maxtext[tpu]` for TPUs or `maxtext[cuda12]` for GPUs. -``` - -To start with, we focus on HuggingFace datasets for convenience. - -- Later on, we will give brief examples for Grain and TFDS. For a comprehensive guide, see the [Data Input Pipeline](../guides/data_input_pipeline.md) topic. -- For demonstration, we use Deepseek-V2-Lite model and C4 dataset. C4 stands for "Colossal Clean Crawled Corpus", a high-quality pretraining dataset first introduced by Google's [T5](https://arxiv.org/pdf/1910.10683) work. Feel free to try other models and datasets. - -## 1. HuggingFace pipeline - -We use the HuggingFace dataset [allenai/c4](https://huggingface.co/datasets/allenai/c4), a processed version of Google's C4. This dataset is organized into subsets (e.g., `en`, `es`), and each subset contains data splits (e.g., `train`, `validation`). - -**Data preparation**: You don't need to download data, as the pipeline can stream data directly from the HuggingFace Hub. Alternatively, it can stream from a Cloud Storage bucket; see the [HuggingFace Pipeline](../guides/data_input_pipeline/data_input_hf.md) page. - -We can use this **command** for pretraining: - -```bash -# replace base_output_directory with your bucket -python3 -m maxtext.trainers.pre_train.train \ -base_output_directory=gs://runner-maxtext-logs run_name=demo \ -model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \ -dataset_type=hf hf_path=allenai/c4 hf_data_dir=en train_split=train \ -tokenizer_type=huggingface tokenizer_path=deepseek-ai/DeepSeek-V2-Lite -``` - -**Dataset config**: - -- `dataset_type`: `hf` -- `hf_path`: the HuggingFace dataset repository is `allenai/c4` -- `hf_data_dir`: the subset is `en`, corresponding to English data. -- `train_split`: `train`. Training will use the `train` split. - -The above command runs training only: `steps=10` on the `train` split, for `en` subset of `allenai/c4`. The log shows: - -``` -completed step: 1, seconds: 0.287, TFLOP/s/device: 110.951, Tokens/s/device: 7131.788, total_weights: 7517, loss: 12.021 -... -completed step: 9, seconds: 1.010, TFLOP/s/device: 31.541, Tokens/s/device: 2027.424, total_weights: 7979, loss: 9.436 -``` - -The total weights is the number of real tokens processed in each step. More explanation can be found in [Understand Logs and Metrics](../guides/monitoring_and_debugging/understand_logs_and_metrics.md#understand-logs-and-metrics) page. - -**Evaluation config (optional)**: - -To add evaluation steps, we can specify a positive evaluation interval and the dataset split, for instance `eval_interval=5 eval_steps=10 hf_eval_split=validation`. For every 5 training step, we run evaluation for 10 steps, using the `validation` split. In the log, you will additionally see: - -``` -Completed eval step 0 -... -Completed eval step 9 -eval metrics after step: 4, loss=9.855, total_weights=75264.0 -Completed eval step 0 -... -Completed eval step 9 -eval metrics after step: 9, loss=9.420, total_weights=75264.0 -``` - -**Tokenizer config**: - -- `tokenizer_type`: `huggingface`. Note HuggingFace input pipeline only supports HuggingFace tokenizer. -- `tokenizer_path`: `deepseek-ai/DeepSeek-V2-Lite`, corresponding to the HuggingFace [model repository](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite/tree/main). - -**HuggingFace access token (optional)**: - -- For a [gated dataset](https://huggingface.co/docs/hub/en/datasets-gated) or a tokenizer from a [gated model](https://huggingface.co/docs/hub/en/models-gated), you need to request access on HuggingFace and provide `hf_access_token=` in the command. For instance, [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b) is a gated model. - -## 2. Grain pipeline - -Grain is a library for reading data for training and evaluating JAX models. It is the recommended input pipeline for determinism and resilience! It supports data formats like ArrayRecord and Parquet. You can check [Grain pipeline](../guides/data_input_pipeline/data_input_grain.md) for more details. - -**Data preparation**: You need to download data to a Cloud Storage bucket, and read data via Cloud Storage Fuse with [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). - -- For example, we can mount the bucket `gs://maxtext-dataset` on the local path `/tmp/gcsfuse` before training - ```bash - bash setup_gcsfuse.sh DATASET_GCS_BUCKET=maxtext-dataset MOUNT_PATH=/tmp/gcsfuse - ``` -- After training, we unmount the local path - ```bash - fusermount -u /tmp/gcsfuse - ``` - -This **command** shows pretraining with Grain pipeline, along with evaluation: - -```bash -# replace DATASET_GCS_BUCKET and base_output_directory with your buckets -python3 -m maxtext.trainers.pre_train.train \ -base_output_directory=gs://runner-maxtext-logs run_name=demo \ -model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \ -dataset_type=grain grain_file_type=arrayrecord grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* grain_worker_count=2 \ -eval_interval=5 eval_steps=10 grain_eval_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record* \ -tokenizer_type=huggingface tokenizer_path=deepseek-ai/DeepSeek-V2-Lite -``` - -**Dataset config**: - -- `dataset_type`: `grain` -- `grain_file_type`: `arrayrecord`. We also support `parquet`. -- `grain_train_files`: `/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record*`, which is a regex pattern. -- `grain_worker_count`: `2`. This parameter controls the number of child processes used by Grain, which should be tuned for performance. - -**Evaluation config (optional)**: - -- `eval_interval=5 eval_steps=10`: after every 5 train steps, perform 10 evaluation steps -- `grain_eval_files`: `/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record*`, which is a regex pattern. - -**Tokenizer config**: - -- The Grain pipeline supports tokenizer_type: `sentencepiece, huggingface` -- Here we use the same `huggingface` tokenizer as in Section 1. If you use a HuggingFace tokenizer from a gated model, you will need to provide `hf_access_token`. - -## 3. TFDS pipeline - -The TensorFlow Datasets (TFDS) pipeline uses dataset in the TFRecord format. You can check [TFDS Pipeline](../guides/data_input_pipeline/data_input_tfds.md) for more details. - -**Data preparation**: You need to download data to a [Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets), and the pipeline streams data from the bucket. - -- To download the AllenAI C4 dataset to your bucket, you can use [download_dataset.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/tools/data_generation/download_dataset.sh): `bash download_dataset.sh ` - -This **command** shows pretraining with TFDS pipeline, along with evaluation: - -```bash -# replace base_output_directory and dataset_path with your buckets -python3 -m maxtext.trainers.pre_train.train \ -base_output_directory=gs://runner-maxtext-logs run_name=demo \ -model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \ -dataset_type=tfds dataset_path=gs://maxtext-dataset dataset_name='c4/en:3.0.1' train_split=train \ -eval_interval=5 eval_steps=10 eval_dataset_name='c4/en:3.0.1' eval_split=validation \ -tokenizer_type=huggingface tokenizer_path=deepseek-ai/DeepSeek-V2-Lite -``` - -**Dataset config**: - -- `dataset_type`: `tfds` -- `dataset_path`: the cloud storage bucket is `gs://maxtext-dataset` -- `dataset_name`: `c4/en:3.0.1` corresponds to the subdirectory inside dataset_path `gs://maxtext-dataset/c4/en/3.0.1` -- `train_split`: `train`, corresponds to `*-train.tfrecord-*` files -- Putting together, we are training on files like `gs://maxtext-dataset/c4/en/3.0.1/c4-train.tfrecord-0000-of-01024` - -**Evaluation config (optional)**: - -- `eval_interval=5 eval_steps=10`: after every 5 train steps, perform 10 evaluation steps -- `eval_dataset_name`: `c4/en:3.0.1`, corresponds to the subdirectory inside dataset_path `gs://maxtext-dataset/c4/en/3.0.1`. It can be different from `dataset_name`. -- `eval_split`: `validation`, corresponds to `*-validation.tfrecord-*` files -- Putting together, we are evaluating on files like `gs://maxtext-dataset/c4/en/3.0.1/c4-validation.tfrecord-00000-of-00008` - -**Tokenizer config**: - -- TFDS pipeline supports tokenizer_type: `sentencepiece, huggingface, tiktoken` -- Here we use the same `huggingface` tokenizer as in Section 1. If you use a HuggingFace tokenizer from a gated model, you will need to provide `hf_access_token`.