diff --git a/docs/guides/checkpointing_solutions/gcs_checkpointing.md b/docs/guides/checkpointing_solutions/gcs_checkpointing.md index b80a9c536b..e78a6939ef 100644 --- a/docs/guides/checkpointing_solutions/gcs_checkpointing.md +++ b/docs/guides/checkpointing_solutions/gcs_checkpointing.md @@ -28,30 +28,30 @@ startup. The first valid condition met is the one executed: ### MaxText configuration -Flag | Description | Type | Default -:------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-------- | :------ -`enable_checkpointing` | A master switch to enable (`True`) or disable (`False`) saving checkpoints during the training run. | `boolean` | `False` -`async_checkpointing` | When set to (`True`), this flag makes checkpoint saving asynchronous. The training step is only blocked for the minimal time needed to capture the model's state, and the actual writing to storage happens in a background thread. This is highly recommended for performance. It's enabled by default. | `boolean` | `True` -`checkpoint_period` | The interval, in training steps, for how often a checkpoint is saved. | `integer` | `10000` -`enable_single_replica_ckpt_restoring` | If `True`, one replica reads the checkpoint from storage and then broadcasts it to all other replicas. This can significantly speed up restoration on multi-host systems by reducing redundant reads from storage.
**Note**: This feature is only compatible with training jobs that utilize a Distributed Data Parallel (DDP) strategy. | `boolean` | `False` -`checkpoint_todelete_subdir` | Subdirectory to move checkpoints to before deletion. For example: `".todelete"` (Ignored if directory is prefixed with gs://) | `string` | `""` -`checkpoint_todelete_full_path` | Full path to move checkpoints to before deletion. | `string` | `""` -`load_parameters_path` | Specifies a path to a checkpoint directory to load a parameter only checkpoint.
**Example**: `"gs://my-bucket/my-previous-run/checkpoints/items/1000"` | `string` | `""` (disabled) -`load_full_state_path` | Specifies a path to a checkpoint directory to load a full checkpoint including optimizer state and step count from a specific directory.
**Example**: `"gs://my-bucket/my-interrupted-run/checkpoints/items/500"` | `string` | `""` (disabled) -`lora_input_adapters_path` | Specifies a parent directory containing LoRA (Low-Rank Adaptation) adapters. | `string` | `""` (disabled) -`force_unroll` | If `True`, unrolls the loop when generating a parameter-only checkpoint. | `boolean` | `False` +| Flag | Description | Type | Default | +| :------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-------- | :-------------- | +| `enable_checkpointing` | A master switch to enable (`True`) or disable (`False`) saving checkpoints during the training run. | `boolean` | `False` | +| `async_checkpointing` | When set to (`True`), this flag makes checkpoint saving asynchronous. The training step is only blocked for the minimal time needed to capture the model's state, and the actual writing to storage happens in a background thread. This is highly recommended for performance. It's enabled by default. | `boolean` | `True` | +| `checkpoint_period` | The interval, in training steps, for how often a checkpoint is saved. | `integer` | `10000` | +| `enable_single_replica_ckpt_restoring` | If `True`, one replica reads the checkpoint from storage and then broadcasts it to all other replicas. This can significantly speed up restoration on multi-host systems by reducing redundant reads from storage.
**Note**: This feature is only compatible with training jobs that utilize a Distributed Data Parallel (DDP) strategy. | `boolean` | `False` | +| `checkpoint_todelete_subdir` | Subdirectory to move checkpoints to before deletion. For example: `".todelete"` (Ignored if directory is prefixed with `gs://`) | `string` | `""` | +| `checkpoint_todelete_full_path` | Full path to move checkpoints to before deletion. | `string` | `""` | +| `load_parameters_path` | Specifies a path to a checkpoint directory to load a parameter only checkpoint.
**Example**: `"gs://my-bucket/my-previous-run/checkpoints/items/1000"` | `string` | `""` (disabled) | +| `load_full_state_path` | Specifies a path to a checkpoint directory to load a full checkpoint including optimizer state and step count from a specific directory.
**Example**: `"gs://my-bucket/my-interrupted-run/checkpoints/items/500"` | `string` | `""` (disabled) | +| `lora_input_adapters_path` | Specifies a parent directory containing LoRA (Low-Rank Adaptation) adapters. | `string` | `""` (disabled) | +| `force_unroll` | If `True`, unrolls the loop when generating a parameter-only checkpoint. | `boolean` | `False` | ## Storage and format configuration These settings control the underlying storage mechanism ([Orbax](https://orbax.readthedocs.io)) for performance and compatibility. -Flag | Description | Type | Default -:----------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------- | :------ -`checkpoint_storage_target_data_file_size_bytes` | Sets a target file size for Orbax to chunk large arrays into smaller physical files. This can dramatically speed up loading over a network and in distributed environments. | `integer` | `2147483648` (2 GB) -`checkpoint_storage_use_ocdbt` | If `True`, uses the TensorStore **OCDBT** (Optionally-Cooperative Distributed B+ Tree)) key-value store as the underlying storage format for checkpointing. Set to `0` for Pathways. | `boolean` | `True` -`checkpoint_storage_use_zarr3` | If `True`, uses the Zarr v3 storage format within Orbax, which is optimized for chunked, compressed, N-dimensional arrays. Set to `0` for Pathways. | `boolean` | `True` -`checkpoint_storage_concurrent_gb` | Controls the concurrent I/O limit in gigabytes for the checkpointer. Larger models may require increasing this value to avoid I/O bottlenecks. | `integer` | `96` -`enable_orbax_v1` | A boolean flag to explicitly enable features and behaviors from Orbax version 1. | `boolean` | `False` -`source_checkpoint_layout` | Specifies the format of the checkpoint being **loaded**. This tells the system how to interpret the files at the source path.
**Options**: `"orbax"`, `"safetensors"` | `string` | `"orbax"` -`checkpoint_conversion_fn` | A user-defined function to process a loaded checkpoint dictionary into a format that the model can understand. This is essential for loading checkpoints from different frameworks or formats (e.g., converting keys from a Hugging Face SafeTensors file). | `function` or `None` | `None` +| Flag | Description | Type | Default | +| :----------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------- | :------------------ | +| `checkpoint_storage_target_data_file_size_bytes` | Sets a target file size for Orbax to chunk large arrays into smaller physical files. This can dramatically speed up loading over a network and in distributed environments. | `integer` | `2147483648` (2 GB) | +| `checkpoint_storage_use_ocdbt` | If `True`, uses the TensorStore **OCDBT** (Optionally-Cooperative Distributed B+ Tree)) key-value store as the underlying storage format for checkpointing. Set to `0` for Pathways. | `boolean` | `True` | +| `checkpoint_storage_use_zarr3` | If `True`, uses the Zarr v3 storage format within Orbax, which is optimized for chunked, compressed, N-dimensional arrays. Set to `0` for Pathways. | `boolean` | `True` | +| `checkpoint_storage_concurrent_gb` | Controls the concurrent I/O limit in gigabytes for the checkpointer. Larger models may require increasing this value to avoid I/O bottlenecks. | `integer` | `96` | +| `enable_orbax_v1` | A boolean flag to explicitly enable features and behaviors from Orbax version 1. | `boolean` | `False` | +| `source_checkpoint_layout` | Specifies the format of the checkpoint being **loaded**. This tells the system how to interpret the files at the source path.
**Options**: `"orbax"`, `"safetensors"` | `string` | `"orbax"` | +| `checkpoint_conversion_fn` | A user-defined function to process a loaded checkpoint dictionary into a format that the model can understand. This is essential for loading checkpoints from different frameworks or formats (e.g., converting keys from a Hugging Face SafeTensors file). | `function` or `None` | `None` | diff --git a/docs/guides/data_input_pipeline/data_input_grain.md b/docs/guides/data_input_pipeline/data_input_grain.md index 5a7d66981d..5522020b3f 100644 --- a/docs/guides/data_input_pipeline/data_input_grain.md +++ b/docs/guides/data_input_pipeline/data_input_grain.md @@ -110,10 +110,10 @@ Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pr ```sh bash src/dependencies/scripts/setup_gcsfuse.sh \ -DATASET_GCS_BUCKET=maxtext-dataset \ +DATASET_GCS_BUCKET=gs:// \ MOUNT_PATH=/tmp/gcsfuse && \ python3 -m maxtext.trainers.pre_train.train \ -run_name= base_output_directory=gs:// \ +run_name= base_output_directory=gs:// \ dataset_type=grain \ grain_file_type=arrayrecord # or parquet \ grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* \ diff --git a/docs/guides/monitoring_and_debugging/features_and_diagnostics.md b/docs/guides/monitoring_and_debugging/features_and_diagnostics.md index a6952fae04..4a0f5efbef 100644 --- a/docs/guides/monitoring_and_debugging/features_and_diagnostics.md +++ b/docs/guides/monitoring_and_debugging/features_and_diagnostics.md @@ -87,7 +87,7 @@ export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true" python3 -m maxtext.trainers.pre_train.train run_name=example_load_compile \ compiled_trainstep_file=my_compiled_train.pickle \ global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \ - base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket + base_output_directory=gs:// dataset_path=gs:// ``` In the save step of example 2 above we included exporting the compiler flag `LIBTPU_INIT_ARGS` and `learning_rate` because those affect the compiled object `my_compiled_train.pickle.` The sizes of the model (e.g. `global_parameter_scale`, `max_sequence_length` and `per_device_batch`) are fixed when you initially compile via `compile_train.py`, you will see a size error if you try to run the saved compiled object with different sizes than you compiled with. However a subtle note is that the **learning rate schedule** is also fixed when you run `compile_train` - which is determined by both `steps` and `learning_rate`. The optimizer parameters such as `adam_b1` are passed only as shaped objects to the compiler - thus their real values are determined when you run `train.py`, not during the compilation. If you do pass in different shapes (e.g. `per_device_batch`), you will get a clear error message reporting that the compiled signature has different expected shapes than what was input. If you attempt to run on different hardware than the compilation targets requested via `compile_topology`, you will get an error saying there is a failure to map the devices from the compiled to your real devices. Using different XLA flags or a LIBTPU than what was compiled will probably run silently with the environment you compiled in without error. However there is no guaranteed behavior in this case; you should run in the same environment you compiled in. @@ -125,7 +125,7 @@ export XLA_FLAGS="--xla_gpu_enable_async_collectives=true" python3 -m maxtext.trainers.pre_train.train run_name=example_load_compile \ compiled_trainstep_file=my_compiled_train.pickle \ attention=dot_product global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \ - base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket + base_output_directory=gs:// dataset_path=gs:// ``` As in the TPU case, note that the compilation environment must match the execution environment, in this case by setting the same `XLA_FLAGS`. diff --git a/docs/guides/monitoring_and_debugging/ml_workload_diagnostics.md b/docs/guides/monitoring_and_debugging/ml_workload_diagnostics.md index 81206bff6b..dccaebc02f 100644 --- a/docs/guides/monitoring_and_debugging/ml_workload_diagnostics.md +++ b/docs/guides/monitoring_and_debugging/ml_workload_diagnostics.md @@ -37,8 +37,8 @@ MaxText has integrated the ML Diagnostics [SDK](https://github.com/AI-Hypercompu ``` python3 -m maxtext.trainers.pre_train.train \ run_name=${USER}-tpu-job \ - base_output_directory="gs://your-output-bucket/" \ - dataset_path="gs://your-dataset-bucket/" \ + base_output_directory="gs:///" \ + dataset_path="gs:///" \ steps=100 \ log_period=10 \ managed_mldiagnostics=True @@ -49,8 +49,8 @@ MaxText has integrated the ML Diagnostics [SDK](https://github.com/AI-Hypercompu ``` python3 -m maxtext.trainers.pre_train.train \ run_name=${USER}-tpu-job \ - base_output_directory="gs://your-output-bucket/" \ - dataset_path="gs://your-dataset-bucket/" \ + base_output_directory="gs:///" \ + dataset_path="gs:///" \ steps=100 \ log_period=10 \ profiler=xplane \ @@ -62,8 +62,8 @@ MaxText has integrated the ML Diagnostics [SDK](https://github.com/AI-Hypercompu ``` python3 -m maxtext.trainers.pre_train.train \ run_name=${USER}-tpu-job \ - base_output_directory="gs://your-output-bucket/" \ - dataset_path="gs://your-dataset-bucket/" \ + base_output_directory="gs:///" \ + dataset_path="gs:///" \ steps=100 \ log_period=10 \ profiler=xplane \ diff --git a/docs/guides/monitoring_and_debugging/understand_logs_and_metrics.md b/docs/guides/monitoring_and_debugging/understand_logs_and_metrics.md index e381e7e8ac..e9833d7e6e 100644 --- a/docs/guides/monitoring_and_debugging/understand_logs_and_metrics.md +++ b/docs/guides/monitoring_and_debugging/understand_logs_and_metrics.md @@ -20,11 +20,11 @@ When you run a training job, MaxText produces detailed output logs. This guide shows you how to interpret these logs to understand your configuration and monitor performance. -To start, run a simple pretraining job on a single-host TPU. For instance, we can run the following command on TPU v5p-8. The resulting log is used as an example throughout this guide. +To start, run a simple pretraining job on a single-host TPU. For instance, we can run the following command on TPU v5p-8. The resulting log is used as an example throughout this guide. Replace `` in the command below to point to the GCS bucket you want to use for your output. ```bash python3 -m maxtext.trainers.pre_train.train \ -base_output_directory=gs://runner-maxtext-logs run_name=demo \ +base_output_directory=gs:// run_name=demo \ model_name=deepseek2-16b \ per_device_batch_size=24 max_target_length=2048 steps=10 dataset_type=synthetic enable_checkpointing=false ``` @@ -80,23 +80,23 @@ Config param data_sharding: (('data', 'stage', 'fsdp', 'fsdp_transpose', 'sequen This also includes the **output paths** for your run artifacts. ``` -Config param base_output_directory: gs://runner-maxtext-logs +Config param base_output_directory: gs:// Config param run_name: demo -Config param metrics_dir: gs://runner-maxtext-logs/demo/metrics/ -Config param tensorboard_dir: gs://runner-maxtext-logs/demo/tensorboard/ -Config param checkpoint_dir: gs://runner-maxtext-logs/demo/checkpoints/ +Config param metrics_dir: gs:///demo/metrics/ +Config param tensorboard_dir: gs:///demo/tensorboard/ +Config param checkpoint_dir: gs:///demo/checkpoints/ ``` ### Understanding output paths -MaxText organizes all of your run's artifacts into a main output directory. The primary location for your run is constructed by combining the `base_output_directory` and the `run_name` you specify in your command. Based on the logs above, the base path for this specific run is `gs://runner-maxtext-logs/demo`. +MaxText organizes all of your run's artifacts into a main output directory. The primary location for your run is constructed by combining the `base_output_directory` and the `run_name` you specify in your command. Based on the logs above, the base path for this specific run is `gs:///demo`. Within this base path, MaxText creates several subdirectories for different types of artifacts. Many of these are optional and only created if you enable them with a specific flag. - **TensorBoard logs (`tensorboard/`)** - Flag: `enable_tensorboard=True` (default) - - Path: `gs://runner-maxtext-logs/demo/tensorboard/` + - Path: `gs:///demo/tensorboard/` - **Profiler traces (`tensorboard/plugins/profile/`)** @@ -106,17 +106,17 @@ Within this base path, MaxText creates several subdirectories for different type - **Metrics in plain text (`metrics/`)** - Flag: `gcs_metrics=True` - - Path: `gs://runner-maxtext-logs/demo/metrics/` + - Path: `gs:///demo/metrics/` - **Configuration file (`config.yml`)** - Flag: `save_config_to_gcs=True` - - Path: `gs://runner-maxtext-logs/demo/config.yml` + - Path: `gs:///demo/config.yml` - **Checkpoints (`checkpoints/`)** - Flag: `enable_checkpointing=True` - - Path: `gs://runner-maxtext-logs/demo/checkpoints/` + - Path: `gs:///demo/checkpoints/` To generate all optional artifacts in one run, you can set the corresponding flags in the command line, like in the example below. @@ -124,7 +124,7 @@ This command enables tensorboard, profiler, text metrics, config saving, and che ```bash python3 -m maxtext.trainers.pre_train.train \ -base_output_directory=gs://runner-maxtext-logs run_name=demo2 \ +base_output_directory=gs:// run_name=demo2 \ model_name=deepseek2-16b \ per_device_batch_size=24 max_target_length=2048 steps=10 dataset_type=synthetic \ enable_tensorboard=True \ diff --git a/docs/install_maxtext.md b/docs/install_maxtext.md index 47d31e93cb..ec7ab961ed 100644 --- a/docs/install_maxtext.md +++ b/docs/install_maxtext.md @@ -112,7 +112,7 @@ environment to avoid dependency conflicts. cd maxtext ``` -:::\{only} is_not_latest +````{only} is_not_latest By default, cloning the repository provides the latest version (**HEAD**). If you wish to use the latest features, please follow the [latest guide](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). @@ -126,7 +126,7 @@ before proceeding with the installation. git checkout |version| ``` -::: +```` 2. Create virtual environment: diff --git a/docs/reference/core_concepts/quantization.md b/docs/reference/core_concepts/quantization.md index dae117a85a..1b2ef6518d 100644 --- a/docs/reference/core_concepts/quantization.md +++ b/docs/reference/core_concepts/quantization.md @@ -87,7 +87,7 @@ Common options for the `quantization` flag when using Qwix include: Here is an example of how to run a training job with int8 quantization enabled via Qwix: ```bash -python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?} base_output_directory=gs:// dataset_type=synthetic use_qwix_quantization=true quantization='int8' +python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?} base_output_directory=gs:// dataset_type=synthetic use_qwix_quantization=true quantization='int8' ``` #### The Qwix Interception API @@ -142,7 +142,7 @@ When using AQT, you can pass one of the following values to the `quantization` f #### Example command for AQT ```bash -python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?} base_output_directory=gs:// dataset_type=synthetic use_qwix_quantization=false quantization='int8' +python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?} base_output_directory=gs:// dataset_type=synthetic use_qwix_quantization=false quantization='int8' ``` Note that `use_qwix_quantization` is not set to `True`. diff --git a/docs/run_maxtext/run_maxtext_localhost.md b/docs/run_maxtext/run_maxtext_localhost.md index 843c52a5f3..1a148cc6d8 100644 --- a/docs/run_maxtext/run_maxtext_localhost.md +++ b/docs/run_maxtext/run_maxtext_localhost.md @@ -26,7 +26,11 @@ MaxText uses a primary YAML file, `configs/base.yml`, to manage its settings. Th - `learning_rate`: The core hyperparameter for the optimizer. - Mode shape parameters: `base_num_decoder_layers`, `base_emb_dim`, `base_num_query_heads`, `base_num_kv_heads`, and `head_dim`. - **Override settings (optional):** You can modify training parameters in two ways: by editing `configs/base.yml` directly or by passing them as command-line arguments to the training script which is the recommended method. For example, to change the number of training steps, you can pass `--steps=500` when running `train.py`. -- **Note**: You **must** update the variable `base_output_directory` which is initialized in `configs/base.yml` to point to a folder within the GCS bucket you just created (e.g., `gs://your-bucket-name/maxtext-output`). +- **Note**: You **must** update the variable `base_output_directory` which is initialized in `configs/base.yml` to point to a folder within the GCS bucket you just created (e.g., `gs:///maxtext-output`). You can set set an environment variable for that: + +```bash +export BASE_OUTPUT_DIRECTORY=gs:// +``` ## Development @@ -40,12 +44,12 @@ Local development on a single host TPU/GPU VM is a convenient way to run MaxText #### Run a Test Training Job -After the installation is complete, run a short training job using synthetic data to confirm everything is working correctly. This command trains a model for just 10 steps. Remember to replace `$YOUR_JOB_NAME` with a unique name for your run and `gs://` with the path to the GCS bucket you configured in the prerequisites. +After the installation is complete, run a short training job using synthetic data to confirm everything is working correctly. This command trains a model for just 10 steps. Remember to replace `$YOUR_JOB_NAME` with a unique name for your run and to set `$BASE_OUTPUT_DIRECTORY` with the path to the GCS bucket you configured in the prerequisites. ```bash python3 -m maxtext.trainers.pre_train.train \ run_name=${YOUR_JOB_NAME?} \ - base_output_directory=gs:// \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ dataset_type=synthetic \ steps=10 ``` @@ -59,7 +63,7 @@ To demonstrate model output, run the following command: ```bash python3 -m maxtext.inference.decode \ run_name=${YOUR_JOB_NAME?} \ - base_output_directory=gs:// \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ per_device_batch_size=1 ``` @@ -80,7 +84,7 @@ To use a pre-configured model for TPUs, you override the `model_name` parameter, python3 -m maxtext.trainers.pre_train.train \ model_name=llama3-8b \ run_name=${YOUR_JOB_NAME?} \ - base_output_directory=gs:// \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ dataset_type=synthetic \ steps=10 ``` @@ -94,7 +98,7 @@ python3 -m maxtext.trainers.pre_train.train \ python3 -m maxtext.trainers.pre_train.train \ model_name=qwen3-4b \ run_name=${YOUR_JOB_NAME?} \ - base_output_directory=gs:// \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ dataset_type=synthetic \ steps=10 ``` @@ -111,7 +115,7 @@ To use a GPU-optimized configuration, you should specify the path to the model's ```bash python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/gpu/models/mixtral_8x7b.yml \ run_name=${YOUR_JOB_NAME?} \ - base_output_directory=gs:// \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ dataset_type=synthetic \ steps=10 ``` @@ -126,7 +130,7 @@ This will load `gpu/mixtral_8x7b.yml`, which inherits from `base.yml`. ```bash python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/gpu/models/llama3-8b.yml \ run_name=${YOUR_JOB_NAME?} \ - base_output_directory=gs:// \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ dataset_type=synthetic \ steps=10 ``` diff --git a/docs/run_maxtext/run_maxtext_via_xpk.md b/docs/run_maxtext/run_maxtext_via_xpk.md index c800366aee..12c3a733cc 100644 --- a/docs/run_maxtext/run_maxtext_via_xpk.md +++ b/docs/run_maxtext/run_maxtext_via_xpk.md @@ -144,8 +144,8 @@ This guide focuses on submitting workloads to an existing cluster. Cluster creat # region as your TPUs to minimize latency and costs. # You can list your buckets and their locations in the # [Cloud Console](https://console.cloud.google.com/storage/browser). - export BASE_OUTPUT_DIRECTORY= # e.g., gs://my-bucket/maxtext-runs - export DATASET_PATH="gs://your-dataset-bucket/" + export BASE_OUTPUT_DIRECTORY= # e.g., gs:///maxtext-runs + export DATASET_PATH="gs:///" ``` 2. **Configure gcloud CLI** diff --git a/docs/tutorials/first_run.md b/docs/tutorials/first_run.md index f04c6acb9c..3bc0ca4e68 100644 --- a/docs/tutorials/first_run.md +++ b/docs/tutorials/first_run.md @@ -24,7 +24,11 @@ This topic provides a basic introduction to get your MaxText workload up and run 1. To store logs and checkpoints, [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) in your project. To run MaxText, the TPU or GPU VMs must have read/write permissions for the bucket. These permissions are granted by service account roles, such as the `STORAGE ADMIN` role. -2. MaxText reads a yaml file for configuration. We also recommend reviewing the configurable options in `configs/base.yml`. This file includes a decoder-only model of ~1B parameters. The configurable options can be overwritten from the command line. For instance, you can change the `steps` or `log_period` by either modifying `configs/base.yml` or by passing in `steps` and `log_period` as additional arguments to the `train.py` call. Set `base_output_directory` to a folder in the bucket you just created. +2. MaxText reads a yaml file for configuration. We also recommend reviewing the configurable options in `configs/base.yml`. This file includes a decoder-only model of ~1B parameters. The configurable options can be overwritten from the command line. For instance, you can change the `steps` or `log_period` by either modifying `configs/base.yml` or by passing in `steps` and `log_period` as additional arguments to the `train.py` call. Set `base_output_directory` to a folder in the bucket you just created. You can set set an environment variable for that: + +```bash +export BASE_OUTPUT_DIRECTORY=gs:// +``` ## Local development for single host @@ -42,7 +46,7 @@ multiple hosts but is a good way to learn about MaxText. ```sh python3 -m maxtext.trainers.pre_train.train \ run_name=${YOUR_JOB_NAME?} \ - base_output_directory=gs:// \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ dataset_type=synthetic \ steps=10 ``` @@ -54,7 +58,7 @@ Optional: If you want to try training on a Hugging Face dataset, see [Data Input ```sh python3 -m maxtext.inference.decode \ run_name=${YOUR_JOB_NAME?} \ - base_output_directory=gs:// \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ per_device_batch_size=1 ``` @@ -76,7 +80,7 @@ You can use [demo_decoding.ipynb](https://github.com/AI-Hypercomputer/maxtext/bl ```sh python3 -m maxtext.trainers.pre_train.train \ run_name=${YOUR_JOB_NAME?} \ - base_output_directory=gs:// \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ dataset_type=synthetic \ steps=10 ``` @@ -86,7 +90,7 @@ python3 -m maxtext.trainers.pre_train.train \ ```sh python3 -m maxtext.inference.decode \ run_name=${YOUR_JOB_NAME?} \ - base_output_directory=gs:// \ + base_output_directory=${BASE_OUTPUT_DIRECTORY?} \ per_device_batch_size=1 ``` diff --git a/docs/tutorials/posttraining/multimodal.md b/docs/tutorials/posttraining/multimodal.md index 8590761be6..8fff8da747 100644 --- a/docs/tutorials/posttraining/multimodal.md +++ b/docs/tutorials/posttraining/multimodal.md @@ -127,8 +127,8 @@ Supervised Fine-Tuning (SFT) of multimodal LLMs in MaxText focuses specifically Here, we use [ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) as an example to demonstrate SFT functionality: ```shell -export MAXTEXT_CKPT_PATH=... # either set to an already available MaxText ckpt or to the one we just converted in the previous step -export BASE_OUTPUT_DIRECTORY=gs://... +export MAXTEXT_CKPT_PATH= # either set to an already available MaxText ckpt or to the one we just converted in the previous step +export BASE_OUTPUT_DIRECTORY=gs:// export STEPS=1000 python -m maxtext.trainers.post_train.sft.train_sft_deprecated \ src/maxtext/configs/post_train/sft-vision-chartqa.yml \ diff --git a/docs/tutorials/pretraining.md b/docs/tutorials/pretraining.md index a1ae985db0..e19c04b83e 100644 --- a/docs/tutorials/pretraining.md +++ b/docs/tutorials/pretraining.md @@ -40,7 +40,7 @@ We can use this **command** for pretraining: ```bash # replace base_output_directory with your bucket python3 -m maxtext.trainers.pre_train.train \ -base_output_directory=gs://runner-maxtext-logs run_name=demo \ +base_output_directory=gs:// run_name=demo \ model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \ dataset_type=hf hf_path=allenai/c4 hf_data_dir=en train_split=train \ tokenizer_type=huggingface tokenizer_path=deepseek-ai/DeepSeek-V2-Lite @@ -93,9 +93,9 @@ Grain is a library for reading data for training and evaluating JAX models. It i **Data preparation**: You need to download data to a Cloud Storage bucket, and read data via Cloud Storage Fuse with [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). -- For example, we can mount the bucket `gs://maxtext-dataset` on the local path `/tmp/gcsfuse` before training +- For example, we can mount the bucket `gs://` on the local path `/tmp/gcsfuse` before training ```bash - bash setup_gcsfuse.sh DATASET_GCS_BUCKET=maxtext-dataset MOUNT_PATH=/tmp/gcsfuse + bash setup_gcsfuse.sh DATASET_GCS_BUCKET=gh:// MOUNT_PATH=/tmp/gcsfuse ``` - After training, we unmount the local path ```bash @@ -107,7 +107,7 @@ This **command** shows pretraining with Grain pipeline, along with evaluation: ```bash # replace DATASET_GCS_BUCKET and base_output_directory with your buckets python3 -m maxtext.trainers.pre_train.train \ -base_output_directory=gs://runner-maxtext-logs run_name=demo \ +base_output_directory=gs:// run_name=demo \ model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \ dataset_type=grain grain_file_type=arrayrecord grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* grain_worker_count=2 \ eval_interval=5 eval_steps=10 grain_eval_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record* \ @@ -144,7 +144,7 @@ This **command** shows pretraining with TFDS pipeline, along with evaluation: ```bash # replace base_output_directory and dataset_path with your buckets python3 -m maxtext.trainers.pre_train.train \ -base_output_directory=gs://runner-maxtext-logs run_name=demo \ +base_output_directory=gs:// run_name=demo \ model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \ dataset_type=tfds dataset_path=gs://maxtext-dataset dataset_name='c4/en:3.0.1' train_split=train \ eval_interval=5 eval_steps=10 eval_dataset_name='c4/en:3.0.1' eval_split=validation \