diff --git a/docs/guides/checkpointing_solutions/gcs_checkpointing.md b/docs/guides/checkpointing_solutions/gcs_checkpointing.md
index b80a9c536b..e78a6939ef 100644
--- a/docs/guides/checkpointing_solutions/gcs_checkpointing.md
+++ b/docs/guides/checkpointing_solutions/gcs_checkpointing.md
@@ -28,30 +28,30 @@ startup. The first valid condition met is the one executed:
### MaxText configuration
-Flag | Description | Type | Default
-:------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-------- | :------
-`enable_checkpointing` | A master switch to enable (`True`) or disable (`False`) saving checkpoints during the training run. | `boolean` | `False`
-`async_checkpointing` | When set to (`True`), this flag makes checkpoint saving asynchronous. The training step is only blocked for the minimal time needed to capture the model's state, and the actual writing to storage happens in a background thread. This is highly recommended for performance. It's enabled by default. | `boolean` | `True`
-`checkpoint_period` | The interval, in training steps, for how often a checkpoint is saved. | `integer` | `10000`
-`enable_single_replica_ckpt_restoring` | If `True`, one replica reads the checkpoint from storage and then broadcasts it to all other replicas. This can significantly speed up restoration on multi-host systems by reducing redundant reads from storage.
**Note**: This feature is only compatible with training jobs that utilize a Distributed Data Parallel (DDP) strategy. | `boolean` | `False`
-`checkpoint_todelete_subdir` | Subdirectory to move checkpoints to before deletion. For example: `".todelete"` (Ignored if directory is prefixed with gs://) | `string` | `""`
-`checkpoint_todelete_full_path` | Full path to move checkpoints to before deletion. | `string` | `""`
-`load_parameters_path` | Specifies a path to a checkpoint directory to load a parameter only checkpoint.
**Example**: `"gs://my-bucket/my-previous-run/checkpoints/items/1000"` | `string` | `""` (disabled)
-`load_full_state_path` | Specifies a path to a checkpoint directory to load a full checkpoint including optimizer state and step count from a specific directory.
**Example**: `"gs://my-bucket/my-interrupted-run/checkpoints/items/500"` | `string` | `""` (disabled)
-`lora_input_adapters_path` | Specifies a parent directory containing LoRA (Low-Rank Adaptation) adapters. | `string` | `""` (disabled)
-`force_unroll` | If `True`, unrolls the loop when generating a parameter-only checkpoint. | `boolean` | `False`
+| Flag | Description | Type | Default |
+| :------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :-------- | :-------------- |
+| `enable_checkpointing` | A master switch to enable (`True`) or disable (`False`) saving checkpoints during the training run. | `boolean` | `False` |
+| `async_checkpointing` | When set to (`True`), this flag makes checkpoint saving asynchronous. The training step is only blocked for the minimal time needed to capture the model's state, and the actual writing to storage happens in a background thread. This is highly recommended for performance. It's enabled by default. | `boolean` | `True` |
+| `checkpoint_period` | The interval, in training steps, for how often a checkpoint is saved. | `integer` | `10000` |
+| `enable_single_replica_ckpt_restoring` | If `True`, one replica reads the checkpoint from storage and then broadcasts it to all other replicas. This can significantly speed up restoration on multi-host systems by reducing redundant reads from storage.
**Note**: This feature is only compatible with training jobs that utilize a Distributed Data Parallel (DDP) strategy. | `boolean` | `False` |
+| `checkpoint_todelete_subdir` | Subdirectory to move checkpoints to before deletion. For example: `".todelete"` (Ignored if directory is prefixed with `gs://`) | `string` | `""` |
+| `checkpoint_todelete_full_path` | Full path to move checkpoints to before deletion. | `string` | `""` |
+| `load_parameters_path` | Specifies a path to a checkpoint directory to load a parameter only checkpoint.
**Example**: `"gs://my-bucket/my-previous-run/checkpoints/items/1000"` | `string` | `""` (disabled) |
+| `load_full_state_path` | Specifies a path to a checkpoint directory to load a full checkpoint including optimizer state and step count from a specific directory.
**Example**: `"gs://my-bucket/my-interrupted-run/checkpoints/items/500"` | `string` | `""` (disabled) |
+| `lora_input_adapters_path` | Specifies a parent directory containing LoRA (Low-Rank Adaptation) adapters. | `string` | `""` (disabled) |
+| `force_unroll` | If `True`, unrolls the loop when generating a parameter-only checkpoint. | `boolean` | `False` |
## Storage and format configuration
These settings control the underlying storage mechanism
([Orbax](https://orbax.readthedocs.io)) for performance and compatibility.
-Flag | Description | Type | Default
-:----------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------- | :------
-`checkpoint_storage_target_data_file_size_bytes` | Sets a target file size for Orbax to chunk large arrays into smaller physical files. This can dramatically speed up loading over a network and in distributed environments. | `integer` | `2147483648` (2 GB)
-`checkpoint_storage_use_ocdbt` | If `True`, uses the TensorStore **OCDBT** (Optionally-Cooperative Distributed B+ Tree)) key-value store as the underlying storage format for checkpointing. Set to `0` for Pathways. | `boolean` | `True`
-`checkpoint_storage_use_zarr3` | If `True`, uses the Zarr v3 storage format within Orbax, which is optimized for chunked, compressed, N-dimensional arrays. Set to `0` for Pathways. | `boolean` | `True`
-`checkpoint_storage_concurrent_gb` | Controls the concurrent I/O limit in gigabytes for the checkpointer. Larger models may require increasing this value to avoid I/O bottlenecks. | `integer` | `96`
-`enable_orbax_v1` | A boolean flag to explicitly enable features and behaviors from Orbax version 1. | `boolean` | `False`
-`source_checkpoint_layout` | Specifies the format of the checkpoint being **loaded**. This tells the system how to interpret the files at the source path.
**Options**: `"orbax"`, `"safetensors"` | `string` | `"orbax"`
-`checkpoint_conversion_fn` | A user-defined function to process a loaded checkpoint dictionary into a format that the model can understand. This is essential for loading checkpoints from different frameworks or formats (e.g., converting keys from a Hugging Face SafeTensors file). | `function` or `None` | `None`
+| Flag | Description | Type | Default |
+| :----------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------- | :------------------ |
+| `checkpoint_storage_target_data_file_size_bytes` | Sets a target file size for Orbax to chunk large arrays into smaller physical files. This can dramatically speed up loading over a network and in distributed environments. | `integer` | `2147483648` (2 GB) |
+| `checkpoint_storage_use_ocdbt` | If `True`, uses the TensorStore **OCDBT** (Optionally-Cooperative Distributed B+ Tree)) key-value store as the underlying storage format for checkpointing. Set to `0` for Pathways. | `boolean` | `True` |
+| `checkpoint_storage_use_zarr3` | If `True`, uses the Zarr v3 storage format within Orbax, which is optimized for chunked, compressed, N-dimensional arrays. Set to `0` for Pathways. | `boolean` | `True` |
+| `checkpoint_storage_concurrent_gb` | Controls the concurrent I/O limit in gigabytes for the checkpointer. Larger models may require increasing this value to avoid I/O bottlenecks. | `integer` | `96` |
+| `enable_orbax_v1` | A boolean flag to explicitly enable features and behaviors from Orbax version 1. | `boolean` | `False` |
+| `source_checkpoint_layout` | Specifies the format of the checkpoint being **loaded**. This tells the system how to interpret the files at the source path.
**Options**: `"orbax"`, `"safetensors"` | `string` | `"orbax"` |
+| `checkpoint_conversion_fn` | A user-defined function to process a loaded checkpoint dictionary into a format that the model can understand. This is essential for loading checkpoints from different frameworks or formats (e.g., converting keys from a Hugging Face SafeTensors file). | `function` or `None` | `None` |
diff --git a/docs/guides/data_input_pipeline/data_input_grain.md b/docs/guides/data_input_pipeline/data_input_grain.md
index 5a7d66981d..5522020b3f 100644
--- a/docs/guides/data_input_pipeline/data_input_grain.md
+++ b/docs/guides/data_input_pipeline/data_input_grain.md
@@ -110,10 +110,10 @@ Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pr
```sh
bash src/dependencies/scripts/setup_gcsfuse.sh \
-DATASET_GCS_BUCKET=maxtext-dataset \
+DATASET_GCS_BUCKET=gs:// \
MOUNT_PATH=/tmp/gcsfuse && \
python3 -m maxtext.trainers.pre_train.train \
-run_name= base_output_directory=gs:// \
+run_name= base_output_directory=gs:// \
dataset_type=grain \
grain_file_type=arrayrecord # or parquet \
grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* \
diff --git a/docs/guides/monitoring_and_debugging/features_and_diagnostics.md b/docs/guides/monitoring_and_debugging/features_and_diagnostics.md
index a6952fae04..4a0f5efbef 100644
--- a/docs/guides/monitoring_and_debugging/features_and_diagnostics.md
+++ b/docs/guides/monitoring_and_debugging/features_and_diagnostics.md
@@ -87,7 +87,7 @@ export LIBTPU_INIT_ARGS="--xla_enable_async_all_gather=true"
python3 -m maxtext.trainers.pre_train.train run_name=example_load_compile \
compiled_trainstep_file=my_compiled_train.pickle \
global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \
- base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket
+ base_output_directory=gs:// dataset_path=gs://
```
In the save step of example 2 above we included exporting the compiler flag `LIBTPU_INIT_ARGS` and `learning_rate` because those affect the compiled object `my_compiled_train.pickle.` The sizes of the model (e.g. `global_parameter_scale`, `max_sequence_length` and `per_device_batch`) are fixed when you initially compile via `compile_train.py`, you will see a size error if you try to run the saved compiled object with different sizes than you compiled with. However a subtle note is that the **learning rate schedule** is also fixed when you run `compile_train` - which is determined by both `steps` and `learning_rate`. The optimizer parameters such as `adam_b1` are passed only as shaped objects to the compiler - thus their real values are determined when you run `train.py`, not during the compilation. If you do pass in different shapes (e.g. `per_device_batch`), you will get a clear error message reporting that the compiled signature has different expected shapes than what was input. If you attempt to run on different hardware than the compilation targets requested via `compile_topology`, you will get an error saying there is a failure to map the devices from the compiled to your real devices. Using different XLA flags or a LIBTPU than what was compiled will probably run silently with the environment you compiled in without error. However there is no guaranteed behavior in this case; you should run in the same environment you compiled in.
@@ -125,7 +125,7 @@ export XLA_FLAGS="--xla_gpu_enable_async_collectives=true"
python3 -m maxtext.trainers.pre_train.train run_name=example_load_compile \
compiled_trainstep_file=my_compiled_train.pickle \
attention=dot_product global_parameter_scale=16 per_device_batch_size=4 steps=10000 learning_rate=1e-3 \
- base_output_directory=gs://my-output-bucket dataset_path=gs://my-dataset-bucket
+ base_output_directory=gs:// dataset_path=gs://
```
As in the TPU case, note that the compilation environment must match the execution environment, in this case by setting the same `XLA_FLAGS`.
diff --git a/docs/guides/monitoring_and_debugging/ml_workload_diagnostics.md b/docs/guides/monitoring_and_debugging/ml_workload_diagnostics.md
index 81206bff6b..dccaebc02f 100644
--- a/docs/guides/monitoring_and_debugging/ml_workload_diagnostics.md
+++ b/docs/guides/monitoring_and_debugging/ml_workload_diagnostics.md
@@ -37,8 +37,8 @@ MaxText has integrated the ML Diagnostics [SDK](https://github.com/AI-Hypercompu
```
python3 -m maxtext.trainers.pre_train.train \
run_name=${USER}-tpu-job \
- base_output_directory="gs://your-output-bucket/" \
- dataset_path="gs://your-dataset-bucket/" \
+ base_output_directory="gs:///" \
+ dataset_path="gs:///" \
steps=100 \
log_period=10 \
managed_mldiagnostics=True
@@ -49,8 +49,8 @@ MaxText has integrated the ML Diagnostics [SDK](https://github.com/AI-Hypercompu
```
python3 -m maxtext.trainers.pre_train.train \
run_name=${USER}-tpu-job \
- base_output_directory="gs://your-output-bucket/" \
- dataset_path="gs://your-dataset-bucket/" \
+ base_output_directory="gs:///" \
+ dataset_path="gs:///" \
steps=100 \
log_period=10 \
profiler=xplane \
@@ -62,8 +62,8 @@ MaxText has integrated the ML Diagnostics [SDK](https://github.com/AI-Hypercompu
```
python3 -m maxtext.trainers.pre_train.train \
run_name=${USER}-tpu-job \
- base_output_directory="gs://your-output-bucket/" \
- dataset_path="gs://your-dataset-bucket/" \
+ base_output_directory="gs:///" \
+ dataset_path="gs:///" \
steps=100 \
log_period=10 \
profiler=xplane \
diff --git a/docs/guides/monitoring_and_debugging/understand_logs_and_metrics.md b/docs/guides/monitoring_and_debugging/understand_logs_and_metrics.md
index e381e7e8ac..e9833d7e6e 100644
--- a/docs/guides/monitoring_and_debugging/understand_logs_and_metrics.md
+++ b/docs/guides/monitoring_and_debugging/understand_logs_and_metrics.md
@@ -20,11 +20,11 @@
When you run a training job, MaxText produces detailed output logs. This guide shows you how to interpret these logs to understand your configuration and monitor performance.
-To start, run a simple pretraining job on a single-host TPU. For instance, we can run the following command on TPU v5p-8. The resulting log is used as an example throughout this guide.
+To start, run a simple pretraining job on a single-host TPU. For instance, we can run the following command on TPU v5p-8. The resulting log is used as an example throughout this guide. Replace `` in the command below to point to the GCS bucket you want to use for your output.
```bash
python3 -m maxtext.trainers.pre_train.train \
-base_output_directory=gs://runner-maxtext-logs run_name=demo \
+base_output_directory=gs:// run_name=demo \
model_name=deepseek2-16b \
per_device_batch_size=24 max_target_length=2048 steps=10 dataset_type=synthetic enable_checkpointing=false
```
@@ -80,23 +80,23 @@ Config param data_sharding: (('data', 'stage', 'fsdp', 'fsdp_transpose', 'sequen
This also includes the **output paths** for your run artifacts.
```
-Config param base_output_directory: gs://runner-maxtext-logs
+Config param base_output_directory: gs://
Config param run_name: demo
-Config param metrics_dir: gs://runner-maxtext-logs/demo/metrics/
-Config param tensorboard_dir: gs://runner-maxtext-logs/demo/tensorboard/
-Config param checkpoint_dir: gs://runner-maxtext-logs/demo/checkpoints/
+Config param metrics_dir: gs:///demo/metrics/
+Config param tensorboard_dir: gs:///demo/tensorboard/
+Config param checkpoint_dir: gs:///demo/checkpoints/
```
### Understanding output paths
-MaxText organizes all of your run's artifacts into a main output directory. The primary location for your run is constructed by combining the `base_output_directory` and the `run_name` you specify in your command. Based on the logs above, the base path for this specific run is `gs://runner-maxtext-logs/demo`.
+MaxText organizes all of your run's artifacts into a main output directory. The primary location for your run is constructed by combining the `base_output_directory` and the `run_name` you specify in your command. Based on the logs above, the base path for this specific run is `gs:///demo`.
Within this base path, MaxText creates several subdirectories for different types of artifacts. Many of these are optional and only created if you enable them with a specific flag.
- **TensorBoard logs (`tensorboard/`)**
- Flag: `enable_tensorboard=True` (default)
- - Path: `gs://runner-maxtext-logs/demo/tensorboard/`
+ - Path: `gs:///demo/tensorboard/`
- **Profiler traces (`tensorboard/plugins/profile/`)**
@@ -106,17 +106,17 @@ Within this base path, MaxText creates several subdirectories for different type
- **Metrics in plain text (`metrics/`)**
- Flag: `gcs_metrics=True`
- - Path: `gs://runner-maxtext-logs/demo/metrics/`
+ - Path: `gs:///demo/metrics/`
- **Configuration file (`config.yml`)**
- Flag: `save_config_to_gcs=True`
- - Path: `gs://runner-maxtext-logs/demo/config.yml`
+ - Path: `gs:///demo/config.yml`
- **Checkpoints (`checkpoints/`)**
- Flag: `enable_checkpointing=True`
- - Path: `gs://runner-maxtext-logs/demo/checkpoints/`
+ - Path: `gs:///demo/checkpoints/`
To generate all optional artifacts in one run, you can set the corresponding flags in the command line, like in the example below.
@@ -124,7 +124,7 @@ This command enables tensorboard, profiler, text metrics, config saving, and che
```bash
python3 -m maxtext.trainers.pre_train.train \
-base_output_directory=gs://runner-maxtext-logs run_name=demo2 \
+base_output_directory=gs:// run_name=demo2 \
model_name=deepseek2-16b \
per_device_batch_size=24 max_target_length=2048 steps=10 dataset_type=synthetic \
enable_tensorboard=True \
diff --git a/docs/install_maxtext.md b/docs/install_maxtext.md
index 47d31e93cb..ec7ab961ed 100644
--- a/docs/install_maxtext.md
+++ b/docs/install_maxtext.md
@@ -112,7 +112,7 @@ environment to avoid dependency conflicts.
cd maxtext
```
-:::\{only} is_not_latest
+````{only} is_not_latest
By default, cloning the repository provides the latest version (**HEAD**).
If you wish to use the latest features, please follow the [latest guide](https://maxtext.readthedocs.io/en/latest/install_maxtext.html).
@@ -126,7 +126,7 @@ before proceeding with the installation.
git checkout |version|
```
-:::
+````
2. Create virtual environment:
diff --git a/docs/reference/core_concepts/quantization.md b/docs/reference/core_concepts/quantization.md
index dae117a85a..1b2ef6518d 100644
--- a/docs/reference/core_concepts/quantization.md
+++ b/docs/reference/core_concepts/quantization.md
@@ -87,7 +87,7 @@ Common options for the `quantization` flag when using Qwix include:
Here is an example of how to run a training job with int8 quantization enabled via Qwix:
```bash
-python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?} base_output_directory=gs:// dataset_type=synthetic use_qwix_quantization=true quantization='int8'
+python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?} base_output_directory=gs:// dataset_type=synthetic use_qwix_quantization=true quantization='int8'
```
#### The Qwix Interception API
@@ -142,7 +142,7 @@ When using AQT, you can pass one of the following values to the `quantization` f
#### Example command for AQT
```bash
-python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?} base_output_directory=gs:// dataset_type=synthetic use_qwix_quantization=false quantization='int8'
+python3 -m maxtext.trainers.pre_train.train run_name=${YOUR_JOB_NAME?} base_output_directory=gs:// dataset_type=synthetic use_qwix_quantization=false quantization='int8'
```
Note that `use_qwix_quantization` is not set to `True`.
diff --git a/docs/run_maxtext/run_maxtext_localhost.md b/docs/run_maxtext/run_maxtext_localhost.md
index 843c52a5f3..1a148cc6d8 100644
--- a/docs/run_maxtext/run_maxtext_localhost.md
+++ b/docs/run_maxtext/run_maxtext_localhost.md
@@ -26,7 +26,11 @@ MaxText uses a primary YAML file, `configs/base.yml`, to manage its settings. Th
- `learning_rate`: The core hyperparameter for the optimizer.
- Mode shape parameters: `base_num_decoder_layers`, `base_emb_dim`, `base_num_query_heads`, `base_num_kv_heads`, and `head_dim`.
- **Override settings (optional):** You can modify training parameters in two ways: by editing `configs/base.yml` directly or by passing them as command-line arguments to the training script which is the recommended method. For example, to change the number of training steps, you can pass `--steps=500` when running `train.py`.
-- **Note**: You **must** update the variable `base_output_directory` which is initialized in `configs/base.yml` to point to a folder within the GCS bucket you just created (e.g., `gs://your-bucket-name/maxtext-output`).
+- **Note**: You **must** update the variable `base_output_directory` which is initialized in `configs/base.yml` to point to a folder within the GCS bucket you just created (e.g., `gs:///maxtext-output`). You can set set an environment variable for that:
+
+```bash
+export BASE_OUTPUT_DIRECTORY=gs://
+```
## Development
@@ -40,12 +44,12 @@ Local development on a single host TPU/GPU VM is a convenient way to run MaxText
#### Run a Test Training Job
-After the installation is complete, run a short training job using synthetic data to confirm everything is working correctly. This command trains a model for just 10 steps. Remember to replace `$YOUR_JOB_NAME` with a unique name for your run and `gs://` with the path to the GCS bucket you configured in the prerequisites.
+After the installation is complete, run a short training job using synthetic data to confirm everything is working correctly. This command trains a model for just 10 steps. Remember to replace `$YOUR_JOB_NAME` with a unique name for your run and to set `$BASE_OUTPUT_DIRECTORY` with the path to the GCS bucket you configured in the prerequisites.
```bash
python3 -m maxtext.trainers.pre_train.train \
run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
+ base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
dataset_type=synthetic \
steps=10
```
@@ -59,7 +63,7 @@ To demonstrate model output, run the following command:
```bash
python3 -m maxtext.inference.decode \
run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
+ base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
per_device_batch_size=1
```
@@ -80,7 +84,7 @@ To use a pre-configured model for TPUs, you override the `model_name` parameter,
python3 -m maxtext.trainers.pre_train.train \
model_name=llama3-8b \
run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
+ base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
dataset_type=synthetic \
steps=10
```
@@ -94,7 +98,7 @@ python3 -m maxtext.trainers.pre_train.train \
python3 -m maxtext.trainers.pre_train.train \
model_name=qwen3-4b \
run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
+ base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
dataset_type=synthetic \
steps=10
```
@@ -111,7 +115,7 @@ To use a GPU-optimized configuration, you should specify the path to the model's
```bash
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/gpu/models/mixtral_8x7b.yml \
run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
+ base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
dataset_type=synthetic \
steps=10
```
@@ -126,7 +130,7 @@ This will load `gpu/mixtral_8x7b.yml`, which inherits from `base.yml`.
```bash
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/gpu/models/llama3-8b.yml \
run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
+ base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
dataset_type=synthetic \
steps=10
```
diff --git a/docs/run_maxtext/run_maxtext_via_xpk.md b/docs/run_maxtext/run_maxtext_via_xpk.md
index c800366aee..12c3a733cc 100644
--- a/docs/run_maxtext/run_maxtext_via_xpk.md
+++ b/docs/run_maxtext/run_maxtext_via_xpk.md
@@ -144,8 +144,8 @@ This guide focuses on submitting workloads to an existing cluster. Cluster creat
# region as your TPUs to minimize latency and costs.
# You can list your buckets and their locations in the
# [Cloud Console](https://console.cloud.google.com/storage/browser).
- export BASE_OUTPUT_DIRECTORY= # e.g., gs://my-bucket/maxtext-runs
- export DATASET_PATH="gs://your-dataset-bucket/"
+ export BASE_OUTPUT_DIRECTORY= # e.g., gs:///maxtext-runs
+ export DATASET_PATH="gs:///"
```
2. **Configure gcloud CLI**
diff --git a/docs/tutorials/first_run.md b/docs/tutorials/first_run.md
index f04c6acb9c..3bc0ca4e68 100644
--- a/docs/tutorials/first_run.md
+++ b/docs/tutorials/first_run.md
@@ -24,7 +24,11 @@ This topic provides a basic introduction to get your MaxText workload up and run
1. To store logs and checkpoints, [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) in your project. To run MaxText, the TPU or GPU VMs must have read/write permissions for the bucket. These permissions are granted by service account roles, such as the `STORAGE ADMIN` role.
-2. MaxText reads a yaml file for configuration. We also recommend reviewing the configurable options in `configs/base.yml`. This file includes a decoder-only model of ~1B parameters. The configurable options can be overwritten from the command line. For instance, you can change the `steps` or `log_period` by either modifying `configs/base.yml` or by passing in `steps` and `log_period` as additional arguments to the `train.py` call. Set `base_output_directory` to a folder in the bucket you just created.
+2. MaxText reads a yaml file for configuration. We also recommend reviewing the configurable options in `configs/base.yml`. This file includes a decoder-only model of ~1B parameters. The configurable options can be overwritten from the command line. For instance, you can change the `steps` or `log_period` by either modifying `configs/base.yml` or by passing in `steps` and `log_period` as additional arguments to the `train.py` call. Set `base_output_directory` to a folder in the bucket you just created. You can set set an environment variable for that:
+
+```bash
+export BASE_OUTPUT_DIRECTORY=gs://
+```
## Local development for single host
@@ -42,7 +46,7 @@ multiple hosts but is a good way to learn about MaxText.
```sh
python3 -m maxtext.trainers.pre_train.train \
run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
+ base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
dataset_type=synthetic \
steps=10
```
@@ -54,7 +58,7 @@ Optional: If you want to try training on a Hugging Face dataset, see [Data Input
```sh
python3 -m maxtext.inference.decode \
run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
+ base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
per_device_batch_size=1
```
@@ -76,7 +80,7 @@ You can use [demo_decoding.ipynb](https://github.com/AI-Hypercomputer/maxtext/bl
```sh
python3 -m maxtext.trainers.pre_train.train \
run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
+ base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
dataset_type=synthetic \
steps=10
```
@@ -86,7 +90,7 @@ python3 -m maxtext.trainers.pre_train.train \
```sh
python3 -m maxtext.inference.decode \
run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
+ base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
per_device_batch_size=1
```
diff --git a/docs/tutorials/posttraining/multimodal.md b/docs/tutorials/posttraining/multimodal.md
index 8590761be6..8fff8da747 100644
--- a/docs/tutorials/posttraining/multimodal.md
+++ b/docs/tutorials/posttraining/multimodal.md
@@ -127,8 +127,8 @@ Supervised Fine-Tuning (SFT) of multimodal LLMs in MaxText focuses specifically
Here, we use [ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) as an example to demonstrate SFT functionality:
```shell
-export MAXTEXT_CKPT_PATH=... # either set to an already available MaxText ckpt or to the one we just converted in the previous step
-export BASE_OUTPUT_DIRECTORY=gs://...
+export MAXTEXT_CKPT_PATH= # either set to an already available MaxText ckpt or to the one we just converted in the previous step
+export BASE_OUTPUT_DIRECTORY=gs://
export STEPS=1000
python -m maxtext.trainers.post_train.sft.train_sft_deprecated \
src/maxtext/configs/post_train/sft-vision-chartqa.yml \
diff --git a/docs/tutorials/pretraining.md b/docs/tutorials/pretraining.md
index a1ae985db0..e19c04b83e 100644
--- a/docs/tutorials/pretraining.md
+++ b/docs/tutorials/pretraining.md
@@ -40,7 +40,7 @@ We can use this **command** for pretraining:
```bash
# replace base_output_directory with your bucket
python3 -m maxtext.trainers.pre_train.train \
-base_output_directory=gs://runner-maxtext-logs run_name=demo \
+base_output_directory=gs:// run_name=demo \
model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \
dataset_type=hf hf_path=allenai/c4 hf_data_dir=en train_split=train \
tokenizer_type=huggingface tokenizer_path=deepseek-ai/DeepSeek-V2-Lite
@@ -93,9 +93,9 @@ Grain is a library for reading data for training and evaluating JAX models. It i
**Data preparation**: You need to download data to a Cloud Storage bucket, and read data via Cloud Storage Fuse with [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh).
-- For example, we can mount the bucket `gs://maxtext-dataset` on the local path `/tmp/gcsfuse` before training
+- For example, we can mount the bucket `gs://` on the local path `/tmp/gcsfuse` before training
```bash
- bash setup_gcsfuse.sh DATASET_GCS_BUCKET=maxtext-dataset MOUNT_PATH=/tmp/gcsfuse
+ bash setup_gcsfuse.sh DATASET_GCS_BUCKET=gh:// MOUNT_PATH=/tmp/gcsfuse
```
- After training, we unmount the local path
```bash
@@ -107,7 +107,7 @@ This **command** shows pretraining with Grain pipeline, along with evaluation:
```bash
# replace DATASET_GCS_BUCKET and base_output_directory with your buckets
python3 -m maxtext.trainers.pre_train.train \
-base_output_directory=gs://runner-maxtext-logs run_name=demo \
+base_output_directory=gs:// run_name=demo \
model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \
dataset_type=grain grain_file_type=arrayrecord grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* grain_worker_count=2 \
eval_interval=5 eval_steps=10 grain_eval_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record* \
@@ -144,7 +144,7 @@ This **command** shows pretraining with TFDS pipeline, along with evaluation:
```bash
# replace base_output_directory and dataset_path with your buckets
python3 -m maxtext.trainers.pre_train.train \
-base_output_directory=gs://runner-maxtext-logs run_name=demo \
+base_output_directory=gs:// run_name=demo \
model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \
dataset_type=tfds dataset_path=gs://maxtext-dataset dataset_name='c4/en:3.0.1' train_split=train \
eval_interval=5 eval_steps=10 eval_dataset_name='c4/en:3.0.1' eval_split=validation \