hpcaitech · Zhao73 · Jun 1, 2026
@@ -29,7 +29,7 @@ We will cover the whole workflow in the `basic tutorials` section.
 The Colossal-AI system will be expanded to include more training skills, these new developments may include but are not limited to:
 
 1. optimization of distributed operations
-2. optimization of training on heterogenous system
+2. optimization of training on heterogeneous system
 3. implementation of training utilities to reduce model size and speed up training while preserving model performance
 4. expansion of existing parallelism methods
 

@@ -133,7 +133,7 @@ model on a single machine.
 
 <figure style={{textAlign: "center"}}>
 <img src="https://s2.loli.net/2022/01/28/qLHD5lk97hXQdbv.png"/>
-<figcaption>Heterogenous system illustration</figcaption>
+<figcaption>Heterogeneous system illustration</figcaption>
 </figure>
 
 Related paper:

@@ -12,7 +12,7 @@ Author: [Wenxuan Tan](https://github.com/Edenzzzz), [Junwen Duan](https://github
 Apart from the widely adopted Adam and SGD, many modern optimizers require layer-wise statistics to update parameters, and thus aren't directly applicable to settings where model layers are sharded across multiple devices. We provide optimized distributed implementations with minimal extra communications, and seamless integrations with Tensor Parallel, DDP and ZeRO plugins, which automatically uses distributed optimizers with 0 code change.
 
 ## Optimizers
-Adafactor is a first-order Adam variant using Non-negative Matrix Factorization(NMF) to reduce memory footprint. CAME improves by introducting a confidence matrix to correct NMF. GaLore further reduces memory by projecting gradients into a low-rank space and 8-bit block-wise quantization. Lamb allows huge batch sizes without lossing accuracy via layer-wise adaptive update bounded by the inverse of its Lipschiz constant.
+Adafactor is a first-order Adam variant using Non-negative Matrix Factorization(NMF) to reduce memory footprint. CAME improves by introducing a confidence matrix to correct NMF. GaLore further reduces memory by projecting gradients into a low-rank space and 8-bit block-wise quantization. Lamb allows huge batch sizes without losing accuracy via layer-wise adaptive update bounded by the inverse of its Lipschiz constant.
 
 
 ## Hands-On Practice
@@ -28,7 +28,7 @@ import colossalai
 import torch
 ```
 
-### step 2. Initialize Distributed Environment and Parallism Group
+### step 2. Initialize Distributed Environment and Parallelism Group
 We need to initialize distributed environment. For demo purpose, we use `colossal run --nproc_per_node 4`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)
 
 ```python

@@ -21,7 +21,7 @@ Author: [Baizhou Zhang](https://github.com/Fridge003), [Bin Jia](https://github.
 ## Introduction
 
 When training large transformer models such as LLaMa-2 70B or OPT 175B, model parallelism methods that divide a huge model into smaller shards, including tensor parallelism or pipeline parallelism, are essential so as to meet the limitation of GPU memory.
-However, manually cutting model and rewriting its forward/backword logic could be difficult for users who are not familiar with distributed training.
+However, manually cutting model and rewriting its forward/backward logic could be difficult for users who are not familiar with distributed training.
 Meanwhile, the Huggingface transformers library has gradually become users' first choice of model source, and most mainstream large models have been open-sourced in Huggingface transformers model library.
 
 Out of this motivation, the ColossalAI team develops **Shardformer**, a feature that automatically does preparation of model parallelism (tensor parallelism/pipeline parallelism) for popular transformer models in HuggingFace.

@@ -29,7 +29,7 @@ More details can be found in our [blog of Stable Diffusion v1](https://www.hpc-a
 ## Roadmap
 This project is in rapid development.
 
-- [X] Train a stable diffusion model v1/v2 from scatch
+- [X] Train a stable diffusion model v1/v2 from scratch
 - [X] Finetune a pretrained Stable diffusion v1 model
 - [X] Inference a pretrained model using PyTorch
 - [ ] Finetune a pretrained Stable diffusion v2 model
@@ -40,7 +40,7 @@ This project is in rapid development.
 ### Option #1: Install from source
 #### Step 1: Requirements
 
-To begin with, make sure your operating system has the cuda version suitable for this exciting training session, which is cuda11.6/11.8. For your convience, we have set up the rest of packages here. You can create and activate a suitable [conda](https://conda.io/) environment named `ldm` :
+To begin with, make sure your operating system has the cuda version suitable for this exciting training session, which is cuda11.6/11.8. For your convenience, we have set up the rest of packages here. You can create and activate a suitable [conda](https://conda.io/) environment named `ldm` :
 
 ```
 conda env create -f environment.yaml
@@ -202,7 +202,7 @@ python main.py --logdir /tmp/ -t -b configs/Teyvat/train_colossalai_teyvat.yaml
 ```
 
 ## Inference
-if you want to test with pretrain model,as bellow:
+if you want to test with pretrain model,as below:
 python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms    --outdir ./output     --ckpt 512-base-ema.ckpt     --config configs/train_ddp.yaml
 
 You can get your training last.ckpt and train config.yaml in your `--logdir`, and run by