Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/en/concepts/colossalai_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ We will cover the whole workflow in the `basic tutorials` section.
The Colossal-AI system will be expanded to include more training skills, these new developments may include but are not limited to:

1. optimization of distributed operations
2. optimization of training on heterogenous system
2. optimization of training on heterogeneous system
3. implementation of training utilities to reduce model size and speed up training while preserving model performance
4. expansion of existing parallelism methods

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/concepts/paradigms_of_parallelism.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ model on a single machine.

<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/qLHD5lk97hXQdbv.png"/>
<figcaption>Heterogenous system illustration</figcaption>
<figcaption>Heterogeneous system illustration</figcaption>
</figure>

Related paper:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/en/features/distributed_optimizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Author: [Wenxuan Tan](https://github.com/Edenzzzz), [Junwen Duan](https://github
Apart from the widely adopted Adam and SGD, many modern optimizers require layer-wise statistics to update parameters, and thus aren't directly applicable to settings where model layers are sharded across multiple devices. We provide optimized distributed implementations with minimal extra communications, and seamless integrations with Tensor Parallel, DDP and ZeRO plugins, which automatically uses distributed optimizers with 0 code change.

## Optimizers
Adafactor is a first-order Adam variant using Non-negative Matrix Factorization(NMF) to reduce memory footprint. CAME improves by introducting a confidence matrix to correct NMF. GaLore further reduces memory by projecting gradients into a low-rank space and 8-bit block-wise quantization. Lamb allows huge batch sizes without lossing accuracy via layer-wise adaptive update bounded by the inverse of its Lipschiz constant.
Adafactor is a first-order Adam variant using Non-negative Matrix Factorization(NMF) to reduce memory footprint. CAME improves by introducing a confidence matrix to correct NMF. GaLore further reduces memory by projecting gradients into a low-rank space and 8-bit block-wise quantization. Lamb allows huge batch sizes without losing accuracy via layer-wise adaptive update bounded by the inverse of its Lipschiz constant.


## Hands-On Practice
Expand All @@ -28,7 +28,7 @@ import colossalai
import torch
```

### step 2. Initialize Distributed Environment and Parallism Group
### step 2. Initialize Distributed Environment and Parallelism Group
We need to initialize distributed environment. For demo purpose, we use `colossal run --nproc_per_node 4`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)

```python
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/features/shardformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Author: [Baizhou Zhang](https://github.com/Fridge003), [Bin Jia](https://github.
## Introduction

When training large transformer models such as LLaMa-2 70B or OPT 175B, model parallelism methods that divide a huge model into smaller shards, including tensor parallelism or pipeline parallelism, are essential so as to meet the limitation of GPU memory.
However, manually cutting model and rewriting its forward/backword logic could be difficult for users who are not familiar with distributed training.
However, manually cutting model and rewriting its forward/backward logic could be difficult for users who are not familiar with distributed training.
Meanwhile, the Huggingface transformers library has gradually become users' first choice of model source, and most mainstream large models have been open-sourced in Huggingface transformers model library.

Out of this motivation, the ColossalAI team develops **Shardformer**, a feature that automatically does preparation of model parallelism (tensor parallelism/pipeline parallelism) for popular transformer models in HuggingFace.
Expand Down
6 changes: 3 additions & 3 deletions examples/images/diffusion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ More details can be found in our [blog of Stable Diffusion v1](https://www.hpc-a
## Roadmap
This project is in rapid development.

- [X] Train a stable diffusion model v1/v2 from scatch
- [X] Train a stable diffusion model v1/v2 from scratch
- [X] Finetune a pretrained Stable diffusion v1 model
- [X] Inference a pretrained model using PyTorch
- [ ] Finetune a pretrained Stable diffusion v2 model
Expand All @@ -40,7 +40,7 @@ This project is in rapid development.
### Option #1: Install from source
#### Step 1: Requirements

To begin with, make sure your operating system has the cuda version suitable for this exciting training session, which is cuda11.6/11.8. For your convience, we have set up the rest of packages here. You can create and activate a suitable [conda](https://conda.io/) environment named `ldm` :
To begin with, make sure your operating system has the cuda version suitable for this exciting training session, which is cuda11.6/11.8. For your convenience, we have set up the rest of packages here. You can create and activate a suitable [conda](https://conda.io/) environment named `ldm` :

```
conda env create -f environment.yaml
Expand Down Expand Up @@ -202,7 +202,7 @@ python main.py --logdir /tmp/ -t -b configs/Teyvat/train_colossalai_teyvat.yaml
```

## Inference
if you want to test with pretrain model,as bellow:
if you want to test with pretrain model,as below:
python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms --outdir ./output --ckpt 512-base-ema.ckpt --config configs/train_ddp.yaml

You can get your training last.ckpt and train config.yaml in your `--logdir`, and run by
Expand Down