Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/BestPractices/Qwen3_5-Best-Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ swift infer \

Megatron-SWIFT训练Qwen3.5的提示:
- 全参数训练:参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh)。
- 关于MTP训练:ms-swift暂不支持多模态MTP的训练。如果你只训练纯文本数据,请设置`SKIP_MULTIMODAL_MTP_VALIDATION=1`环境变量,忽略检查
- 关于MTP训练:"mcore-bridge>=1.1.0"支持了多模态MTP的训练(暂时需安装[main分支](https://github.com/modelscope/mcore-bridge/pull/14)),请安装对应版本
- TP 限制解除:使用 "megatron-core>=0.16" 可解除 TP 受到的 `num_query_groups` 限制。
- 默认 `GatedDeltaNet` 使用 Megatron 实现,需使用 "megatron-core>=0.16"(ms-swift>=4.1.0,之前版本默认使用transformers实现)。设置环境变量 `USE_MCORE_GDN=0`可切换至 transformers 实现,transformers实现不支持packing和GDN的TP。
- padding_free/packing的支持:packing可以提升训练速度。参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh)。
Expand Down
4 changes: 3 additions & 1 deletion docs/source/GetStarted/SWIFT-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
```shell
# 推荐
pip install 'ms-swift' -U
# 使用评测
# 额外安装megatron依赖
pip install 'ms-swift[megatron]' -U
# 额外安装评测依赖
pip install 'ms-swift[eval]' -U
# 全能力
pip install 'ms-swift[all]' -U
Expand Down
1 change: 1 addition & 0 deletions docs/source/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,7 @@
**MTP参数**
- mtp_num_layers: 多token预测(MTP)层的数量。MTP将每个位置的预测范围扩展到多个未来token。此MTP实现使用D个顺序模块依次预测D个额外的token。默认为None。(需要"megatron-core>=0.14")
- 注意:mtp_num_layers的值,将不自动从config.json获取,需手动设置。你可以参考config.json中的`num_nextn_predict_layers`字段填写该值。使用mcore-bridge时,将优先从safetensors文件中加载MTP权重,若无法找到,则进行随机初始化。(若要使用blockwise fp8 + mtp,请使用mcore>=0.15)
- 多模态MTP的支持: 需安装"mcore-bridge>=1.1.0"。
- mtp_loss_scaling_factor: 多token预测(MTP)损失的缩放因子。我们计算所有深度上MTP损失的平均值,然后乘以该缩放因子得到总体MTP损失,它将作为一个额外的训练目标。默认为0.1。

**Tuner参数**:
Expand Down
11 changes: 8 additions & 3 deletions docs/source/Megatron-SWIFT/Quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,13 @@ git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

# mcore-bridge megatron-core
pip install "megatron-core==0.16.*" mcore-bridge -U
# mcore-bridge
pip install mcore-bridge -U
# 安装main分支
# pip install git+https://github.com/modelscope/mcore-bridge.git

# megatron-core
pip install "megatron-core==0.16.*" -U

# 若使用多机训练,请额外设置`MODELSCOPE_CACHE`环境变量为共享存储路径
# 这将确保数据集缓存共享,而加速预处理速度。
Expand Down Expand Up @@ -67,7 +72,7 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2
| transformer-engine | >=2.3 | 2.12.0 | |
| apex | | 0.1 | |
| megatron-core | >=0.12,<0.17 | 0.16 | |
| mcore-bridge | >=1.0.1 | | |
| mcore-bridge | >=1.0.2 | | |
| flash-attn | | 2.8.3/3.0.0b1 | |
| transformers | >=4.33 | 4.57.6/5.2.0 | |
| modelscope | >=1.23 | | |
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/BestPractices/Qwen3_5-Best-Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,7 +307,7 @@ swift infer \
Tips for training Qwen3.5 with Megatron-SWIFT:

- Full parameter training: Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh).
- Regarding MTP training: ms-swift currently does not support multimodal MTP training. If you are only training on pure text data, please set the `SKIP_MULTIMODAL_MTP_VALIDATION=1` environment variable to skip the validation check.
- Regarding MTP training: `mcore-bridge>=1.1.0` supports multimodal MTP training (currently requires installing the [main branch](https://github.com/modelscope/mcore-bridge/pull/14)). Please install the corresponding version.
- TP Limitation Removed: Using `megatron-core>=0.16` removes the `num_query_groups` limitation on TP.
- By default, `GatedDeltaNet` uses the Megatron implementation, which requires "megatron-core>=0.16" (ms-swift>=4.1.0; previous versions defaulted to the transformers implementation). Set the environment variable `USE_MCORE_GDN=0` to switch to the transformers implementation. Note that the transformers implementation does not support packing and GDN's TP.
- Support for padding_free/packing: Packing can improve training speed. Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh).
Expand Down
4 changes: 3 additions & 1 deletion docs/source_en/GetStarted/SWIFT-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ You can install it using pip:
```shell
# recommend
pip install 'ms-swift' -U
# For evaluation usage
# Install additional Megatron dependencies
pip install 'ms-swift[megatron]' -U
# Install additional evaluation dependencies
pip install 'ms-swift[eval]' -U
# Full capabilities
pip install 'ms-swift[all]' -U
Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,7 @@ For guidance on selecting parallelization strategies, please refer to the [Train
**MTP Parameters**
- mtp_num_layers: Number of Multi-Token Prediction (MTP) layers. MTP extends the prediction scope at each position to multiple future tokens. This MTP implementation uses D sequential modules to sequentially predict D additional tokens. Default is None. (requires "megatron-core>=0.14")
- Note: The value of mtp_num_layers will not be automatically retrieved from config.json and must be set manually. You can refer to the `num_nextn_predict_layers` field in config.json to fill in this value. When using mcore-bridge, MTP weights will be loaded from safetensors files first. If not found, random initialization will be performed. (To use blockwise fp8 + mtp, please use mcore>=0.15)
- Multimodal MTP support: Requires installing "mcore-bridge>=1.1.0".
- mtp_loss_scaling_factor: Scaling factor of Multi-Token Prediction (MTP) loss. We compute the average of MTP losses across all depths, then multiply it by this scaling factor to obtain the overall MTP loss, which serves as an additional training objective. Default is 0.1.

**Tuner Parameters**:
Expand Down
11 changes: 8 additions & 3 deletions docs/source_en/Megatron-SWIFT/Quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,13 @@ git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

# mcore-bridge megatron-core
pip install "megatron-core==0.16.*" mcore-bridge -U
# mcore-bridge
pip install mcore-bridge -U
# Install from main branch
# pip install git+https://github.com/modelscope/mcore-bridge.git

# megatron-core
pip install "megatron-core==0.16.*" -U

# If you are using multi-node training, please additionally set the `MODELSCOPE_CACHE` environment variable to a shared storage path.
# This will ensure that the dataset cache is shared, thereby speeding up preprocessing.
Expand Down Expand Up @@ -67,7 +72,7 @@ Recommended Operating Environment:
| transformer-engine | >=2.3 | 2.12.0 | |
| apex | | 0.1 | |
| megatron-core | >=0.12,<0.17 | 0.16 | |
| mcore-bridge | >=1.0.1 | | |
| mcore-bridge | >=1.0.2 | | |
| flash-attn | | 2.8.3/3.0.0b1 | |
| transformers | >=4.33 | 4.57.6/5.2.0 | |
| modelscope | >=1.23 | | |
Expand Down
1 change: 0 additions & 1 deletion examples/models/qwen3_5/packing.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
IMAGE_MAX_TOKEN_NUM=1024 \
VIDEO_MAX_TOKEN_NUM=128 \
FPS_MAX_FRAMES=12 \
SKIP_MULTIMODAL_MTP_VALIDATION=1 \
megatron sft \
--model Qwen/Qwen3.5-35B-A3B \
--save_safetensors true \
Expand Down
3 changes: 3 additions & 0 deletions requirements/megatron.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
mcore-bridge>=1.0.2
megatron-core>=0.12
peft>=0.15
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ def gen_packages_items():
install_requires, deps_link = parse_requirements('requirements.txt')
extra_requires = {}
all_requires = []
extra_requires['megatron'], _ = parse_requirements('requirements/megatron.txt')
extra_requires['eval'], _ = parse_requirements('requirements/eval.txt')
extra_requires['swanlab'], _ = parse_requirements('requirements/swanlab.txt')
extra_requires['ray'], _ = parse_requirements('requirements/ray.txt')
Expand Down
2 changes: 1 addition & 1 deletion swift/megatron/init.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ def _new_load_inline(*args, **kwargs):


def _patch_mcore_bridge():
require_version('mcore-bridge>=1.0.1.dev', 'please install mcore-bridge via `pip install mcore-bridge -U`')
require_version('mcore-bridge>=1.0.2', 'please install mcore-bridge via `pip install mcore-bridge -U`')
import mcore_bridge
from mcore_bridge import GPTBridge
logger.info(f'mcore_bridge.__version__: {mcore_bridge.__version__}')
Expand Down
Loading