Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
da34ee7
update branch (#9189)
ericharper May 14, 2024
bb26e98
Enable CUDA graphs by default only for transcription (#9196)
artbataev May 14, 2024
fc9d6dc
Update to using Model Optimizer (formerly AMMO) in PTQ workflow (#9178)
janekl May 15, 2024
fd36bcc
update branch
ericharper May 15, 2024
7690b3d
update branch (#9211)
ericharper May 16, 2024
3251cdc
rename paths2audiofiles to audio (#9209)
nithinraok May 16, 2024
7d20f0d
fix graphviz installation for local run (#9233)
andrusenkoau May 17, 2024
6a5187b
Support dataloader as input to `audio` for transcription (#9201) (#9235)
titu1994 May 17, 2024
5a68d2a
Revert rope fusion defaults (#9238)
cuichenx May 17, 2024
f5ad4ab
dist adam transpose fix (#9239)
dimapihtar May 17, 2024
6b170b8
increase time limit for Speech_Checkpoints_tests (#9186) (#9247)
pablo-garay May 17, 2024
60c9257
Update Online_Offline_Microphone_VAD_Demo.ipynb (#9252)
stevehuang52 May 20, 2024
2a2d985
Dgalvez/fix greedy batch strategy name r2.0.0rc0 (#9243)
galv May 20, 2024
38fcd5f
Merge branch 'r2.0.0rc0' of github.com:NVIDIA/NeMo into r2.0.0rc0
ericharper May 21, 2024
44a6676
sum-reduce grad_norm in DP+CP domain (#9262)
erhoo82 May 21, 2024
91003a0
Fix T5 G2P Input and Output Types (#9224)
blisc May 21, 2024
4a6f158
Update nemo.export module for quantized models (#9250)
janekl May 21, 2024
26f566b
Pin transformers (#9261)
ericharper May 22, 2024
212023c
Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfe…
galv May 22, 2024
fe594b5
Fix loading github raw images on notebook (#9282)
nithinraok May 22, 2024
2e4f5aa
TRTLLM new API support (#9003)
meatybobby May 13, 2024
833d4cc
Add TRT-LLM params like max_num_tokens and opt_num_tokens (#9210)
oyilmaz-nvidia May 21, 2024
d8afaba
Merge branch 'r2.0.0rc0' of https://github.com/NVIDIA/NeMo into r2.0.…
pablo-garay May 22, 2024
8b65e3e
Merge branch 'r2.0.0rc0' of github.com:NVIDIA/NeMo into r2.0.0rc0
ericharper May 23, 2024
c12030f
Fix circular import for MM dataprep notebook (#9287)
cuichenx May 23, 2024
afd3c7e
Remove .nemo instead of renaming (#9281)
mikolajblaz May 23, 2024
60a1588
neva media_type + text generation default fix (#9257)
paul-gibbons May 23, 2024
40dbcf1
fix lora and ptuning and isort/black (#9290)
oyilmaz-nvidia May 23, 2024
16aace0
add check if num layers is divisible by pp size (#9208)
dimapihtar May 23, 2024
f073ed9
Fix P-tuning for Llama based models (#9297)
apanteleev May 23, 2024
6040af5
Fix typo in HF tutorial (#9302)
titu1994 May 24, 2024
7235f2b
Guard cuda memory allocator update (#9312)
titu1994 May 25, 2024
0411b7c
typos (#9314)
nithinraok May 25, 2024
525604f
add deprecation warnings (#9266)
pablo-garay May 25, 2024
86a0bb6
move pooler under post_process (#9328)
dimapihtar May 29, 2024
4cefd5d
Re-enable cuda graphs in training modes. (#9338)
galv May 29, 2024
452bd95
add large model stable training fix and contrastive loss update for v…
nithinraok May 30, 2024
d2bdb49
add deprecation note for nmt (#9342)
dimapihtar May 30, 2024
0113f47
Fix incorrect checkpoint removal logic (#9192) (#9204)
mikolajblaz Jun 1, 2024
81ab4ed
conv1d stable version (#9330) (#9369)
pablo-garay Jun 3, 2024
60525c8
Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) (#9…
titu1994 Jun 4, 2024
e999061
Adding the original change made for label_models (#9377)
tango4j Jun 4, 2024
f2dffaa
Force diarizer to use CUDA if cuda is available and if device=None. (…
tango4j Jun 5, 2024
d02bb32
fix fp16 precision issue (#9376)
dimapihtar Jun 5, 2024
265bd73
Merge branch 'r2.0.0rc0' of github.com:NVIDIA/NeMo into r2.0.0rc0
ericharper Jun 5, 2024
9294486
Update config for NeMo 2
diarray-hub Jun 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,357 changes: 167 additions & 1,190 deletions .github/workflows/cicd-main.yml

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions .github/workflows/codeql.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ name: "CodeQL"

on:
push:
branches: [ "main", "[rv][0-9]*", "gh-pages-src" ]
branches: [ 'r2.0.0rc0', "[rv][0-9]*", "gh-pages-src" ]
pull_request:
# The branches below must be a subset of the branches above
branches: [ "main" ]
branches: [ 'r2.0.0rc0' ]
schedule:
- cron: '19 1 * * 4'

Expand Down
2 changes: 0 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -133,8 +133,6 @@ RUN for f in $(ls requirements*.txt); do pip3 install --disable-pip-version-chec
RUN pip install flash-attn
# install numba for latest containers
RUN pip install numba>=0.57.1
# install ammo
RUN pip install nvidia-ammo~=0.9.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir

# copy nemo source into a scratch image
FROM scratch as nemo-src
Expand Down
6 changes: 3 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ Use this installation mode if you want the latest released version.
.. code-block:: bash

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython
pip install Cython packaging
pip install nemo_toolkit['all']

Depending on the shell used, you may need to use ``"nemo_toolkit[all]"`` instead in the above command.
Expand All @@ -272,7 +272,7 @@ Use this installation mode if you want the version from a particular GitHub bran
.. code-block:: bash

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython
pip install Cython packaging
python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[all]


Expand Down Expand Up @@ -310,7 +310,7 @@ To install NeMo on Mac with Apple M-Series GPU:
conda install -c conda-forge pynini

# install Cython manually
pip install cython
pip install cython packaging

# clone the repo and install in development mode
git clone https://github.com/NVIDIA/NeMo
Expand Down
48 changes: 45 additions & 3 deletions docs/source/nlp/quantization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,60 @@ PTQ enables deploying a model in a low-precision format -- FP8, INT4, or INT8 --

Model quantization has two primary benefits: reduced model memory requirements and increased inference throughput.

In NeMo, quantization is enabled by the Nvidia AMMO library -- a unified algorithmic model optimization & deployment toolkit.
In NeMo, quantization is enabled by the `NVIDIA TensorRT Model Optimizer (ModelOpt) <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ library -- a library to quantize and compress deep learning models for optimized inference on GPUs.

The quantization process consists of the following steps:

1. Loading a model checkpoint using an appropriate parallelism strategy
2. Calibrating the model to obtain appropriate algorithm-specific scaling factors
3. Producing an output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).

Loading models requires using an AMMO spec defined in `megatron.core.inference.gpt.model_specs.py <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/inference/gpt/model_specs.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in NeMo project in ``nemo.deploy`` and ``nemo.export`` modules.
Loading models requires using an ModelOpt spec defined in `nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in NeMo project in ``nemo.deploy`` and ``nemo.export`` modules.

Quantization algorithm can also be conveniently set to ``"null"`` to perform only the weights export step using default precision for TensorRT-LLM deployment. This is useful to obtain baseline performance and accuracy results for comparison.

Support Matrix
^^^^^^^^^^^^^^

Table below presents verified model support matrix for popular LLM architectures. Each model entry also optionally provides a download link to a corresponding Nemo checkpoint for testing purposes. Support for other model families is experimental.

.. list-table:: Model Support Matrix
:widths: 15 15 15 15
:header-rows: 1

* - **Model Family**
- **FP8**
- **INT8_SQ**
- **INT4_AWQ**
* - Llama (1, 2, 3)
- ✅
- ✅
- ✅
* - Mistral
- ✅
- ✅
- ✅
* - `GPT-3 <https://huggingface.co/nvidia/GPT-2B-001>`_
- ✅
- ✅
- ✅
* - `Nemotron-3 8b <https://huggingface.co/nvidia/nemotron-3-8b-base-4k>`_
- ✅
- ✅
- ✅
* - Nemotron-4 15b
- ✅
- ✅
- ✅
* - StarCoder 2
- ✅
- ✅
- ✅
* - Gemma
- ✅
- ✅
- ✅


Example
^^^^^^^
Expand All @@ -31,7 +73,7 @@ The script must be launched correctly with the number of processes equal to tens

.. code-block:: bash

torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_llama_quantization.py \
torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_quantization.py \
model_file=llama2-70b-base-bf16.nemo \
tensor_model_parallel_size=8 \
pipeline_model_parallel_size=1 \
Expand Down
6 changes: 3 additions & 3 deletions docs/source/starthere/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,13 +96,13 @@ This section details the steps to clone and install the Megatron Core.
git checkout a5415fcfacef2a37416259bd38b7c4b673583675 && \
pip install .

AMMO Installation
Model Optimizer Installation

This final step involves installing the AMMO package.
This final step involves installing the Model Optimizer package.

.. code-block:: bash

pip install nvidia-ammo~=0.7.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir
pip install nvidia-modelopt[torch]~=0.11.0 --extra-index-url https://pypi.nvidia.com


.. code-block:: bash
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling parameters
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling params
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling parameters
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling params
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling parameters
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling params
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling parameters
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling params
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 18
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling params
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling parameters
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
1 change: 1 addition & 0 deletions examples/asr/conf/ssl/fastconformer/fast-conformer.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ model:
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17
d_model: 512
use_bias: True # whether to apply bias in the feedforward, MHA and convolution modules

# Sub-sampling params
subsampling: dw_striding # vggnet, striding, stacking or stacking_norm, dw_striding
Expand Down
13 changes: 9 additions & 4 deletions examples/asr/transcribe_speech.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
from nemo.collections.asr.parts.submodules.ctc_decoding import CTCDecodingConfig
from nemo.collections.asr.parts.submodules.multitask_decoding import MultiTaskDecoding, MultiTaskDecodingConfig
from nemo.collections.asr.parts.submodules.rnnt_decoding import RNNTDecodingConfig
from nemo.collections.asr.parts.submodules.rnnt_greedy_decoding import GreedyBatchedRNNTInferConfig
from nemo.collections.asr.parts.utils.eval_utils import cal_write_wer
from nemo.collections.asr.parts.utils.rnnt_utils import Hypothesis
from nemo.collections.asr.parts.utils.transcribe_utils import (
Expand Down Expand Up @@ -121,9 +122,9 @@ class TranscriptionConfig:
pretrained_name: Optional[str] = None # Name of a pretrained model
audio_dir: Optional[str] = None # Path to a directory which contains audio files
dataset_manifest: Optional[str] = None # Path to dataset's JSON manifest
channel_selector: Optional[
Union[int, str]
] = None # Used to select a single channel from multichannel audio, or use average across channels
channel_selector: Optional[Union[int, str]] = (
None # Used to select a single channel from multichannel audio, or use average across channels
)
audio_key: str = 'audio_filepath' # Used to override the default audio key in dataset_manifest
eval_config_yaml: Optional[str] = None # Path to a yaml file of config of evaluation
presort_manifest: bool = True # Significant inference speedup on short-form data due to padding reduction
Expand Down Expand Up @@ -161,6 +162,7 @@ class TranscriptionConfig:
ctc_decoding: CTCDecodingConfig = CTCDecodingConfig()

# Decoding strategy for RNNT models
# enable CUDA graphs for transcription
rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig(fused_batch_size=-1)

# Decoding strategy for AED models
Expand Down Expand Up @@ -407,7 +409,10 @@ def autocast(dtype=None):
override_cfg.augmentor = augmentor
override_cfg.text_field = cfg.gt_text_attr_name
override_cfg.lang_field = cfg.gt_lang_attr_name
transcriptions = asr_model.transcribe(audio=filepaths, override_config=override_cfg,)
transcriptions = asr_model.transcribe(
audio=filepaths,
override_config=override_cfg,
)

if cfg.dataset_manifest is not None:
logging.info(f"Finished transcribing from manifest file: {cfg.dataset_manifest}")
Expand Down
4 changes: 3 additions & 1 deletion examples/asr/transcribe_speech_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@
from nemo.collections.asr.models import ASRModel, EncDecHybridRNNTCTCModel
from nemo.collections.asr.models.configs.asr_models_config import ASRDatasetConfig
from nemo.collections.asr.parts.submodules.rnnt_decoding import RNNTDecodingConfig
from nemo.collections.asr.parts.submodules.rnnt_greedy_decoding import GreedyBatchedRNNTInferConfig
from nemo.core.config import TrainerConfig, hydra_runner
from nemo.utils import logging
from nemo.utils.get_rank import is_global_rank_zero
Expand All @@ -100,7 +101,8 @@ class ParallelTranscriptionConfig:
use_cer: bool = False

# decoding strategy for RNNT models
rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig()
# Double check whether fused_batch_size=-1 is right
rnnt_decoding: RNNTDecodingConfig = RNNTDecodingConfig(fused_batch_size=-1)

# decoder type: ctc or rnnt, can be used to switch between CTC and RNNT decoder for Hybrid RNNT/CTC models
decoder_type: Optional[str] = None
Expand Down
8 changes: 4 additions & 4 deletions examples/nlp/language_modeling/conf/megatron_bert_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ trainer:
devices: 1
num_nodes: 1
accelerator: gpu
precision: 16
precision: bf16
logger: False # logger provided by exp_manager
enable_checkpointing: False
use_distributed_sampler: False
Expand Down Expand Up @@ -41,7 +41,7 @@ exp_manager:

model:
# model parallelism
mcore_bert: False
mcore_bert: True
micro_batch_size: 4
global_batch_size: 8
tensor_model_parallel_size: 1
Expand Down Expand Up @@ -85,7 +85,7 @@ model:
fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16

# Megatron O2-style half-precision
megatron_amp_O2: False # Enable O2-level automatic mixed precision using main parameters
megatron_amp_O2: True # Enable O2-level automatic mixed precision using main parameters
grad_allreduce_chunk_size_mb: 125
grad_div_ar_fusion: False

Expand Down Expand Up @@ -158,4 +158,4 @@ model:
name: CosineAnnealing
warmup_steps: 500
constant_steps: 50000
min_lr: 2e-5
min_lr: 2e-5
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ model:
position_embedding_type: 'rope' # Position embedding type. Options ['learned_absolute', 'rope']
rotary_percentage: 0.5 # If using position_embedding_type=rope, then the per head dim is multiplied by this. For chatglm2, it is 0.5 (https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L754)
rotary_interleaved: True # chatglm2 use interleaved rotary embedding
apply_rope_fusion: True
apply_rope_fusion: False
attention_type: 'multihead' # Attention type. Options ['multihead']
share_embeddings_and_output_weights: False # Share embedding and output layer weights.
overlap_p2p_comm: False # Overlap p2p communication with computes. This argument is valid only when `virtual_pipeline_model_parallel_size` is larger than 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ model:
bias_dropout_add_fusion: False # Use a kernel that fuses the bias addition, dropout and residual connection addition.
masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
get_attention_mask_from_fusion: True # When using fused softmax it will create the attention mask so we won't copy it to the pipeline stages.
apply_rope_fusion: True # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
apply_rope_fusion: False # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope


# Miscellaneous
Expand Down
8 changes: 4 additions & 4 deletions examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ trainer:
devices: 1
num_nodes: 1
accelerator: gpu
precision: 16
precision: bf16
logger: False # logger provided by exp_manager
enable_checkpointing: False
use_distributed_sampler: False
Expand Down Expand Up @@ -55,7 +55,7 @@ exp_manager:

model:
# use GPTModel from megatron.core
mcore_gpt: False
mcore_gpt: True

# specify micro_batch_size, global_batch_size, and model parallelism
# gradient accumulation will be done automatically based on data_parallel_size
Expand Down Expand Up @@ -120,7 +120,7 @@ model:
fp16_lm_cross_entropy: False # Move the cross entropy unreduced loss calculation for lm head to fp16

# Megatron O2-style half-precision
megatron_amp_O2: False # Enable O2-level automatic mixed precision using main parameters
megatron_amp_O2: True # Enable O2-level automatic mixed precision using main parameters
grad_allreduce_chunk_size_mb: 125

# Fusion
Expand All @@ -130,7 +130,7 @@ model:
bias_dropout_add_fusion: True # Use a kernel that fuses the bias addition, dropout and residual connection addition.
masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
get_attention_mask_from_fusion: True # When using fused softmax it will create the attention mask so we won't copy it to the pipeline stages.
apply_rope_fusion: True # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
apply_rope_fusion: False # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope


# Miscellaneous
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ model:
bias_dropout_add_fusion: False # Use a kernel that fuses the bias addition, dropout and residual connection addition.
masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
get_attention_mask_from_fusion: True # When using fused softmax it will create the attention mask so we won't copy it to the pipeline stages.
apply_rope_fusion: True # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
apply_rope_fusion: False # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope


# Miscellaneous
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ export:
decoder_type: llama # gptnext, gpt2, llama
inference_tensor_parallel: 1 # Default using 1 TP for inference
inference_pipeline_parallel: 1 # Default using 1 PP for inference
dtype: 16 # Default precision data type
dtype: bf16 # Default precision data type

model_file: llama2-7b-fp16.nemo # Nemo file path
model_save: llama2-7b-fp8.qnemo # Path where the quantized model will be saved
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ model:
bias_dropout_add_fusion: True # Use a kernel that fuses the bias addition, dropout and residual connection addition.
masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask.
get_attention_mask_from_fusion: True # When using fused softmax it will create the attention mask so we won't copy it to the pipeline stages.
apply_rope_fusion: True # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope
apply_rope_fusion: False # Use a kernel to add rotary positional embeddings. Only used if position_embedding_type=rope

# Miscellaneous
seed: 1234
Expand Down
Loading
Loading