Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
fb651ee
wip
paarthneekhara Apr 15, 2026
f06dd2d
WIP
paarthneekhara Apr 16, 2026
526ff55
speaker encoder optional
paarthneekhara Apr 19, 2026
aef605f
Apply isort and black reformatting
paarthneekhara Apr 22, 2026
a4b3fe1
add option to remove text embedding and lm head
paarthneekhara May 2, 2026
01652a7
cas encoder layers
paarthneekhara May 2, 2026
a2ac6f8
use IPA as text prob added during training
paarthneekhara May 3, 2026
0e2bfaa
Add multiturn dataloader
Edresson Apr 13, 2026
8bfb7bd
Update multiturn dataloader
Edresson Apr 14, 2026
68a1978
Add multiturn config
Edresson Apr 14, 2026
c9ebcb2
Add a intermediary fix for prior
Edresson Apr 15, 2026
bd39122
Fix audio tokens name
Edresson Apr 15, 2026
1231ffb
Add formatter to support json dataset on lhotse inference
Edresson Apr 15, 2026
ff60947
Update inference to support multiturn dataloader
Edresson Apr 16, 2026
87df539
Bug fix in dataloder
Edresson Apr 17, 2026
c868e21
Add parameter to remove user turns
Edresson Apr 17, 2026
c56b8c2
Add multiturn inference script
Edresson Apr 21, 2026
47b2fe4
Update inference script
Edresson Apr 21, 2026
e373e44
Remove unused codes
Edresson Apr 21, 2026
8e9ebb0
Update inference recipe
Edresson Apr 22, 2026
baa8e77
Remove librosa resample
Edresson Apr 22, 2026
ebf7f0a
Add silence aug
Edresson Apr 22, 2026
7c63ae3
Add silence tts data augmentation
Edresson Apr 23, 2026
329cfc6
Add parameter to remove subword text conditioning
Edresson Apr 27, 2026
0d5ef87
Add support for extra duplex dataloaders
Edresson Apr 27, 2026
fc99f28
Add partial loading
Edresson Apr 28, 2026
967306b
Fix interruption handling for validation dataset
Edresson Apr 28, 2026
23b0f48
Add user silence mask
Edresson Apr 30, 2026
d0580a9
Add use_user_speaking_token
Edresson May 1, 2026
c352a64
IPA handling in multiturn data
shehzeen May 2, 2026
bf66a01
Fix inference script
Edresson May 4, 2026
056cfc9
Add transition tokens on loss
Edresson May 4, 2026
a0e0a54
Add extra parameter type
Edresson May 4, 2026
43a46b8
Fix augmentation
Edresson May 5, 2026
7c31b34
Add min_number_of_turns and max_gap_duration_collapse_turns
Edresson May 5, 2026
048c91b
Add phoneme multiturn inference support and update silence augmentation
Edresson May 7, 2026
7b82ec6
Fix sil augmentation on formatter
Edresson May 8, 2026
4ed07aa
Fix inference
Edresson May 8, 2026
d554231
Add new inference script and fix data formatter
Edresson May 11, 2026
ec05d9b
Fix merge issue
Edresson May 12, 2026
f20e2e9
Add restore custom checkpoint to avoid full model loading on .nemo ch…
Edresson May 13, 2026
c7af376
Add raw tts data support on TTS dataloader
Edresson May 14, 2026
5252400
Remove complex prefil code and add slupport to nemotron_h on prefill
Edresson May 14, 2026
5e773a8
Add user aduio conditioning
Edresson May 17, 2026
a35c51d
Add multiturn inference support with user audio conditioning
Edresson May 18, 2026
373bf1a
Add silence if user audio is not available
Edresson May 19, 2026
6bbee02
Add multiturn augmentations
Edresson May 19, 2026
18cd4fd
Add update inference
Edresson May 20, 2026
cbbedd3
Add use_explicit_silence_for_streaming_audio_delay
Edresson May 20, 2026
a60a1a6
Add new trim augmentation
Edresson May 21, 2026
80c95c5
Add new trim aug
Edresson May 21, 2026
a19de1a
remove use_explicit_silence_for_streaming_audio_delay
Edresson May 21, 2026
3840141
Add new inference
Edresson May 21, 2026
8b1a0d3
Update inference script
Edresson May 22, 2026
e4a53f0
Fix phoneme loss
Edresson May 25, 2026
c7817e1
Remove debug print
Edresson May 25, 2026
ca69191
Fix phoneme inference
Edresson May 25, 2026
ae17a09
Add phoneme_loss_mask_padding
Edresson May 26, 2026
81cc757
Update inference
Edresson May 26, 2026
2d4bc28
Add full user prefill support on nemotron_h class
Edresson May 26, 2026
e91e0e9
Add phoneme_loss_mask_agent_expanded
Edresson May 27, 2026
5be9aac
Rename phoneme_loss_mask_include_transition
Edresson May 27, 2026
4a3f9fb
Remove unused code
Edresson May 28, 2026
42eb971
Add partial copy support
Edresson May 28, 2026
d490367
Add parameter to drop all turn in sample
Edresson May 29, 2026
f77e77e
phoneme turn dropout
shehzeen May 30, 2026
208e7c7
filewise metrics and aggregated metrics in the inference script
shehzeen Jun 1, 2026
1fabe29
Add multigpu inference script
Edresson Jun 1, 2026
3311de9
phoneme pad for short turns
shehzeen Jun 3, 2026
9619236
ignore punctuation in word counting
shehzeen Jun 4, 2026
40e50ef
Add turn based metrics
Edresson Jun 6, 2026
bac2c03
Remove unused methods
Edresson Jun 8, 2026
475038c
Add new easymagpie compatible inference script
Edresson Jun 8, 2026
a73132d
Update EasyMagpie inference script to support multiturn
Edresson Jun 8, 2026
0a99730
Fix new inference volume norm
Edresson Jun 8, 2026
de7bbb9
Expose ASR and EOU batch sizes on config
Edresson Jun 9, 2026
81ca466
Get language from dataloader for multiturn eval
Edresson Jun 9, 2026
430d1c3
Remove old multiturn eval scripts
Edresson Jun 9, 2026
c824a23
Undo unecessary changes on cutset
Edresson Jun 9, 2026
926722c
Remove unused user_audio_mask
Edresson Jun 9, 2026
b07acd9
Aplly Black
Edresson Jun 9, 2026
32fd5f3
Remove unused params
Edresson Jun 9, 2026
5d84500
short phoneme turn handling
shehzeen Jun 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/tts/conf/magpietts/easy_magpietts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ model:
disable_lm_text_head: false
disable_subword_embedding: false
use_bpe_char_tokenizer: true
cas_encoder_n_layers: 1

# HuggingFace backend config (used when decoder_type: "huggingface")
transformer_hf_backend: "Qwen/Qwen2.5-1.5B"
Expand Down
1 change: 1 addition & 0 deletions examples/tts/conf/magpietts/easy_magpietts_lhotse.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ model:
disable_lm_text_head: false
disable_subword_embedding: false
use_bpe_char_tokenizer: true
cas_encoder_n_layers: 1

# HuggingFace backend config (used when decoder_type: "huggingface")
transformer_hf_backend: "Qwen/Qwen2.5-1.5B"
Expand Down
231 changes: 231 additions & 0 deletions examples/tts/conf/magpietts/easy_magpietts_lhotse_multiturn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
name: Magpie-TTS-DecoderOnly-EN

quadratic_duration: 20

# Adjust batch size based on GPU memory
# When doing weighted sampling with multiple manifests, this defines how many training steps are in an epoch.
# If null, then weighted sampling is disabled.

model:
use_lhotse: true

# Decoder backend selection
# Options: "huggingface" (default), "nemotron_h"
decoder_type: "huggingface"

# HuggingFace backend config (used when decoder_type: "huggingface")
transformer_hf_backend: "Qwen/Qwen2.5-1.5B"

# NemotronH config (used when decoder_type: "nemotron_h")
# Hybrid Mamba2/MoE/Attention model (~3B total, ~600-800M active). Layer types via hybrid_override_pattern:
# 'M' = Mamba2 layer, '*' = Attention layer, '-' = MLP layer, 'E' = MoE layer
nemotron_h_config:
hidden_size: 1536 # Should match embedding_dim
num_hidden_layers: 48
vocab_size: 131072
# Attention config
num_attention_heads: 12
num_key_value_heads: 4
attention_dropout: 0.0
attention_bias: false
max_position_embeddings: 8192
# Mamba config
mamba_num_heads: 64
mamba_head_dim: 24
ssm_state_size: 128
conv_kernel: 4
n_groups: 8
chunk_size: 256
mamba_hidden_act: "silu"
use_conv_bias: true
use_bias: false
# MLP config
intermediate_size: 4096
mlp_hidden_act: "silu"
mlp_bias: false
# MoE config (scaled from Nemotron-3-Nano-30B-A3B)
n_routed_experts: 48
num_experts_per_tok: 6
moe_intermediate_size: 1024
moe_shared_expert_intermediate_size: 2048
n_group: 1
topk_group: 1
routed_scaling_factor: 2.5
norm_topk_prob: true
# Layer pattern: (M E M E M *) x 8 => 16 Mamba, 16 MoE, 8 Attention
hybrid_override_pattern: "MEMEM*MEMEM*MEMEM*MEMEM*MEMEM*MEMEM*MEMEM*MEMEM*"
# Normalization
layer_norm_epsilon: 1e-5
residual_in_fp32: true

use_text_conditioning_encoder: true # If true, distilbert will be used to encode context_text if provided.
context_duration_min: 5.0
context_duration_max: 5.0
load_cached_codes_if_available: true

embedding_dim: 1536
hidden_dim: 1536
audio_embedding_dim: 1536 # Can set a smaller dimension for audio embeddings to reduce parameters. Set equal to hidden_dim for no projection.
codecmodel_path: ???

# Local transformer parameters for autoregressive codebook prediction within a frame
local_transformer_type: "autoregressive" # "none", "autoregressive"
# Below args are only relevant if use_local_transformer is autoregressive
local_transformer_loss_scale: 1.0
phoneme_loss_weight: 1.0
local_transformer_n_layers: 3
local_transformer_n_heads: 12
local_transformer_hidden_dim: 1536

cfg_unconditional_prob: 0.05

# Multi-mode training configuration
training_modes:
- text_input_mode: "streaming" # Options: "full", "streaming"
streaming_phonemes_delay: 0
streaming_speech_delay: 1

frame_stacking_factor: 2
phoneme_stacking_factor: 1
phoneme_confidence_unk_threshold: 0.0 # If max phoneme probability is below this threshold at inference-time, replace the predicted timestep with UNK to reduce error propagation.
dropout_text_input_prob: 0.1
phoneme_corruption_batch_prob: 0.1
phoneme_corruption_timestep_ratio: 0.15
phoneme_corruption_unk_mode_prob: 0.5
phoneme_corruption_type: "repeat_skip_unk" # "repeat_skip_unk" or "complete_channel"
phoneme_turn_dropout_batch_prob: 0.0 # prob of applying turn dropout to a sample
phoneme_turn_dropout_turn_prob: 0.0 # prob of dropping each phoneme turn within a sample
phoneme_turn_max_words_to_drop: 0 # turns with <= this many words keep phoneme tokens as pad_id

phoneme_tokenizer:
_target_: nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPABPETokenizer
tokenizer_path: ???

text_tokenizers:
nemotron_nano_30b:
_target_: AutoTokenizer
pretrained_model: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"

train_ds:
use_lhotse: ${model.use_lhotse}
volume_norm: true
dataset:
multi_config: true
shuffle: true
seed: 42
shard_seed: "trng"

sampler_fusion: randomized_round_robin
sampler_weights:
tts_data: 0.5
duplex_data: 0.5
tts_data:
min_duration: 0.2
min_context_speaker_similarity: 0.6
max_cer: 0.03
batch_duration : ??? # in seconds. Adjust based on your GPU memory.
quadratic_duration: ${quadratic_duration}
use_bucketing: true
num_buckets: 20
bucket_buffer_size: 10_000
shuffle_buffer_size: 10_000
num_cuts_for_bins_estimate: 10_000
shard_seed: "trng"
drop_last: true
shuffle: true
num_workers: 6
pin_memory: true

input_cfg:
- type: lhotse_shar
shar_path: ???
weight: 1.0
tags:
tokenizer_names: ["english_phoneme"]

duplex_data:
input_cfg: /lustre/fsw/convai_convaird_nemo-speech/data/duplex/multispeaker_syn_duplex.yaml
use_bucketing: true
num_buckets: 20
bucket_buffer_size: 1_000
shuffle_buffer_size: 1_000
num_cuts_for_bins_estimate: 1_000
max_duration: 300 # 5 mi max duration
bucket_duration_bins: [4.0, 8.9, 10.2, 11.6, 13.2, 15.0, 17.0, 19.3, 25.0, 31.5, 38.5, 46.0, 55.5, 66.5, 79.5, 93.3, 110.0, 130.0, 156.8, 203.3]
bucket_batch_size: [75, 33, 29, 25, 23, 20, 18, 15, 12, 10, 8, 7, 5, 4, 3, 3, 2, 2, 1, 1]


validation_ds:
use_lhotse: ${model.use_lhotse}
volume_norm: true

dataset:
min_duration: 0.2
min_context_speaker_similarity: 0.6
max_cer: 0.03
batch_duration: ??? # recommend to use smaller batch_duration for validation dataset than training dataset.
quadratic_duration: ${quadratic_duration}
use_bucketing: false
force_finite: true
force_map_dataset: true
drop_last: false
shuffle: false
num_workers: 2
pin_memory: true
seed: 42
shard_seed: "randomized"

input_cfg:
- type: lhotse_shar
shar_path: ???
weight: 1.0
tags:
tokenizer_names: ["english_phoneme"]

optim:
_target_: torch.optim.AdamW
lr: 1e-4

sched:
name: ExponentialLR
gamma: 0.998

trainer:
num_nodes: 1
devices: -1
accelerator: gpu
strategy: ddp_find_unused_parameters_true
precision: bf16-mixed
max_steps: ???
accumulate_grad_batches: 1
enable_checkpointing: False # Provided by exp_manager
logger: false # Provided by exp_manager
log_every_n_steps: 100
limit_train_batches: 1_000
val_check_interval: 1_000
num_sanity_val_steps: 0
benchmark: false
use_distributed_sampler: false # required because Lhotse has its own handling
gradient_clip_val: 2.5

exp_manager:
exp_dir: null
name: ${name}
create_tensorboard_logger: true
create_wandb_logger: false
wandb_logger_kwargs:
entity: null
name: ${name}
project: null
group: null
resume: true
create_checkpoint_callback: true
checkpoint_callback_params:
monitor: val_loss
mode: min
save_top_k: 5
save_best_model: true
always_save_nemo: true
filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.4f}-{step}-{epoch}'
resume_if_exists: true
resume_ignore_no_checkpoint: true
3 changes: 3 additions & 0 deletions examples/tts/easy_magpietts.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ def main(cfg):
else:
raise NotImplementedError(f"Only train, onlinepo_train and test modes are supported. Got {mode}")

if cfg.get("pretrained_model", None):
model.restore_from_pretrained_checkpoint(cfg.pretrained_model)

model.maybe_init_from_pretrained_checkpoint(cfg=cfg)

if mode in ['train', 'onlinepo_train']:
Expand Down
Loading
Loading