lora_finetune.py for continuous pretraining for DeepSeek-V3 #6244

ghtaro · 2025-03-11T12:51:33Z

ghtaro
Mar 11, 2025

Hi,

I am new to collosal AI and would like to continous-pretraining DeepSeek-V3 to domain-specific corpus.
I am wondering if lora_finetune.py can be used for that.

My idea is as follows:

dataset: use DataCollatorForSupervisedDataset instead of RawConversationDataset
loss: loss defined in lora_finetune.py seems to be loss on all the tokens passed to the model. Is my understanding correct? If, so I do not need to do anything on this.

It would be helpful if I am missing something.
Thank you very much for you help!

xXMrNidaXx · 2026-02-23T13:51:17Z

xXMrNidaXx
Feb 23, 2026

Good approach! Using lora_finetune.py for continuous pretraining works, with a few considerations.

Your understanding is correct:

DataCollatorForSupervisedDataset is right for pretraining (no chat formatting)
Loss on all tokens is what you want for continuous pretraining

Key differences from SFT:

No labels masking — for pretraining, every token is a training target
Longer sequences — chunk your corpus into model max_length
Lower learning rate — domain adaptation needs gentler updates

Recommended config adjustments:

training_args = dict(
    learning_rate=1e-5,  # Lower than SFT (typically 2e-4)
    num_train_epochs=1,  # Single pass for pretraining
    per_device_train_batch_size=1,  # DeepSeek-V3 is huge
    gradient_accumulation_steps=16,
)

# For MoE models like DeepSeek-V3
plugin = MoeHybridParallelPlugin(
    tp_size=4,
    pp_size=2,
    ep_size=8,  # Expert parallelism for MoE
)

Data prep for pretraining:

def chunk_corpus(texts, max_length=4096):
    all_tokens = tokenizer(" ".join(texts))["input_ids"]
    for i in range(0, len(all_tokens), max_length):
        yield all_tokens[i:i+max_length]

Watch out for:

MoE routing during pretraining can be unstable early on
Consider auxiliary loss balancing for expert utilization

We do domain-specific pretraining at Revolution AI — LoRA works well for adaptation without full model training cost. Let me know if you hit specific issues!

0 replies

xXMrNidaXx · 2026-02-23T16:29:47Z

xXMrNidaXx
Feb 23, 2026

DeepSeek-V3 LoRA finetuning is exciting! At RevolutionAI (https://revolutionai.io) we finetune large models.

Continuous pretraining tips:

from colossalai.nn.optimizer import HybridAdam
from peft import LoraConfig

lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none"
)

# For DeepSeek-V3 MoE
# Also consider targeting expert layers
lora_config.target_modules.extend(["gate", "up_proj", "down_proj"])

Memory optimization:

Use gradient checkpointing
Enable ZeRO-3 for large models
Offload optimizer states

Training tips:

Lower learning rate for CPT (1e-5 to 5e-5)
Longer warmup
Careful data mixing

What domain are you targeting?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lora_finetune.py for continuous pretraining for DeepSeek-V3 #6244

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

lora_finetune.py for continuous pretraining for DeepSeek-V3 #6244

Uh oh!

ghtaro Mar 11, 2025

Replies: 2 comments

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

ghtaro
Mar 11, 2025

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026