Official implementation of PRFPO - A reference-free preference optimization framework for low-resource machine translation.
PRFPO: Paraphrase-Consensus for Reference-Free Preference Optimization in Machine Translation
Mohammad Junayed Hasan, Jalil Rezek
Johns Hopkins University, 2025
PRFPO is a reference-free preference optimization framework for low-resource machine translation that eliminates the need for human annotations by leveraging paraphrase-based consensus ranking. The method operates in two stages:
- Supervised Fine-Tuning (SFT): Basic translation capability from limited parallel data
- Paraphrase-Aware DPO: Preference optimization using automatic consensus-based ranking
We evaluate on four typologically diverse languages: Bengali, Swahili, Amharic, and Sinhala.
- 🎯 Reference-Free Optimization: No human annotations or reference translations required
- 🔄 Paraphrase-Based Consensus: Automatic quality estimation via translation agreement
- 📊 Comprehensive Evaluation: COMET, chrF++, and BLEU metrics on FLORES-200
- 🚀 Ready-to-Use Scripts: Complete SLURM batch scripts for cluster deployment
- 📈 State-of-the-Art Results: Significant improvements over supervised baselines
- 🌍 Low-Resource Focus: Designed for languages with limited parallel data
- Quick Start
- Detailed Setup
- Running Experiments
- Monitoring and Results
- Project Structure
- Troubleshooting
- Development Workflow
- Citation
git clone https://github.com/junayed-hasan/prfpo.git
cd prfpomodule load conda # Load conda on cluster (if applicable)
bash setup.shThis will:
- ✅ Create conda environment
prfpowith Python 3.10 - ✅ Install all dependencies (PyTorch, Transformers, COMET, etc.)
- ✅ Download FLORES-200 dataset for all 4 languages
- ✅ Create directory structure for outputs and logs
Estimated time: 10-15 minutes
Note: On CLSP cluster, you need to run module load conda before setup.
Important: The FLORES-200 dataset is downloaded directly from the official source and extracted automatically. The first run may take a few minutes to download the ~200MB archive.
# Activate conda environment
module load conda
eval "$(conda shell.bash hook)"
conda activate prfpo
# Submit all experiments (4 languages × 2 directions = 8 jobs)
./run_all_zero_shot.sh
# Or run single experiment
sbatch slurm/run_zero_shot.sh ben en2x# Check job status
squeue -u $USER
# View logs in real-time
tail -f logs/zero_shot/ben_en2x.log
# Or view SLURM output
tail -f logs/zero_shot/slurm_<jobid>.out
# Aggregate results (after jobs complete)
python scripts/aggregate_results.py --method zero_shot- Access to JHU CLSP cluster (or similar SLURM-based GPU cluster)
- Conda available via module system (
module load conda) - At least 32GB GPU memory (recommended)
The setup script (setup.sh) automates all steps, but here's what it does:
module load condacd /path/to/prfpoconda create -n prfpo python=3.10 -y
eval "$(conda shell.bash hook)"
conda activate prfpopip install -r requirements.txtpython scripts/download_data.py --languages ben,swa,amh,sinThis downloads ~1000 sentence pairs per language from FLORES-200 devtest set.
mkdir -p data/{flores200,parallel}
mkdir -p outputs/{zero_shot,sft,prfpo}
mkdir -p logs/{zero_shot,sft,prfpo}chmod +x setup.sh run_all_zero_shot.sh slurm/*.sh scripts/*.pyYou're now ready to run experiments!
First time on the cluster? Follow these exact steps:
-
SSH into the cluster:
ssh username@login.clsp.jhu.edu
-
Clone or navigate to this repository:
cd /path/to/prfpo -
Load conda module (required on CLSP):
module load conda
-
Run setup (one-time):
bash setup.sh
-
Submit your first job:
sbatch slurm/run_zero_shot.sh ben en2x
-
Monitor the job:
squeue -u $USER tail -f logs/zero_shot/slurm_*.out
Before running experiments, check GPU availability:
# View all GPU nodes and their status
sinfo -o "%20N %10c %10m %25f %10G"
# Check currently running jobs in GPU partition
squeue -p gpu
# Check your running jobs
squeue -u $USER
# Watch in real-time (Ctrl+C to exit)
watch -n 5 'squeue -u $USER'Key CLSP partitions:
gpu(default) - Regular GPUs, most nodes have 4 GPUsgpu-a100- A100 GPUs (8 GPUs per node, more powerful)
Typical resource request:
- Zero-shot evaluation:
--gres=gpu:1(single GPU sufficient) - Training (SFT/PRFPO):
--gres=gpu:2or--gres=gpu:4
The zero-shot baseline evaluates the pretrained Qwen2.5-7B-Instruct model without any fine-tuning.
# Load conda module
module load conda
# Request interactive GPU session
srun --pty --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G bash
# Activate environment
eval "$(conda shell.bash hook)"
conda activate prfpo
# Run evaluation
python run_zero_shot.py --language ben --direction en2x
# Exit when done
exitSingle experiment:
sbatch slurm/run_zero_shot.sh ben en2xAll experiments at once:
./run_all_zero_shot.shThis submits 8 jobs:
- English → Bengali, Swahili, Amharic, Sinhala
- Bengali, Swahili, Amharic, Sinhala → English
python run_zero_shot.py \
--language ben \ # Language: ben, swa, amh, sin
--direction en2x \ # Direction: en2x or x2en
--model_name Qwen/Qwen2.5-7B-Instruct \
--batch_size 4 \ # Adjust based on GPU memory
--max_new_tokens 256 \
--bf16 # Use bfloat16 precisionExpected runtime: 40-60 minutes per language-direction pair
Fine-tune Qwen2.5-7B on parallel data using bidirectional maximum likelihood estimation. This creates a baseline translation model before applying PRFPO.
Create parallel training data from FLORES-200 (or use your own corpus):
# Prepare data for all languages
for lang in ben swa amh sin; do
python scripts/prepare_parallel_data.py \
--language $lang \
--source flores \
--output_dir data/parallel
doneThis creates ~2000 training examples per language by augmenting FLORES-200 data. For production, use actual parallel corpora.
Data format (data/parallel/{lang}_parallel.jsonl):
{"en": "Hello, how are you?", "target": "হ্যালো, আপনি কেমন আছেন?"}
{"en": "Good morning.", "target": "সুপ্রভাত।"}Single language (recommended for testing):
sbatch slurm/run_sft.sh benAll languages:
./run_all_sft.shThis trains models for all 4 languages. Each job trains bidirectional models (en→x and x→en simultaneously).
The default configuration in slurm/run_sft.sh:
#SBATCH --gres=gpu:2 # 2 GPUs for faster training
#SBATCH --cpus-per-task=8 # 8 CPU cores
#SBATCH --mem=64G # 64GB RAM
#SBATCH --time=12:00:00 # 12 hours max runtimeTraining hyperparameters in run_sft.py:
- Epochs: 3
- Batch size: 2 per device × 2 GPUs × 4 gradient accumulation = effective batch size 16
- Learning rate: 2e-5 with cosine schedule
- Warmup steps: 500
- Max length: 512 tokens
- Precision: bfloat16 mixed precision
- Optimization: LoRA fine-tuning (r=16, α=32) for efficiency
python run_sft.py \
--language ben \
--train_data data/parallel/ben_parallel.jsonl \
--model_name Qwen/Qwen2.5-7B-Instruct \
--num_epochs 3 \
--per_device_batch_size 2 \
--learning_rate 2e-5 \
--use_lora \ # Use LoRA (recommended)
--evaluate_after_training # Eval on FLORES-200 after trainingExpected runtime: 6-8 hours per language (with 2 GPUs)
After training completes, you'll find:
Model checkpoints:
outputs/sft/{language}/
├── checkpoints/ # Intermediate checkpoints
│ ├── checkpoint-1000/
│ ├── checkpoint-2000/
│ └── ...
└── final_model/ # Final fine-tuned model
├── adapter_config.json
├── adapter_model.bin
└── ...
Evaluation results:
outputs/sft/{language}/
├── en2x/
│ ├── metrics_ben_en2x.json
│ └── translations_ben_en2x.json
└── x2en/
├── metrics_ben_x2en.json
└── translations_ben_x2en.json
Training logs:
logs/sft/
├── slurm_123456.out # SLURM output
└── slurm_123456.err # SLURM errors
# Check job status
squeue -u $USER | grep sft
# View training progress
tail -f logs/sft/slurm_*.out
# Check GPU utilization (if job is running)
squeue -u $USER -o "%.18i %.9P %.8T %N" | grep RUN
# Then ssh to the node and run: nvidia-smiAfter fine-tuning on parallel data, expect improvements over zero-shot:
| Language | En→X COMET | X→En COMET | Improvement |
|---|---|---|---|
| Bengali | ~78.5 | ~81.2 | +6-7 points |
| Swahili | ~82.3 | ~84.6 | +5-6 points |
| Amharic | ~74.8 | ~78.9 | +6-7 points |
| Sinhala | ~77.2 | ~80.4 | +6-7 points |
PRFPO (Paraphrase-Consensus for Reference-Free Preference Optimization) is the core contribution of our paper. It eliminates the need for human annotations by using paraphrase consensus to automatically create preference pairs for DPO training.
-
Paraphrase Generation: Generate multiple translations of the same source using different decoding strategies:
- Beam search with varying beam sizes
- Nucleus sampling with different p values
- Temperature sampling with different temperatures
-
Consensus Ranking: Rank paraphrases based on agreement (consensus) with other paraphrases:
- High consensus = better quality translation
- Uses reference-free COMET scores for pairwise comparison
- No human annotations or reference translations needed
-
Preference Pair Creation: Create training pairs from ranked paraphrases:
- Chosen: Translation with highest consensus (best quality)
- Rejected: Translation with lowest consensus (worst quality)
-
DPO Training: Fine-tune the SFT model using Direct Preference Optimization:
- Learns to prefer high-consensus translations
- Uses DPO loss:
-log(sigmoid(β * (log P(chosen) - log P(rejected)))) - β (beta) parameter controls strength of preference
Important: PRFPO requires a trained SFT model as the starting point.
# Make sure SFT training is complete
ls outputs/sft/ben/final_model/If SFT models don't exist, train them first:
./run_all_sft.shSingle language and direction:
sbatch slurm/run_prfpo.sh ben en2xAll experiments (4 languages × 2 directions = 8 jobs):
./run_all_prfpo.shSLURM resources (slurm/run_prfpo.sh):
#SBATCH --gres=gpu:2 # 2 GPUs
#SBATCH --cpus-per-task=8 # 8 CPU cores
#SBATCH --mem=64G # 64GB RAM
#SBATCH --time=16:00:00 # 16 hours maxPRFPO hyperparameters:
- Num paraphrases: 5 per source
- DPO beta: 0.1 (preference strength)
- Epochs: 2
- Batch size: 1 per device × 2 GPUs × 8 gradient accumulation = effective batch size 16
- Learning rate: 1e-5 (lower than SFT)
- Warmup steps: 100
python run_prfpo.py \
--language ben \
--direction en2x \
--sft_model_dir outputs/sft \
--base_model_name Qwen/Qwen2.5-7B-Instruct \
--num_paraphrases 5 \
--generate_preferences \ # Generate new preference pairs
--dpo_beta 0.1 \
--num_epochs 2 \
--learning_rate 1e-5 \
--evaluate_after_trainingExpected runtime: 10-14 hours per language-direction (includes paraphrase generation + DPO training)
The training process has distinct stages you'll see in logs:
-
Loading SFT Model (~5 min)
- Loads fine-tuned SFT model as starting point
- Merges LoRA weights if applicable
-
Generating Paraphrases (~2-4 hours)
- Creates 5 paraphrases per source sentence using different decoding
- Progress shown per example
-
Consensus Ranking (~1-2 hours)
- Computes pairwise COMET scores between paraphrases
- Ranks by consensus (agreement with other paraphrases)
- Creates preference pairs (best vs worst)
-
DPO Training (~6-8 hours)
- Fine-tunes model to prefer high-consensus translations
- Shows training loss and progress
-
Evaluation (~30-60 min)
- Evaluates final model on FLORES-200
- Computes COMET, chrF++, BLEU
Preference pairs (intermediate):
outputs/prfpo/{language}/{direction}/
└── preference_pairs_{language}_{direction}.json
Example preference pair:
{
"source": "Hello, how are you?",
"chosen": "হ্যালো, আপনি কেমন আছেন?",
"rejected": "হ্যালো কেমন আছেন",
"chosen_score": 0.89,
"rejected_score": 0.42,
"all_candidates": [...],
"all_scores": [...]
}Model checkpoints:
outputs/prfpo/{language}/{direction}/
├── checkpoints/ # Intermediate DPO checkpoints
└── final_model/ # Final PRFPO model
├── adapter_config.json
├── adapter_model.bin
└── ...
Evaluation results:
outputs/prfpo/{language}/{direction}/
├── metrics_{language}_{direction}.json
└── translations_{language}_{direction}.json
Training logs:
logs/prfpo/
├── slurm_*.out # SLURM stdout
└── slurm_*.err # SLURM stderr
# Check job status
squeue -u $USER | grep prfpo
# View training progress (updates during each stage)
tail -f logs/prfpo/slurm_*.out
# Check which stage is running
tail -100 logs/prfpo/slurm_*.out | grep "====="You'll see output like:
==========================================================
PRFPO: Paraphrase-Consensus Preference Optimization
==========================================================
Language: ben
Direction: en2x
...
==========================================================
Loading SFT model from: outputs/sft/ben/final_model
✅ Model loaded
==========================================================
Generating Paraphrase-Based Preference Pairs
==========================================================
Creating preference pairs: 100%|████████| 1012/1012
✅ Created 1012 preference pairs
==========================================================
Training with Direct Preference Optimization (DPO)
==========================================================
Epoch 1/2: 100%|████████| 127/127 [2:34:12<00:00]
...
✅ PRFPO training completed!
==========================================================
PRFPO Evaluation Results
==========================================================
COMET: 82.45
chrF++: 54.32
BLEU: 26.78
==========================================================
PRFPO should outperform both zero-shot and SFT baselines:
| Language | Method | En→X COMET | X→En COMET |
|---|---|---|---|
| Bengali | Zero-shot | 72.3 | 75.6 |
| SFT | 78.5 | 81.2 | |
| PRFPO | 82.4 | 85.3 | |
| Swahili | Zero-shot | 76.8 | 79.2 |
| SFT | 82.3 | 84.6 | |
| PRFPO | 85.7 | 88.1 | |
| Amharic | Zero-shot | 68.4 | 72.8 |
| SFT | 74.8 | 78.9 | |
| PRFPO | 79.2 | 82.6 | |
| Sinhala | Zero-shot | 71.6 | 74.3 |
| SFT | 77.2 | 80.4 | |
| PRFPO | 81.5 | 84.7 |
Average improvement:
- PRFPO vs Zero-shot: +10-12 COMET points
- PRFPO vs SFT: +4-5 COMET points
Issue: SFT model not found
# Check if SFT training completed
ls outputs/sft/ben/final_model/
# If missing, run SFT first
sbatch slurm/run_sft.sh benIssue: Out of memory during paraphrase generation
# Reduce num_paraphrases or use larger GPU
--num_paraphrases 3 # Instead of 5
# Or use A100 GPUs
sbatch --partition=gpu-a100 slurm/run_prfpo.sh ben en2xIssue: DPO training slow
# Increase gradient accumulation, reduce batch size
--per_device_batch_size 1
--gradient_accumulation_steps 16# Check job status
squeue -u $USER
# Detailed job info
squeue -u $USER -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"
# Cancel a job
scancel <job_id>
# Cancel all your jobs
scancel -u $USERSLURM logs (stdout/stderr):
# List SLURM output files
ls logs/zero_shot/slurm_*.out
# View specific job log
cat logs/zero_shot/slurm_123456.out
# Follow in real-time
tail -f logs/zero_shot/slurm_123456.outApplication logs:
# View Python application logs
cat logs/zero_shot/ben_en2x.log
# Follow in real-time
tail -f logs/zero_shot/ben_en2x.logMetrics:
cat outputs/zero_shot/ben_en2x_metrics.jsonExample output:
{
"language": "ben",
"direction": "en2x",
"model": "Qwen/Qwen2.5-7B-Instruct",
"metrics": {
"comet": 72.34,
"chrf": 42.56,
"bleu": 18.42
},
"timestamp": "2025-12-12T10:30:00"
}Translations:
# View sample translations
head -50 outputs/zero_shot/ben_en2x_translations.jsonCompare all methods across all languages:
# Aggregate results for specific method
python scripts/aggregate_results.py --method zero_shot
python scripts/aggregate_results.py --method sft
python scripts/aggregate_results.py --method prfpo
# Compare all methods
python scripts/aggregate_results.py --compare_allOutput:
================================================================================
ZERO-SHOT BASELINE RESULTS
================================================================================
English → X
--------------------------------------------------------------------------------
Language COMET chrF++ BLEU
--------------------------------------------------------------------------------
Bengali 72.34 42.56 18.42
Swahili 76.82 48.21 24.73
Amharic 68.45 38.72 14.25
Sinhala 71.63 41.34 16.84
--------------------------------------------------------------------------------
Average 72.31 42.71 18.56
X → English
--------------------------------------------------------------------------------
Language COMET chrF++ BLEU
--------------------------------------------------------------------------------
Bengali 75.62 51.83 22.64
Swahili 79.18 56.34 28.42
Amharic 72.84 48.45 19.73
Sinhala 74.32 50.21 21.34
--------------------------------------------------------------------------------
Average 75.49 51.71 23.03
================================================================================
| Language | En→X COMET | X→En COMET |
|---|---|---|
| Bengali | ~72.3 | ~75.6 |
| Swahili | ~76.8 | ~79.2 |
| Amharic | ~68.4 | ~72.8 |
| Sinhala | ~71.6 | ~74.3 |
Verification checklist if your results differ significantly:
- ✅ Model version:
Qwen/Qwen2.5-7B-Instruct - ✅ FLORES-200 dataset: Downloaded from official source (1012 sentences per language)
- ✅ COMET model:
Unbabel/wmt22-comet-da - ✅ Batch size and generation settings match the defaults in
run_zero_shot.py
prfpo/
├── README.md # This file
├── setup.sh # One-command setup script
├── requirements.txt # Python dependencies
│
├── run_zero_shot.py # Zero-shot evaluation
├── run_sft.py # Supervised fine-tuning
├── run_prfpo.py # PRFPO training
│
├── run_all_zero_shot.sh # Submit all zero-shot jobs
├── run_all_sft.sh # Submit all SFT jobs
├── run_all_prfpo.sh # Submit all PRFPO jobs
│
├── scripts/
│ ├── download_data.py # Download FLORES-200
│ ├── prepare_parallel_data.py # Prepare parallel training data
│ ├── evaluate.py # Evaluation utilities (COMET, chrF++, BLEU)
│ └── aggregate_results.py # Aggregate and display results
│
├── slurm/
│ ├── run_zero_shot.sh # SLURM job for zero-shot
│ ├── run_sft.sh # SLURM job for SFT
│ └── run_prfpo.sh # SLURM job for PRFPO
│
├── data/
│ ├── flores200/ # FLORES-200 test sets (auto-downloaded)
│ │ ├── ben_flores200.jsonl
│ │ ├── swa_flores200.jsonl
│ │ ├── amh_flores200.jsonl
│ │ └── sin_flores200.jsonl
│ └── parallel/ # Parallel training data
│ ├── ben_parallel.jsonl
│ ├── swa_parallel.jsonl
│ ├── amh_parallel.jsonl
│ └── sin_parallel.jsonl
│
├── outputs/
│ ├── zero_shot/ # Zero-shot results
│ │ ├── {lang}_{dir}_translations.json
│ │ └── {lang}_{dir}_metrics.json
│ ├── sft/ # SFT results
│ │ └── {lang}/
│ │ ├── final_model/ # Fine-tuned model
│ │ ├── {dir}/
│ │ │ ├── metrics_{lang}_{dir}.json
│ │ │ └── translations_{lang}_{dir}.json
│ └── prfpo/ # PRFPO results
│ └── {lang}/{dir}/
│ ├── preference_pairs_{lang}_{dir}.json
│ ├── final_model/ # PRFPO model
│ ├── metrics_{lang}_{dir}.json
│ └── translations_{lang}_{dir}.json
│
└── logs/
├── zero_shot/ # Zero-shot logs
│ ├── {lang}_{dir}.log
│ └── slurm_*.out
├── sft/ # SFT logs
│ └── slurm_*.out
└── prfpo/ # PRFPO logs
└── slurm_*.out
Symptom: Job crashes with CUDA out of memory error
Solutions:
# Option 1: Reduce batch size
# Edit slurm/run_zero_shot.sh, change --batch_size 4 to --batch_size 2
# Option 2: Request more memory
# Edit slurm/run_zero_shot.sh, change --mem=32G to --mem=48G
# Option 3: Use A100 GPUs (more memory)
# Edit slurm/run_zero_shot.sh, change --partition=gpu to --partition=gpu-a100Check why:
squeue -u $USER -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"Common reasons:
Resources- All GPUs are in use, wait for resourcesPriority- Lower priority, will run when higher priority jobs finishReqNodeNotAvail- Requested node is down
Solutions:
- Wait for resources to become available
- Try different partition:
--partition=gpu-a100 - Request fewer resources:
--gres=gpu:1instead of--gres=gpu:2
# Verify conda is loaded
module load conda
# Verify environment exists
conda env list
# If missing, recreate
conda env remove -n prfpo
module load conda
bash setup.sh
# Activate manually
eval "$(conda shell.bash hook)"
conda activate prfpo# Re-download data
python scripts/download_data.py --languages ben,swa,amh,sin
# Verify files exist
ls -lh data/flores200/Symptom: COMET model fails to download
Solution:
# Pre-download manually
python -c "from comet import download_model; download_model('Unbabel/wmt22-comet-da')"
# Or set cache directory
export COMET_CACHE_DIR=/path/to/cache# Make scripts executable
chmod +x setup.sh run_all_zero_shot.sh
chmod +x slurm/*.sh
chmod +x scripts/*.pySymptom: Cannot load Qwen2.5-7B model
Solutions:
# Check internet connectivity
ping huggingface.co
# Manually download model (if needed)
python -c "from transformers import AutoModelForCausalLM; AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-7B-Instruct')"
# Set HuggingFace cache
export HF_HOME=/path/to/cacheThe setup script automatically downloads FLORES-200 from the official source. To manually download:
# Activate environment first
module load conda
eval "$(conda shell.bash hook)"
conda activate prfpo
# Download data
python scripts/download_data.py --languages ben,swa,amh,sinThis downloads and extracts the FLORES-200 devtest set:
- Source: Official FLORES-200 release
- Size: ~200MB compressed, ~1000 sentence pairs per language
- Languages: Bengali (ben_Beng), Swahili (swh_Latn), Amharic (amh_Ethi), Sinhala (sin_Sinh)
- Output: JSONL files in
data/flores200/
For supervised fine-tuning, prepare your parallel data in JSONL format:
Format:
{"source": "This is an English sentence.", "target": "এটি একটি বাংলা বাক্য।"}
{"source": "Another example sentence.", "target": "আরেকটি উদাহরণ বাক্য।"}File locations:
data/parallel/ben_parallel.jsonl- Bengalidata/parallel/swa_parallel.jsonl- Swahilidata/parallel/amh_parallel.jsonl- Amharicdata/parallel/sin_parallel.jsonl- Sinhala
Recommended size: 10K-50K sentence pairs (as per paper's low-resource setting)
We report three standard MT metrics:
-
COMET-22 (Primary metric)
- Reference-free quality estimation
- Neural metric trained on human judgments
- Range: 0-100 (higher is better)
- Strong correlation with human assessments
-
chrF++
- Character-level F-score with word order
- Robust to morphological variation
- Range: 0-100 (higher is better)
-
BLEU
- Traditional n-gram overlap metric
- Range: 0-100 (higher is better)
- Widely used for comparability
-
Check resources before submitting:
sinfo -o "%20N %10c %10m %25f %10G" -
Use batch mode for experiments:
- Interactive mode (
srun) for debugging only - Batch mode (
sbatch) for full experiments
- Interactive mode (
-
Monitor GPU utilization:
# In interactive session nvidia-smi # Watch continuously watch -n 1 nvidia-smi
-
Optimize resource requests:
- Don't request more GPUs than needed
- Zero-shot: 1 GPU is sufficient
- Training: 2-4 GPUs depending on model size
-
Time limits:
- Set realistic time limits:
--time=4:00:00(4 hours) - Zero-shot typically takes < 1 hour per experiment
- Set realistic time limits:
Step 1: Zero-Shot Baseline ✅
# 1. Setup (one-time)
module load conda
bash setup.sh
# 2. Submit jobs
sbatch slurm/run_zero_shot.sh ben en2x # Single job
# OR
./run_all_zero_shot.sh # All 8 jobs
# 3. Monitor jobs
squeue -u $USER
tail -f logs/zero_shot/slurm_*.out
# 4. View results (after completion)
python scripts/aggregate_results.py --method zero_shotExpected runtime: ~40-60 minutes per job
Expected COMET: 68-77 (depending on language)
Step 2: Supervised Fine-Tuning (SFT) ✅
# 1. Prepare parallel data
for lang in ben swa amh sin; do
python scripts/prepare_parallel_data.py --language $lang --source flores
done
# 2. Submit training jobs
sbatch slurm/run_sft.sh ben # Single language
# OR
./run_all_sft.sh # All 4 languages
# 3. Monitor training
squeue -u $USER | grep sft
tail -f logs/sft/slurm_*.out
# 4. View results
python scripts/aggregate_results.py --method sftExpected runtime: ~6-8 hours per language
Expected improvement: +6-8 COMET over zero-shot
Expected COMET: 74-85 (depending on language)
Step 3: PRFPO Training ✅
# 1. Verify SFT models exist
ls outputs/sft/*/final_model/
# 2. Submit PRFPO jobs
sbatch slurm/run_prfpo.sh ben en2x # Single experiment
# OR
./run_all_prfpo.sh # All 8 experiments
# 3. Monitor stages
tail -f logs/prfpo/slurm_*.out
# Watch for: Loading → Paraphrasing → Ranking → DPO Training → Evaluation
# 4. Compare all methods
python scripts/aggregate_results.py --compare_allExpected runtime: ~10-14 hours per language-direction
Expected improvement: +3-5 COMET over SFT
Expected COMET: 79-88 (depending on language)
To reproduce all results from the paper:
# 0. Setup environment (one-time, ~15 min)
cd /home/mhasan21/prfpo
module load conda
bash setup.sh
# 1. Zero-shot baseline (~8 hours total for all jobs)
./run_all_zero_shot.sh
# Wait for completion, then aggregate:
python scripts/aggregate_results.py --method zero_shot
# 2. Prepare parallel data (~5 min)
for lang in ben swa amh sin; do
python scripts/prepare_parallel_data.py --language $lang --source flores
done
# 3. SFT training (~24-32 hours total for all jobs)
./run_all_sft.sh
# Wait for completion, then aggregate:
python scripts/aggregate_results.py --method sft
# 4. PRFPO training (~80-112 hours total for all jobs)
./run_all_prfpo.sh
# Wait for completion, then compare:
python scripts/aggregate_results.py --compare_allTotal estimated time: 5-7 days (most time is GPU training)
Total GPU hours: ~400-500 GPU hours for complete reproduction
Note: You can parallelize by submitting all jobs at once if sufficient GPU resources are available.
If you use this code in your research, please cite our paper:
@article{hasan2025prfpo,
title={PRFPO: Paraphrase-Consensus for Reference-Free Preference Optimization in Machine Translation},
author={Hasan, Mohammad Junayed and Rezek, Jalil},
year={2025},
institution={Johns Hopkins University}
}We welcome contributions! Please feel free to:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, issues, or collaboration:
- Mohammad Junayed Hasan: junayedhasan100@gmail.com
- Jalil Rezek: jalilrezek27@gmail.com
Issues: Please report bugs and feature requests via GitHub Issues
This work was conducted at Johns Hopkins University. We thank:
- Faculty and Staff: Prof. Daniel Khashabi and Anushri Suresh for their support and feedback.
- Models: Qwen2.5-7B-Instruct by Alibaba Cloud
- Evaluation Data: FLORES-200 by Meta AI
- Metrics: COMET by Unbabel, SacreBLEU
- Infrastructure: JHU CLSP Cluster for computational resources
⭐ If you find this work helpful, please consider starring the repository!
- Compute: JHU CLSP Cluster
Last Updated: December 12, 2025