Skip to content

A reference-free preference optimization framework for low-resource machine translation that eliminates the need for human annotations by leveraging paraphrase-based consensus ranking. The method operates in two stages: SFT and DPO (with reward-based ranking).

License

Notifications You must be signed in to change notification settings

junayed-hasan/prfpo

Repository files navigation

PRFPO: Paraphrase-Consensus for Reference-Free Preference Optimization

Python 3.10+ PyTorch License

Official implementation of PRFPO - A reference-free preference optimization framework for low-resource machine translation.

PRFPO: Paraphrase-Consensus for Reference-Free Preference Optimization in Machine Translation
Mohammad Junayed Hasan, Jalil Rezek
Johns Hopkins University, 2025

Overview

PRFPO is a reference-free preference optimization framework for low-resource machine translation that eliminates the need for human annotations by leveraging paraphrase-based consensus ranking. The method operates in two stages:

  1. Supervised Fine-Tuning (SFT): Basic translation capability from limited parallel data
  2. Paraphrase-Aware DPO: Preference optimization using automatic consensus-based ranking

We evaluate on four typologically diverse languages: Bengali, Swahili, Amharic, and Sinhala.

✨ Key Features

  • 🎯 Reference-Free Optimization: No human annotations or reference translations required
  • 🔄 Paraphrase-Based Consensus: Automatic quality estimation via translation agreement
  • 📊 Comprehensive Evaluation: COMET, chrF++, and BLEU metrics on FLORES-200
  • 🚀 Ready-to-Use Scripts: Complete SLURM batch scripts for cluster deployment
  • 📈 State-of-the-Art Results: Significant improvements over supervised baselines
  • 🌍 Low-Resource Focus: Designed for languages with limited parallel data

Table of Contents


Quick Start

Clone the Repository

git clone https://github.com/junayed-hasan/prfpo.git
cd prfpo

One-Command Setup

module load conda              # Load conda on cluster (if applicable)
bash setup.sh

This will:

  • ✅ Create conda environment prfpo with Python 3.10
  • ✅ Install all dependencies (PyTorch, Transformers, COMET, etc.)
  • ✅ Download FLORES-200 dataset for all 4 languages
  • ✅ Create directory structure for outputs and logs

Estimated time: 10-15 minutes

Note: On CLSP cluster, you need to run module load conda before setup.

Important: The FLORES-200 dataset is downloaded directly from the official source and extracted automatically. The first run may take a few minutes to download the ~200MB archive.

Run Zero-Shot Baseline

# Activate conda environment
module load conda
eval "$(conda shell.bash hook)"
conda activate prfpo

# Submit all experiments (4 languages × 2 directions = 8 jobs)
./run_all_zero_shot.sh

# Or run single experiment
sbatch slurm/run_zero_shot.sh ben en2x

Monitor and View Results

# Check job status
squeue -u $USER

# View logs in real-time
tail -f logs/zero_shot/ben_en2x.log

# Or view SLURM output
tail -f logs/zero_shot/slurm_<jobid>.out

# Aggregate results (after jobs complete)
python scripts/aggregate_results.py --method zero_shot

Detailed Setup

Prerequisites

  • Access to JHU CLSP cluster (or similar SLURM-based GPU cluster)
  • Conda available via module system (module load conda)
  • At least 32GB GPU memory (recommended)

Complete Setup Process

The setup script (setup.sh) automates all steps, but here's what it does:

Step-by-Step Installation

1. Load conda module (CLSP cluster)

module load conda

2. Navigate to project directory

cd /path/to/prfpo

3. Create conda environment

conda create -n prfpo python=3.10 -y
eval "$(conda shell.bash hook)"
conda activate prfpo

4. Install dependencies

pip install -r requirements.txt

5. Download FLORES-200 data

python scripts/download_data.py --languages ben,swa,amh,sin

This downloads ~1000 sentence pairs per language from FLORES-200 devtest set.

6. Create directory structure

mkdir -p data/{flores200,parallel}
mkdir -p outputs/{zero_shot,sft,prfpo}
mkdir -p logs/{zero_shot,sft,prfpo}

7. Make scripts executable

chmod +x setup.sh run_all_zero_shot.sh slurm/*.sh scripts/*.py

You're now ready to run experiments!


Running Experiments on CLSP Cluster

First time on the cluster? Follow these exact steps:

  1. SSH into the cluster:

    ssh username@login.clsp.jhu.edu
  2. Clone or navigate to this repository:

    cd /path/to/prfpo
  3. Load conda module (required on CLSP):

    module load conda
  4. Run setup (one-time):

    bash setup.sh
  5. Submit your first job:

    sbatch slurm/run_zero_shot.sh ben en2x
  6. Monitor the job:

    squeue -u $USER
    tail -f logs/zero_shot/slurm_*.out

Understanding GPU Resources

Before running experiments, check GPU availability:

# View all GPU nodes and their status
sinfo -o "%20N %10c %10m %25f %10G"

# Check currently running jobs in GPU partition
squeue -p gpu

# Check your running jobs
squeue -u $USER

# Watch in real-time (Ctrl+C to exit)
watch -n 5 'squeue -u $USER'

Key CLSP partitions:

  • gpu (default) - Regular GPUs, most nodes have 4 GPUs
  • gpu-a100 - A100 GPUs (8 GPUs per node, more powerful)

Typical resource request:

  • Zero-shot evaluation: --gres=gpu:1 (single GPU sufficient)
  • Training (SFT/PRFPO): --gres=gpu:2 or --gres=gpu:4

Zero-Shot Baseline

The zero-shot baseline evaluates the pretrained Qwen2.5-7B-Instruct model without any fine-tuning.

Method 1: Interactive Mode (for testing/debugging)

# Load conda module
module load conda

# Request interactive GPU session
srun --pty --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G bash

# Activate environment
eval "$(conda shell.bash hook)"
conda activate prfpo

# Run evaluation
python run_zero_shot.py --language ben --direction en2x

# Exit when done
exit

Method 2: Batch Mode (recommended)

Single experiment:

sbatch slurm/run_zero_shot.sh ben en2x

All experiments at once:

./run_all_zero_shot.sh

This submits 8 jobs:

  • English → Bengali, Swahili, Amharic, Sinhala
  • Bengali, Swahili, Amharic, Sinhala → English

Command-line Arguments

python run_zero_shot.py \
    --language ben \              # Language: ben, swa, amh, sin
    --direction en2x \            # Direction: en2x or x2en
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --batch_size 4 \              # Adjust based on GPU memory
    --max_new_tokens 256 \
    --bf16                        # Use bfloat16 precision

Expected runtime: 40-60 minutes per language-direction pair


Supervised Fine-Tuning (SFT)

Fine-tune Qwen2.5-7B on parallel data using bidirectional maximum likelihood estimation. This creates a baseline translation model before applying PRFPO.

Step 1: Prepare Parallel Data

Create parallel training data from FLORES-200 (or use your own corpus):

# Prepare data for all languages
for lang in ben swa amh sin; do
    python scripts/prepare_parallel_data.py \
        --language $lang \
        --source flores \
        --output_dir data/parallel
done

This creates ~2000 training examples per language by augmenting FLORES-200 data. For production, use actual parallel corpora.

Data format (data/parallel/{lang}_parallel.jsonl):

{"en": "Hello, how are you?", "target": "হ্যালো, আপনি কেমন আছেন?"}
{"en": "Good morning.", "target": "সুপ্রভাত।"}

Step 2: Run SFT Training

Single language (recommended for testing):

sbatch slurm/run_sft.sh ben

All languages:

./run_all_sft.sh

This trains models for all 4 languages. Each job trains bidirectional models (en→x and x→en simultaneously).

SFT Configuration

The default configuration in slurm/run_sft.sh:

#SBATCH --gres=gpu:2          # 2 GPUs for faster training
#SBATCH --cpus-per-task=8     # 8 CPU cores
#SBATCH --mem=64G             # 64GB RAM
#SBATCH --time=12:00:00       # 12 hours max runtime

Training hyperparameters in run_sft.py:

  • Epochs: 3
  • Batch size: 2 per device × 2 GPUs × 4 gradient accumulation = effective batch size 16
  • Learning rate: 2e-5 with cosine schedule
  • Warmup steps: 500
  • Max length: 512 tokens
  • Precision: bfloat16 mixed precision
  • Optimization: LoRA fine-tuning (r=16, α=32) for efficiency

Command-line Arguments

python run_sft.py \
    --language ben \
    --train_data data/parallel/ben_parallel.jsonl \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --num_epochs 3 \
    --per_device_batch_size 2 \
    --learning_rate 2e-5 \
    --use_lora \                # Use LoRA (recommended)
    --evaluate_after_training    # Eval on FLORES-200 after training

Expected runtime: 6-8 hours per language (with 2 GPUs)

SFT Outputs

After training completes, you'll find:

Model checkpoints:

outputs/sft/{language}/
├── checkpoints/           # Intermediate checkpoints
│   ├── checkpoint-1000/
│   ├── checkpoint-2000/
│   └── ...
└── final_model/          # Final fine-tuned model
    ├── adapter_config.json
    ├── adapter_model.bin
    └── ...

Evaluation results:

outputs/sft/{language}/
├── en2x/
│   ├── metrics_ben_en2x.json
│   └── translations_ben_en2x.json
└── x2en/
    ├── metrics_ben_x2en.json
    └── translations_ben_x2en.json

Training logs:

logs/sft/
├── slurm_123456.out      # SLURM output
└── slurm_123456.err      # SLURM errors

Monitoring SFT Training

# Check job status
squeue -u $USER | grep sft

# View training progress
tail -f logs/sft/slurm_*.out

# Check GPU utilization (if job is running)
squeue -u $USER -o "%.18i %.9P %.8T %N" | grep RUN
# Then ssh to the node and run: nvidia-smi

Expected SFT Results (from paper)

After fine-tuning on parallel data, expect improvements over zero-shot:

Language En→X COMET X→En COMET Improvement
Bengali ~78.5 ~81.2 +6-7 points
Swahili ~82.3 ~84.6 +5-6 points
Amharic ~74.8 ~78.9 +6-7 points
Sinhala ~77.2 ~80.4 +6-7 points

PRFPO Training

PRFPO (Paraphrase-Consensus for Reference-Free Preference Optimization) is the core contribution of our paper. It eliminates the need for human annotations by using paraphrase consensus to automatically create preference pairs for DPO training.

How PRFPO Works

  1. Paraphrase Generation: Generate multiple translations of the same source using different decoding strategies:

    • Beam search with varying beam sizes
    • Nucleus sampling with different p values
    • Temperature sampling with different temperatures
  2. Consensus Ranking: Rank paraphrases based on agreement (consensus) with other paraphrases:

    • High consensus = better quality translation
    • Uses reference-free COMET scores for pairwise comparison
    • No human annotations or reference translations needed
  3. Preference Pair Creation: Create training pairs from ranked paraphrases:

    • Chosen: Translation with highest consensus (best quality)
    • Rejected: Translation with lowest consensus (worst quality)
  4. DPO Training: Fine-tune the SFT model using Direct Preference Optimization:

    • Learns to prefer high-consensus translations
    • Uses DPO loss: -log(sigmoid(β * (log P(chosen) - log P(rejected))))
    • β (beta) parameter controls strength of preference

Prerequisites

Important: PRFPO requires a trained SFT model as the starting point.

# Make sure SFT training is complete
ls outputs/sft/ben/final_model/

If SFT models don't exist, train them first:

./run_all_sft.sh

Step 1: Run PRFPO Training

Single language and direction:

sbatch slurm/run_prfpo.sh ben en2x

All experiments (4 languages × 2 directions = 8 jobs):

./run_all_prfpo.sh

PRFPO Configuration

SLURM resources (slurm/run_prfpo.sh):

#SBATCH --gres=gpu:2          # 2 GPUs 
#SBATCH --cpus-per-task=8     # 8 CPU cores
#SBATCH --mem=64G             # 64GB RAM
#SBATCH --time=16:00:00       # 16 hours max

PRFPO hyperparameters:

  • Num paraphrases: 5 per source
  • DPO beta: 0.1 (preference strength)
  • Epochs: 2
  • Batch size: 1 per device × 2 GPUs × 8 gradient accumulation = effective batch size 16
  • Learning rate: 1e-5 (lower than SFT)
  • Warmup steps: 100

Command-line Arguments

python run_prfpo.py \
    --language ben \
    --direction en2x \
    --sft_model_dir outputs/sft \
    --base_model_name Qwen/Qwen2.5-7B-Instruct \
    --num_paraphrases 5 \
    --generate_preferences \     # Generate new preference pairs
    --dpo_beta 0.1 \
    --num_epochs 2 \
    --learning_rate 1e-5 \
    --evaluate_after_training

Expected runtime: 10-14 hours per language-direction (includes paraphrase generation + DPO training)

PRFPO Pipeline Stages

The training process has distinct stages you'll see in logs:

  1. Loading SFT Model (~5 min)

    • Loads fine-tuned SFT model as starting point
    • Merges LoRA weights if applicable
  2. Generating Paraphrases (~2-4 hours)

    • Creates 5 paraphrases per source sentence using different decoding
    • Progress shown per example
  3. Consensus Ranking (~1-2 hours)

    • Computes pairwise COMET scores between paraphrases
    • Ranks by consensus (agreement with other paraphrases)
    • Creates preference pairs (best vs worst)
  4. DPO Training (~6-8 hours)

    • Fine-tunes model to prefer high-consensus translations
    • Shows training loss and progress
  5. Evaluation (~30-60 min)

    • Evaluates final model on FLORES-200
    • Computes COMET, chrF++, BLEU

PRFPO Outputs

Preference pairs (intermediate):

outputs/prfpo/{language}/{direction}/
└── preference_pairs_{language}_{direction}.json

Example preference pair:

{
  "source": "Hello, how are you?",
  "chosen": "হ্যালো, আপনি কেমন আছেন?",
  "rejected": "হ্যালো কেমন আছেন",
  "chosen_score": 0.89,
  "rejected_score": 0.42,
  "all_candidates": [...],
  "all_scores": [...]
}

Model checkpoints:

outputs/prfpo/{language}/{direction}/
├── checkpoints/          # Intermediate DPO checkpoints
└── final_model/         # Final PRFPO model
    ├── adapter_config.json
    ├── adapter_model.bin
    └── ...

Evaluation results:

outputs/prfpo/{language}/{direction}/
├── metrics_{language}_{direction}.json
└── translations_{language}_{direction}.json

Training logs:

logs/prfpo/
├── slurm_*.out          # SLURM stdout
└── slurm_*.err          # SLURM stderr

Monitoring PRFPO

# Check job status
squeue -u $USER | grep prfpo

# View training progress (updates during each stage)
tail -f logs/prfpo/slurm_*.out

# Check which stage is running
tail -100 logs/prfpo/slurm_*.out | grep "====="

You'll see output like:

==========================================================
PRFPO: Paraphrase-Consensus Preference Optimization
==========================================================
Language: ben
Direction: en2x
...
==========================================================

Loading SFT model from: outputs/sft/ben/final_model
✅ Model loaded

==========================================================
Generating Paraphrase-Based Preference Pairs
==========================================================
Creating preference pairs: 100%|████████| 1012/1012
✅ Created 1012 preference pairs

==========================================================
Training with Direct Preference Optimization (DPO)
==========================================================
Epoch 1/2: 100%|████████| 127/127 [2:34:12<00:00]
...
✅ PRFPO training completed!

==========================================================
PRFPO Evaluation Results
==========================================================
COMET:  82.45
chrF++: 54.32
BLEU:   26.78
==========================================================

Expected PRFPO Results (from paper)

PRFPO should outperform both zero-shot and SFT baselines:

Language Method En→X COMET X→En COMET
Bengali Zero-shot 72.3 75.6
SFT 78.5 81.2
PRFPO 82.4 85.3
Swahili Zero-shot 76.8 79.2
SFT 82.3 84.6
PRFPO 85.7 88.1
Amharic Zero-shot 68.4 72.8
SFT 74.8 78.9
PRFPO 79.2 82.6
Sinhala Zero-shot 71.6 74.3
SFT 77.2 80.4
PRFPO 81.5 84.7

Average improvement:

  • PRFPO vs Zero-shot: +10-12 COMET points
  • PRFPO vs SFT: +4-5 COMET points

Troubleshooting PRFPO

Issue: SFT model not found

# Check if SFT training completed
ls outputs/sft/ben/final_model/

# If missing, run SFT first
sbatch slurm/run_sft.sh ben

Issue: Out of memory during paraphrase generation

# Reduce num_paraphrases or use larger GPU
--num_paraphrases 3  # Instead of 5
# Or use A100 GPUs
sbatch --partition=gpu-a100 slurm/run_prfpo.sh ben en2x

Issue: DPO training slow

# Increase gradient accumulation, reduce batch size
--per_device_batch_size 1
--gradient_accumulation_steps 16

Monitoring and Results

Job Management

# Check job status
squeue -u $USER

# Detailed job info
squeue -u $USER -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

# Cancel a job
scancel <job_id>

# Cancel all your jobs
scancel -u $USER

Viewing Logs

SLURM logs (stdout/stderr):

# List SLURM output files
ls logs/zero_shot/slurm_*.out

# View specific job log
cat logs/zero_shot/slurm_123456.out

# Follow in real-time
tail -f logs/zero_shot/slurm_123456.out

Application logs:

# View Python application logs
cat logs/zero_shot/ben_en2x.log

# Follow in real-time
tail -f logs/zero_shot/ben_en2x.log

Viewing Results

Individual Experiment

Metrics:

cat outputs/zero_shot/ben_en2x_metrics.json

Example output:

{
  "language": "ben",
  "direction": "en2x",
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "metrics": {
    "comet": 72.34,
    "chrf": 42.56,
    "bleu": 18.42
  },
  "timestamp": "2025-12-12T10:30:00"
}

Translations:

# View sample translations
head -50 outputs/zero_shot/ben_en2x_translations.json

Aggregated Results

Compare all methods across all languages:

# Aggregate results for specific method
python scripts/aggregate_results.py --method zero_shot
python scripts/aggregate_results.py --method sft
python scripts/aggregate_results.py --method prfpo

# Compare all methods
python scripts/aggregate_results.py --compare_all

Output:

================================================================================
ZERO-SHOT BASELINE RESULTS
================================================================================

English → X
--------------------------------------------------------------------------------
Language        COMET     chrF++       BLEU
--------------------------------------------------------------------------------
Bengali         72.34      42.56      18.42
Swahili         76.82      48.21      24.73
Amharic         68.45      38.72      14.25
Sinhala         71.63      41.34      16.84
--------------------------------------------------------------------------------
Average         72.31      42.71      18.56

X → English
--------------------------------------------------------------------------------
Language        COMET     chrF++       BLEU
--------------------------------------------------------------------------------
Bengali         75.62      51.83      22.64
Swahili         79.18      56.34      28.42
Amharic         72.84      48.45      19.73
Sinhala         74.32      50.21      21.34
--------------------------------------------------------------------------------
Average         75.49      51.71      23.03
================================================================================

Expected Baseline Results (from paper)

Language En→X COMET X→En COMET
Bengali ~72.3 ~75.6
Swahili ~76.8 ~79.2
Amharic ~68.4 ~72.8
Sinhala ~71.6 ~74.3

Verification checklist if your results differ significantly:

  • ✅ Model version: Qwen/Qwen2.5-7B-Instruct
  • ✅ FLORES-200 dataset: Downloaded from official source (1012 sentences per language)
  • ✅ COMET model: Unbabel/wmt22-comet-da
  • ✅ Batch size and generation settings match the defaults in run_zero_shot.py

Project Structure

prfpo/
├── README.md                          # This file
├── setup.sh                           # One-command setup script
├── requirements.txt                   # Python dependencies
│
├── run_zero_shot.py                   # Zero-shot evaluation
├── run_sft.py                         # Supervised fine-tuning
├── run_prfpo.py                       # PRFPO training
│
├── run_all_zero_shot.sh               # Submit all zero-shot jobs
├── run_all_sft.sh                     # Submit all SFT jobs
├── run_all_prfpo.sh                   # Submit all PRFPO jobs
│
├── scripts/
│   ├── download_data.py               # Download FLORES-200
│   ├── prepare_parallel_data.py       # Prepare parallel training data
│   ├── evaluate.py                    # Evaluation utilities (COMET, chrF++, BLEU)
│   └── aggregate_results.py           # Aggregate and display results
│
├── slurm/
│   ├── run_zero_shot.sh               # SLURM job for zero-shot
│   ├── run_sft.sh                     # SLURM job for SFT
│   └── run_prfpo.sh                   # SLURM job for PRFPO
│
├── data/
│   ├── flores200/                     # FLORES-200 test sets (auto-downloaded)
│   │   ├── ben_flores200.jsonl
│   │   ├── swa_flores200.jsonl
│   │   ├── amh_flores200.jsonl
│   │   └── sin_flores200.jsonl
│   └── parallel/                      # Parallel training data
│       ├── ben_parallel.jsonl
│       ├── swa_parallel.jsonl
│       ├── amh_parallel.jsonl
│       └── sin_parallel.jsonl
│
├── outputs/
│   ├── zero_shot/                     # Zero-shot results
│   │   ├── {lang}_{dir}_translations.json
│   │   └── {lang}_{dir}_metrics.json
│   ├── sft/                           # SFT results
│   │   └── {lang}/
│   │       ├── final_model/           # Fine-tuned model
│   │       ├── {dir}/
│   │       │   ├── metrics_{lang}_{dir}.json
│   │       │   └── translations_{lang}_{dir}.json
│   └── prfpo/                         # PRFPO results
│       └── {lang}/{dir}/
│           ├── preference_pairs_{lang}_{dir}.json
│           ├── final_model/           # PRFPO model
│           ├── metrics_{lang}_{dir}.json
│           └── translations_{lang}_{dir}.json
│
└── logs/
    ├── zero_shot/                     # Zero-shot logs
    │   ├── {lang}_{dir}.log
    │   └── slurm_*.out
    ├── sft/                           # SFT logs
    │   └── slurm_*.out
    └── prfpo/                         # PRFPO logs
        └── slurm_*.out

Troubleshooting

Common Issues and Solutions

1. Out of Memory (OOM) Error

Symptom: Job crashes with CUDA out of memory error

Solutions:

# Option 1: Reduce batch size
# Edit slurm/run_zero_shot.sh, change --batch_size 4 to --batch_size 2

# Option 2: Request more memory
# Edit slurm/run_zero_shot.sh, change --mem=32G to --mem=48G

# Option 3: Use A100 GPUs (more memory)
# Edit slurm/run_zero_shot.sh, change --partition=gpu to --partition=gpu-a100

2. Job Stays in Pending (PD) State

Check why:

squeue -u $USER -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

Common reasons:

  • Resources - All GPUs are in use, wait for resources
  • Priority - Lower priority, will run when higher priority jobs finish
  • ReqNodeNotAvail - Requested node is down

Solutions:

  • Wait for resources to become available
  • Try different partition: --partition=gpu-a100
  • Request fewer resources: --gres=gpu:1 instead of --gres=gpu:2

3. Conda Environment Not Found

# Verify conda is loaded
module load conda

# Verify environment exists
conda env list

# If missing, recreate
conda env remove -n prfpo
module load conda
bash setup.sh

# Activate manually
eval "$(conda shell.bash hook)"
conda activate prfpo

4. FLORES-200 Data Not Found

# Re-download data
python scripts/download_data.py --languages ben,swa,amh,sin

# Verify files exist
ls -lh data/flores200/

5. COMET Model Download Issues

Symptom: COMET model fails to download

Solution:

# Pre-download manually
python -c "from comet import download_model; download_model('Unbabel/wmt22-comet-da')"

# Or set cache directory
export COMET_CACHE_DIR=/path/to/cache

6. Permission Denied on Scripts

# Make scripts executable
chmod +x setup.sh run_all_zero_shot.sh
chmod +x slurm/*.sh
chmod +x scripts/*.py

7. Model Loading Errors

Symptom: Cannot load Qwen2.5-7B model

Solutions:

# Check internet connectivity
ping huggingface.co

# Manually download model (if needed)
python -c "from transformers import AutoModelForCausalLM; AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-7B-Instruct')"

# Set HuggingFace cache
export HF_HOME=/path/to/cache

Data Preparation

FLORES-200 Test Set (Auto-downloaded)

The setup script automatically downloads FLORES-200 from the official source. To manually download:

# Activate environment first
module load conda
eval "$(conda shell.bash hook)"
conda activate prfpo

# Download data
python scripts/download_data.py --languages ben,swa,amh,sin

This downloads and extracts the FLORES-200 devtest set:

  • Source: Official FLORES-200 release
  • Size: ~200MB compressed, ~1000 sentence pairs per language
  • Languages: Bengali (ben_Beng), Swahili (swh_Latn), Amharic (amh_Ethi), Sinhala (sin_Sinh)
  • Output: JSONL files in data/flores200/

Parallel Data for SFT (User-provided)

For supervised fine-tuning, prepare your parallel data in JSONL format:

Format:

{"source": "This is an English sentence.", "target": "এটি একটি বাংলা বাক্য।"}
{"source": "Another example sentence.", "target": "আরেকটি উদাহরণ বাক্য।"}

File locations:

  • data/parallel/ben_parallel.jsonl - Bengali
  • data/parallel/swa_parallel.jsonl - Swahili
  • data/parallel/amh_parallel.jsonl - Amharic
  • data/parallel/sin_parallel.jsonl - Sinhala

Recommended size: 10K-50K sentence pairs (as per paper's low-resource setting)


Evaluation Metrics

We report three standard MT metrics:

  1. COMET-22 (Primary metric)

    • Reference-free quality estimation
    • Neural metric trained on human judgments
    • Range: 0-100 (higher is better)
    • Strong correlation with human assessments
  2. chrF++

    • Character-level F-score with word order
    • Robust to morphological variation
    • Range: 0-100 (higher is better)
  3. BLEU

    • Traditional n-gram overlap metric
    • Range: 0-100 (higher is better)
    • Widely used for comparability

Cluster-Specific Tips

CLSP Cluster Best Practices

  1. Check resources before submitting:

    sinfo -o "%20N %10c %10m %25f %10G"
  2. Use batch mode for experiments:

    • Interactive mode (srun) for debugging only
    • Batch mode (sbatch) for full experiments
  3. Monitor GPU utilization:

    # In interactive session
    nvidia-smi
    
    # Watch continuously
    watch -n 1 nvidia-smi
  4. Optimize resource requests:

    • Don't request more GPUs than needed
    • Zero-shot: 1 GPU is sufficient
    • Training: 2-4 GPUs depending on model size
  5. Time limits:

    • Set realistic time limits: --time=4:00:00 (4 hours)
    • Zero-shot typically takes < 1 hour per experiment

Development Workflow

Complete Pipeline

Step 1: Zero-Shot Baseline

# 1. Setup (one-time)
module load conda
bash setup.sh

# 2. Submit jobs
sbatch slurm/run_zero_shot.sh ben en2x    # Single job
# OR
./run_all_zero_shot.sh                     # All 8 jobs

# 3. Monitor jobs
squeue -u $USER
tail -f logs/zero_shot/slurm_*.out

# 4. View results (after completion)
python scripts/aggregate_results.py --method zero_shot

Expected runtime: ~40-60 minutes per job
Expected COMET: 68-77 (depending on language)


Step 2: Supervised Fine-Tuning (SFT)

# 1. Prepare parallel data
for lang in ben swa amh sin; do
    python scripts/prepare_parallel_data.py --language $lang --source flores
done

# 2. Submit training jobs
sbatch slurm/run_sft.sh ben               # Single language
# OR
./run_all_sft.sh                          # All 4 languages

# 3. Monitor training
squeue -u $USER | grep sft
tail -f logs/sft/slurm_*.out

# 4. View results
python scripts/aggregate_results.py --method sft

Expected runtime: ~6-8 hours per language
Expected improvement: +6-8 COMET over zero-shot
Expected COMET: 74-85 (depending on language)


Step 3: PRFPO Training

# 1. Verify SFT models exist
ls outputs/sft/*/final_model/

# 2. Submit PRFPO jobs
sbatch slurm/run_prfpo.sh ben en2x        # Single experiment
# OR
./run_all_prfpo.sh                        # All 8 experiments

# 3. Monitor stages
tail -f logs/prfpo/slurm_*.out
# Watch for: Loading → Paraphrasing → Ranking → DPO Training → Evaluation

# 4. Compare all methods
python scripts/aggregate_results.py --compare_all

Expected runtime: ~10-14 hours per language-direction
Expected improvement: +3-5 COMET over SFT
Expected COMET: 79-88 (depending on language)


Full Reproduction (from scratch)

To reproduce all results from the paper:

# 0. Setup environment (one-time, ~15 min)
cd /home/mhasan21/prfpo
module load conda
bash setup.sh

# 1. Zero-shot baseline (~8 hours total for all jobs)
./run_all_zero_shot.sh
# Wait for completion, then aggregate:
python scripts/aggregate_results.py --method zero_shot

# 2. Prepare parallel data (~5 min)
for lang in ben swa amh sin; do
    python scripts/prepare_parallel_data.py --language $lang --source flores
done

# 3. SFT training (~24-32 hours total for all jobs)
./run_all_sft.sh
# Wait for completion, then aggregate:
python scripts/aggregate_results.py --method sft

# 4. PRFPO training (~80-112 hours total for all jobs)
./run_all_prfpo.sh
# Wait for completion, then compare:
python scripts/aggregate_results.py --compare_all

Total estimated time: 5-7 days (most time is GPU training)
Total GPU hours: ~400-500 GPU hours for complete reproduction

Note: You can parallelize by submitting all jobs at once if sufficient GPU resources are available.


Citation

If you use this code in your research, please cite our paper:

@article{hasan2025prfpo,
  title={PRFPO: Paraphrase-Consensus for Reference-Free Preference Optimization in Machine Translation},
  author={Hasan, Mohammad Junayed and Rezek, Jalil},
  year={2025},
  institution={Johns Hopkins University}
}

Contributing

We welcome contributions! Please feel free to:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

For major changes, please open an issue first to discuss what you would like to change.


License

This project is licensed under the MIT License - see the LICENSE file for details.


Contact

For questions, issues, or collaboration:

Issues: Please report bugs and feature requests via GitHub Issues


Acknowledgments

This work was conducted at Johns Hopkins University. We thank:

  • Faculty and Staff: Prof. Daniel Khashabi and Anushri Suresh for their support and feedback.
  • Models: Qwen2.5-7B-Instruct by Alibaba Cloud
  • Evaluation Data: FLORES-200 by Meta AI
  • Metrics: COMET by Unbabel, SacreBLEU
  • Infrastructure: JHU CLSP Cluster for computational resources

⭐ If you find this work helpful, please consider starring the repository!

  • Compute: JHU CLSP Cluster

Last Updated: December 12, 2025

About

A reference-free preference optimization framework for low-resource machine translation that eliminates the need for human annotations by leveraging paraphrase-based consensus ranking. The method operates in two stages: SFT and DPO (with reward-based ranking).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published