Minimal Seq2Seq model with attention for neural machine translation in PyTorch.
This implementation focuses on the following features:
- Modular structure to be used in other projects
- Minimal code for readability
- Full utilization of batches and GPU.
Dataset (Multi30k DE→EN) is loaded via HuggingFace datasets; tokenization uses spaCy.
- Encoder: Bidirectional GRU
- Decoder: GRU with Attention Mechanism
- Attention: Neural Machine Translation by Jointly Learning to Align and Translate
- Python 3.9+
- PyTorch >= 2.0 (CPU, CUDA, or Apple MPS)
datasets(HuggingFace, replaces torchtext)- Spacy >= 3.7
pip install -r requirements.txt
python -m spacy download de_core_news_sm
python -m spacy download en_core_web_sm
python train.py -epochs 30 -batch_size 32 -lr 3e-4
Device is auto-detected (CUDA → MPS → CPU). Smaller -hidden_size / -embed_size flags are useful for CPU smoke runs.
Sanity check (CPU, 500 batches, hidden=128/embed=64):
| step | train loss | perplexity |
|---|---|---|
| init | 9.19 | 9803 |
| 50 | 6.98 | 1071 |
| 100 | 5.48 | 239 |
| 250 | 5.15 | 173 |
| 500 | 4.84 | 127 |
Final val loss: 4.93 (random-init prior is log(|V|) ≈ 9.19).
Based on the following implementations
