-
Notifications
You must be signed in to change notification settings - Fork 224
Home
hydropix edited this page May 11, 2026
·
36 revisions
Last updated: 2026-05-11 23:18
This wiki contains translation quality benchmarks for various LLM models across 23 languages.
Methodology v2 — judge: Claude Opus 4.7 applying the formal rubric v2 via Poe. Split data layout: per-model
translations/joined to a singlejudgments/<judge-id>.jsonby(model_id, text_id, target_lang)withoutput_hashintegrity check. Previous benchmark archived under Archive-Index.
| Indicator | Range | Label |
|---|---|---|
| 🟢 | 9-10 | Excellent |
| 🟡 | 7-8 | Good |
| 🟠 | 5-6 | Acceptable |
| 🔴 | 3-4 | Poor |
| ⚫ | 1-2 | Failed |
Overall performance across all tested languages:
| Rank | Model | Avg Score | Accuracy | Fluency | Style | Languages | Obs | Verified |
|---|---|---|---|---|---|---|---|---|
| 1 | gemini-3.1-flash-lite | 🟡 8.2 | 8.6 | 8.7 | 8.0 | 45 | 45 | verified |
| 2 | mistral-small-latest | 🟡 7.9 | 8.1 | 8.6 | 7.7 | 45 | 45 | verified |
| 3 | gemma4:31b | 🟡 7.7 | 8.1 | 8.3 | 7.5 | 227 | 227 | self-reported |
| 4 | claude-haiku-4.5 | 🟡 7.6 | 8.1 | 7.9 | 7.2 | 227 | 227 | verified |
| 5 | qwen3.5:35b | 🟡 7.5 | 7.8 | 8.0 | 7.3 | 45 | 45 | self-reported |
| 6 | gemma3:27b | 🟡 7.5 | 7.8 | 8.2 | 7.2 | 227 | 227 | self-reported |
| 7 | gemma4:e2b | 🟠 6.2 | 6.6 | 6.5 | 5.9 | 227 | 227 | self-reported |
Ranked by the best model's average overall on each target language. (Average across all models would skew toward whatever model count we have per language, so the best-model floor is a fairer cross-language signal.)
| Rank | Language | Native | Best Score | Best Model | Tests | Obs | Verified |
|---|---|---|---|---|---|---|---|
| 1 | Chinese (Simplified) | 简体中文 | 🟡 8.4 | gemini-3.1-flash-lite | 77 | 77 | mixed |
| 2 | French | Français | 🟡 8.3 | gemini-3.1-flash-lite | 70 | 70 | mixed |
| 3 | Portuguese | Português | 🟡 8.2 | gemma4:31b | 40 | 40 | mixed |
| 4 | Spanish | Español | 🟡 8.2 | gemma4:31b | 70 | 70 | mixed |
| 5 | Korean | 한국어 | 🟡 8.2 | gemma4:31b | 40 | 40 | mixed |
| 6 | Vietnamese | Tiếng Việt | 🟡 8.1 | gemini-3.1-flash-lite | 70 | 70 | mixed |
| 7 | Thai | ไทย | 🟡 8.1 | gemma4:31b | 40 | 40 | mixed |
| 8 | Arabic | العربية | 🟡 8.0 | gemma4:31b | 40 | 40 | mixed |
| 9 | English | English | 🟡 8.0 | gemini-3.1-flash-lite | 28 | 28 | mixed |
| 10 | Italian | Italiano | 🟡 8.0 | gemma3:27b | 40 | 40 | mixed |
| 11 | Japanese | 日本語 | 🟡 8.0 | gemma4:31b | 48 | 48 | mixed |
| 12 | Hindi | हिन्दी | 🟡 8.0 | gemma4:31b | 40 | 40 | mixed |
| 13 | Indonesian | Bahasa Indonesia | 🟡 8.0 | gemma4:31b | 40 | 40 | mixed |
| 14 | Dutch | Nederlands | 🟡 7.9 | gemma4:31b | 40 | 40 | mixed |
| 15 | Bengali | বাংলা | 🟡 7.9 | gemma4:31b | 40 | 40 | mixed |
- Total Models Tested: 7
- Total Languages: 23
- Total Translations: 1043
- Evaluator Model: claude-opus-4-7-rubric-v2-poe
- Source Language: English
| Language | Best Score | Best Model |
|---|---|---|
| Chinese (Simplified) | 🟡 8.4 | gemini-3.1-flash-lite |
| Korean | 🟡 8.2 | gemma4:31b |
| Vietnamese | 🟡 8.1 | gemini-3.1-flash-lite |
| Thai | 🟡 8.1 | gemma4:31b |
| Japanese | 🟡 8.0 | gemma4:31b |
| Hindi | 🟡 8.0 | gemma4:31b |
| Indonesian | 🟡 8.0 | gemma4:31b |
| Bengali | 🟡 7.9 | gemma4:31b |
| Tamil | 🟡 7.5 | gemma3:27b |
| Language | Best Score | Best Model |
|---|---|---|
| French | 🟡 8.3 | gemini-3.1-flash-lite |
| Portuguese | 🟡 8.2 | gemma4:31b |
| Spanish | 🟡 8.2 | gemma4:31b |
| English | 🟡 8.0 | gemini-3.1-flash-lite |
| Italian | 🟡 8.0 | gemma3:27b |
| Dutch | 🟡 7.9 | gemma4:31b |
| Swedish | 🟡 7.9 | gemma3:27b |
| Polish | 🟡 7.9 | gemma4:31b |
| Danish | 🟡 7.8 | gemma4:31b |
| German | 🟡 7.8 | claude-haiku-4.5 |
| Greek | 🟡 7.5 | gemma3:27b |
| Language | Best Score | Best Model |
|---|---|---|
| Arabic | 🟡 8.0 | gemma4:31b |
| Hebrew | 🟡 7.3 | claude-haiku-4.5 |
| Language | Best Score | Best Model |
|---|---|---|
| Russian | 🟡 7.7 | gemma4:31b |
- By Language: All Languages
- By Model: All Models
- Benchmark Documentation: How to Run Benchmarks
- Raw Data: Translations / Judgments
- Archived v1 benchmark: Archive-Index — previous methodology, kept for reference
Generated by TranslateBookWithLLM benchmark system