Skip to content
hydropix edited this page May 11, 2026 · 36 revisions

Translation Quality Benchmark

Last updated: 2026-05-11 23:18

This wiki contains translation quality benchmarks for various LLM models across 23 languages.

Methodology v2 — judge: Claude Opus 4.7 applying the formal rubric v2 via Poe. Split data layout: per-model translations/ joined to a single judgments/<judge-id>.json by (model_id, text_id, target_lang) with output_hash integrity check. Previous benchmark archived under Archive-Index.

Score Legend

Indicator Range Label
🟢 9-10 Excellent
🟡 7-8 Good
🟠 5-6 Acceptable
🔴 3-4 Poor
1-2 Failed

Model Rankings

Overall performance across all tested languages:

Rank Model Avg Score Accuracy Fluency Style Languages Obs Verified
1 gemini-3.1-flash-lite 🟡 8.2 8.6 8.7 8.0 45 45 verified
2 mistral-small-latest 🟡 7.9 8.1 8.6 7.7 45 45 verified
3 gemma4:31b 🟡 7.7 8.1 8.3 7.5 227 227 self-reported
4 claude-haiku-4.5 🟡 7.6 8.1 7.9 7.2 227 227 verified
5 qwen3.5:35b 🟡 7.5 7.8 8.0 7.3 45 45 self-reported
6 gemma3:27b 🟡 7.5 7.8 8.2 7.2 227 227 self-reported
7 gemma4:e2b 🟠 6.2 6.6 6.5 5.9 227 227 self-reported

Language Rankings (Top 15)

Ranked by the best model's average overall on each target language. (Average across all models would skew toward whatever model count we have per language, so the best-model floor is a fairer cross-language signal.)

Rank Language Native Best Score Best Model Tests Obs Verified
1 Chinese (Simplified) 简体中文 🟡 8.4 gemini-3.1-flash-lite 77 77 mixed
2 French Français 🟡 8.3 gemini-3.1-flash-lite 70 70 mixed
3 Portuguese Português 🟡 8.2 gemma4:31b 40 40 mixed
4 Spanish Español 🟡 8.2 gemma4:31b 70 70 mixed
5 Korean 한국어 🟡 8.2 gemma4:31b 40 40 mixed
6 Vietnamese Tiếng Việt 🟡 8.1 gemini-3.1-flash-lite 70 70 mixed
7 Thai ไทย 🟡 8.1 gemma4:31b 40 40 mixed
8 Arabic العربية 🟡 8.0 gemma4:31b 40 40 mixed
9 English English 🟡 8.0 gemini-3.1-flash-lite 28 28 mixed
10 Italian Italiano 🟡 8.0 gemma3:27b 40 40 mixed
11 Japanese 日本語 🟡 8.0 gemma4:31b 48 48 mixed
12 Hindi हिन्दी 🟡 8.0 gemma4:31b 40 40 mixed
13 Indonesian Bahasa Indonesia 🟡 8.0 gemma4:31b 40 40 mixed
14 Dutch Nederlands 🟡 7.9 gemma4:31b 40 40 mixed
15 Bengali বাংলা 🟡 7.9 gemma4:31b 40 40 mixed

View all 23 languages...


Quick Stats

  • Total Models Tested: 7
  • Total Languages: 23
  • Total Translations: 1043
  • Evaluator Model: claude-opus-4-7-rubric-v2-poe
  • Source Language: English

Categories

By Language Category

Asian Languages

Language Best Score Best Model
Chinese (Simplified) 🟡 8.4 gemini-3.1-flash-lite
Korean 🟡 8.2 gemma4:31b
Vietnamese 🟡 8.1 gemini-3.1-flash-lite
Thai 🟡 8.1 gemma4:31b
Japanese 🟡 8.0 gemma4:31b
Hindi 🟡 8.0 gemma4:31b
Indonesian 🟡 8.0 gemma4:31b
Bengali 🟡 7.9 gemma4:31b
Tamil 🟡 7.5 gemma3:27b

European Major Languages

Language Best Score Best Model
French 🟡 8.3 gemini-3.1-flash-lite
Portuguese 🟡 8.2 gemma4:31b
Spanish 🟡 8.2 gemma4:31b
English 🟡 8.0 gemini-3.1-flash-lite
Italian 🟡 8.0 gemma3:27b
Dutch 🟡 7.9 gemma4:31b
Swedish 🟡 7.9 gemma3:27b
Polish 🟡 7.9 gemma4:31b
Danish 🟡 7.8 gemma4:31b
German 🟡 7.8 claude-haiku-4.5
Greek 🟡 7.5 gemma3:27b

Semitic Languages

Language Best Score Best Model
Arabic 🟡 8.0 gemma4:31b
Hebrew 🟡 7.3 claude-haiku-4.5

Cyrillic Languages

Language Best Score Best Model
Russian 🟡 7.7 gemma4:31b

Browse


Generated by TranslateBookWithLLM benchmark system

Clone this wiki locally