A small bench for evaluating dependency parsers on the EvaLatin 2024 test data, with the goal of beating the published winners.
Comes with two reference models (UDPipe 2, ÚFAL LatinPipe), a local-LLM template via LM Studio, and a stub for your own model. Add a new model = write a 30-line Python file.
Two ways to use this repo:
- Just want to read what's been tried? See Explore the repo — nothing to install.
- Want to run the bench or add a model? See Get started.
The repo ships with the actual parse outputs of every system we've run, so you can inspect results without installing anything. Start here if you're just reading along.
LAS = labeled attachment score, CLAS = content-word LAS. Gold tokenization, so the system contribution is purely the parse. See docs/01_findings.md for the full breakdown (UAS, fallback rates, per-finding analysis).
| split | system | LAS | CLAS |
|---|---|---|---|
| poetry | LatinPipe (1× checkpoint) | 72.27 | 71.28 |
| poetry | UDPipe 2 (latin-perseus-ud-2.17) |
61.19 | 59.90 |
| poetry | qwen3-vl-8b-instruct-mlx (LM Studio) | 18.21 | 17.42 |
| poetry | qwen3-0.6b-mlx (LM Studio) | 2.67 | 2.72 |
| prose | LatinPipe (1× checkpoint) | 75.06 | 70.90 |
| prose | UDPipe 2 (latin-perseus-ud-2.17) |
62.43 | 57.46 |
| prose | qwen3-vl-8b-instruct-mlx (LM Studio) | 17.80 | 14.53 |
| prose | qwen3-0.6b-mlx (LM Studio) | 1.62 | 1.51 |
The bar to beat is the published LatinPipe 7-model ensemble: ~78 LAS poetry, ~83 LAS prose. The 1× checkpoint above is one model from that ensemble and is what we ship as the local reference.
- docs/01_findings.md — distilled research log: what we tried, what worked, what's next.
- docs/00_task_explained.md — plain-English walkthrough of dependency parsing on Latin.
- notebooks/02_compare_models.ipynb — the bench end-to-end: running the comparison, the LLM error analysis, the 0.6B → 8B scale-up. Renders directly on GitHub.
- predictions/ — each system's actual parse output:
predictions/<system>/scores.json— full CoNLL-18 scorer output (UAS / LAS / CLAS / MLAS / BLEX, plus token/sent/word F1).predictions/<system>/{poetry,prose}_pred.conllu— the predicted UD trees, sentence-by-sentence. Diff againstdata/EvaLatin-2024-test/gold to see where each system goes wrong.
- src/latinbench/ — the small Python package:
ModelABCBenchorchestrator (core.py), scorer wrapper (score.py), one file per model inmodels/.
Skip this section if you only want to read results. Everything below assumes you'll be running models locally.
git clone https://github.com/Nicolas-Py/NLP-LLM-SS2026
cd NLP-LLM-SS2026
# Main env (notebook kernel + Python API)
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
# Subprocess env for LatinPipe (Keras + PyTorch backend)
python3 -m venv third_party/latinpipe/.venv
third_party/latinpipe/.venv/bin/pip install -e .
# Download the LatinPipe checkpoint (≈700 MB) from
# https://hdl.handle.net/11234/1-5671 and extract its contents
# (model.weights.h5, mappings.pkl, la_evalatin24.tokenizer, options.json)
# into checkpoints/latinpipe-evalatin24-240520/Dependencies live in pyproject.toml — there's no requirements.txt (it
would just duplicate). pip install -e . is the modern equivalent.
The two reference models (udpipe, latinpipe) come pre-registered. Pick
whichever interface you prefer:
Open PyCharm's Terminal tab (View → Tool Windows → Terminal, or ⌥F12)
and paste:
# One-time setup (skip if already done)
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
python3 -m venv third_party/latinpipe/.venv
third_party/latinpipe/.venv/bin/pip install -e .
# Then drop the LatinPipe checkpoint files into
# checkpoints/latinpipe-evalatin24-240520/
# Every PyCharm session — activate the venv first
source .venv/bin/activate
# Smoke-test both reference models
python -c "
from latinbench import Bench, MODELS
print(Bench().compare([MODELS['udpipe'], MODELS['latinpipe']]).to_string(index=False))
"To make PyCharm itself (its notebook runner, autocomplete, "Run" buttons) use
this venv, point its project interpreter at <repo>/.venv/bin/python once.
source .venv/bin/activate
jupyter lab # or jupyter notebookOpen notebooks/02_compare_models.ipynb → Run All. Same flow as PyCharm;
just a browser instead of the IDE.
Open the NLP-LLM-SS2026/ folder, pick the kernel <repo>/.venv/bin/python
in the top-right of the notebook, then Run All.
.venv/bin/python -c "
from latinbench import Bench, MODELS
print(Bench().compare([MODELS['udpipe'], MODELS['latinpipe']]).to_string(index=False))
"Results cache at predictions/<model_name>/scores.json. The committed
predictions/ tree means Bench().run(...) will short-circuit to cached
scores on a fresh checkout. To genuinely re-run:
# Re-run everything from scratch
rm -rf predictions/*/scores.json
# Re-run just one model
rm -rf predictions/latinpipe
# Or pass force=True from Python:
.venv/bin/python -c "from latinbench import Bench, MODELS; Bench().run(MODELS['latinpipe'], force=True)"LatinPipe inference takes ~1–2 min per split on M1 CPU; UDPipe takes ~5–10 sec (REST API).
ModuleNotFoundError: No module named 'latinbench'— wrong Python interpreter. In PyCharm, fix via Settings → Interpreter. From the shell,which pythonshould point at<repo>/.venv/bin/python.- LatinPipe fails with TensorFlow import error — the subprocess venv
exists but
KERAS_BACKEND=torchwasn't set. The bench sets it automatically; if you're invokinglatinpipe_evalatin24.pyby hand, prefix withKERAS_BACKEND=torch. - LatinPipe checkpoint not found — verify
checkpoints/latinpipe-evalatin24-240520/model.weights.h5exists (≈663 MB). Re-download from the LINDAT link above.
- Open
src/latinbench/models/template.py. - Subclass
Model, setname, implementpredict(test_path, out_path). - Import it in
notebooks/02_compare_models.ipynband add it to thebench.compare([...])call.
That's it. The bench handles writing predictions, calling the official scorer, parsing results, and plotting.
src/latinbench/ # the Python package
├── core.py # Model ABC + Bench orchestrator
├── score.py # scorer subprocess wrapper
├── data.py # canonical paths
└── models/ # one file per model
data/ # EvaLatin 2024 test + gold (committed)
third_party/
├── scorer/ # CoNLL-18 official scorer
└── latinpipe/ # vendored ÚFAL LatinPipe (no .git, no venv)
checkpoints/ # gitignored; LatinPipe weights live here
predictions/ # tracked; one subdir per system (see Explore above)
notebooks/ # 01_explore_data, 02_compare_models
docs/
├── 00_task_explained.md # what dependency parsing is, in plain English
└── 01_findings.md # research log: what's been tried, what worked
from latinbench import Bench
from latinbench.models.udpipe import UdpipeModel, list_perseus_models
print(list_perseus_models()) # all available LINDAT ids
Bench().run(UdpipeModel('latin-perseus-ud-2.6-200830')) # try an older one
Bench().run(UdpipeModel(model_id='latest')) # auto-pick newestEach id gets its own predictions/<id>/ cache dir, so swapping versions
doesn't clobber prior results.
Three LM Studio entries are registered (one 0.6B baseline, one 8B Qwen3-VL, one Gemma-3-12B). They share a single running LM Studio server; only one model is hot in memory at a time but LM Studio auto-swaps on request, so the same workflow handles all three.
One-time setup:
- Install LM Studio.
- Discover tab → search "Qwen3 0.6B MLX" → download
lmstudio-community/Qwen3-0.6B-MLX-4bit(~400 MB). Repeat for any other model you want to bench. - Developer tab → load the model → Start Server (defaults to port 1234).
- Recommended server settings: max parallel requests 8, Flash Attention on, KV cache q8_0.
- Sanity check:
curl http://localhost:1234/v1/modelsshould list the loaded model.
Run any registered LM Studio model:
from latinbench import Bench, MODELS
Bench().run(MODELS["qwen3-lmstudio"]) # 0.6B baseline
Bench().run(MODELS["qwen3-vl-8b-lmstudio"]) # 8B
Bench().run(MODELS["gemma-3-12b-lmstudio"]) # 12B, different familySwap to any other model loaded in LM Studio by constructing directly (pass
the exact id LM Studio reports for GET /v1/models):
from latinbench import Bench
from latinbench.models.lmstudio_llm import LMStudioModel
Bench().run(LMStudioModel("qwen/qwen3-4b")) # different size
Bench().run(LMStudioModel("meta-llama/llama-3.2-3b")) # different familyEach model_id gets its own predictions/<slug>/ cache dir (: and /
are sanitized to -). Expect ~5 s per short sentence on a 0.6B model — full
splits take many minutes; the bench caches per-model so re-runs are instant.
A <pred>.partial.json sidecar updates as each sentence completes so a
crash mid-predict resumes from where it left off.