Skip to content

Rethunk-AI/bakeoff

Repository files navigation

Local LLM N-vs-N Benchmark (LM Studio GGUFs via llama.cpp)

ci license python

Small harness that serves models from ~/.lmstudio/models/ through a llama-swap proxy in front of llama.cpp podman containers, and benchmarks them on quality, latency, and cost (energy). Supports any number of models: round-robin tournament (pairwise_all) or absolute rubric (scored). Emits JSON, Markdown, and a single-file HTML dashboard.

Matrix: tasks × prompt_variants × models.

Documentation

Start here:

  • HUMANS.md — operators & developers: prerequisites, install, run, configure, troubleshoot, clean up.
  • AGENTS.md — LLMs & contributors: design invariants, hardware caveats, judge-mode selection, editing conventions. CLAUDE.md is a symlink here.
  • CONTRIBUTING.md — opening a PR: pre-commit checklist, commit style, what not to change without discussing.

Reference:

  • config.yaml — single source of truth for server, models, prompts, dataset, judge, cost, output. Inline comments describe every knob.

Design choices (explicit)

Concern Choice Reason
Serving llama-swap proxy in front of podman + ghcr.io/ggml-org/llama.cpp:server-vulkan llama-swap owns model lifecycle (boot/unload on demand). The Vulkan image works on AMD (Strix Halo tested), NVIDIA, Intel without per-backend wrangling. No llama-server binary ships in LM Studio's own ~/.lmstudio/extensions/backends/*/ — only .so libraries for LM Studio's internal runtime.
One model at a time llama-swap unloads current backend before starting next; runner iterates per-model-sequentially on top of that Unified-memory APUs (and modest-VRAM discrete GPUs) can't hold A + B + judge concurrently. Each model pays exactly one swap, absorbed by warmup.
Transport OpenAI-compatible /v1/chat/completions llama.cpp server exposes it; one client class works for Ollama, vLLM, LM Studio, llama.cpp.
Quality scorer pairwise_all tournament (default) or scored 1-5 rubric via LLM judge; plus heuristic (exact / contains / regex) for structured tasks Tournament gives sharp ranking on small N (2-4); rubric scales linearly for larger N. Heuristics catch hard ground-truth items without a judge round trip.
Pairwise positional-bias mitigation Order randomized per call (seeded from run.seed); swapped verdicts inverted before counting Judges show a 5-15% preference for slot A; flipping per call averages it out across the matrix. order: "AB" | "BA" is stored on every judgement.
Default context 4096 Benchmark prompts are short; keeping ctx small cuts load time + memory. Override per model or in server.ctx.
"Cost" for local models Energy estimate via nvidia-smi --query-gpu=power.draw or rocm-smi --showpower sampled at call start + end; average × wall time × $/kWh No per-token price for local. Energy is the honest cost axis.
Cost fallback energy_wh / cost_usd = null when neither tool works No silent substitution with latency. On Strix Halo, rocm-smi often fails on libdrm_amdgpu.so — expect null.
Dataset Generated from seeded templates across qa / code / summarize / classify Simple stack, no external corpus. Deterministic via run.seed.
Stack Python + httpx + pyyaml, stdlib for everything else No promptfoo / lm-eval / framework.
Dashboard One static HTML file with Chart.js via CDN, reads embedded run JSON No build step, opens from disk.

Layout

config.yaml          the contract
bin/llama-swap.sh    proxy launcher (up / down / sweep / wait)
bench/               clients · dataset · download · llama_swap · metrics · publish · runner · report
run.sh               uv venv + install + pinned llama-swap bootstrap + run
.cache/              vendored llama-swap binary + generated proxy config (gitignored)
datasets/            generated inputs (gitignored)
results/             run-<ts>.json / .md / .html (gitignored)

Per-module breakdown with behavior notes: AGENTS.md § Layout.

For prerequisites, configuration, and troubleshooting see HUMANS.md.

About

Local LLM N-vs-N benchmark harness — llama.cpp via llama-swap, quality/latency/cost scoring.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors