Skip to content

Research: 32K-vocab English-optimized small model for quant.cpp #92

@unamedkr

Description

@unamedkr

Summary

Investigate creating or fine-tuning a small (1.7-3.8B) model with a 32K vocabulary optimized for English document QA on quant.cpp. This addresses the fundamental speed/quality tension discovered in our benchmarking.

The Vocab Size Dilemma

All 2025-2026 models have moved to large vocabularies for multilingual support:

Model Year Vocab Our tok/s (M3 Q8)
Phi-3.5-mini 2024 32K 6.5
SmolLM2-1.7B 2024 49K 23
Qwen3-4B 2025 152K ~2 (est)
Phi-4-mini 2025 200K ~1 (est)
Gemma-3-4B 2025 262K ~0.8 (est)

The industry trend (bigger vocab) is the opposite of what local CPU inference needs (smaller vocab).

Phi-3.5's 32K vocab is the last model with a small English-focused vocabulary. Its benchmarks are now outdated (2024).

Options

Option A: Vocabulary pruning

Take Qwen3-4B (best quality) and prune its 152K vocab to ~32K English-only tokens. Re-train the embedding/lm_head layers.

  • Pro: Best underlying model quality
  • Con: Requires GPU training, may degrade quality

Option B: Knowledge distillation

Distill Qwen3-4B's knowledge into a Phi-3.5-architecture student with 32K vocab.

  • Pro: Purpose-built architecture
  • Con: Significant training effort

Option C: Fine-tune Phi-3.5 on document QA

Keep Phi-3.5's 32K vocab but fine-tune on document QA tasks (SQuAD, NaturalQuestions, etc.).

  • Pro: No vocabulary changes, just quality improvement
  • Con: Limited by Phi-3.5's 2024-era pre-training

Option D: Community model search

Monitor HuggingFace for new models with small vocabularies. Some research groups may release English-focused models.

  • Pro: Zero effort
  • Con: May never appear (industry trend is opposite)

Why This Matters

The speed formula for local inference is approximately:

tok/s ∝ 1 / (vocab_size × params^0.5 × quant_overhead)

A 3.8B model with 32K vocab is 7.5x faster than the same model with 200K vocab. This is not an optimization — it's a fundamental architectural advantage for the English-only use case.

Priority: P3

Long-term research direction. Immediate impact comes from #83 (KV cache) and #84 (coherence API).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions