Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0 by immartian · Pull Request #1024 · openai/parameter-golf

immartian · 2026-03-28T14:30:03Z

Evidence-Aware Dirichlet Concentration for N-gram Mixing

The problem

Hierarchical Dirichlet mixing (CTW / Willems et al. 1995) blends n-gram predictions across orders using a concentration parameter c:

p = (c * p_backup + count) / (c + ctx_count)

Current implementations use fixed c (typically 5.0) for all contexts. But contexts vary enormously in reliability:

Context A:  "the" -> next token?     (ctx_count=50,000, but 800+ distinct continuations)
Context B:  "mitochondri" -> ?       (ctx_count=47, but almost always "a" or "al")

Fixed c=5.0 treats both the same. That's wrong.

The fix (one line)

# Before: fixed concentration
c = 5.0

# After: evidence-aware concentration
c = c_base / (1 + beta * log(ctx_count) * avg_idf(context_tokens))

How it works:

              ctx_count    context specificity    c_eff     behavior
              ---------    -------------------    -----     --------
"the" ->         50,000    low  (IDF ~ 0.1)       3.8      smooth toward backup
"mitochondri"->      47    high (IDF ~ 0.9)       0.6      trust the counts
random noise ->       2    any                     4.9      fall back to backup

The compression scheme knows its own reliability. High evidence + rare context -> trust the counts. Low evidence + common context -> smooth.

Results (synthetic benchmark)

Two-regime corpus (200K tokens, vocab=1024):

Regime A: rare deterministic patterns ([950,951,952] -> 953 always)
Regime B: common ambiguous patterns ([5] -> uniform random)

Method	All (bpt)	Rare ctx	Common ctx
Uniform baseline	10.000	10.000	10.000
Fixed CTW (c=5.0)	1.051	1.519	1.087
Evidence-aware	0.687	0.976	0.720

35% improvement. Wins on both regimes: trusts strong evidence, smooths weak evidence.

Integration

Drop-in replacement for any Dirichlet n-gram mixer. The only addition is an IDF table (vocab_size floats = 4KB for vocab=1024):

# In the inner loop of hierarchical Dirichlet mixing:
spec = avg_idf(context_tokens[position])            # lookup, O(1)
c_eff = c_base / (1 + beta * np.log1p(cc) * spec)   # scalar math
blended[idx] = (c_eff * prev_p + fc) / (c_eff + cc)  # same formula, adaptive c

Zero additional memory beyond the IDF table. Compatible with any properly normalized n-gram backend.

Note on normalization

This improvement is orthogonal to the hash-based normalization issues discussed in #677. The evidence-aware concentration adapts the mixing weight, not the underlying probability computation. It works with any valid n-gram scoring method.

Files

File	Description
binding_ctw.py	Evidence-aware CTW module with full and single-pass scoring
test/test_binding_ctw.py	19 tests
test/proof_binding_beats_fixed.py	Reproducible benchmark

Test plan

19 unit tests passing
Reproducible proof on synthetic data
Test on FineWeb validation set
Tune beta and c_base on real data

One-line change to hierarchical Dirichlet CTW mixing: c_eff = c_base / (1 + β × log(ctx_count) × avg_idf(context)) Instead of fixed c=5.0 for all contexts, adapt concentration based on evidence strength (ctx_count) and context specificity (IDF): - High counts + rare context → low c → trust n-gram counts - Low counts + common context → c ≈ c_base → smooth toward backup Results (synthetic two-regime corpus, 200K tokens): Fixed CTW (c=5.0): 1.0511 bits/token Binding CTW (c=c(B)): 0.6868 bits/token (35% better) Wins on both regimes: Rare deterministic: 0.976 vs 1.519 (+0.543 bpt) Common ambiguous: 0.720 vs 1.087 (+0.366 bpt) 19 tests + reproducible proof script included.

immartian · 2026-03-28T19:37:08Z

Note on normalization: We're aware of the ongoing discussion in #677 regarding n-gram cache normalization issues. Our binding CTW module (binding_ctw.py) is agnostic to the normalization backend — the evidence-aware concentration formula works with any properly normalized probability computation, not just hash-based caches.

The 35% improvement shown here compares fixed vs adaptive concentration under identical conditions (same cache, same normalization). The relative improvement should transfer to any valid n-gram scoring method.

We're happy to adapt the implementation to whatever evaluation rules the maintainers settle on.

immartian changed the title ~~Evidence-aware Dirichlet concentration — 35% improvement over fixed c=5.0~~ Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0 Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0#1024

Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0#1024
immartian wants to merge 1 commit intoopenai:mainfrom
immartian:binding-ctw-improvement

immartian commented Mar 28, 2026 •

edited

Loading

Uh oh!

immartian commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

immartian commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evidence-Aware Dirichlet Concentration for N-gram Mixing

The problem

The fix (one line)

Results (synthetic benchmark)

Integration

Note on normalization

Files

Test plan

Uh oh!

immartian commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

immartian commented Mar 28, 2026 •

edited

Loading