Skip to content

Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0#1024

Open
immartian wants to merge 1 commit intoopenai:mainfrom
immartian:binding-ctw-improvement
Open

Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0#1024
immartian wants to merge 1 commit intoopenai:mainfrom
immartian:binding-ctw-improvement

Conversation

@immartian
Copy link
Copy Markdown

@immartian immartian commented Mar 28, 2026

Evidence-Aware Dirichlet Concentration for N-gram Mixing

The problem

Hierarchical Dirichlet mixing (CTW / Willems et al. 1995) blends n-gram predictions across orders using a concentration parameter c:

p = (c * p_backup + count) / (c + ctx_count)

Current implementations use fixed c (typically 5.0) for all contexts. But contexts vary enormously in reliability:

Context A:  "the" -> next token?     (ctx_count=50,000, but 800+ distinct continuations)
Context B:  "mitochondri" -> ?       (ctx_count=47, but almost always "a" or "al")

Fixed c=5.0 treats both the same. That's wrong.

The fix (one line)

# Before: fixed concentration
c = 5.0

# After: evidence-aware concentration
c = c_base / (1 + beta * log(ctx_count) * avg_idf(context_tokens))

How it works:

              ctx_count    context specificity    c_eff     behavior
              ---------    -------------------    -----     --------
"the" ->         50,000    low  (IDF ~ 0.1)       3.8      smooth toward backup
"mitochondri"->      47    high (IDF ~ 0.9)       0.6      trust the counts
random noise ->       2    any                     4.9      fall back to backup

The compression scheme knows its own reliability. High evidence + rare context -> trust the counts. Low evidence + common context -> smooth.

Results (synthetic benchmark)

Two-regime corpus (200K tokens, vocab=1024):

  • Regime A: rare deterministic patterns ([950,951,952] -> 953 always)
  • Regime B: common ambiguous patterns ([5] -> uniform random)
Method All (bpt) Rare ctx Common ctx
Uniform baseline 10.000 10.000 10.000
Fixed CTW (c=5.0) 1.051 1.519 1.087
Evidence-aware 0.687 0.976 0.720

35% improvement. Wins on both regimes: trusts strong evidence, smooths weak evidence.

Integration

Drop-in replacement for any Dirichlet n-gram mixer. The only addition is an IDF table (vocab_size floats = 4KB for vocab=1024):

# In the inner loop of hierarchical Dirichlet mixing:
spec = avg_idf(context_tokens[position])            # lookup, O(1)
c_eff = c_base / (1 + beta * np.log1p(cc) * spec)   # scalar math
blended[idx] = (c_eff * prev_p + fc) / (c_eff + cc)  # same formula, adaptive c

Zero additional memory beyond the IDF table. Compatible with any properly normalized n-gram backend.

Note on normalization

This improvement is orthogonal to the hash-based normalization issues discussed in #677. The evidence-aware concentration adapts the mixing weight, not the underlying probability computation. It works with any valid n-gram scoring method.

Files

File Description
binding_ctw.py Evidence-aware CTW module with full and single-pass scoring
test/test_binding_ctw.py 19 tests
test/proof_binding_beats_fixed.py Reproducible benchmark

Test plan

  • 19 unit tests passing
  • Reproducible proof on synthetic data
  • Test on FineWeb validation set
  • Tune beta and c_base on real data

One-line change to hierarchical Dirichlet CTW mixing:
  c_eff = c_base / (1 + β × log(ctx_count) × avg_idf(context))

Instead of fixed c=5.0 for all contexts, adapt concentration based on
evidence strength (ctx_count) and context specificity (IDF):
  - High counts + rare context → low c → trust n-gram counts
  - Low counts + common context → c ≈ c_base → smooth toward backup

Results (synthetic two-regime corpus, 200K tokens):
  Fixed CTW (c=5.0):    1.0511 bits/token
  Binding CTW (c=c(B)): 0.6868 bits/token  (35% better)

Wins on both regimes:
  Rare deterministic:  0.976 vs 1.519 (+0.543 bpt)
  Common ambiguous:    0.720 vs 1.087 (+0.366 bpt)

19 tests + reproducible proof script included.
@immartian immartian changed the title Evidence-aware Dirichlet concentration — 35% improvement over fixed c=5.0 Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0 Mar 28, 2026
@immartian
Copy link
Copy Markdown
Author

Note on normalization: We're aware of the ongoing discussion in #677 regarding n-gram cache normalization issues. Our binding CTW module (binding_ctw.py) is agnostic to the normalization backend — the evidence-aware concentration formula works with any properly normalized probability computation, not just hash-based caches.

The 35% improvement shown here compares fixed vs adaptive concentration under identical conditions (same cache, same normalization). The relative improvement should transfer to any valid n-gram scoring method.

We're happy to adapt the implementation to whatever evaluation rules the maintainers settle on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant