Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0#1024
Open
immartian wants to merge 1 commit intoopenai:mainfrom
Open
Evidence-aware Dirichlet concentration, 35% improvement over fixed c=5.0#1024immartian wants to merge 1 commit intoopenai:mainfrom
immartian wants to merge 1 commit intoopenai:mainfrom
Conversation
One-line change to hierarchical Dirichlet CTW mixing: c_eff = c_base / (1 + β × log(ctx_count) × avg_idf(context)) Instead of fixed c=5.0 for all contexts, adapt concentration based on evidence strength (ctx_count) and context specificity (IDF): - High counts + rare context → low c → trust n-gram counts - Low counts + common context → c ≈ c_base → smooth toward backup Results (synthetic two-regime corpus, 200K tokens): Fixed CTW (c=5.0): 1.0511 bits/token Binding CTW (c=c(B)): 0.6868 bits/token (35% better) Wins on both regimes: Rare deterministic: 0.976 vs 1.519 (+0.543 bpt) Common ambiguous: 0.720 vs 1.087 (+0.366 bpt) 19 tests + reproducible proof script included.
Author
|
Note on normalization: We're aware of the ongoing discussion in #677 regarding n-gram cache normalization issues. Our binding CTW module ( The 35% improvement shown here compares fixed vs adaptive concentration under identical conditions (same cache, same normalization). The relative improvement should transfer to any valid n-gram scoring method. We're happy to adapt the implementation to whatever evaluation rules the maintainers settle on. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Evidence-Aware Dirichlet Concentration for N-gram Mixing
The problem
Hierarchical Dirichlet mixing (CTW / Willems et al. 1995) blends n-gram predictions across orders using a concentration parameter c:
Current implementations use fixed c (typically 5.0) for all contexts. But contexts vary enormously in reliability:
The fix (one line)
How it works:
The compression scheme knows its own reliability. High evidence + rare context -> trust the counts. Low evidence + common context -> smooth.
Results (synthetic benchmark)
Two-regime corpus (200K tokens, vocab=1024):
35% improvement. Wins on both regimes: trusts strong evidence, smooths weak evidence.
Integration
Drop-in replacement for any Dirichlet n-gram mixer. The only addition is an IDF table (vocab_size floats = 4KB for vocab=1024):
Zero additional memory beyond the IDF table. Compatible with any properly normalized n-gram backend.
Note on normalization
This improvement is orthogonal to the hash-based normalization issues discussed in #677. The evidence-aware concentration adapts the mixing weight, not the underlying probability computation. It works with any valid n-gram scoring method.
Files
Test plan