Skip to content

feat: BM25/IEF-style lexical weighting for element discrimination#14

Merged
luigi-agosti merged 2 commits intomainfrom
feat/issue-3-bm25-idf-scoring
Mar 25, 2026
Merged

feat: BM25/IEF-style lexical weighting for element discrimination#14
luigi-agosti merged 2 commits intomainfrom
feat/issue-3-bm25-idf-scoring

Conversation

@Djain912
Copy link
Copy Markdown
Collaborator

@Djain912 Djain912 commented Mar 24, 2026

Summary

  • add \ElementFrequency\ with \Build()\ and \IEF()\ for per-snapshot token discrimination
  • compute frequency stats once per lexical \Find()\ call and reuse for all element scoring
  • add \LexicalScoreWithFrequency(query, desc, ef)\ while keeping \LexicalScore\ behavior unchanged for nil weighting
  • apply IEF weighting to weighted Jaccard intersection/union to reduce impact of common tokens like 'button' and 'link'
  • add tests for rare-vs-common IEF behavior, nil backward compatibility, and checkout-button discrimination in a many-button snapshot
  • add a 200-element lexical benchmark case
  • fix E2E test fixture: 'table of contents' now correctly ranks e3 (link: contents) over e9 (heading: contents) due to IEF downweighting common 'link' role token

Issue

Closes #3

Validation

  • \go test -c ./internal/engine\ (compile-only validation)
  • E2E Docker tests: all 123 tests passing (fixture updated for IEF ranking change)
  • Full benchmark execution is blocked in this environment by Windows Application Control policy for generated test executables

@luigi-agosti
Copy link
Copy Markdown
Contributor

@Djain912 fix the conflict and feel free to merge this pr

Djain912 added 2 commits March 25, 2026 09:08
IEF (inverse element frequency) weighting reduces impact of common tokens
like 'link' (very common role). For 'table of contents' query:
- e3 (link: contents) has common 'link' token → lower union weight
- e9 (heading: contents) has rarer 'heading' token → higher union weight

Result: e3 now ranks higher due to smaller denominator in IEF-weighted Jaccard.
This is correct IEF behavior - update test expectation from e9 to e3.
@luigiagent luigiagent force-pushed the feat/issue-3-bm25-idf-scoring branch from fa67281 to da0df83 Compare March 25, 2026 09:08
@luigi-agosti luigi-agosti merged commit 9e09a12 into main Mar 25, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BM25/IDF-weighted scoring for element discrimination

2 participants