feat(candidates): add corpus mining (Wikipedia dump scan) by send · Pull Request #244 · send/lexime

send · 2026-05-09T18:28:46Z

Summary

Surface-first vocabulary mining from a Wikipedia jawiki dump. Adds `dictool candidates corpus ` that:

streams the bz2 directly (no full decompress to disk)
filters to article namespace (`0`)
skips wikitext templates `{{...}}` and `...` blocks
extracts maximal kanji runs (`[一-龥々]+`, length 2-20)
frequency-counts and diffs against the build dict's surface set
outputs `wikipedia.tsv` (gitignored) with `surface\tfreq`

Reading-assignment intentionally deferred — user picks top-N gap surfaces and looks up readings by hand before promoting to `extras/.tsv`. Mirrors the existing `mine`-then-promote workflow.

Why this approach (vs Sudachi / Wikidata, both tried in PR #243 / Wikidata experiment)

Per `feedback_extras_promotion.md`:

Sudachi naive scan: 1.9M candidates → 2 promoted (yield ~10⁻⁶) — frozen vocab, no frequency signal
Wikidata SPARQL: 5000 → 5 promoted (all 神話/宗教) — P1814 only fills固有名詞
Corpus mining (this PR): real-text frequency signal, catches modern vocab, surface-first

User insight (this session): forget reading + POS during extraction; just yank kanji runs. Reading lookup is the bottleneck — apply it only to the small post-diff candidate set.

Pilot result

`jawiki-articles1.bz2` (398MB compressed → 1.5GB raw, 80K articles):

32s end-to-end
304K gap surfaces (freq >= 5) after template/ref/ns filter
Top entries dominated by lattice-composable compounds (Mozc has 徳川/家康 separately and composes via Viterbi). `lextool explain` confirms 徳川家康 / 室町時代 / 令和元年 are all top-1 already.
Real misses surface in the mix: e.g. `宇宙戦艦` (Mozc top-1 returns 宇宙船感). Per-candidate verification via `lextool explain` is the next step before promotion.

No extras additions in this PR — this is the tool only. Future PRs hand-pick from the candidate file.

Test plan

`cargo fmt --all --check` / `cargo clippy --workspace --all-features -- -D warnings` / `cargo test --workspace --all-features` all green
12 new unit tests for `scan_kanji_runs` (length filters, iteration mark, multi-occurrence) and `scan_prose_kanji_runs` (template skip, nested templates, `` skip, self-closing ref, cross-slice state)
Smoke test on synthetic XML (3 pages, 13 surfaces, all hit dict)
Pilot run on jawiki-articles1.bz2: 80K articles in 32s, 304K freq>=5 gap surfaces
Spot-checked `徳川家康`/`室町時代`/`令和元年`/`宇宙戦艦` against `lextool explain` — confirms tool finds genuine miss (`宇宙戦艦`) along with lattice-composable noise

Follow-ups (not this PR)

Reading-assignment helper (Sudachi lookup or manual) for the top-N gap surfaces
Verification step that runs `lextool explain` per candidate to filter lattice-composable rows
Document workflow in feedback memory once a real promotion lands

🤖 Generated with Claude Code

Surface-first vocabulary mining from a Wikipedia jawiki dump. Streams the bz2 directly, skips wikitext templates `{{...}}` and `<ref>` blocks, filters to article namespace, extracts maximal kanji runs, and diffs against the build dict's surface set. Outputs `wikipedia.tsv` with `surface\tfreq` rows. Reading-assignment is intentionally deferred — the user picks top-N gap surfaces and looks up readings before promoting to `extras/<domain>.tsv`, mirroring the existing `mine`-then-promote-by-hand workflow. Pilot run on jawiki-articles1 (80K articles, ~1.5GB raw text) finishes in ~32s and yields 304K freq>=5 gap surfaces. Most are lattice- composable (徳川家康, 室町時代, 令和元年 — Mozc handles via segment composition) but real misses surface in the mix (e.g. 宇宙戦艦 → Mozc top-1 returns 宇宙船感). Per-candidate verification via `lextool explain` is still required before promotion. deps: bzip2 0.4 (lex-cli only — same dev-tool scope as the existing zip dep used by `candidates mine`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new “surface-first” candidate mining path that scans a Wikipedia XML dump to extract frequent kanji-run surfaces, diffs them against the merged build dictionary’s surface set, and writes the remaining gaps to a TSV for manual curation into extras/.

Changes:

Added dictool candidates corpus <dump> subcommand to scan .xml / .xml.bz2 Wikipedia dumps and emit wikipedia.tsv (surface + frequency).
Implemented streaming Wikipedia dump scanner with template ({{...}}) and <ref>...</ref> skipping plus kanji-run extraction + unit tests.
Added bzip2 dependency to support streaming decompression of .bz2 dumps.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
engine/crates/lex-cli/src/commands/candidates_ops.rs	Adds `corpus()` orchestration: run Wikipedia scan, build surface coverage set from build dict, diff + write `wikipedia.tsv`.
engine/crates/lex-cli/src/candidates/wikipedia.rs	New scanner/extractor for Wikipedia dump streaming, markup skipping, kanji-run frequency counting, and unit tests.
engine/crates/lex-cli/src/candidates/mod.rs	Exposes the new `wikipedia` candidates module.
engine/crates/lex-cli/src/bin/dictool.rs	Adds the `CandidatesAction::Corpus` CLI subcommand and wiring to `candidates_ops::corpus`.
engine/crates/lex-cli/Cargo.toml	Adds `bzip2` dependency for dump decompression.
engine/Cargo.lock	Locks `bzip2` / `bzip2-sys` transitive additions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        if !in_block && b == b'<' && s[i..].starts_with("<ref") {
+            // Self-closing `<ref ... />` is one shot; full `<ref>...</ref>`
+            // is multi-token. Cheaply check the next `>`.


+        // Reset at each <page> boundary so a non-article page doesn't
+        // poison the next page when no explicit <ns> is provided.
+        if line.contains("<page>") {
+            current_page_is_article = true;


+        /// Drop surfaces with frequency below this (default: 3).
+        #[arg(long, default_value_t = 3)]
+        min_freq: u32,


Copilot AI review requested due to automatic review settings May 9, 2026 18:28

Copilot started reviewing on behalf of send May 9, 2026 18:29 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(candidates): add corpus mining (Wikipedia dump scan)#244

feat(candidates): add corpus mining (Wikipedia dump scan)#244
send wants to merge 1 commit intomainfrom
feat/extras-corpus-mining

send commented May 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

send commented May 9, 2026

Summary

Why this approach (vs Sudachi / Wikidata, both tried in PR #243 / Wikidata experiment)

Pilot result

Test plan

Follow-ups (not this PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants