Skip to content

feat(candidates): add corpus mining (Wikipedia dump scan)#244

Open
send wants to merge 1 commit intomainfrom
feat/extras-corpus-mining
Open

feat(candidates): add corpus mining (Wikipedia dump scan)#244
send wants to merge 1 commit intomainfrom
feat/extras-corpus-mining

Conversation

@send
Copy link
Copy Markdown
Owner

@send send commented May 9, 2026

Summary

Surface-first vocabulary mining from a Wikipedia jawiki dump. Adds `dictool candidates corpus ` that:

  • streams the bz2 directly (no full decompress to disk)
  • filters to article namespace (`0`)
  • skips wikitext templates `{{...}}` and `...` blocks
  • extracts maximal kanji runs (`[一-龥々]+`, length 2-20)
  • frequency-counts and diffs against the build dict's surface set
  • outputs `wikipedia.tsv` (gitignored) with `surface\tfreq`

Reading-assignment intentionally deferred — user picks top-N gap surfaces and looks up readings by hand before promoting to `extras/.tsv`. Mirrors the existing `mine`-then-promote workflow.

Why this approach (vs Sudachi / Wikidata, both tried in PR #243 / Wikidata experiment)

Per `feedback_extras_promotion.md`:

  • Sudachi naive scan: 1.9M candidates → 2 promoted (yield ~10⁻⁶) — frozen vocab, no frequency signal
  • Wikidata SPARQL: 5000 → 5 promoted (all 神話/宗教) — P1814 only fills固有名詞
  • Corpus mining (this PR): real-text frequency signal, catches modern vocab, surface-first

User insight (this session): forget reading + POS during extraction; just yank kanji runs. Reading lookup is the bottleneck — apply it only to the small post-diff candidate set.

Pilot result

`jawiki-articles1.bz2` (398MB compressed → 1.5GB raw, 80K articles):

  • 32s end-to-end
  • 304K gap surfaces (freq >= 5) after template/ref/ns filter
  • Top entries dominated by lattice-composable compounds (Mozc has 徳川/家康 separately and composes via Viterbi). `lextool explain` confirms 徳川家康 / 室町時代 / 令和元年 are all top-1 already.
  • Real misses surface in the mix: e.g. `宇宙戦艦` (Mozc top-1 returns 宇宙船感). Per-candidate verification via `lextool explain` is the next step before promotion.

No extras additions in this PR — this is the tool only. Future PRs hand-pick from the candidate file.

Test plan

  • `cargo fmt --all --check` / `cargo clippy --workspace --all-features -- -D warnings` / `cargo test --workspace --all-features` all green
  • 12 new unit tests for `scan_kanji_runs` (length filters, iteration mark, multi-occurrence) and `scan_prose_kanji_runs` (template skip, nested templates, `` skip, self-closing ref, cross-slice state)
  • Smoke test on synthetic XML (3 pages, 13 surfaces, all hit dict)
  • Pilot run on jawiki-articles1.bz2: 80K articles in 32s, 304K freq>=5 gap surfaces
  • Spot-checked `徳川家康`/`室町時代`/`令和元年`/`宇宙戦艦` against `lextool explain` — confirms tool finds genuine miss (`宇宙戦艦`) along with lattice-composable noise

Follow-ups (not this PR)

  • Reading-assignment helper (Sudachi lookup or manual) for the top-N gap surfaces
  • Verification step that runs `lextool explain` per candidate to filter lattice-composable rows
  • Document workflow in feedback memory once a real promotion lands

🤖 Generated with Claude Code

Surface-first vocabulary mining from a Wikipedia jawiki dump. Streams
the bz2 directly, skips wikitext templates `{{...}}` and `<ref>` blocks,
filters to article namespace, extracts maximal kanji runs, and diffs
against the build dict's surface set. Outputs `wikipedia.tsv` with
`surface\tfreq` rows.

Reading-assignment is intentionally deferred — the user picks top-N gap
surfaces and looks up readings before promoting to `extras/<domain>.tsv`,
mirroring the existing `mine`-then-promote-by-hand workflow.

Pilot run on jawiki-articles1 (80K articles, ~1.5GB raw text) finishes
in ~32s and yields 304K freq>=5 gap surfaces. Most are lattice-
composable (徳川家康, 室町時代, 令和元年 — Mozc handles via segment
composition) but real misses surface in the mix (e.g. 宇宙戦艦 →
Mozc top-1 returns 宇宙船感). Per-candidate verification via
`lextool explain` is still required before promotion.

deps: bzip2 0.4 (lex-cli only — same dev-tool scope as the existing
zip dep used by `candidates mine`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 9, 2026 18:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “surface-first” candidate mining path that scans a Wikipedia XML dump to extract frequent kanji-run surfaces, diffs them against the merged build dictionary’s surface set, and writes the remaining gaps to a TSV for manual curation into extras/.

Changes:

  • Added dictool candidates corpus <dump> subcommand to scan .xml / .xml.bz2 Wikipedia dumps and emit wikipedia.tsv (surface + frequency).
  • Implemented streaming Wikipedia dump scanner with template ({{...}}) and <ref>...</ref> skipping plus kanji-run extraction + unit tests.
  • Added bzip2 dependency to support streaming decompression of .bz2 dumps.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
engine/crates/lex-cli/src/commands/candidates_ops.rs Adds corpus() orchestration: run Wikipedia scan, build surface coverage set from build dict, diff + write wikipedia.tsv.
engine/crates/lex-cli/src/candidates/wikipedia.rs New scanner/extractor for Wikipedia dump streaming, markup skipping, kanji-run frequency counting, and unit tests.
engine/crates/lex-cli/src/candidates/mod.rs Exposes the new wikipedia candidates module.
engine/crates/lex-cli/src/bin/dictool.rs Adds the CandidatesAction::Corpus CLI subcommand and wiring to candidates_ops::corpus.
engine/crates/lex-cli/Cargo.toml Adds bzip2 dependency for dump decompression.
engine/Cargo.lock Locks bzip2 / bzip2-sys transitive additions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +206 to +208
if !in_block && b == b'<' && s[i..].starts_with("<ref") {
// Self-closing `<ref ... />` is one shot; full `<ref>...</ref>`
// is multi-token. Cheaply check the next `>`.
Comment on lines +76 to +79
// Reset at each <page> boundary so a non-article page doesn't
// poison the next page when no explicit <ns> is provided.
if line.contains("<page>") {
current_page_is_article = true;
Comment on lines +273 to +275
/// Drop surfaces with frequency below this (default: 3).
#[arg(long, default_value_t = 3)]
min_freq: u32,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants