feat(candidates): add corpus mining (Wikipedia dump scan)#244
Open
feat(candidates): add corpus mining (Wikipedia dump scan)#244
Conversation
Surface-first vocabulary mining from a Wikipedia jawiki dump. Streams
the bz2 directly, skips wikitext templates `{{...}}` and `<ref>` blocks,
filters to article namespace, extracts maximal kanji runs, and diffs
against the build dict's surface set. Outputs `wikipedia.tsv` with
`surface\tfreq` rows.
Reading-assignment is intentionally deferred — the user picks top-N gap
surfaces and looks up readings before promoting to `extras/<domain>.tsv`,
mirroring the existing `mine`-then-promote-by-hand workflow.
Pilot run on jawiki-articles1 (80K articles, ~1.5GB raw text) finishes
in ~32s and yields 304K freq>=5 gap surfaces. Most are lattice-
composable (徳川家康, 室町時代, 令和元年 — Mozc handles via segment
composition) but real misses surface in the mix (e.g. 宇宙戦艦 →
Mozc top-1 returns 宇宙船感). Per-candidate verification via
`lextool explain` is still required before promotion.
deps: bzip2 0.4 (lex-cli only — same dev-tool scope as the existing
zip dep used by `candidates mine`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new “surface-first” candidate mining path that scans a Wikipedia XML dump to extract frequent kanji-run surfaces, diffs them against the merged build dictionary’s surface set, and writes the remaining gaps to a TSV for manual curation into extras/.
Changes:
- Added
dictool candidates corpus <dump>subcommand to scan.xml/.xml.bz2Wikipedia dumps and emitwikipedia.tsv(surface + frequency). - Implemented streaming Wikipedia dump scanner with template (
{{...}}) and<ref>...</ref>skipping plus kanji-run extraction + unit tests. - Added
bzip2dependency to support streaming decompression of.bz2dumps.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| engine/crates/lex-cli/src/commands/candidates_ops.rs | Adds corpus() orchestration: run Wikipedia scan, build surface coverage set from build dict, diff + write wikipedia.tsv. |
| engine/crates/lex-cli/src/candidates/wikipedia.rs | New scanner/extractor for Wikipedia dump streaming, markup skipping, kanji-run frequency counting, and unit tests. |
| engine/crates/lex-cli/src/candidates/mod.rs | Exposes the new wikipedia candidates module. |
| engine/crates/lex-cli/src/bin/dictool.rs | Adds the CandidatesAction::Corpus CLI subcommand and wiring to candidates_ops::corpus. |
| engine/crates/lex-cli/Cargo.toml | Adds bzip2 dependency for dump decompression. |
| engine/Cargo.lock | Locks bzip2 / bzip2-sys transitive additions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+206
to
+208
| if !in_block && b == b'<' && s[i..].starts_with("<ref") { | ||
| // Self-closing `<ref ... />` is one shot; full `<ref>...</ref>` | ||
| // is multi-token. Cheaply check the next `>`. |
Comment on lines
+76
to
+79
| // Reset at each <page> boundary so a non-article page doesn't | ||
| // poison the next page when no explicit <ns> is provided. | ||
| if line.contains("<page>") { | ||
| current_page_is_article = true; |
Comment on lines
+273
to
+275
| /// Drop surfaces with frequency below this (default: 3). | ||
| #[arg(long, default_value_t = 3)] | ||
| min_freq: u32, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Surface-first vocabulary mining from a Wikipedia jawiki dump. Adds `dictool candidates corpus ` that:
Reading-assignment intentionally deferred — user picks top-N gap surfaces and looks up readings by hand before promoting to `extras/.tsv`. Mirrors the existing `mine`-then-promote workflow.
Why this approach (vs Sudachi / Wikidata, both tried in PR #243 / Wikidata experiment)
Per `feedback_extras_promotion.md`:
User insight (this session): forget reading + POS during extraction; just yank kanji runs. Reading lookup is the bottleneck — apply it only to the small post-diff candidate set.
Pilot result
`jawiki-articles1.bz2` (398MB compressed → 1.5GB raw, 80K articles):
No extras additions in this PR — this is the tool only. Future PRs hand-pick from the candidate file.
Test plan
Follow-ups (not this PR)
🤖 Generated with Claude Code