feat(cli): dictool candidates mine — extras dict candidate pool from SudachiDict#242
Merged
feat(cli): dictool candidates mine — extras dict candidate pool from SudachiDict#242
Conversation
Adds a `dictool candidates mine` subcommand that downloads SudachiDict-full, diffs against the merged build dict, classifies each `(reading, surface)` into one of three coarse buckets by Sudachi POS, and writes per-bucket TSVs to `engine/data/extras-candidates/` (gitignored). This addresses the proactive side of "I typed X and there was no candidate": mining surfaces a pool of words Mozc lacks before the user encounters them, without ever auto-merging into the build dict (which is what got Sudachi bulk-merge removed in PR #156). ## Why a separate module from `dict_source` `dict_source` modules feed entries directly into `lexime.dict`. PR #156 (2026-02-20) deleted the Mozc+SudachiDict merge after it produced too much top-1 noise — that lesson holds. Putting Sudachi into a new `candidates/` module makes the contract explicit: mined output never enters the build dict; the user reviews TSVs in `extras-candidates/` and hand-copies useful rows into `dict_source/extras/<domain>.tsv`. ## Buckets Sudachi POS doesn't carry semantic categories, so domain-tagged splits (food / it / medical) aren't possible from the source alone. Coarse syntactic split: - `sudachi-place.tsv` — `名詞,固有名詞,地名,*` (place names; ~537K seed) - `sudachi-common.tsv` — `名詞,普通名詞,一般,*` (general nouns; ~149K seed) - `sudachi-other.tsv` — verbs/adjectives/suffixes/etc. (~1.2M, low signal) Each file is sorted by Sudachi cost ascending so the highest-priority candidates surface first when scanning. User filters via grep: ``` grep -E "椒|醤|油" engine/data/extras-candidates/sudachi-common.tsv | head ``` ## Confirmed limits Mining can't help when: - The reading is borrowed cross-language (Sudachi has 老抽 with reading `ろうちゅう` (Mandarin pinyin), but Japanese cooks type `らおちゅう` — Sudachi's reading isn't useful). - Word is too niche to be in Sudachi at all (一保堂, 雲母坂, 冪等 — all absent). Curated TSV remains the only path for these. So mining augments the curated layer rather than replacing it. ## Layout ``` engine/crates/lex-cli/src/ candidates/ mod.rs # Bucket enum, classify(), write_candidates() sudachi.rs # fetch + parse 18-col CSV (resurrected from PR #156) commands/ candidates_ops.rs # mine() orchestration, default paths ``` `fetch + parse` is the same code path as the deleted PR #156 sudachi.rs (commit 0a45265^), confined to candidate mining. The `zip` crate dependency is reintroduced as a hard dep on lex-cli only — the IME runtime is unaffected.
There was a problem hiding this comment.
Pull request overview
Adds a new dictool candidates mine workflow to proactively mine potential extras/ dictionary entries from SudachiDict-full without auto-merging them into the build dictionary, outputting gitignored per-bucket TSVs for manual review/promotion.
Changes:
- Introduces
lex_cli::candidatesmodule (Sudachi fetch + parse, POS-based bucketing, TSV output). - Adds
dictool candidates mineCLI subcommand and orchestration helpers (default paths, optional--clean). - Restores
zipas alex-clidependency to handle Sudachi ZIP releases and updates docs around extras entry workflows.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| engine/crates/lex-cli/src/lib.rs | Exposes new candidates module. |
| engine/crates/lex-cli/src/dict_source/extras.rs | Documents “friction” vs “proactive mining” paths for adding extras entries. |
| engine/crates/lex-cli/src/commands/mod.rs | Registers new candidates_ops command module. |
| engine/crates/lex-cli/src/commands/candidates_ops.rs | Implements mine() orchestration + default/cache/output path helpers. |
| engine/crates/lex-cli/src/candidates/mod.rs | Defines candidate data model, bucketing, and TSV writer. |
| engine/crates/lex-cli/src/candidates/sudachi.rs | Implements SudachiDict version discovery, download/extract, and CSV parsing. |
| engine/crates/lex-cli/src/bin/dictool.rs | Adds dictool candidates mine CLI wiring and flags. |
| engine/crates/lex-cli/Cargo.toml | Adds zip dependency for Sudachi ZIP extraction. |
| engine/Cargo.lock | Locks new zip dependency graph for lex-cli. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- (IMP) sudachi.rs:60 — version-aware cache: stamp/latest mismatch now wipes stale CSVs before re-downloading, instead of silently reusing old data while writing the new version into the stamp/output headers. Adds regression tests `read_stamp` / `stale_cache_csvs_wiped_on_version_mismatch`. - (MIN perf) sudachi.rs:103 — stream `*_lex.csv` via `BufReader::lines` rather than slurping the whole (>100 MB) file into a String. - (MIN perf) sudachi.rs:189 — stage downloaded ZIPs to `.tmp.zip` on disk and stream-extract from that file instead of buffering ZIP bytes + extracted CSVs simultaneously in memory. - (MIN perf) mod.rs:126 — use `sort_by` with borrowed string compare to avoid cloning surface/reading at sort-key construction (millions of allocs at full-Sudachi scale). - (MIN doc) candidates_ops.rs:78 — comment claimed cache lives outside `engine/data/`; aligned with the actual `engine/data/.sudachi-cache` path and explained the leading-dot convention. Memory peak observation: ZIP-on-disk + line-streaming roughly halves peak RSS during `dictool candidates mine` on Sudachi-full.
`zip` (added in PR #242 for SudachiDict ZIP extraction) pulls in 4 crates that weren't previously exempt at `safe-to-deploy`: - `arbitrary 1.4.2` (new) - `bumpalo 3.19.1` (was `safe-to-run`; raised since zip's path needs deploy-grade) - `derive_arbitrary 1.4.2` (new) - `zopfli 0.8.3` (new) Adds them to `engine/supply-chain/config.toml` as exemptions to unblock CI `audit` job. Manually auditing 23k+ lines of zopfli / arbitrary upstream is out of scope; if any of these crates publishes a security concern, cargo-vet will surface it on the next run.
`zip 2.4.2` (added in PR #242 for SudachiDict ZIP extraction) ships a `build.rs`. Reviewed: standard cfg-detection pattern, no network access or codegen. Updates the build-script baseline so the audit job stops flagging it.
- (IMP) sudachi.rs:68 — robust cache invalidation: `fetch` now treats
ANY of {missing stamp, empty stamp, version mismatch, no stamp +
orphan CSVs} as a fully invalid cache and wipes before re-downloading.
R1's fix only handled version-mismatch; this closes the missing-
stamp + stale-CSV path. Adds `test_stale_cache_invariant` covering
all five state combinations and `test_any_csv_exists_detects_orphans`.
- (MIN perf) Cargo.toml:32 — switched from `zip = "2"` to `zip = "7"`,
collapsing the two zip majors in Cargo.lock into one. Drops the
arbitrary/derive_arbitrary transitive-dep tail that came in with
zip 2.x's deflate path; their cargo-vet exemptions are removed.
- (MIN perf) mod.rs:156 — stream rows to `BufWriter<File>` instead of
building the whole TSV (~100MB for full Sudachi) in a String first.
- (MIN perf) candidates_ops.rs:50 — drop the per-reading
`HashSet<String>` allocation and clone; linear scan over the small
Vec<DictEntry> from `dict.lookup` is faster for typical homophone
counts.
- (MIN doc) mod.rs:94 — fixed the off-by-one in the `classify` POS
column comment (was 4..=9, actually 5..=10 since col 4 is the
normalized surface, not POS).
Build artifact cleanup: build-script-baseline.txt drops `zip` (zip
7.x has no build.rs, unlike zip 2.x).
…T CI)
- (CRIT CI) supply-chain/config.toml — add zlib-rs 0.6.2 exemption.
zip 7's deflate path pulls in flate2 → zlib-rs, which wasn't covered
by the previous R2 supply-chain edits and was breaking the audit job.
- (MIN doc/style) mod.rs:151 — emit each TSV header line via its own
`writeln!` so source-code indentation doesn't bleed into the output
as leading whitespace before `#`. Adds an assertion that header
lines start at col 0.
- (MIN perf) sudachi.rs:185 — split the parsed row type into
`CandidateRow` (no reading) for the multimap value and `Candidate`
(with reading) only at the output site. The reading lives in the
HashMap key alone, dropping ~1.9M duplicate String allocations at
full-Sudachi scale.
- (MIN safety) sudachi.rs:245 — use a per-process unique tmp filename
(`.tmp.{pid}.zip`) and install `TmpFileGuard` BEFORE the download
block, so a failed `io::copy` no longer leaks a partial tmp and
parallel `mine` runs against the same cache dir don't clobber each
other.
- (MIN perf) candidates_ops.rs:57 — drop `cand.reading.clone()` (now
the outer `reading` is the canonical key) and replace
`pos.split('-').collect::<Vec>()` + `classify(&[..])` with a new
`classify_pos_string` helper that does the split inline without
allocating a Vec.
Tests: +1 `classify_pos_string_matches_classify`, header
col-0 assertion in `write_candidates_splits_buckets_and_sorts_by_cost`.
- (IMP) sudachi.rs:225 — `parse_latest_version` now accepts the `YYYYMMDD_N` re-cut form (e.g. `20201023_2`). Previously the digit-only filter dropped these entries, so a future re-cut release at the top of the listing would silently fall back to an older date-only version. Adds `is_sudachi_version` helper, regression test with `_N` at the head of the listing, and a noise-filter test covering empty / dangling / double-underscore / alpha cases. - (MIN test) sudachi.rs:499 — switch the parse_dir test fixture from a `\`-continued multi-line string (which leaks source-code indentation into the data) to `concat!` of separate literals, and add the missing `surface == "東京都"` assertion that would have caught the leakage. - (MIN doc) supply-chain/config.toml:69 — annotate why bumpalo's exemption was raised from `safe-to-run` to `safe-to-deploy` (zip ZIP-extraction tool path; bumpalo not used at IME runtime, but cargo-vet evaluates the dep graph at deploy-grade).
- (perf) sudachi.rs:152 — walk the split iterator with an enumerate
match instead of `collect::<Vec<&str>>()` per CSV line. Drops
~3M Vec allocations on Sudachi-full and breaks early at col 11
(cols 12..17 aren't read).
- (safety) sudachi.rs:297 — extract each CSV to a per-process
`.{name}.extract.{pid}` temp path then atomically rename to the
final csv_path. Two parallel `mine` runs against the same cache
dir no longer race the prior exists-check + truncate-create
pattern.
- (perf) candidates_ops.rs:69 — consume `upstream` by value so each
`CandidateRow`'s `surface` and `pos` Strings can be moved into
`Candidate` instead of cloned. Reading is still cloned per row
(the HashMap key remains canonical across the outer loop).
- (doc) supply-chain/config.toml:73 — fix the bumpalo rationale.
R4's comment claimed bumpalo only came in via build-side
proc-macro tooling, but PR #242's `zip 7` switch added a
`zip → zopfli → bumpalo` runtime path through dictool. Comment
now lists both paths.
…und)
- (perf) candidates_ops.rs:60 — scope the `seen` dedupe HashSet to a
single reading instead of growing it globally as `(reading, surface)`
tuples. Surface duplicates are only meaningful within one reading
(when Sudachi has multiple POS variants for the same row), so the
per-reading set drops both the cross-reading entries and the
per-row reading clone.
- (safety) sudachi.rs:321 — wrap `extract_tmp` with the existing
`TmpFileGuard` RAII pattern so a failure in `io::copy` or the
subsequent `fs::rename` doesn't leak a `.{basename}.extract.{pid}`
file in the cache dir. Successful rename leaves nothing at the
staged path, so the guard's drop is a no-op then.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
`dictool candidates mine` で SudachiDict-full を fetch → build dict と diff → POS で 3 bucket 分類 → `engine/data/extras-candidates/` に per-bucket TSV を出力。
「打って候補が出ない」というストレスの proactive 側 を埋めるツール。マイニング結果は build dict に自動マージしません (#156 で消した Sudachi 全マージの教訓を踏襲)。ユーザーが TSV を手動でレビュー → 良いものだけ `dict_source/extras/.tsv` に転記して PR、というワークフロー。
Buckets
各 TSV は Sudachi cost 昇順 (= 出現頻度プロキシ) ソート。grep で domain 絞り込み:
```bash
grep -E "椒|醤|油" engine/data/extras-candidates/sudachi-common.tsv | head -50
```
マイニングの限界 (確認済)
→ マイニングは curated を 置換しない、補完する。
過去の Sudachi 全マージ (#156, 0a45265) との設計上の違い
`fetch + parse` のコード自体は 0a45265^ から resurrect (~300 LoC)。 ただし `dict_source/` ではなく専用の `candidates/` モジュールに配置し、`SOURCES` レジストリには登録しない (= `dictool fetch --source sudachi` も `compile --source sudachi` も呼べない)。
レイアウト
```
engine/crates/lex-cli/src/
├── candidates/
│ ├── mod.rs # Bucket / classify / write_candidates
│ └── sudachi.rs # fetch + 18-col CSV parse
└── commands/
└── candidates_ops.rs # mine() orchestration + default paths
```
依存: `zip` クレートを lex-cli の hard dep として復活 (Sudachi リリースが ZIP のため)。IME runtime には影響しない (lex-cli は build/dev tool)。
CLI
```bash
dictool candidates mine [--build-dict PATH] [--cache-dir DIR] [--out-dir DIR] [--clean]
```
デフォルト:
Test plan
🤖 Generated with Claude Code