Add OpenDataLoader parity coverage and Java-core auto rescue by JamesbbBriz · Pull Request #13 · doctruthhq/DocTruth

JamesbbBriz · 2026-06-27T17:54:52Z

Summary

Add OpenDataLoader pipeline parity contracts: processor matrix, stage order, heuristic ownership, behavior-family buckets, and full200 gate metadata.
Gate Java-core preset=auto prediction so readable Java/PDFBox output stays canonical, while sparse output can route through Rust/MNN OCR rescue.
Keep model workers warm across OpenDataLoader prediction batches and prepare local PP-OCRv5 MNN cache when needed.
Generate concrete low-score-buckets.json artifacts from evaluator output so full200 gaps are machine-readable by metric bucket and matrix-aligned behavior family.

Full200 Verification

Latest release full200, Java-core auto + Rust/MNN rescue:

artifact: third_party/opendataloader-bench/prediction/doctruth-java-core-20260627T190836Z/full200
parsed:   200/200
failed:   0
overall:  0.781875
nid:      0.900985
teds:     0.736174
mhs:      0.492119
latency:  108.615072 ms/doc
rss:      start 7 MB, peak 71 MB, end 71 MB
python/torch/docling residency: false
model routes: 1 OCR route, 01030000000141
low-score artifact: low-score-buckets.json
metric buckets: heading 58, reading-order 23, table 7, overall 22, missing 0
behavior buckets: text-noise 22, two-column 23, sidebar 23, heading 58, bordered-table 7, borderless-table 7, ocr-rescue 0
classification basis: metric_proxy until richer layout tags are available

Earlier release full200, deterministic Java-core lite baseline:

artifact: third_party/opendataloader-bench/prediction/doctruth-java-core-20260627T174900Z/full200
parsed:   200/200
failed:   0
overall:  0.779731
nid:      0.898148
teds:     0.736174
mhs:      0.489455
latency:  64.569802 ms/doc
rss peak: 19 MB
python/torch/docling residency: false
model routes: 0

Test Plan

cargo fmt --manifest-path runtime/doctruth-runtime/Cargo.toml -- --check
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml --test benchmark_corpus_contract opendataloader_ -- --nocapture
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml --test opendataloader_parity_matrix_contract -- --nocapture
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml
sh scripts/smoke-doctruth-runtime.sh
sh scripts/smoke-doctruth-runtime-model-worker.sh
sh scripts/smoke-doctruth-runtime-benchmark-corpus.sh
git diff --check
DOCTRUTH_RUNTIME_BUILD_PROFILE=release DOCTRUTH_OPENDATALOADER_PRESET=auto sh scripts/run-opendataloader-java-core-parity.sh --full200

CI Note

The GitHub CI build (25) job is still red on mvn -B -ntp spotless:check checkstyle:check. I reproduced the same 117-file Spotless failure on main in a temporary worktree, so it is a baseline formatting issue rather than a regression from this PR. I am intentionally not mixing a repo-wide Java formatting sweep into this parser parity PR.

Notes

OpenDataLoader remains a behavior reference and benchmark surface, not a canonical output schema or production fallback.
TrustDocument remains canonical.
low-score-buckets.json now separates raw metric buckets from matrix-aligned behavior buckets. Behavior buckets are metric-proxy classifications until the evaluator consumes richer layout tags.

JamesbbBriz · 2026-06-27T18:46:13Z

CI note: the current failing build (25) job is from mvn -B -ntp spotless:check checkstyle:check reporting 117 Java formatting violations. I reproduced the same failure on main at 9df5f96, with the same 117-file Spotless baseline issue. I am not mixing a repository-wide Java formatting sweep into this parser parity PR. The parser/Rust validation and full200 gates for this branch have passed locally.

JamesbbBriz · 2026-06-27T19:02:00Z

Updated after closing the concrete low-score artifact gap.

Latest head: 04231a4 feat: write opendataloader low-score bucket artifacts

Fresh release full200, Java-core auto + Rust/MNN rescue:

artifact: third_party/opendataloader-bench/prediction/doctruth-java-core-20260627T185345Z/full200
parsed:   200/200
failed:   0
overall:  0.781875
nid:      0.900985
teds:     0.736174
mhs:      0.492119
latency:  128.819003 ms/doc
rss:      start 7 MB, peak 60 MB, end 60 MB
python/torch/docling residency: false
model routes: 1 OCR route, 01030000000141
low-score artifact: low-score-buckets.json
low-score cases: 76/200; heading 58, reading-order 23, table 7, overall 22, missing 0

Fresh verification run locally:

cargo fmt --manifest-path runtime/doctruth-runtime/Cargo.toml -- --check
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml --test benchmark_corpus_contract opendataloader_ -- --nocapture
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml
sh scripts/smoke-doctruth-runtime.sh
sh scripts/smoke-doctruth-runtime-model-worker.sh
sh scripts/smoke-doctruth-runtime-benchmark-corpus.sh
git diff --check
DOCTRUTH_RUNTIME_BUILD_PROFILE=release DOCTRUTH_OPENDATALOADER_PRESET=auto sh scripts/run-opendataloader-java-core-parity.sh --full200

CI note remains: build (25) is red on the repo-wide Java Spotless/checkstyle baseline. I reproduced the same 117-file Spotless failure on main; this PR does not add that failure, and I did not mix a full Java formatting sweep into this parser parity branch.

JamesbbBriz added 30 commits June 18, 2026 11:27

test: cover mnn model worker routing

cc225fc

feat: add opendataloader benchmark oracle

7029c19

test: add parser reference harness

b5a4102

docs: record opendataloader rustification plan

b11473b

chore: vendor opendataloader pdf references

0240472

feat: route auto preset to ocr mnn worker

f6c48ee

feat: discover packaged rapidocr mnn worker

4305ca9

feat: add mnn promotion benchmark gate

13e9d01

feat: add mnn promotion bench lane

e8e77bf

feat: enrich rust opendataloader prediction artifacts

731a66f

feat: add direct rust opendataloader prediction command

93fd3a4

feat: import opendataloader metrics in rust prediction

051f8f7

feat: add rust opendataloader promotion report

83e10b2

feat: add rust opendataloader evaluator mvp

f3d8359

feat: align rust opendataloader evaluator normalization

cfb4de6

feat: add rust mhs tree evaluator

0f3e1d8

feat: add rust teds tree evaluator

72a6301

feat: convert markdown tables in rust evaluator

651050b

feat: rustify opendataloader bench runner

a86a5de

feat: add rust opendataloader timeout

0da0a2b

test: add opendataloader evaluator parity smoke

92aaaa6

test: fail closed python oracle baseline

7ade82d

test: guard legacy python prediction adapter

b8e4406

test: require opt in for official evaluator

833aed9

feat: normalize table attributes in rust evaluator

b4145c6

feat: match opendataloader markdown table conversion

614d4f0

feat: align rust opendataloader projection path

47fbb39

docs: clarify rust mnn parser boundary

c92d574

feat: add rust mnn model worker path

2d69b9b

fix: make rust mnn worker fail closed

044b787

JamesbbBriz added 3 commits June 28, 2026 00:52

docs: align opendataloader parity source of truth

5d87b1d

fix: align opendataloader full200 gate artifacts

9875fd2

feat: gate java-core auto with mnn rescue

d35c33a

feat: write opendataloader low-score bucket artifacts

04231a4

JamesbbBriz added 24 commits June 28, 2026 03:13

fix: align opendataloader low-score buckets with behavior matrix

47fca8e

chore: apply java spotless formatting

4daa45b

fix: suppress repeated running headers from headings

097b3eb

fix: configure rust runtime for java ci

4d7fcd5

fix: ratchet java coverage gate for parser parity

bb7abad

fix: make parser seed smoke use fake mnn ocr manifest

b20c6b1

fix: improve opendataloader reading order and tables

0422e3e

docs: plan opendataloader processor parity completion

e8ce995

docs: classify opendataloader temporary repairs

f979551

fix: harden opendataloader repair registry

a90fe62

feat: prioritize opendataloader processor work

d3425ba

fix: validate opendataloader next work buckets

9362434

refactor: rehome opendataloader table repairs

b20deea

fix: preserve opendataloader repair order

3660f37

style: apply spotless to processor parity test

f68a220

docs: fold opendataloader plan into parity docs

5587b8a

fix: promote opendataloader bare headings

c93cdd0

fix: promote opendataloader dotted headings

0e70b22

fix: improve opendataloader heading hierarchy

b933c78

fix: demote opendataloader procedure steps

bc0138a

fix: demote opendataloader toc headings

5075ea9

fix: merge opendataloader heading fragments

3266d0c

fix: split opendataloader inline headings

79c80e6

fix: demote opendataloader heading furniture

7786317

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OpenDataLoader parity coverage and Java-core auto rescue#13

Add OpenDataLoader parity coverage and Java-core auto rescue#13
JamesbbBriz wants to merge 283 commits into
mainfrom
feat/opendataloader-parity-coverage

JamesbbBriz commented Jun 27, 2026 •

edited

Loading

Uh oh!

JamesbbBriz commented Jun 27, 2026

Uh oh!

JamesbbBriz commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JamesbbBriz commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Full200 Verification

Test Plan

CI Note

Notes

Uh oh!

JamesbbBriz commented Jun 27, 2026

Uh oh!

JamesbbBriz commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JamesbbBriz commented Jun 27, 2026 •

edited

Loading