Skip to content

Add OpenDataLoader parity coverage and Java-core auto rescue#13

Open
JamesbbBriz wants to merge 283 commits into
mainfrom
feat/opendataloader-parity-coverage
Open

Add OpenDataLoader parity coverage and Java-core auto rescue#13
JamesbbBriz wants to merge 283 commits into
mainfrom
feat/opendataloader-parity-coverage

Conversation

@JamesbbBriz

@JamesbbBriz JamesbbBriz commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add OpenDataLoader pipeline parity contracts: processor matrix, stage order, heuristic ownership, behavior-family buckets, and full200 gate metadata.
  • Gate Java-core preset=auto prediction so readable Java/PDFBox output stays canonical, while sparse output can route through Rust/MNN OCR rescue.
  • Keep model workers warm across OpenDataLoader prediction batches and prepare local PP-OCRv5 MNN cache when needed.
  • Generate concrete low-score-buckets.json artifacts from evaluator output so full200 gaps are machine-readable by metric bucket and matrix-aligned behavior family.

Full200 Verification

Latest release full200, Java-core auto + Rust/MNN rescue:

artifact: third_party/opendataloader-bench/prediction/doctruth-java-core-20260627T190836Z/full200
parsed:   200/200
failed:   0
overall:  0.781875
nid:      0.900985
teds:     0.736174
mhs:      0.492119
latency:  108.615072 ms/doc
rss:      start 7 MB, peak 71 MB, end 71 MB
python/torch/docling residency: false
model routes: 1 OCR route, 01030000000141
low-score artifact: low-score-buckets.json
metric buckets: heading 58, reading-order 23, table 7, overall 22, missing 0
behavior buckets: text-noise 22, two-column 23, sidebar 23, heading 58, bordered-table 7, borderless-table 7, ocr-rescue 0
classification basis: metric_proxy until richer layout tags are available

Earlier release full200, deterministic Java-core lite baseline:

artifact: third_party/opendataloader-bench/prediction/doctruth-java-core-20260627T174900Z/full200
parsed:   200/200
failed:   0
overall:  0.779731
nid:      0.898148
teds:     0.736174
mhs:      0.489455
latency:  64.569802 ms/doc
rss peak: 19 MB
python/torch/docling residency: false
model routes: 0

Test Plan

cargo fmt --manifest-path runtime/doctruth-runtime/Cargo.toml -- --check
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml --test benchmark_corpus_contract opendataloader_ -- --nocapture
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml --test opendataloader_parity_matrix_contract -- --nocapture
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml
sh scripts/smoke-doctruth-runtime.sh
sh scripts/smoke-doctruth-runtime-model-worker.sh
sh scripts/smoke-doctruth-runtime-benchmark-corpus.sh
git diff --check
DOCTRUTH_RUNTIME_BUILD_PROFILE=release DOCTRUTH_OPENDATALOADER_PRESET=auto sh scripts/run-opendataloader-java-core-parity.sh --full200

CI Note

The GitHub CI build (25) job is still red on mvn -B -ntp spotless:check checkstyle:check. I reproduced the same 117-file Spotless failure on main in a temporary worktree, so it is a baseline formatting issue rather than a regression from this PR. I am intentionally not mixing a repo-wide Java formatting sweep into this parser parity PR.

Notes

  • OpenDataLoader remains a behavior reference and benchmark surface, not a canonical output schema or production fallback.
  • TrustDocument remains canonical.
  • low-score-buckets.json now separates raw metric buckets from matrix-aligned behavior buckets. Behavior buckets are metric-proxy classifications until the evaluator consumes richer layout tags.

@JamesbbBriz

Copy link
Copy Markdown
Contributor Author

CI note: the current failing build (25) job is from mvn -B -ntp spotless:check checkstyle:check reporting 117 Java formatting violations. I reproduced the same failure on main at 9df5f96, with the same 117-file Spotless baseline issue. I am not mixing a repository-wide Java formatting sweep into this parser parity PR. The parser/Rust validation and full200 gates for this branch have passed locally.

@JamesbbBriz

Copy link
Copy Markdown
Contributor Author

Updated after closing the concrete low-score artifact gap.

Latest head: 04231a4 feat: write opendataloader low-score bucket artifacts

Fresh release full200, Java-core auto + Rust/MNN rescue:

artifact: third_party/opendataloader-bench/prediction/doctruth-java-core-20260627T185345Z/full200
parsed:   200/200
failed:   0
overall:  0.781875
nid:      0.900985
teds:     0.736174
mhs:      0.492119
latency:  128.819003 ms/doc
rss:      start 7 MB, peak 60 MB, end 60 MB
python/torch/docling residency: false
model routes: 1 OCR route, 01030000000141
low-score artifact: low-score-buckets.json
low-score cases: 76/200; heading 58, reading-order 23, table 7, overall 22, missing 0

Fresh verification run locally:

cargo fmt --manifest-path runtime/doctruth-runtime/Cargo.toml -- --check
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml --test benchmark_corpus_contract opendataloader_ -- --nocapture
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml
sh scripts/smoke-doctruth-runtime.sh
sh scripts/smoke-doctruth-runtime-model-worker.sh
sh scripts/smoke-doctruth-runtime-benchmark-corpus.sh
git diff --check
DOCTRUTH_RUNTIME_BUILD_PROFILE=release DOCTRUTH_OPENDATALOADER_PRESET=auto sh scripts/run-opendataloader-java-core-parity.sh --full200

CI note remains: build (25) is red on the repo-wide Java Spotless/checkstyle baseline. I reproduced the same 117-file Spotless failure on main; this PR does not add that failure, and I did not mix a full Java formatting sweep into this parser parity branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant