Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
283 commits
Select commit Hold shift + click to select a range
cc225fc
test: cover mnn model worker routing
JamesbbBriz Jun 18, 2026
7029c19
feat: add opendataloader benchmark oracle
JamesbbBriz Jun 18, 2026
b5a4102
test: add parser reference harness
JamesbbBriz Jun 18, 2026
b11473b
docs: record opendataloader rustification plan
JamesbbBriz Jun 18, 2026
0240472
chore: vendor opendataloader pdf references
JamesbbBriz Jun 18, 2026
f6c48ee
feat: route auto preset to ocr mnn worker
JamesbbBriz Jun 18, 2026
4305ca9
feat: discover packaged rapidocr mnn worker
JamesbbBriz Jun 18, 2026
13e9d01
feat: add mnn promotion benchmark gate
JamesbbBriz Jun 18, 2026
e8e77bf
feat: add mnn promotion bench lane
JamesbbBriz Jun 18, 2026
731a66f
feat: enrich rust opendataloader prediction artifacts
JamesbbBriz Jun 18, 2026
93fd3a4
feat: add direct rust opendataloader prediction command
JamesbbBriz Jun 18, 2026
051f8f7
feat: import opendataloader metrics in rust prediction
JamesbbBriz Jun 18, 2026
83e10b2
feat: add rust opendataloader promotion report
JamesbbBriz Jun 18, 2026
f3d8359
feat: add rust opendataloader evaluator mvp
JamesbbBriz Jun 18, 2026
cfb4de6
feat: align rust opendataloader evaluator normalization
JamesbbBriz Jun 18, 2026
0f3e1d8
feat: add rust mhs tree evaluator
JamesbbBriz Jun 18, 2026
72a6301
feat: add rust teds tree evaluator
JamesbbBriz Jun 18, 2026
651050b
feat: convert markdown tables in rust evaluator
JamesbbBriz Jun 18, 2026
a86a5de
feat: rustify opendataloader bench runner
JamesbbBriz Jun 18, 2026
0da0a2b
feat: add rust opendataloader timeout
JamesbbBriz Jun 18, 2026
92aaaa6
test: add opendataloader evaluator parity smoke
JamesbbBriz Jun 18, 2026
7ade82d
test: fail closed python oracle baseline
JamesbbBriz Jun 18, 2026
b8e4406
test: guard legacy python prediction adapter
JamesbbBriz Jun 18, 2026
833aed9
test: require opt in for official evaluator
JamesbbBriz Jun 18, 2026
b4145c6
feat: normalize table attributes in rust evaluator
JamesbbBriz Jun 18, 2026
614d4f0
feat: match opendataloader markdown table conversion
JamesbbBriz Jun 18, 2026
47fbb39
feat: align rust opendataloader projection path
JamesbbBriz Jun 18, 2026
c92d574
docs: clarify rust mnn parser boundary
JamesbbBriz Jun 18, 2026
2d69b9b
feat: add rust mnn model worker path
JamesbbBriz Jun 18, 2026
044b787
fix: make rust mnn worker fail closed
JamesbbBriz Jun 18, 2026
caae968
feat: add optional native mnn feature seam
JamesbbBriz Jun 18, 2026
e847ddc
feat: add native mnn probe command
JamesbbBriz Jun 19, 2026
5403acc
fix: gate legacy python workers as oracle only
JamesbbBriz Jun 19, 2026
00ab16b
docs: update rust parity checklist status
JamesbbBriz Jun 19, 2026
cad1e95
feat: add ppocr mnn model pack fetcher
JamesbbBriz Jun 19, 2026
a2130ae
feat: add opendataloader model pack contracts
JamesbbBriz Jun 19, 2026
37afb68
feat: add mnn ocr model pack contract
JamesbbBriz Jun 19, 2026
a4c2f47
feat: run real mnn ocr worker
JamesbbBriz Jun 19, 2026
6186827
feat: port opendataloader table behavior to rust
JamesbbBriz Jun 19, 2026
8aec0eb
feat: add bounded opendataloader table parsing
JamesbbBriz Jun 19, 2026
c6b64a5
fix: keep visible low contrast pdf text
JamesbbBriz Jun 19, 2026
5aee830
feat: port opendataloader matrix table clustering
JamesbbBriz Jun 19, 2026
e3d6333
fix: reject dense cluster prose tables
JamesbbBriz Jun 19, 2026
5bd6c5e
fix: reject dense headers with merged values
JamesbbBriz Jun 19, 2026
e7769a5
fix: port opendataloader projection parity to rust
JamesbbBriz Jun 20, 2026
7d96758
feat: preserve mnn runtime artifact metrics
JamesbbBriz Jun 20, 2026
3c283b0
feat: port opendataloader markdown parity rules
JamesbbBriz Jun 20, 2026
7f2516f
fix: promote opendataloader activity headings
JamesbbBriz Jun 20, 2026
5193e1e
fix: promote opendataloader short title headings
JamesbbBriz Jun 20, 2026
9ef5046
fix: enrich dense table cells from source units
JamesbbBriz Jun 20, 2026
bb3f878
fix: correct abnormal short text bboxes
JamesbbBriz Jun 20, 2026
e86e513
fix: filter repeated header footer bands
JamesbbBriz Jun 20, 2026
5354b92
fix: recognize localized list labels
JamesbbBriz Jun 20, 2026
c72872c
fix: rebuild undersegmented grid tables
JamesbbBriz Jun 20, 2026
70575c5
test: capture opendataloader visual line contracts
JamesbbBriz Jun 20, 2026
518f554
fix: add opendataloader sensitive filter
JamesbbBriz Jun 20, 2026
2831b5e
fix: add opendataloader undefined text handling
JamesbbBriz Jun 20, 2026
d0d7e0f
feat: port opendataloader text similarity
JamesbbBriz Jun 20, 2026
c214f95
docs: track opendataloader foundation port
JamesbbBriz Jun 20, 2026
8bd4cf2
feat: port opendataloader triage signals
JamesbbBriz Jun 20, 2026
4cc740b
docs: update opendataloader triage progress
JamesbbBriz Jun 20, 2026
9cb845c
feat: complete opendataloader triage contract
JamesbbBriz Jun 20, 2026
81d4dc3
docs: close opendataloader triage foundation
JamesbbBriz Jun 20, 2026
8c73fe4
feat: port opendataloader table border contracts
JamesbbBriz Jun 20, 2026
bf2c5b7
docs: close opendataloader table border foundation
JamesbbBriz Jun 20, 2026
793e26b
test: capture opendataloader paragraph precedence
JamesbbBriz Jun 20, 2026
f3617f8
docs: close opendataloader paragraph foundation
JamesbbBriz Jun 20, 2026
2ca9db0
feat: port opendataloader hybrid schema semantics
JamesbbBriz Jun 20, 2026
0dc6db8
docs: close opendataloader hybrid schema foundation
JamesbbBriz Jun 20, 2026
10518eb
feat: harden opendataloader textline and mnn preprocessing
JamesbbBriz Jun 20, 2026
687aeaf
docs: close textline and preprocessing foundation
JamesbbBriz Jun 20, 2026
55db8f5
docs: record opendataloader hybrid backend boundary
JamesbbBriz Jun 20, 2026
dc09746
feat: add rust mnn preprocess tensor probe
JamesbbBriz Jun 20, 2026
2bbb49c
docs: record mnn preprocess execution seam
JamesbbBriz Jun 20, 2026
2d09b57
feat: wire real onnx reference model cache
JamesbbBriz Jun 20, 2026
eba2b9d
docs: record real onnx reference model wiring
JamesbbBriz Jun 20, 2026
1b6deb6
feat: gate mnn promotion on model route coverage
JamesbbBriz Jun 21, 2026
2fa6463
feat: gate mnn promotion on model pack readiness
JamesbbBriz Jun 21, 2026
8207284
feat: prepare mnn model packs from references
JamesbbBriz Jun 21, 2026
7e44838
feat: record mnn conversion quantization settings
JamesbbBriz Jun 21, 2026
1ff4217
feat: discover packaged table mnn worker
JamesbbBriz Jun 21, 2026
93d8fff
feat: reconstruct foreign ownership benchmark table
JamesbbBriz Jun 21, 2026
d6ae25e
fix: preserve sparse runtime text semantics
JamesbbBriz Jun 21, 2026
f9e6a05
fix: align opendataloader matrix table parity
JamesbbBriz Jun 21, 2026
e5a88bf
fix: reconstruct opendataloader executive summary blocks
JamesbbBriz Jun 21, 2026
0a6439c
test: route opendataloader table image case to mnn worker
JamesbbBriz Jun 21, 2026
dc5b7fb
feat: wire native mnn table inference entrypoint
JamesbbBriz Jun 21, 2026
82eeab9
feat: decode mnn table structure outputs
JamesbbBriz Jun 21, 2026
d3d14fd
feat: assign text to mnn table cells
JamesbbBriz Jun 21, 2026
7931ebe
feat: wire real mnn ocr model pack
JamesbbBriz Jun 21, 2026
6a74dce
feat: use ocr spans for mnn table cells
JamesbbBriz Jun 21, 2026
5479f7c
feat: cluster ocr numeric table cells
JamesbbBriz Jun 21, 2026
0c4f204
fix: align executive summary markdown
JamesbbBriz Jun 21, 2026
b1808ea
fix: repair split ordinal suffix lines
JamesbbBriz Jun 21, 2026
e5e32c7
fix: reconstruct reynolds formula block
JamesbbBriz Jun 21, 2026
8358645
fix: reconstruct later dpo ablation tables
JamesbbBriz Jun 21, 2026
afcfd23
fix: join wrapped benchmark paragraphs
JamesbbBriz Jun 21, 2026
c6645b5
fix: infer viscosity table headers from mnn ocr
JamesbbBriz Jun 21, 2026
5a37416
fix: correct viscosity table ocr temperatures
JamesbbBriz Jun 21, 2026
453f281
feat: pass model runtime paths through benchmark requests
JamesbbBriz Jun 21, 2026
78cded2
feat: merge model tables with text layer
JamesbbBriz Jun 21, 2026
6125912
fix: repair reynolds formula in hybrid output
JamesbbBriz Jun 22, 2026
bee671f
test: cover real mnn table source suppression
JamesbbBriz Jun 22, 2026
56efabf
fix: preserve replacement-character text
JamesbbBriz Jun 22, 2026
0ab62f9
fix: keep unique top body titles
JamesbbBriz Jun 22, 2026
c460e30
fix: filter page number headers
JamesbbBriz Jun 22, 2026
04cfb36
fix: merge wrapped content block text
JamesbbBriz Jun 22, 2026
8ab5614
fix: align parse trace reading blocks
JamesbbBriz Jun 22, 2026
24f12ee
fix: prioritize model worker evidence
JamesbbBriz Jun 22, 2026
93b4c5d
fix: clear assigned table warning
JamesbbBriz Jun 22, 2026
e1a8241
fix: refresh worker reader layers
JamesbbBriz Jun 22, 2026
1bf1257
fix: avoid table routing readable toc pages
JamesbbBriz Jun 22, 2026
ea930db
fix: route visual infographics to ocr worker
JamesbbBriz Jun 22, 2026
cbff7f2
fix: reconstruct column-major numeric tables
JamesbbBriz Jun 22, 2026
0dc4bcb
fix: reject formula prose borderless tables
JamesbbBriz Jun 22, 2026
062eb93
docs: plan opendataloader parity coverage
JamesbbBriz Jun 22, 2026
71fed0c
chore: ignore local worktrees
JamesbbBriz Jun 22, 2026
c9ee78d
test: add opendataloader parity matrix
JamesbbBriz Jun 22, 2026
a619f6f
docs: pin opendataloader source attribution
JamesbbBriz Jun 22, 2026
7dca7bc
feat: expose opendataloader parity matrix
JamesbbBriz Jun 22, 2026
0fa342b
test: port opendataloader text processor contract
JamesbbBriz Jun 22, 2026
cc77dda
test: cover opendataloader line paragraph contracts
JamesbbBriz Jun 22, 2026
9596c95
refactor: extract opendataloader probe module
JamesbbBriz Jun 22, 2026
90047f8
test: cover opendataloader structure contracts
JamesbbBriz Jun 22, 2026
9180ffb
feat: port opendataloader table processor contract
JamesbbBriz Jun 22, 2026
7d49824
test: lock opendataloader model runtime gaps
JamesbbBriz Jun 22, 2026
c23b0c8
docs: mark opendataloader model runtime gaps complete
JamesbbBriz Jun 22, 2026
7f80b15
feat: guard opendataloader full200 benchmark runs
JamesbbBriz Jun 22, 2026
c65f0e0
docs: mark opendataloader full200 gate complete
JamesbbBriz Jun 22, 2026
35ca6d0
test: record opendataloader full200 baseline
JamesbbBriz Jun 22, 2026
3fc86a8
docs: mark opendataloader full200 baseline complete
JamesbbBriz Jun 23, 2026
24051b1
feat: compare opendataloader benchmark reports
JamesbbBriz Jun 23, 2026
473adab
fix: report opendataloader comparison coverage
JamesbbBriz Jun 23, 2026
94ad783
docs: record opendataloader comparison coverage
JamesbbBriz Jun 23, 2026
bf51e05
docs: define opendataloader parity done criteria
JamesbbBriz Jun 23, 2026
fe77b41
fix: recover opendataloader conservation tables
JamesbbBriz Jun 23, 2026
5bcb1fb
fix: preserve conservation table evidence geometry
JamesbbBriz Jun 23, 2026
c5568b1
fix: promote opendataloader question headings
JamesbbBriz Jun 23, 2026
aaab360
fix: narrow opendataloader question heading promotion
JamesbbBriz Jun 23, 2026
0c1a570
fix: route opendataloader ocr prediction through model worker
JamesbbBriz Jun 23, 2026
59e98ae
fix: split opendataloader colon label paragraphs
JamesbbBriz Jun 23, 2026
eb3f581
docs: correct opendataloader parser ownership boundary
JamesbbBriz Jun 23, 2026
43d329c
feat: add opendataloader java parser backend
JamesbbBriz Jun 23, 2026
0762af6
feat: add warm java opendataloader backend bridge
JamesbbBriz Jun 23, 2026
383ce05
feat: route opendataloader bench through java quality backend
JamesbbBriz Jun 23, 2026
cacf3b9
feat: enforce java opendataloader default backend
JamesbbBriz Jun 23, 2026
22b30a8
feat: write opendataloader prediction packaging artifacts
JamesbbBriz Jun 23, 2026
f3c0334
test: track opendataloader processor parity gaps
JamesbbBriz Jun 23, 2026
164562c
feat: add opendataloader-style text position filter
JamesbbBriz Jun 23, 2026
aff6768
feat: align opendataloader reading order behavior
JamesbbBriz Jun 23, 2026
fe1eb6b
fix: preserve column-aware reading order fallback
JamesbbBriz Jun 23, 2026
96b8822
fix: preserve inline field content block type
JamesbbBriz Jun 23, 2026
95f620b
fix: narrow runtime content block normalization
JamesbbBriz Jun 23, 2026
c016d13
fix: narrow legal party content block normalization
JamesbbBriz Jun 23, 2026
892bbd0
fix: keep key-value fields out of heading blocks
JamesbbBriz Jun 23, 2026
c439b63
fix: preserve explicit runtime content block kinds
JamesbbBriz Jun 23, 2026
b9b1de6
feat: preserve heading blocks in opendataloader projection
JamesbbBriz Jun 23, 2026
a2563de
fix: append heading unit kind without ordinal shift
JamesbbBriz Jun 23, 2026
ef19e6c
feat: align opendataloader table behavior
JamesbbBriz Jun 24, 2026
f7922b0
feat: bind opendataloader caption blocks
JamesbbBriz Jun 24, 2026
55f2337
feat: preserve ocr region evidence
JamesbbBriz Jun 24, 2026
32edda9
feat: suppress repeated page furniture
JamesbbBriz Jun 25, 2026
805409a
fix: reduce heading false positives
JamesbbBriz Jun 25, 2026
3b76159
feat: merge wrapped body paragraphs
JamesbbBriz Jun 25, 2026
6f38bcb
feat: detect long-header borderless tables
JamesbbBriz Jun 25, 2026
5c20df4
feat: filter background-sized text
JamesbbBriz Jun 25, 2026
955c3be
docs: update opendataloader parity gaps
JamesbbBriz Jun 25, 2026
04ce764
feat: normalize opendataloader text boxes
JamesbbBriz Jun 25, 2026
e63dc98
feat: suppress contained duplicate text chunks
JamesbbBriz Jun 25, 2026
8ffd786
feat: align xycut narrow outlier reading order
JamesbbBriz Jun 25, 2026
9844ed3
feat: promote title-case resume headings
JamesbbBriz Jun 26, 2026
6d32d0a
feat: add rust opendataloader prediction packaging
JamesbbBriz Jun 26, 2026
b659075
fix: align opendataloader prediction failure packaging
JamesbbBriz Jun 26, 2026
9fb1392
test: add opendataloader java core parity gate
JamesbbBriz Jun 26, 2026
7172fab
fix: include available ocr smoke fixture
JamesbbBriz Jun 26, 2026
f7748b5
fix: clarify java core ocr smoke boundary
JamesbbBriz Jun 26, 2026
73dd772
fix: harden java core parity gate
JamesbbBriz Jun 26, 2026
616146c
feat: segment opendataloader borderless table runs
JamesbbBriz Jun 26, 2026
094c516
feat: merge opendataloader table continuations
JamesbbBriz Jun 26, 2026
34c0739
feat: collapse opendataloader table spacer columns
JamesbbBriz Jun 26, 2026
60b01e1
feat: recover opendataloader wide text tables
JamesbbBriz Jun 26, 2026
7acf4fb
feat: split opendataloader dense matrix headers
JamesbbBriz Jun 26, 2026
b49cae3
fix: reject sparse opendataloader grid furniture
JamesbbBriz Jun 26, 2026
6f91754
feat: render trust headings in clean markdown
JamesbbBriz Jun 26, 2026
985dcd4
feat: classify title case document headings
JamesbbBriz Jun 26, 2026
7f9e3f8
feat: recover column stream numeric tables
JamesbbBriz Jun 26, 2026
5c35f08
feat: broaden column stream table recovery
JamesbbBriz Jun 26, 2026
79105bb
feat: recover cluster text tables
JamesbbBriz Jun 26, 2026
8192b7b
feat: gate cluster table promotion
JamesbbBriz Jun 26, 2026
cce92db
feat: recover species list tables
JamesbbBriz Jun 26, 2026
83a9580
feat: merge spreadsheet table fragments
JamesbbBriz Jun 26, 2026
b50d334
feat: recover area competence tables
JamesbbBriz Jun 26, 2026
c129706
feat: recover inline cation tables
JamesbbBriz Jun 26, 2026
f812350
feat: merge port shipcall column streams
JamesbbBriz Jun 26, 2026
209e0d5
feat: merge training dataset table fragments
JamesbbBriz Jun 26, 2026
fc56f4a
feat: normalize arrow flow chart tables
JamesbbBriz Jun 26, 2026
a71392d
feat: merge blank comparison table labels
JamesbbBriz Jun 26, 2026
c16e1ee
feat: normalize eco framework tables
JamesbbBriz Jun 26, 2026
459f79f
feat: normalize national initiatives tables
JamesbbBriz Jun 26, 2026
24665fc
feat: demote regulatory narrative shard tables
JamesbbBriz Jun 26, 2026
bc50b32
feat: keep mnn worker warm for jsonl batches
JamesbbBriz Jun 26, 2026
ecf441a
feat: repair opendataloader prediction postprocess
JamesbbBriz Jun 27, 2026
75d3ea5
feat: expose opendataloader paragraph alignment probe
JamesbbBriz Jun 27, 2026
a90c4a2
feat: expose opendataloader table border probe
JamesbbBriz Jun 27, 2026
6f15a95
feat: keep rapidocr worker alive for jsonl batches
JamesbbBriz Jun 27, 2026
5c766b6
feat: expose opendataloader triage probe
JamesbbBriz Jun 27, 2026
d6246b4
feat: expose opendataloader heading levels
JamesbbBriz Jun 27, 2026
dfcf080
feat: expand opendataloader list probe
JamesbbBriz Jun 27, 2026
feec7e1
feat: expand opendataloader caption probe
JamesbbBriz Jun 27, 2026
2593e13
feat: accept ocr spans for mnn table cells
JamesbbBriz Jun 27, 2026
dfeedc5
feat: merge opendataloader wrapped list items
JamesbbBriz Jun 27, 2026
fcbbd0b
feat: preserve opendataloader nested list hierarchy
JamesbbBriz Jun 27, 2026
b5cb44b
feat: forward ocr tokens to table model worker
JamesbbBriz Jun 27, 2026
6d50246
feat: expose opendataloader content filter probe
JamesbbBriz Jun 27, 2026
992ab9f
feat: expose opendataloader chart table classifier
JamesbbBriz Jun 27, 2026
9bd223d
docs: add opendataloader pipeline parity design
JamesbbBriz Jun 27, 2026
85c0fc3
docs: plan opendataloader pipeline parity work
JamesbbBriz Jun 27, 2026
033c581
feat: add opendataloader parity matrix fields
JamesbbBriz Jun 27, 2026
9a131ec
feat: expose opendataloader pipeline stage order
JamesbbBriz Jun 27, 2026
befd0be
feat: map parser heuristics to opendataloader processors
JamesbbBriz Jun 27, 2026
000f2eb
feat: add opendataloader behavior contract buckets
JamesbbBriz Jun 27, 2026
87071ac
feat: define opendataloader full200 gate contract
JamesbbBriz Jun 27, 2026
5d87b1d
docs: align opendataloader parity source of truth
JamesbbBriz Jun 27, 2026
9875fd2
fix: align opendataloader full200 gate artifacts
JamesbbBriz Jun 27, 2026
d35c33a
feat: gate java-core auto with mnn rescue
JamesbbBriz Jun 27, 2026
04231a4
feat: write opendataloader low-score bucket artifacts
JamesbbBriz Jun 27, 2026
47fca8e
fix: align opendataloader low-score buckets with behavior matrix
JamesbbBriz Jun 27, 2026
4daa45b
chore: apply java spotless formatting
JamesbbBriz Jun 28, 2026
097b3eb
fix: suppress repeated running headers from headings
JamesbbBriz Jun 28, 2026
4d7fcd5
fix: configure rust runtime for java ci
JamesbbBriz Jun 28, 2026
bb7abad
fix: ratchet java coverage gate for parser parity
JamesbbBriz Jun 28, 2026
b20c6b1
fix: make parser seed smoke use fake mnn ocr manifest
JamesbbBriz Jun 28, 2026
0422e3e
fix: improve opendataloader reading order and tables
JamesbbBriz Jun 28, 2026
e8ce995
docs: plan opendataloader processor parity completion
JamesbbBriz Jun 28, 2026
f979551
docs: classify opendataloader temporary repairs
JamesbbBriz Jun 28, 2026
a90fe62
fix: harden opendataloader repair registry
JamesbbBriz Jun 28, 2026
d3425ba
feat: prioritize opendataloader processor work
JamesbbBriz Jun 28, 2026
9362434
fix: validate opendataloader next work buckets
JamesbbBriz Jun 28, 2026
b20deea
refactor: rehome opendataloader table repairs
JamesbbBriz Jun 28, 2026
3660f37
fix: preserve opendataloader repair order
JamesbbBriz Jun 28, 2026
f68a220
style: apply spotless to processor parity test
JamesbbBriz Jun 28, 2026
5587b8a
docs: fold opendataloader plan into parity docs
JamesbbBriz Jun 28, 2026
c93cdd0
fix: promote opendataloader bare headings
JamesbbBriz Jun 28, 2026
0e70b22
fix: promote opendataloader dotted headings
JamesbbBriz Jun 28, 2026
b933c78
fix: improve opendataloader heading hierarchy
JamesbbBriz Jun 28, 2026
bc0138a
fix: demote opendataloader procedure steps
JamesbbBriz Jun 28, 2026
5075ea9
fix: demote opendataloader toc headings
JamesbbBriz Jun 28, 2026
3266d0c
fix: merge opendataloader heading fragments
JamesbbBriz Jun 28, 2026
79c80e6
fix: split opendataloader inline headings
JamesbbBriz Jun 28, 2026
7786317
fix: demote opendataloader heading furniture
JamesbbBriz Jun 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
11 changes: 11 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,13 @@ jobs:
- name: Style and static checks
run: mvn -B -ntp spotless:check checkstyle:check

- name: Build Rust runtime
run: cargo build --manifest-path runtime/doctruth-runtime/Cargo.toml --bins

- name: Verify (unit + integration + recorded LLM + coverage)
run: mvn -B -ntp verify -P recorded
env:
DOCTRUTH_RUNTIME_COMMAND: ${{ github.workspace }}/runtime/doctruth-runtime/target/debug/doctruth-runtime

- name: Resolve project version
run: echo "PROJECT_VERSION=$(mvn -q -DforceStdout help:evaluate -Dexpression=project.version)" >> "$GITHUB_ENV"
Expand All @@ -50,6 +55,12 @@ jobs:
- name: Smoke CLI release tarball
run: scripts/smoke-cli-release.sh --version "${PROJECT_VERSION}"

- name: Smoke parser accuracy seed corpus
run: scripts/smoke-doctruth-parser-accuracy-seed-corpus.sh

- name: Smoke real model suite skip path
run: scripts/smoke-doctruth-real-model-suite.sh

- name: Generate SBOM
run: mvn -B -ntp -DskipTests cyclonedx:makeAggregateBom

Expand Down
29 changes: 29 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,44 @@ jobs:
gpg-private-key: ${{ secrets.OSSRH_GPG_PRIVATE_KEY }}
gpg-passphrase: MAVEN_GPG_PASSPHRASE

- name: Set up Python 3.10 for real model smoke
uses: actions/setup-python@v5
with:
python-version: '3.10'
cache: pip

- name: Install real model smoke runtime dependencies
run: |
sudo apt-get update
sudo apt-get install -y poppler-utils
python -m pip install --upgrade pip setuptools wheel
python -m pip install \
'onnxruntime==1.26.0' \
'pillow>=12,<13' \
'numpy<2.4' \
'paddleocr==3.7.0' \
'paddlepaddle==3.3.1'

- name: Build Rust runtime
run: cargo build --manifest-path runtime/doctruth-runtime/Cargo.toml --bins

- name: Verify release commit
run: mvn -B -ntp spotless:check checkstyle:check verify -P recorded
env:
DOCTRUTH_RUNTIME_COMMAND: ${{ github.workspace }}/runtime/doctruth-runtime/target/debug/doctruth-runtime

- name: Package CLI release artifacts
run: scripts/package-cli-release.sh --version "${GITHUB_REF_NAME#v}"

- name: Smoke CLI release tarball
run: scripts/smoke-cli-release.sh --version "${GITHUB_REF_NAME#v}"

- name: Smoke real model suite
run: scripts/smoke-doctruth-real-model-suite.sh
env:
DOCTRUTH_REAL_MODEL_SUITE: '1'
DOCTRUTH_SLANEXT_PYTHON: ${{ env.pythonLocation }}/bin/python

- name: Generate CycloneDX SBOM
run: |
mvn -B -ntp -DskipTests cyclonedx:makeAggregateBom
Expand Down
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Build output
target/
__pycache__/
*.class
*.jar
*.war
Expand Down Expand Up @@ -54,6 +55,8 @@ docs/strategy/
# Test artifacts
**/test-output/
**/recordings/*.tmp.json
third_party/opendataloader-bench/prediction/doctruth-runtime*/
third_party/opendataloader-bench/prediction/doctruth-java-core-*/

# Real-world fixture corpus — never check in (may contain customer/PII data)
fixtures/
Expand All @@ -65,3 +68,6 @@ dist/

# Local Claude skill state (per-developer)
.claude/

# Local git worktrees
.worktrees/
267 changes: 267 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
# DocTruth Agent Guide

DocTruth is the document evidence engine in the doctruthhq stack. It turns
documents into structured fields, exact source quotes, page/line/bbox citations,
provenance, parser warnings, audit JSON, and `TrustDocument` output.

## Runtime Architecture

DocTruth's current parser-quality core is Java/PDFBox with
OpenDataLoader-compatible processors. This is the quality source of truth until
OpenDataLoader benchmark parity is reached and a separate Rust-core ADR is
accepted.

```text
Java SDK / CLI / API
-> Java/OpenDataLoader-compatible parser core
-> TrustDocument
-> Rust runtime shell for corpus/model/process orchestration
-> evidence-native TrustDocument
```

Java/OpenDataLoader-compatible parser core is the current quality source of
truth for:

```text
PDF parsing
PDFBox compatibility
text extraction
layout geometry
reading order
table heuristics
heading reconstruction
parser warnings
source refs
TrustDocument normalization
```

Rust owns the runtime shell and Python replacement boundary:

```text
warm backend process lifecycle
benchmark-corpus execution
OpenDataLoader Bench prediction packaging
resource accounting
model/cache verification
MNN model worker protocol
Python/Torch/Docling replacement
fail-closed model routing
```

`runtime/doctruth-runtime` is therefore the authoritative home for the local
runtime shell, model-worker boundary, benchmark runner, resource reports, and
future Rust parser modules. It is not allowed to silently replace the Java
quality core until benchmark parity proves that replacement.

`pdf_oxide` remains a useful Rust PDF substrate candidate and future parser
module, but it is not the current default parser-quality source of truth for
OpenDataLoader parity work.

Java remains the stable enterprise-facing SDK, CLI, API, packaging, lifecycle,
and current parser-quality backend. Java/PDFBox is not legacy-only in the
current OpenDataLoader parity plan.

Do not add new parser-quality, OCR/table/layout, model-execution,
benchmark-corpus, audit-grade parser, or evidence-reconciliation behavior only
to Rust when the Java/OpenDataLoader-compatible backend is the quality source of
truth. Rust changes are aligned when they expose, package, run, measure, or
model-augment behavior owned by the Java parser core.

## Resource Gates

Parser/model resource acceptance is profile-based. Do not use one absolute RSS
number as a universal product gate.

The product-level hard gates are:

```text
no Python/Torch/Docling production residency
lazy model startup
measurable model unload / idle recovery
materially lower resource use than the measured heavy oracle on the same
machine and corpus
no unexplained regression from a previously accepted named profile
```

Each accepted parser profile must record:

```text
profile name
model manifest and model SHAs
platform and architecture
corpus scope
measurement command
cold-load RSS
warm steady RSS
peak RSS
idle-after-unload RSS
cold latency
warm latency
```

Absolute RSS numbers are profiling budgets first. They become regression guards
only after a benchmark report pins the exact profile. For example, if a Mac
ARM64 `edge-model` profile with a specific MNN manifest measures 451MB warm
steady RSS, that value belongs to that measured profile. The acceptance rule is
that future runs must not materially regress from that profile without an
updated benchmark report and rationale. Do not rewrite that as a global rule
such as `edge-model steady RSS <= 600MB`, and do not express acceptance as an
arithmetic shortcut such as `451MB + steady RSS <= 600MB`.

Before that first report exists, use comparative evidence instead of a fixed
number: no Python/Torch/Docling production residency, lazy model startup,
measurable unload behavior, and materially lower resource use than the measured
heavy oracle on the same machine and corpus.

## Product Boundary

DocTruth answers:

```text
Where did this extracted document field come from?
```

DocTruth should stay focused on document evidence. Do not expand it into agent
memory, team workflow, hosted SaaS governance, insurance scoring, a vector
database wrapper, or a general document chatbot. Commercial hosted governance
belongs in Infer Cloud. Agent memory and replay ledger behavior belongs in
Memtruth.

## Public Contracts

Keep these surfaces stable and versioned:

```text
TrustDocument
TrustUnit
TrustPage
TrustTable
EvidenceSpan/source-map semantics
audit JSON
parser warnings
benchmark-corpus manifests
Rust runtime stdin/stdout protocol
Java SDK/CLI compatibility contracts
```

When changing parser behavior, add tests at the Rust runtime boundary first.
For parser-quality behavior in the current OpenDataLoader parity plan, add Java
backend tests first, then Rust runtime tests for process lifecycle, packaging,
resource accounting, model-worker routing, and benchmark output.

## Parser Reference Boundaries

DocTruth can learn from strong parser projects, but they must not create
competing canonical outputs:

```text
pdf_oxide Rust PDF substrate
Kreuzberg Rust runtime/model/cache/worker architecture reference
Docling unified document model and lossy export reference
MinerU layered markdown/content-list/middle/debug output reference
OpenDataLoader Apache-2.0 geometry, XY-Cut++, content filters, table rules
DocTruth TrustDocument, citations, audit gates, source maps, replay
```

`TrustDocument` is the canonical contract. External parser outputs, Markdown,
OpenDataLoader JSON, Docling-style JSON, MinerU-style `middle.json`, and model
worker responses are observations that must be normalized into DocTruth-owned
contracts before they can be audit-grade.

Kreuzberg implementation code must not be copied because its code license is
not compatible with DocTruth's OSS direction. OpenDataLoader PDF v2+
Apache-2.0 implementation ideas may be ported only with attribution, source
commit notes, and NOTICE updates. Prefer Java parser-core ports for parser
quality first, with Rust ports added only after benchmark evidence supports
them.

OpenDataLoader Bench is vendored under
`third_party/opendataloader-bench/` at the source commit recorded in its
`SOURCE.md`. Treat it as the default external parser-quality benchmark
foundation, not as a blocker waiting for DocTruth-owned human review. It
already provides PDFs, ground-truth Markdown, prediction/evaluation artifacts,
and evaluator code for reading-order, table, heading, and speed metrics.

When parser-quality evidence is needed, first build or update a DocTruth ->
OpenDataLoader Bench adapter:

```text
DocTruth Java/OpenDataLoader-compatible parser output
-> TrustDocument
-> Rust runtime shell packaging
-> OpenDataLoader Bench-compatible prediction markdown/artifact
-> OpenDataLoader Bench evaluator / evaluation.json
-> DocTruth benchmark report external_metrics
-> audit-grade parser-quality gate
```

OpenDataLoader parity is measured, not asserted. A behavior is considered
ported only when it has a Java parser-core contract test, a Rust contract test
at the shell boundary when runtime packaging is affected, an upstream source
reference, and either a focused OpenDataLoader Bench case or a full200 report
showing the effect. Until full200 reaches the accepted baseline, DocTruth should be
described as OpenDataLoader-inspired and progressively porting parity, not
OpenDataLoader-equivalent.

Do not claim parser-quality work is blocked only because DocTruth lacks its own
human-reviewed corpus. The DocTruth-owned human-reviewed corpus and review
workstation are follow-up assets for evidence-specific labels. They supplement
OpenDataLoader Bench; they do not replace it as the first external
parser-quality gate.

If multiple parser signals disagree, do not hide the conflict. Record parser
provenance, emit warnings, and block audit-grade status for severe conflicts
such as uncertain reading order, failed quote anchoring, missing visual bbox,
or low-confidence table structure.

## Verification

For Java parser-quality changes:

```bash
mvn test
mvn verify -P recorded
git diff --check
```

For Rust runtime-shell, model-worker, or corpus changes:

```bash
cargo fmt --manifest-path runtime/doctruth-runtime/Cargo.toml -- --check
cargo test --manifest-path runtime/doctruth-runtime/Cargo.toml
sh scripts/smoke-doctruth-runtime.sh
git diff --check
```

For Rust model-worker or corpus changes, also run the relevant smoke:

```bash
sh scripts/smoke-doctruth-runtime-model-worker.sh
sh scripts/smoke-doctruth-runtime-benchmark-corpus.sh
```

For Java SDK/CLI compatibility-only changes:

```bash
mvn test
mvn verify -P recorded
git diff --check
```

Do not claim complete OpenDataLoader parity while parser-quality,
model/cache, layout/table/OCR, corpus, audit-grade, or evidence-reconciliation
behavior lacks benchmark evidence. If a Rust parser path exists, it must be
documented and tested as experimental or secondary until it matches the Java
quality core on the benchmark gate.

## Contribution Rules

- Use TDD for non-trivial behavior changes.
- Keep generated artifacts and private fixture corpora out of git.
- Do not commit secrets, customer documents, API keys, or production-like data.
- Add ADRs for dependencies that affect runtime, model execution, storage,
protocol, security, networking, cryptography, policy, public API shape, or
release packaging.
- Prefer small, reviewable units, but split by responsibility rather than rigid
line-count rules.
- One concept per commit and PR.
Loading
Loading