This repository contains a small rule-based system for detecting internal references in regulatory text and resolving them to paragraph IDs within the same document.
The goal was not to build a heavy NLP pipeline, but to put together a working and readable prototype that can:
- identify structural items such as paragraphs, annexes, appendices, and figures
- extract internal references from paragraph text
- match those references to the correct target IDs in the same JSON document
The solution is organized as a simple pipeline:
-
parser.pyParses each paragraph, detects its record type, extracts the label, and separates the body text. -
indexer.pyBuilds lookup tables from the parsed document so references can later be resolved to paragraph IDs. -
extractor.pyUses regex patterns to find references such as:
paragraph 2.4paragraphs 8.24.5.1.5 and 8.24.5.1.8Annex 3Appendix 2 of Annex 6Figure 1
-
resolver.pyMaps extracted references to the matching IDs using the lookup tables. -
pipeline.pyRuns the full end-to-end prediction flow and writes the output JSON. -
evaluation.pyRuns the pipeline on the reference dataset and computes evaluation metrics.
- src/parser.py
- src/indexer.py
- src/extractor.py
- src/resolver.py
- src/pipeline.py
- src/evaluation.py
- data/evaluation_data.json
- data/test_data.json
Both datasets follow the same structure:
{
"documentVersionKey": "...",
"documentVersionId": "...",
"rootRegion": "...",
"region": "...",
"paragraphLinks": [
{
"text": "3.2.1. ... paragraph 2.4. ...",
"id": "paragraph-id-1",
"targetIds": []
}
]
}The pipeline returns the same structure, but with predicted values in targetIds.
Run evaluation on the reference dataset:
python3 -m src.evaluationGenerate predictions for the test dataset:
python3 -m src.pipelineThis writes the output file to:
On the reference dataset, the current implementation produces:
- precision:
0.50 - recall:
0.476 - f1:
0.4875 - exact match rate:
0.9543
On the test set, the generated prediction file contains 298 paragraphs, with predicted references for 68 of them.