Cross Reference Linker

This repository contains a small rule-based system for detecting internal references in regulatory text and resolving them to paragraph IDs within the same document.

The goal was not to build a heavy NLP pipeline, but to put together a working and readable prototype that can:

identify structural items such as paragraphs, annexes, appendices, and figures
extract internal references from paragraph text
match those references to the correct target IDs in the same JSON document

Approach

The solution is organized as a simple pipeline:

parser.py Parses each paragraph, detects its record type, extracts the label, and separates the body text.
indexer.py Builds lookup tables from the parsed document so references can later be resolved to paragraph IDs.
extractor.py Uses regex patterns to find references such as:

paragraph 2.4
paragraphs 8.24.5.1.5 and 8.24.5.1.8
Annex 3
Appendix 2 of Annex 6
Figure 1

resolver.py Maps extracted references to the matching IDs using the lookup tables.
pipeline.py Runs the full end-to-end prediction flow and writes the output JSON.
evaluation.py Runs the pipeline on the reference dataset and computes evaluation metrics.

Repository Structure

Input Format

Both datasets follow the same structure:

{
  "documentVersionKey": "...",
  "documentVersionId": "...",
  "rootRegion": "...",
  "region": "...",
  "paragraphLinks": [
    {
      "text": "3.2.1. ... paragraph 2.4. ...",
      "id": "paragraph-id-1",
      "targetIds": []
    }
  ]
}

The pipeline returns the same structure, but with predicted values in targetIds.

How To Run

Run evaluation on the reference dataset:

python3 -m src.evaluation

Generate predictions for the test dataset:

python3 -m src.pipeline

This writes the output file to:

data/test_predictions.json

Current Result

On the reference dataset, the current implementation produces:

precision: 0.50
recall: 0.476
f1: 0.4875
exact match rate: 0.9543

On the test set, the generated prediction file contains 298 paragraphs, with predicted references for 68 of them.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross Reference Linker

Approach

Repository Structure

Input Format

How To Run

Current Result

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cross Reference Linker

Approach

Repository Structure

Input Format

How To Run

Current Result

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages