Skip to content

e181337/cross_reference_linker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross Reference Linker

This repository contains a small rule-based system for detecting internal references in regulatory text and resolving them to paragraph IDs within the same document.

The goal was not to build a heavy NLP pipeline, but to put together a working and readable prototype that can:

  • identify structural items such as paragraphs, annexes, appendices, and figures
  • extract internal references from paragraph text
  • match those references to the correct target IDs in the same JSON document

Approach

The solution is organized as a simple pipeline:

  1. parser.py Parses each paragraph, detects its record type, extracts the label, and separates the body text.

  2. indexer.py Builds lookup tables from the parsed document so references can later be resolved to paragraph IDs.

  3. extractor.py Uses regex patterns to find references such as:

  • paragraph 2.4
  • paragraphs 8.24.5.1.5 and 8.24.5.1.8
  • Annex 3
  • Appendix 2 of Annex 6
  • Figure 1
  1. resolver.py Maps extracted references to the matching IDs using the lookup tables.

  2. pipeline.py Runs the full end-to-end prediction flow and writes the output JSON.

  3. evaluation.py Runs the pipeline on the reference dataset and computes evaluation metrics.

Repository Structure

Input Format

Both datasets follow the same structure:

{
  "documentVersionKey": "...",
  "documentVersionId": "...",
  "rootRegion": "...",
  "region": "...",
  "paragraphLinks": [
    {
      "text": "3.2.1. ... paragraph 2.4. ...",
      "id": "paragraph-id-1",
      "targetIds": []
    }
  ]
}

The pipeline returns the same structure, but with predicted values in targetIds.

How To Run

Run evaluation on the reference dataset:

python3 -m src.evaluation

Generate predictions for the test dataset:

python3 -m src.pipeline

This writes the output file to:

Current Result

On the reference dataset, the current implementation produces:

  • precision: 0.50
  • recall: 0.476
  • f1: 0.4875
  • exact match rate: 0.9543

On the test set, the generated prediction file contains 298 paragraphs, with predicted references for 68 of them.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages