Skip to content

umilISLab/ISLab-Text2Story26

Repository files navigation

Evaluating LLMs for Character Identification in Italian Renaissance Epics

This repository contains the code and resources accompanying the paper:

“Evaluating Large Language Models for Character Identification in Italian Renaissance Epics: A Case Study on Orlando Furioso”, submitted to the Ninth International Workshop on Narrative Extraction from Texts (Text2Story’26).

In this study, we evaluate how well instruction-tuned Large Language Models can perform character identification in a low-resource narrative domain: 16th-century Italian Renaissance epic poetry in ottava rima, focusing on the first twelve cantos of Orlando Furioso as a case study.

Data Source

The XML edition of Orlando Furioso was obtained from Biblioteca Italiana.

Ground truth dataset

The manually curated ground truth for Cantos I–XII of Orlando Furioso is available at data/GroundTruth_OF_1-12.csv. The CSV is a stanza-level list of character occurrences with columns Canto (canto number), Ottava (stanza number), character (short character identifier) and name (canonical character name).

Experiments

We assess:

NER baselines: SpaCy CNN NER (it_core_news_lg) and Transformer NER (osiria/bert-italian-cased-ner). Their outputs are available under output/NER/.

LLMs: Llama-3.1-8B-Instruct, Mixtral-8x7B-Instruct, Llama-3.1-70B-Instruct and Qwen-2.5-72B-Instruct. Their outputs are available under output/baseline/.
Note: In the repository, the term baseline in files and folders refers to the main LLM-based extraction system used in the paper. This naming choice reflects our intention to introduce more advanced extraction methods in future developments.

The poem is processed by splitting cantos into batches of n stanzas, testing batch sizes (n ∈ {4, 8, 12}) and memory condition (with vs. without narrative memory injection). Narrative memory is implemented as 2–3 sentence summaries of the previous stanza batch, which are generated by Llama-3.1-70B-Instruct and Qwen-2.5-72B-Instruct with a fixed batch size of n = 4. Summaries are available under output/summarization/.

Prompts

The prompt template for narrative memory generation is provided below.

Alt text

The prompt template for character extraction is provided below. The block highlighted in blue is included only when providing the preceding batch’s summary.

Alt text

Evaluation

The outputs for each experimental configuration are first normalized (outputs under output/normalization/) and then evaluated against the ground truth. For each configuration, evaluation results are stored in a JSON file within a dedicated subfolder under evaluation/.

Quantitative Results

Figure 1 reports F1 scores across Cantos I–XII. Column 1 compares NER baselines with the top-performing configurations (with n = 8 and Qwen summaries). Column 2 illustrates the impact of Llama memory injection at n = 4, while Column 3 shows scaling effects for n = 8 and n = 12 with and without memory. Dashed and dotted lines indicate the best NER baseline (bert-italian-ner) and the overall best configuration (Qwen-2.5-72B from the top left panel).

Alt text

Figure 2 shows the distribution of TP, FP, and FN by canto for the optimal configuration (Qwen-2.5-72B, n=8, with Qwen summaries), with F1-score trend.

Alt text

Analysis

The analysis/ folder contains the Jupyter notebooks used for the narrative memory assessment and the qualitative analyses reported in the paper.

The main notebooks are:

analysis_results.ipynb – Loads the normalized extraction outputs and produces detailed performance reports, including TP/FP/FN counts and Precision, Recall, and F1-score at stanza, canto, and global level. It also supports character-level inspection.

narrative_memory_impact.ipynb – Investigates the role of narrative memory injection by examining how performance varies depending on when a character is mentioned within a batch (e.g., before or after the first proper-name mention, or when no onomastic mention appears in the batch).

TP_hallucinated_mentions.ipynb – Verifies whether the extracted textual mention associated with correctly identified characters (True Positives) is actually present in the original stanza text, highlighting cases where the character is correctly named but the mention itself is hallucinated.

Additional subfolders (narrative_memory_impact/, tp_hallucinated_mentions/) contain auxiliary notebooks, scripts, and intermediate CSV files used to generate the datasets required for the analyses presented in the main notebooks.

Quick Start

To reproduce the main results:

pip install -r requirements.txt
python run_baseline.py
python run_normalizer.py
python run_evaluator.py

For full details and configuration options, see the Reproducibility section below (expanding "Full Reproduction Pipeline").


Repository Organization

.
├── analysis/                          
│   ├── narrative_memory_impact/       # Auxiliary scripts 
│   ├── tp_hallucinated_mentions/      # Auxiliary scripts
│   ├── analysis_results.ipynb         # Detailed evaluation breakdown
│   ├── narrative_memory_impact.ipynb  # Narrative memory impact
│   └── TP_hallucinated_mentions.ipynb # Analysis of textual mentions for TPs
├── config/
│   └── settings.yaml                  # Configuration file
├── data/
│   ├── GroundTruth_OF_1-12.csv        # Ground truth annotations
│   ├── orlando_furioso.json           # Structured poem (Cantos I–XII)
│   └── text_preprocessing.ipynb       # Converts poem XML into JSON
├── evaluation/                        # Evaluation results and plots
├── output/                            
│   ├── baseline/                      # LLM extraction outputs
│   ├── NER/                           # Classical NER baseline outputs
│   ├── normalization/                 # Normalized extraction files
│   └── summarization/                 # Narrative memory summaries
├── src/
│   ├── baseline.py                    # LLM character extraction
│   ├── evaluator.py                   # Evaluation and plots
│   ├── normalizer.py                  # Normalization
│   └── summarizer.py                  # LLM narrative summarization
├── utils/
│   ├── llm_client.py                  # HuggingFace LLM wrapper
│   ├── schemas.py                     # Pydantic output schemas
│   └── utils.py                       # Batching, logging, helpers
├── requirements.txt
├── run_baseline.py                    # Entry point: LLM character extraction
├── run_evaluator.py                   # Entry point: evaluation
├── run_ner.py                         # Entry point: NER baselines
├── run_normalizer.py                  # Entry point: normalization
└── run_summarizer.py                  # Entry point: narrative memory generation

Reproducibility

File Naming Convention

File Naming Convention

All summarization and extraction outputs follow a standardized naming convention. Each run is identified by a run_id with the following structure: <model>_C<c_start>-O<s_start>_to_C<c_end>-O<s_end>_batch<n>_<mode>[ _wMemLlama ]

Where:

<model> = model name

C<c_start>-O<s_start> = starting canto and stanza (ottava)

to_C<c_end>-O<s_end> = ending canto and stanza (ottava)

batch<n> = batch size

<mode> = pipeline mode (e.g., summarization or baseline)

_wMemLlama (or _wMemQwen) = optional suffix indicating memory injection was enabled (specifying the model that generated summaries)

Configuration

Configuration

The experiments are controlled through:

config/settings.yaml

Important parameters are:

Pipeline
  • min_canto, min_stanza, max_canto, max_stanza: poem span selection
  • batch_size: number of stanzas processed in one batch
  • use_memory: whether previous summaries are injected as contextual memory
LLM
  • model: HuggingFace-compatible instruction model
  • temperature: decoding temperature (default 0.1)
Normalization
  • similarity_threshold: string similarity for clustering name variants
Evaluation
  • similarity_threshold: fuzzy match threshold against ground truth
Full Reproduction Pipeline

Full Reproduction Pipeline

To reproduce results reported in our paper:

  1. Install dependencies:

    pip install -r requirements.txt
    
  2. Run NER:

 python run_ner.py
  1. Generate summaries (if memory is used):

    python run_summarizer.py
    
  2. Run character extraction:

    python run_baseline.py
    
  3. Normalize output:

    python run_normalizer.py
    
  4. Evaluate:

    python run_evaluator.py
    

Ensure:

  • Same batch size of the original paper (batch_size = 4 for summarization and batch_size ∈ {4, 8, 12} for character extraction)
  • Same similarity thresholds (0.83)

Below, a detailed breakdown is provided for each step.

1. NER Baselines

1. NER Baselines

This module provides non-LLM baselines for character extraction using standard Named Entity Recognition (NER) systems trained on contemporary Italian:

  1. SpaCyit_core_news_lg
  2. BERT NERosiria/bert-italian-cased-ner

Entities labeled as PER (person) are extracted stanza-wise from the poem.

Reproducing the Results

First, install the SpaCy Italian model:

python -m spacy download it_core_news_lg

Then run:

python run_ner.py

Output

Results are saved in:

output/NER/
    it_core_news_lg_*.csv
    bert-italian-cased-ner_*.csv

Each file contains:

canto | stanza | character_name

Where:

  • canto = canto number

  • stanza = octave number

  • character_name = character detected by the NER model

These files can be directly used as input for:

  • run_normalizer.py
2. Narrative Memory Generation

2. Narrative Memory Generation

This module generates short narrative summaries for batches of stanzas from the poem. These summaries can later be used as contextual memory during character extraction.

For each stanza batch (as defined by batch_size in settings.yaml), the system:

  • Receives all stanzas in the current batch and the summary of the previous stanza batch
  • Produces the summary in a structured JSON.

Reproducing the Results

Run:

python run_summarizer.py

The configuration parameters are controlled via config/settings.yaml:

pipeline:
  mode: "summarizer" 
  min_canto: 1
  min_stanza: 1
  max_canto: 12
  max_stanza: 94
  batch_size: 4

llm:
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct"
  temperature: 0.1

Output

Summaries are saved as:

output/summarization/<run_id>.json

Each entry has the following structure:

{
  "canto": int,
  "range": "start-end",
  "summary": "..."
}

Where:

canto = canto number

range = stanza range covered in the batch (e.g., "1-4")

summary = generated narrative summary

These files can be used as input for character extraction by inserting the corresponding file path in config/settings.yaml. For example:

pipeline:
  summary_file: "output/summarization/Qwen2.5-72B-Instruct_C1-O1_to_C12-O94_batch4_summarizer.json"
3. Character Extraction

3. Character Extraction

This module implements the main LLM-based character extraction system described in the paper.

The poem is processed in batches of stanzas (controlled by batch_size in settings.yaml). For each batch, the selected instruction-tuned LLM receives:

  • All stanzas in the current batch

  • (Optionally) a contextual summary of preceding stanza batch if memory is enabled

The model is prompted to extract fictional characters appearing in each stanza and to return structured JSON output.

Reproducing the Results

Run:

python run_baseline.py

Extraction behavior is controlled via config/settings.yaml:

pipeline:
  mode: "baseline"
  summary_file: "output/summarization/Qwen2.5-72B-Instruct_C1-O1_to_C12-O94_batch4_summarizer.json"
  min_canto: 1
  min_stanza: 1
  max_canto: 12
  max_stanza: 94
  batch_size: 4
  use_memory: true

llm:
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct"
  temperature: 0.1

If use_memory: true, the summaries generated by run_summarizer.py and stored in the specified file path (summary_file) are loaded and injected into the prompt as contextual memory.
If use_memory: false, extraction operates without cross-batch context.

Large models (e.g., 70B, Qwen, Mixtral) are automatically loaded with 4-bit quantization.

Output

Results are saved as:

output/baseline/<run_id>.csv

Each row contains:

canto | stanza | character_name | textual_mention

Where:

  • canto = canto number

  • stanza = octave number

  • character_name = canonical character name returned by the model

  • textual_mention = excerpt mentioning the character found in the stanza

These files can be used as input for:

  • run_normalizer.py

A detailed JSONL log of prompts and model outputs is also stored in:

output/baseline/logs/
4. Normalization

4. Normalization

This module performs post-processing normalization of character names extracted by LLMs and NER. Its purpose is to merge near-duplicate character names produced by the model.

The normalization procedure:

  • Standardizes extracted names (uppercase, removal of articles/prepositions, punctuation cleanup)
  • Computes pairwise string similarity
  • Groups names whose similarity exceeds a configurable threshold
  • Retains the most frequent variant as the canonical form
  • Replaces all clustered variants with the retained canonical name

This process is purely string-based and does not use contextual information or external knowledge.

String Similarity

String similarity is computed using Python’s difflib.SequenceMatcher, which implements the Ratcliff/Obershelp (Gestalt pattern matching) algorithm. Similarity is calculated as:

similarity = 2 * M / T

where M is the number of matching characters and T is the total number of characters across both strings.

Reproducing the Results

Run:

python run_normalizer.py

The input file and similarity threshold are controlled via config/settings.yaml:

normalization:
  similarity_threshold: 0.83
  target_path: "output/baseline/Qwen2.5-72B-Instruct_C1-O1_to_C12-O94_batch8_baseline_wMemQwen.csv"

Where:

  • target_path = extraction CSV to normalize (it can be chosen among the files saved under output/NER/ and under output/baseline/)

  • similarity_threshold = minimum similarity score (between 0 and 1) required to merge two names

Output

The normalized file is saved as:

output/normalization/<run_id>_normalized.csv

The structure of the file remains:

canto | stanza | character_name | textual_mention

Only the character_name field may be modified.

These files can be used as input for:

  • run_evaluator.py

A detailed log of all merge operations is saved in:

output/normalization/logs/log_<run_id>.json
5. Evaluation

5. Evaluation

This module evaluates extracted character mentions against our manually curated ground truth for Orlando Furioso (Cantos I–XII): data/GroundTruth_OF_1-12.csv.

Evaluation operates at the stanza level. For each stanza, predicted character names are compared against annotated gold-standard names.

String Similarity

Character matching is performed using fuzzy string similarity, based on the Ratcliff/Obershelp algorithm (as explained for the normalization step).
A predicted name is considered a True Positive if its similarity with a ground-truth name exceeds the configurable threshold.

Metrics

The evaluator computes:

  • Precision

  • Recall

  • F1-score

  • True Positives (TP)

  • False Positives (FP)

  • False Negatives (FN)

  • True Negatives (TN) (defined at the stanza level, i.e., stanzas where the model correctly identifies no character mentions)

Reproducing the Results

Run:

python run_evaluator.py

The input file and similarity threshold are controlled via config/settings.yaml:

evaluation:
  similarity_threshold: 0.83
  target_path: "output/normalization/Qwen2.5-72B-Instruct_C1-O1_to_C12-O94_batch8_baseline_wMemQwen_normalized.csv"

Where:

  • target_path = any CSV file saved under output/normalization/.

Output

Evaluation results are saved in:

evaluation/<run_id>_normalized/

The main result file is:

eval_<run_id>_normalized.json

This JSON file contains:

  • Global metrics

  • Raw counts

  • Per-stanza detailed results

  • Evaluation configuration parameters

Generated visualizations

Plots are automatically generated and saved under:

evaluation/<run_id>_normalized/plots/

The following visualizations are produced:

  • Confusion matrix
  • Character-level recall
  • Top False Positives
  • (Top False Negatives)
  • Per-canto F1
  • Combined TP/FP/FN per canto
  • Stanza-level performance barcode

Citation

If using this repository, please cite our Text2Story 2026 paper (citation details to be updated upon publication).

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

About

Code and accompanying material for our submission to Text2Story 2026.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors