This repository contains the code and resources accompanying the paper:
“Evaluating Large Language Models for Character Identification in Italian Renaissance Epics: A Case Study on Orlando Furioso”, submitted to the Ninth International Workshop on Narrative Extraction from Texts (Text2Story’26).
In this study, we evaluate how well instruction-tuned Large Language Models can perform character identification in a low-resource narrative domain: 16th-century Italian Renaissance epic poetry in ottava rima, focusing on the first twelve cantos of Orlando Furioso as a case study.
The XML edition of Orlando Furioso was obtained from Biblioteca Italiana.
The manually curated ground truth for Cantos I–XII of Orlando Furioso is available at data/GroundTruth_OF_1-12.csv. The CSV is a stanza-level list of character occurrences with columns Canto (canto number), Ottava (stanza number), character (short character identifier) and name (canonical character name).
We assess:
NER baselines: SpaCy CNN NER (it_core_news_lg) and Transformer NER (osiria/bert-italian-cased-ner). Their outputs are available under output/NER/.
LLMs: Llama-3.1-8B-Instruct, Mixtral-8x7B-Instruct, Llama-3.1-70B-Instruct and Qwen-2.5-72B-Instruct. Their outputs are available under output/baseline/.
Note: In the repository, the term baseline in files and folders refers to the main LLM-based extraction system used in the paper. This naming choice reflects our intention to introduce more advanced extraction methods in future developments.
The poem is processed by splitting cantos into batches of n stanzas, testing batch sizes (n ∈ {4, 8, 12}) and memory condition (with vs. without narrative memory injection). Narrative memory is implemented as 2–3 sentence summaries of the previous stanza batch, which are generated by Llama-3.1-70B-Instruct and Qwen-2.5-72B-Instruct with a fixed batch size of n = 4. Summaries are available under output/summarization/.
The prompt template for narrative memory generation is provided below.
The prompt template for character extraction is provided below. The block highlighted in blue is included only when providing the preceding batch’s summary.
The outputs for each experimental configuration are first normalized (outputs under output/normalization/) and then evaluated against the ground truth. For each configuration, evaluation results are stored in a JSON file within a dedicated subfolder under evaluation/.
Figure 1 reports F1 scores across Cantos I–XII. Column 1 compares NER baselines with the top-performing configurations (with n = 8 and Qwen summaries). Column 2 illustrates the impact of Llama memory injection at n = 4, while Column 3 shows scaling effects for n = 8 and n = 12 with and without memory. Dashed and dotted lines indicate the best NER baseline (bert-italian-ner) and the overall best configuration (Qwen-2.5-72B from the top left panel).
Figure 2 shows the distribution of TP, FP, and FN by canto for the optimal configuration (Qwen-2.5-72B, n=8, with Qwen summaries), with F1-score trend.
The analysis/ folder contains the Jupyter notebooks used for the narrative memory assessment and the qualitative analyses reported in the paper.
The main notebooks are:
analysis_results.ipynb – Loads the normalized extraction outputs and produces detailed performance reports, including TP/FP/FN counts and Precision, Recall, and F1-score at stanza, canto, and global level. It also supports character-level inspection.
narrative_memory_impact.ipynb – Investigates the role of narrative memory injection by examining how performance varies depending on when a character is mentioned within a batch (e.g., before or after the first proper-name mention, or when no onomastic mention appears in the batch).
TP_hallucinated_mentions.ipynb – Verifies whether the extracted textual mention associated with correctly identified characters (True Positives) is actually present in the original stanza text, highlighting cases where the character is correctly named but the mention itself is hallucinated.
Additional subfolders (narrative_memory_impact/, tp_hallucinated_mentions/) contain auxiliary notebooks, scripts, and intermediate CSV files used to generate the datasets required for the analyses presented in the main notebooks.
To reproduce the main results:
pip install -r requirements.txt
python run_baseline.py
python run_normalizer.py
python run_evaluator.pyFor full details and configuration options, see the Reproducibility section below (expanding "Full Reproduction Pipeline").
.
├── analysis/
│ ├── narrative_memory_impact/ # Auxiliary scripts
│ ├── tp_hallucinated_mentions/ # Auxiliary scripts
│ ├── analysis_results.ipynb # Detailed evaluation breakdown
│ ├── narrative_memory_impact.ipynb # Narrative memory impact
│ └── TP_hallucinated_mentions.ipynb # Analysis of textual mentions for TPs
├── config/
│ └── settings.yaml # Configuration file
├── data/
│ ├── GroundTruth_OF_1-12.csv # Ground truth annotations
│ ├── orlando_furioso.json # Structured poem (Cantos I–XII)
│ └── text_preprocessing.ipynb # Converts poem XML into JSON
├── evaluation/ # Evaluation results and plots
├── output/
│ ├── baseline/ # LLM extraction outputs
│ ├── NER/ # Classical NER baseline outputs
│ ├── normalization/ # Normalized extraction files
│ └── summarization/ # Narrative memory summaries
├── src/
│ ├── baseline.py # LLM character extraction
│ ├── evaluator.py # Evaluation and plots
│ ├── normalizer.py # Normalization
│ └── summarizer.py # LLM narrative summarization
├── utils/
│ ├── llm_client.py # HuggingFace LLM wrapper
│ ├── schemas.py # Pydantic output schemas
│ └── utils.py # Batching, logging, helpers
├── requirements.txt
├── run_baseline.py # Entry point: LLM character extraction
├── run_evaluator.py # Entry point: evaluation
├── run_ner.py # Entry point: NER baselines
├── run_normalizer.py # Entry point: normalization
└── run_summarizer.py # Entry point: narrative memory generation
File Naming Convention
All summarization and extraction outputs follow a standardized naming convention. Each run is identified by a run_id with the following structure:
<model>_C<c_start>-O<s_start>_to_C<c_end>-O<s_end>_batch<n>_<mode>[ _wMemLlama ]
Where:
<model> = model name
C<c_start>-O<s_start> = starting canto and stanza (ottava)
to_C<c_end>-O<s_end> = ending canto and stanza (ottava)
batch<n> = batch size
<mode> = pipeline mode (e.g., summarization or baseline)
_wMemLlama (or _wMemQwen) = optional suffix indicating memory injection was enabled (specifying the model that generated summaries)
Configuration
The experiments are controlled through:
config/settings.yaml
Important parameters are:
min_canto,min_stanza,max_canto,max_stanza: poem span selectionbatch_size: number of stanzas processed in one batchuse_memory: whether previous summaries are injected as contextual memory
model: HuggingFace-compatible instruction modeltemperature: decoding temperature (default 0.1)
similarity_threshold: string similarity for clustering name variants
similarity_threshold: fuzzy match threshold against ground truth
Full Reproduction Pipeline
To reproduce results reported in our paper:
-
Install dependencies:
pip install -r requirements.txt -
Run NER:
python run_ner.py
-
Generate summaries (if memory is used):
python run_summarizer.py -
Run character extraction:
python run_baseline.py -
Normalize output:
python run_normalizer.py -
Evaluate:
python run_evaluator.py
Ensure:
- Same batch size of the original paper (
batch_size= 4 for summarization andbatch_size∈ {4, 8, 12} for character extraction) - Same similarity thresholds (0.83)
Below, a detailed breakdown is provided for each step.
1. NER Baselines
This module provides non-LLM baselines for character extraction using standard Named Entity Recognition (NER) systems trained on contemporary Italian:
- SpaCy –
it_core_news_lg - BERT NER –
osiria/bert-italian-cased-ner
Entities labeled as PER (person) are extracted stanza-wise from the poem.
First, install the SpaCy Italian model:
python -m spacy download it_core_news_lgThen run:
python run_ner.py
Results are saved in:
output/NER/
it_core_news_lg_*.csv
bert-italian-cased-ner_*.csv
Each file contains:
canto | stanza | character_name
Where:
-
canto= canto number -
stanza= octave number -
character_name= character detected by the NER model
These files can be directly used as input for:
run_normalizer.py
2. Narrative Memory Generation
This module generates short narrative summaries for batches of stanzas from the poem. These summaries can later be used as contextual memory during character extraction.
For each stanza batch (as defined by batch_size in settings.yaml), the system:
- Receives all stanzas in the current batch and the summary of the previous stanza batch
- Produces the summary in a structured JSON.
Run:
python run_summarizer.py
The configuration parameters are controlled via config/settings.yaml:
pipeline:
mode: "summarizer"
min_canto: 1
min_stanza: 1
max_canto: 12
max_stanza: 94
batch_size: 4
llm:
model: "meta-llama/Meta-Llama-3.1-70B-Instruct"
temperature: 0.1
Summaries are saved as:
output/summarization/<run_id>.json
Each entry has the following structure:
{
"canto": int,
"range": "start-end",
"summary": "..."
}
Where:
canto = canto number
range = stanza range covered in the batch (e.g., "1-4")
summary = generated narrative summary
These files can be used as input for character extraction by inserting the corresponding file path in config/settings.yaml. For example:
pipeline:
summary_file: "output/summarization/Qwen2.5-72B-Instruct_C1-O1_to_C12-O94_batch4_summarizer.json"
3. Character Extraction
This module implements the main LLM-based character extraction system described in the paper.
The poem is processed in batches of stanzas (controlled by batch_size in settings.yaml).
For each batch, the selected instruction-tuned LLM receives:
-
All stanzas in the current batch
-
(Optionally) a contextual summary of preceding stanza batch if memory is enabled
The model is prompted to extract fictional characters appearing in each stanza and to return structured JSON output.
Run:
python run_baseline.py
Extraction behavior is controlled via config/settings.yaml:
pipeline:
mode: "baseline"
summary_file: "output/summarization/Qwen2.5-72B-Instruct_C1-O1_to_C12-O94_batch4_summarizer.json"
min_canto: 1
min_stanza: 1
max_canto: 12
max_stanza: 94
batch_size: 4
use_memory: true
llm:
model: "meta-llama/Meta-Llama-3.1-70B-Instruct"
temperature: 0.1
If use_memory: true, the summaries generated by run_summarizer.py and stored in the specified file path (summary_file) are loaded and injected into the prompt as contextual memory.
If use_memory: false, extraction operates without cross-batch context.
Large models (e.g., 70B, Qwen, Mixtral) are automatically loaded with 4-bit quantization.
Results are saved as:
output/baseline/<run_id>.csv
Each row contains:
canto | stanza | character_name | textual_mention
Where:
-
canto= canto number -
stanza= octave number -
character_name= canonical character name returned by the model -
textual_mention= excerpt mentioning the character found in the stanza
These files can be used as input for:
run_normalizer.py
A detailed JSONL log of prompts and model outputs is also stored in:
output/baseline/logs/
4. Normalization
This module performs post-processing normalization of character names extracted by LLMs and NER. Its purpose is to merge near-duplicate character names produced by the model.
The normalization procedure:
- Standardizes extracted names (uppercase, removal of articles/prepositions, punctuation cleanup)
- Computes pairwise string similarity
- Groups names whose similarity exceeds a configurable threshold
- Retains the most frequent variant as the canonical form
- Replaces all clustered variants with the retained canonical name
This process is purely string-based and does not use contextual information or external knowledge.
String similarity is computed using Python’s difflib.SequenceMatcher,
which implements the Ratcliff/Obershelp (Gestalt pattern matching) algorithm.
Similarity is calculated as:
similarity = 2 * M / T
where M is the number of matching characters and T is the total number of characters across both strings.
Run:
python run_normalizer.py
The input file and similarity threshold are controlled via config/settings.yaml:
normalization:
similarity_threshold: 0.83
target_path: "output/baseline/Qwen2.5-72B-Instruct_C1-O1_to_C12-O94_batch8_baseline_wMemQwen.csv"
Where:
-
target_path= extraction CSV to normalize (it can be chosen among the files saved underoutput/NER/and underoutput/baseline/) -
similarity_threshold= minimum similarity score (between 0 and 1) required to merge two names
The normalized file is saved as:
output/normalization/<run_id>_normalized.csv
The structure of the file remains:
canto | stanza | character_name | textual_mention
Only the character_name field may be modified.
These files can be used as input for:
run_evaluator.py
A detailed log of all merge operations is saved in:
output/normalization/logs/log_<run_id>.json
5. Evaluation
This module evaluates extracted character mentions against our manually curated ground truth for Orlando Furioso (Cantos I–XII): data/GroundTruth_OF_1-12.csv.
Evaluation operates at the stanza level. For each stanza, predicted character names are compared against annotated gold-standard names.
Character matching is performed using fuzzy string similarity, based on the Ratcliff/Obershelp algorithm (as explained for the normalization step).
A predicted name is considered a True Positive if its similarity with a ground-truth name exceeds the configurable threshold.
The evaluator computes:
-
Precision
-
Recall
-
F1-score
-
True Positives (TP)
-
False Positives (FP)
-
False Negatives (FN)
-
True Negatives (TN) (defined at the stanza level, i.e., stanzas where the model correctly identifies no character mentions)
Run:
python run_evaluator.py
The input file and similarity threshold are controlled via config/settings.yaml:
evaluation:
similarity_threshold: 0.83
target_path: "output/normalization/Qwen2.5-72B-Instruct_C1-O1_to_C12-O94_batch8_baseline_wMemQwen_normalized.csv"
Where:
target_path= any CSV file saved underoutput/normalization/.
Evaluation results are saved in:
evaluation/<run_id>_normalized/
The main result file is:
eval_<run_id>_normalized.json
This JSON file contains:
-
Global metrics
-
Raw counts
-
Per-stanza detailed results
-
Evaluation configuration parameters
Plots are automatically generated and saved under:
evaluation/<run_id>_normalized/plots/
The following visualizations are produced:
- Confusion matrix
- Character-level recall
- Top False Positives
- (Top False Negatives)
- Per-canto F1
- Combined TP/FP/FN per canto
- Stanza-level performance barcode
If using this repository, please cite our Text2Story 2026 paper (citation details to be updated upon publication).
This work is licensed under a Creative Commons Attribution 4.0 International License.




