ESMA: Fine-Tuning Language Models to Know What They Know

This repository contains the code for Evolution Strategy for Metacognitive Alignment (ESMA), a method to improve large language models’ awareness of their own knowledge. The work is described in:

Overview

Metacognition, knowing what one knows, is central to intelligence. This project provides:

A measurement framework and evaluation tools for LLM metacognition: a dual-prompt protocol (Direct Questions + Meta Questions), the d′_type2 metric from signal detection theory, and evaluation for d′_type2, raw alignment, accuracy, yes/no ratios, and related metrics on TriviaQA and other QA datasets.
ESMA: evolution-strategy-based fine-tuning that strengthens the link between a model’s internal knowledge and its explicit answers, including "Do you know the answer?" style meta-questions.
Evolution strategy weight patching scripts to extract weight deltas (tuned − base) and apply sparse or full updates (e.g. top/bottom p% by magnitude) for analyzing which parameter changes drive metacognitive improvement.

ESMA uses a population of weight-perturbed models, a joint reward over direct correctness and meta-alignment, and weighted averaging of parameters. It improves metacognitive sensitivity (e.g. d′_type2 ≈ 1) and generalizes to unseen prompts, languages, and datasets.

Installation

git clone https://github.com/cosmoquester/ESMA.git && cd ESMA
pip install -e .

Repository Structure

esma/ – Core library:
- metric.py – d′_type2, raw alignment, yes/no detection, RMI.
- reward.py – ESMA joint reward and ablations (correctness-only, alignment-only).
- evolution.py – Evolution strategy (perturbation, evaluation, weighted update).
- prompt.py – Direct / Meta / IDK prompt templates.
- dataset.py – Dataset and data loading utilities.
- data/ – TriviaQA, FreebaseQA, NQ Open, WebQuestions, MKQA, FictionalQA, etc.
scripts/
- train_es.py – ESMA training (evolution strategy on TriviaQA).
- train_sft_meta.py – Supervised fine-tuning for meta-answers (SFT baseline).
- train_sft.py – General SFT (e.g. for FictionalQA).
- evaluate_qa.py – Evaluate models on dual-prompt QA (d′_type2, alignment, accuracy).
- evaluate_qa_idw.py – “I don’t know” (IDK) single-prompt evaluation.
- evaluate_qa_threshold.py – Threshold-based / confidence evaluation.
- evaluate_qa_api.py – Evaluation for API-based models.
- apply_weight_change.py – Apply (e.g. sparse) weight deltas to a base model.
- extract_weight_change.py – Extract weight changes (e.g. for patching analysis).
notebooks/ – Plotting confidence distributions and weight-patching effects.

Quick Start

Train with ESMA

accelerate launch --num_processes 8 \
  scripts/train_es.py \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --reward-type esma

Hyperparameters (e.g. σ, α, iterations, population size) follow the paper; --reward-type esma uses the joint reward (correctness + meta-alignment).

Evaluate a model

python scripts/evaluate_qa.py --model path/to/model

Use --help for data paths, batch size, and output options.

Citation

If you use this code or the method, please cite:

@misc{park2026finetuninglanguagemodelsknow,
      title={Fine-Tuning Language Models to Know What They Know},
      author={Sangjun Park and Elliot Meyerson and Xin Qiu and Risto Miikkulainen},
      year={2026},
      eprint={2602.02605},
      archivePrefix={arXiv},
      primaryClass={cs.NE},
      url={https://arxiv.org/abs/2602.02605},
}

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
.circleci		.circleci
.vscode		.vscode
esma		esma
scripts		scripts
tests		tests
.gitignore		.gitignore
CITATION.bib		CITATION.bib
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ESMA: Fine-Tuning Language Models to Know What They Know

Overview

Installation

Repository Structure

Quick Start

Train with ESMA

Evaluate a model

Citation

About

Uh oh!

Languages

License

cosmoquester/ESMA

Folders and files

Latest commit

History

Repository files navigation

ESMA: Fine-Tuning Language Models to Know What They Know

Overview

Installation

Repository Structure

Quick Start

Train with ESMA

Evaluate a model

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages