This repository contains the code for Evolution Strategy for Metacognitive Alignment (ESMA), a method to improve large language models’ awareness of their own knowledge. The work is described in:
Metacognition, knowing what one knows, is central to intelligence. This project provides:
- A measurement framework and evaluation tools for LLM metacognition: a dual-prompt protocol (Direct Questions + Meta Questions), the d′type2 metric from signal detection theory, and evaluation for d′type2, raw alignment, accuracy, yes/no ratios, and related metrics on TriviaQA and other QA datasets.
- ESMA: evolution-strategy-based fine-tuning that strengthens the link between a model’s internal knowledge and its explicit answers, including "Do you know the answer?" style meta-questions.
- Evolution strategy weight patching scripts to extract weight deltas (tuned − base) and apply sparse or full updates (e.g. top/bottom p% by magnitude) for analyzing which parameter changes drive metacognitive improvement.
ESMA uses a population of weight-perturbed models, a joint reward over direct correctness and meta-alignment, and weighted averaging of parameters. It improves metacognitive sensitivity (e.g. d′type2 ≈ 1) and generalizes to unseen prompts, languages, and datasets.
git clone https://github.com/cosmoquester/ESMA.git && cd ESMA
pip install -e .esma/– Core library:metric.py– d′type2, raw alignment, yes/no detection, RMI.reward.py– ESMA joint reward and ablations (correctness-only, alignment-only).evolution.py– Evolution strategy (perturbation, evaluation, weighted update).prompt.py– Direct / Meta / IDK prompt templates.dataset.py– Dataset and data loading utilities.data/– TriviaQA, FreebaseQA, NQ Open, WebQuestions, MKQA, FictionalQA, etc.
scripts/train_es.py– ESMA training (evolution strategy on TriviaQA).train_sft_meta.py– Supervised fine-tuning for meta-answers (SFT baseline).train_sft.py– General SFT (e.g. for FictionalQA).evaluate_qa.py– Evaluate models on dual-prompt QA (d′type2, alignment, accuracy).evaluate_qa_idw.py– “I don’t know” (IDK) single-prompt evaluation.evaluate_qa_threshold.py– Threshold-based / confidence evaluation.evaluate_qa_api.py– Evaluation for API-based models.apply_weight_change.py– Apply (e.g. sparse) weight deltas to a base model.extract_weight_change.py– Extract weight changes (e.g. for patching analysis).
notebooks/– Plotting confidence distributions and weight-patching effects.
accelerate launch --num_processes 8 \
scripts/train_es.py \
--model Qwen/Qwen2.5-1.5B-Instruct \
--reward-type esmaHyperparameters (e.g. σ, α, iterations, population size) follow the paper; --reward-type esma uses the joint reward (correctness + meta-alignment).
python scripts/evaluate_qa.py --model path/to/modelUse --help for data paths, batch size, and output options.
If you use this code or the method, please cite:
@misc{park2026finetuninglanguagemodelsknow,
title={Fine-Tuning Language Models to Know What They Know},
author={Sangjun Park and Elliot Meyerson and Xin Qiu and Risto Miikkulainen},
year={2026},
eprint={2602.02605},
archivePrefix={arXiv},
primaryClass={cs.NE},
url={https://arxiv.org/abs/2602.02605},
}