Skip to content

Commit dcc2338

Browse files
committed
calibration evaluator
1 parent c0c7c8e commit dcc2338

File tree

5 files changed

+432
-0
lines changed

5 files changed

+432
-0
lines changed

examples/calibration/README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Calibration Evaluator
2+
3+
This directory contains an evaluator for measuring the calibration of LLM classifiers.
4+
It calculates:
5+
- **Accuracy**: Fraction of correct predictions.
6+
- **Brier Score**: Mean squared error of the probabilities. Lower is better.
7+
- **ECE (Expected Calibration Error)**: Weighted average of the difference between confidence and accuracy in bins. Lower is better.
8+
9+
## Usage
10+
11+
1. Install dependencies:
12+
```bash
13+
pip install datasets numpy openai python-dotenv
14+
```
15+
16+
2. Set your Fireworks API key in `.env` or environment variables:
17+
```bash
18+
export FIREWORKS_API_KEY=your_key
19+
```
20+
21+
3. Run the evaluation script:
22+
```bash
23+
python run_calibration.py
24+
```
25+
26+
## Files
27+
28+
- `evaluator.py`: Contains the `calibration_evaluator` batch reward function.
29+
- `run_calibration.py`: Script to load AG News dataset and run the evaluation on specified models.
30+
31+
## Configuration
32+
33+
You can modify `run_calibration.py` to:
34+
- Change the models being evaluated (`MODELS` list).
35+
- Change the dataset or number of samples.
36+
- Adjust the class mapping if using a different dataset.
37+
38+
You can modify `evaluator.py` to:
39+
- Change the class tokens (`CLASS_TOKENS`) if the model uses different tokenization.
40+
- Adjust `top_logprobs` if needed (note that some models limit this to 5).

0 commit comments

Comments
 (0)