Author: Sourav Roy
Email: royxlead@proton.me
Date: October 2025
This repository contains a complete implementation and evaluation suite for "Self-Diagnosing Neural Models": uncertainty quantification (UQ) methods and a novel unsupervised confidence metric that estimates model confidence without using labels. The project includes:
- Implementations of multiple UQ methods: Baseline (MSP), Monte Carlo Dropout (MC Dropout), Evidential Deep Learning (EDL), and Deep Ensembles.
- A novel unsupervised confidence metric combining prediction consistency across augmentations, entropy, feature-space dispersion, and softmax temperature analysis.
- Dataset: CIFAR-10 (ID) vs CIFAR-100 (OOD)
- Training epochs reported: 100
- Best single-model accuracy: Evidential (≈91.7%)
- Best calibration (lowest ECE): MC Dropout (≈0.0097)
- Best OOD detection AUROC (≈0.855): Baseline / Ensemble
Create and activate a virtual environment and install the core packages. Adjust the torch wheel for your CUDA version.
# optional: create a venv and activate it
python -m venv .venv; .\\.venv\\Scripts\\Activate.ps1
# common packages - pin or change versions if you need exact reproduction
pip install numpy scipy scikit-learn matplotlib seaborn tqdm tensorboard
# Install torch + torchvision following instructions at https://pytorch.org (pick the right CUDA)
pip install torch torchvision- Open
self_diagnosing_neural_models_python.ipynbin Jupyter or VS Code and run cells from the top. - The notebook exposes a
main_pipeline(...)orchestration function to train, evaluate, and export results. When running experiments you can settrain_models=Falseto load existing checkpoints instead of retraining.
Example (from inside the notebook after converting to .py or using the notebook kernel):
models, results, unsupervised_results, evaluator = main_pipeline(
train_models=False, # load checkpoints instead of training from scratch
num_epochs=100,
run_ablations=True,
id_dataset='cifar10',
ood_dataset='cifar100',
batch_size=128
)- Checkpoints for Baseline, MC Dropout and Evidential models are available in
checkpoints/. - Ensemble member weights (if present) live in
ensemble_model/ensemble_model_*.pth.
- Install dependencies and set the appropriate PyTorch wheel for your GPU/CPU.
- Open and run the notebook cells in order — the notebook sets seeds for deterministic behavior where possible.
- For quick smoke tests, use the notebook's CLI flags (
--smoke-testor--fast-debug) or set theFAST_DEBUG_SUBSETenv var.
The images/ folder contains the main plotted outputs. A few representative figures are embedded below — click the images to open the full-size PNGs in the repository.
Comparison across methods (accuracy / ECE / AUROC):
Figure: Side-by-side comparison of key metrics across Baseline, MC Dropout, Evidential, and Ensemble models.
Confidence distributions and unsupervised metric behavior:
Figure: Predicted confidence histograms and the proposed unsupervised confidence score behavior across datasets.
OOD detection ROC curves (CIFAR-10 ID vs CIFAR-100 OOD):
Figure: ROC curves for OOD detection using CIFAR-10 as ID and CIFAR-100 as OOD; higher AUROC indicates better separability.
More visuals (in images/): training curves for each method, reliability diagrams (*_reliability_diagram.png), per-method unsupervised analyses (*_unsupervised_analysis.png), and ablation plots (ablation_*.png).
Training curves
Reliability diagrams
Unsupervised analyses
Ablation studies
- The notebook contains well-commented components:
DatasetManager,BaselineModel,MCDropoutModel,EvidentialModel,UnsupervisedConfidenceMetric,Trainer,DeepEnsemble,ComprehensiveEvaluator,Visualizer, andAblationStudies. - The main pipeline function
main_pipeline(...)orchestrates dataset loading, training (or loading), evaluation, and plotting.
This project is licensed under the MIT License — see the LICENSE file.
















