This is the official repository of the CVPR 2026 paper "IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment" by Simone Magistri, Dipam Goswami, Marco Mistretta, BartΕomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov.
Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models.
(a) Standard CLIP intra-modal retrieval is sub-optimal due to intra-modal misalignment. (b) Inversion methods are effective but add high latency. (c) IsoCLIP isolates well-aligned directions from the inter-modal operator Ξ¨ = Wα΅’α΅ Wβ, improving intra-modal performance at no extra cost.
This guide provides step-by-step instructions on how to set up the isoclip conda environment and install all necessary dependencies.
- Create and Activate Conda Environment
conda create -y -n isoclip python=3.10.14
conda activate isoclip- Install PyTorch (with CUDA 12.1)
conda install pytorch==2.1.1 torchvision==0.16.1 pytorch-cuda=12.1 -c pytorch -c nvidia- Clone the repository and install pip dependecies
git clone https://github.com/simomagi/IsoCLIP.git
cd IsoCLIP
pip install --no-build-isolation git+https://github.com/KaiyangZhou/Dassl.pytorch
chmod +x install_requirements.sh
./install_requirements.shWe recommend storing all datasets in a single directory:
mkdir -p /path/to/datasetsUse this path with the --dataroot flag when running:
python src/retrieval.py --dataroot /path/to/datasets
python src/classification/zeroshotclassification.py --dataroot /path/to/datasetsOur codebase supports all datasets from Cross-the-Gap repository:
- Please refer to the official Cross-the-Gap dataset documentation for download instructions.
- For adding and setting up new datasets, see the dataset installation guide.
In addition, our codebase also support:
- Download the validation set and store it under:
/path/to/datasets/Places365_val/
βββ places365_val.txt
βββ places_devkit/
βββ val_large/- Official website : http://places2.csail.mit.edu/download-private.html
- Download the iNaturalist 2021 mini-train split and store it under:
/path/to/datasets/iNaturalist2021_train/ βββ annotations βββ train_mini
- Official website: https://github.com/visipedia/inat_comp/tree/master/2021
The main script for retrieval experiments is:
python src/retrieval.pyIt supports:
- image-to-image retrieval
- text-to-text retrieval
- IsoCLIP-enabled evaluation
- standard CLIP/OpenCLIP evaluation
-
--dataroot: Root directory containing all datasets. -
--dataset_name: Name of the dataset to evaluate. See data_utils.py for a list of dataset names. -
--query_eval_type {image,text}: Feature type used for the query set. -
--gallery_eval_type {image,text}: Feature type used for the gallery set.
-
--clip_model_name: CLIP backbone to use. Default:ViT-B/32. -
--no_iso: Disable IsoCLIP projection. By default, IsoCLIP is enabled. -
--iso_ktop: Number of top singular directions to remove. Default:150. -
--iso_kbottom: Number of bottom singular directions to remove. Default:50. -
--query_split: Query split to evaluate. If not specified, the default split for the dataset is used. See data_utils.py for default splits. -
--gallery_split: Gallery split to evaluate. If not specified, the default split for the dataset is used. See data_utils.py for default splits. -
--use_open_clip: Use OpenCLIP instead of OpenAI CLIP. -
--open_clip_pretrained: Name of the OpenCLIP pretrained weights. -
--out_path: Output directory for the experiments. If not specified, a folder namedlocal_run_retrievalis created by default, and the experiment is saved inside it.
python src/retrieval.py \
--dataroot /path/to/datasets/ \
--dataset_name cub2011 \
--clip_model_name ViT-B/16 \
--query_eval_type image \
--gallery_eval_type image \
--iso_ktop 200 \
--iso_kbottom 50 \
--out_path iso_cub_img_retrievalpython src/retrieval.py \
--dataroot /path/to/datasets/ \
--dataset_name flickr30k_text \
--clip_model_name ViT-B/16 \
--query_eval_type text \
--gallery_eval_type text \
--iso_ktop 10 \
--iso_kbottom 50 \
--out_path iso_flickr_txt_retrievalpython src/retrieval.py \
--dataroot /path/to/datasets/ \
--dataset_name cub2011 \
--clip_model_name ViT-B/16 \
--query_eval_type image \
--gallery_eval_type image \
--no_iso \
--out_path baseline_retrievalpython src/retrieval.py \
--dataroot /path/to/datasets \
--dataset_name cub2011 \
--clip_model_name ViT-B-16 \
--use_open_clip \
--open_clip_pretrained datacomp_xl_s13b_b90k \
--query_eval_type image \
--gallery_eval_type image \
--out_path openclip_eval \
--iso_ktop 100 \
--iso_kbottom 100 \
When running the codebase for the first time, the required model is automatically downloaded, and the extracted image or text features are stored under the data/ directory inside the repository.
If the same model and evaluation are executed again, the precomputed features are automatically loaded, avoiding redundant inference and significantly reducing runtime when multiple evaluations are performed with the same model and data.
Each run:
- prints metrics to the terminal
- creates a unique folder
- saves a
summary.csvwith configuration and results
results/<out_path>/exp_<timestamp>_<random_tag>/or (if --out_path is not set):
local_run_retrieval/exp_<timestamp>_<random_tag>/The file contains a single-row summary with:
- configuration: dataset, model, eval types, splits, ISO parameters, OpenCLIP settings
- metrics:
mAP,mAP_at_R,precision_at_R,recall_at_1 - dataset-specific metrics (e.g. Oxford/Paris mAP variants)
- metadata:
folder_path,timestamp
- Some metrics may be empty depending on the dataset
- Designed for easy aggregation across multiple runs. Use aggregate_retrieval.py to aggregate multiple
summary.csvfiles within the sameout_pathfolder.
We provide bash scripts to run retrieval experiments for each model.
exp_img-img_retrieval/: image-to-image retrievalexp_txt-txt_retrieval/: text-to-text retrieval
Each script corresponds to a specific model (e.g. clip_b32_img_retrieval.sh).
Before running, set:
PYTHON="" # Python executable
ROOT_DIR="" # Repository root
DATA_ROOT="" # Dataset rootRun a script as:
cd exp_img-img_retrieval
bash clip_b32_img_retrieval.shThe procedure is similar to the retrieval one. Run:
python src/classification/zero_shot_classification.pySupports:
zeroshotβ CLIP zero-shotncmβ NCM with CLIP featuresiso_ncmβ NCM with IsoCLIP
--dataroot # dataset root
--dataset_name # e.g. caltech101, oxford_pets, ...
--eval-type {zeroshot,ncm,iso_ncm}
--clip_model_name ViT-B/32
--iso_ktop 150 # only for iso_ncm
--iso_kbottom 50 # only for iso_ncm
--use_open_clip
--open_clip_pretrained <name>
--out_path <dir> # default: local_run_classification
python src/classification/zero_shot_classification.py \
--dataroot /path/to/datasets \
--dataset_name caltech101 \
--eval-type iso_ncm \
--clip_model_name ViT-B/32 \
--iso_ktop 150 \
--iso_kbottom 50 \
--out_path iso_classificationEach run:
- prints accuracy
- creates:
<out_path>/exp_<timestamp>_<tag>/ - saves:
classification_summary.csv
The CSV contains: dataset, model, eval type, ISO params, and accuracy.
We provide bash scripts to run classification experiments for each model.
exp_img_classification/: image classification experiments
Each script corresponds to a specific model (e.g. clip_b32_classification.sh).
Before running, set:
PYTHON="" # Python executable
ROOT_DIR="" # Repository root
DATA_ROOT="" # Dataset rootRun a script as:
cd exp_img_classification
bash clip_b32_classification.shThe output directory is automatically set to the script name (without .sh). Use aggregate_classification.py to aggregate multiple classification_summary.csv files within the same folder.
This notebook provides a minimal demo of how to apply IsoCLIP to a pre-trained CLIP model.
It shows how to:
- load a CLIP model
- modify the image forward pass to extract pre-projection features
- extract the image and text projection layers
- compute the IsoCLIP projectors
- apply them in a simple forward pass
π Open Demo Notebook
The notebook is intended as a lightweight example of the IsoCLIP pipeline, from model loading to the computation of IsoCLIP-transformed features.
Adapting IsoCLIP to new CLIP-style models
The codebase is tested on the CLIP-like models and task used in the paper.
For other CLIP-style models, you may need to adapt the model-specific code for:
- model loading (
utils.load_clip)- projector extraction (
encode_no_projection.get_projection_layers)- pre-projection image/text features
(get_encode_image_with_noproj,get_encode_text_with_noproj)- architecture-specific attention handling (
encode_attention_module)Before running IsoCLIP on a new model, check that the projection layers are identified correctly and that image/text features are extracted before the final projection head. Differences in pooling, projection, or encoder implementation are the most common source of integration issues.
If you find this work useful, please consider citing:
@InProceedings{Magistri_2026_CVPR,
author = {Magistri, Simone and Goswami, Dipam and Mistretta, Marco and Twardowski, Bart{\l}omiej and van de Weijer, Joost and Bagdanov, Andrew D.},
title = {IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
year = {2026}
}
Our codebase builds upon Cross-the-Gap. If you find this codebase useful for your research, please also consider citing Cross the Gap:
@inproceedings{mistretta2025cross,
title={Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion},
author={Marco Mistretta and Alberto Baldrati and Lorenzo Agnolucci and Marco Bertini and Andrew D. Bagdanov},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=VVVfuIcmKR}
}This project is licensed under the MIT License. See LICENSE for details.