Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein and Volkan Cevher
LIONS @ EPFL and Tübingen AI Center
NeurIPS 2025
[Paper] [HuggingFace] [BibTeX]
We propose Levenshtein Efficient Adversarial Finetuning (LEAF) in order to obtain an adversarially robust text encoder for CLIP. Combining it with an existing robust image encoder, FARE, yields robustness in both input domains. All of our models are available on Huggingface 🤗 at https://huggingface.co/LEAF-CLIP.
To install the required packages, install Python 3.11 and run:
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:./src"You can load our models as any other CLIP model, for example, loading LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2 can be done by following the "openai/clip-vit-large-patch14" example snippet:
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model_name = "LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2"
processor_name = "openai/clip-vit-large-patch14"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(processor_name)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
When loading other model sizes, the processor_name needs to be changed accordingly as:
| Model Size | Processor Name |
|---|---|
| ViT-L-14 | "openai/clip-vit-large-patch14" |
| ViT-H-14 | "laion/CLIP-ViT-H-14-laion2B-s32B-b79K" |
| ViT-g-14 | "laion/CLIP-ViT-g-14-laion2B-s12B-b42K" |
| ViT-bigG-14 | "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k" |
If just need the text encoder, you can load it with the following snippet:
from transformers import CLIPTokenizer, CLIPTextModel
model_name = "LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2"
processor_name = "openai/clip-vit-large-patch14"
model = CLIPTextModel.from_pretrained(model_name)
tokenizer = CLIPTokenizer.from_pretrained(processor_name)
inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooled_output # pooled (EOS token) statesOur models are trained on the first 80k samples from DataComp-small. For instructions on downloading the dataset, follow the DataComp repository. In order to replicate our training runs, modify the corresponding file in the scripts folder to include the data location. For example, ViT-L can be finetuned from the FARE2 checkpoints as:
bash scripts/train_leaf_vitl.shIf memory requirements are a bottleneck, the parameters --accum-freq and --batch-size can be adjusted so that they factorize the original batch size of 128. For example, setting --accum-freq 4 and --batch-size 32 will perform 4 gradient accumulation steps with a batch size of 32 samples in order to simulate a batch size of 128.
We adopt the training code from FARE and show the training commands in scripts/.
We train the models in OpenCLIP format and convert them afterwards to HuggingFace format via conversion/convert_2.py.
To measure the LEAF loss in the AG-News dataset, run:
python3 eval_textfare.py --model_name LEAF-CLIP/OpenCLIP-ViT-H-rho50-k1-constrained-FARE2To evaluate the zero-shot adversarial accuracy in the AG-News dataset, run:
python3 eval_zero_shot_text.py --model_name LEAF-CLIP/OpenCLIP-ViT-H-rho50-k1-constrained-FARE2Other datasets can be considered by changing the --dataset parameter.
To evaluate the text-to-image generation performance of StableDiffusion v1.5 in the MS-COCO dataset at Levenshtein distance --coco_root variable to the location of your MS-COCO dataset and run:
python3 eval_text_to_image.py --model_name stable-diffusion-v1-5/stable-diffusion-v1-5 --batch_size 10 --adv --constrain --k 2 --text_encoder_name LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2 --coco_root path/to/coco/rootIn order to compute the CLIPScores, modify --coco_real, --coco_captions and --path_fake and run:
python3 compute_clipscores.py --coco_real path/to/coco/images --coco_captions path/to/coco/captions.json --path_fake path/to/generated_imagesTo run the text embedding inversions, convert the HuggingFace models to OpenCLIP via conversion/convert_to_openclip.py. Then supply checkpoint paths in src/pez/run_coco.pyand run:
cd ./src/pez
pip install -r requirements.txt # install additional requirements
python3 run_coco.py --model vit-h-14 [--robust]To evaluate the clean and robust performance on ImageNet, modify --imagenet_root and run:
python3 src/robust_vlm/eval/eval_imagenet.py --model_name LEAF-CLIP/OpenCLIP-ViT-H-rho50-k1-constrained-FARE2 --norm linf --eps 2 --imagenet_root path/to/imagenetTo evaluate the clean and robust performance on COCO-retrieval (automatically downloads the LEAF robust models from HF), we use 1000 samples from the karpathy_test2017 set.
python eval_retrieval.py --num-samples 1000 --model-name 'ViT-L-14' --coco_root path/to/coco/val2017This codebase gratefully forks from
If you find this project useful, please cite our paper:
@inproceedings{rocamora2025robustness,
title = {Robustness in Both Domains: CLIP Needs a Robust Text Encoder},
author = {Elias Abad Rocamora and Christian Schlarmann and Naman Deep Singh and Yongtao Wu and Matthias Hein and Volkan Cevher},
booktitle = {{NeurIPS}},
year = 2025
}