A modular benchmarking framework for multimodal evaluation of large language models (LLMs). ModalityEval cleanly isolates and compares causal attention (text-only) and cross-attention (image-only) pathways under realistic filters and perturbations.
- Amit Stein
- Ron Mondshein
- Koren Ben Ezra
Natural Language Processing (NLP) course project, Tel Aviv University.
- Isolated Attention Streams: Routes text inputs through causal self-attention and images through cross-attention for fair, repeatable comparisons.
- Extensible Filters: Apply character-level noise, Gaussian blur, contextual cues, and personalized hints to probe model vulnerabilities.
- Case Study Ready: Built-in support for benchmarking Meta’s LLaMA 3.2 Vision-Instruct on GSM8K.
- Modular Architecture: Swap models, datasets, and filters via a simple wrapper design.
- Comprehensive Reports: Generate per-filter summaries, consolidated CSVs, and publication-quality plots.
Figure 1. End-to-end flow of the Benchmark Manager: dataset ingestion, text & image filtering, multimodal wrapper execution, category extraction, and summary report generation.
Each component in the pipeline is implemented as a modular plugin. You can replace or extend:
- Data Loaders: Swap in different dataset ingestion scripts by following the
DatasetLoaderinterface. - Filters: Add or customize text/image filters by implementing new subclasses of the
Filterbase class. - Model Wrappers: Integrate other LLMs or vision models by creating a new wrapper conforming to the
MultimodalWrapperAPI.
-
Clone the repository:
git clone https://github.com/Koren-Ben-Ezra/ModalityEval.git
-
Set up virtual environment (see below).
If you don’t have Miniconda installed, download it: https://www.anaconda.com/download/success
# Create the environment
conda env create -f environment/environment.yml
# Activate it
conda activate ModalityEvalOne can run a premade evaluation through
python ./main.py <Test Section (A-H)> <Test Number>Where test section and number are chosen accordingly to eval.py.
You can also create custom eval files by defining your own custom (1) multimodal wrapper, (2) dataset wrapper, (3) Text/Image Filters. For each one of those there is an abstract you can freely implement to suffice the framework's pipeline. Then, you can define your own benchmark manager and execute tests, similary to how its done in eval.py.
./run_slurm <Test Section (A–H)> <Test Number>All evaluation scripts reside in the eval_model/ directory. By default, outputs go to eval_model/results/ and generated plots to reports/.
-
Aggregate results:
python eval_model/eval_results.py
Consolidates raw CSVs (
*_TF.csv,*_IF.csv) into a singleeval_summary.csv. Columns ineval_summary.csvare:filter,correct,total, andaccuracy. -
Split summary CSV:
python eval_model/separate_csv.py eval_summary.csv
Reads
eval_summary.csvand saves a summary with separate TF and IF accuracy columns into a new CSV file. -
Count empty entries:
python eval_model/count_blank.py 'eval_model/results/*.csv'Counts empty entries in the last column of every
.csvin the target folder and writes the results tocount_blank.csvin the current directory. -
Plot accuracy vs. noise:
python eval_model/plot.py eval_summary.csv --output reports/fig_text_image_accuracy_acl.pdf
Generates
fig_text_image_accuracy_acl.pdf, a plot ofaccuracyvs. shuffle probability for text (TF) and image (IF) inputs. -
Combine two CSV plots:
python eval_model/plot2csv.py csv1.csv csv2.csv --output reports/combined_plot.pdf
Creates a combined plot from two CSV files in a single figure.
The full article PDF is located in the paper/ directory. Download it here: ModalityEval Article.
This project is licensed under the MIT License. See LICENSE for details.
