SAW-Bench (Situated Awareness in the Real World) is a benchmark for evaluating observer-centric situated awareness in multimodal foundation models (MFMs) β the ability to reason about space, motion, and possible actions relative to one's own egocentric viewpoint as it evolves over time.
Unlike prior benchmarks that emphasize environment-centric relations (how objects relate to each other in a scene), SAW-Bench probes whether a model can maintain a coherent observer-centric spatial state from egocentric video. It comprises 786 real-world videos captured with Ray-Ban Meta (Gen 2) smart glasses and 2,071 human-annotated questionβanswer pairs across six tasks. Even the best model trails humans by 37.66%.
π Paper: arXiv:2602.16682 Β· π Project page Β· π€ Dataset
Task (key) |
What it probes | # QA |
|---|---|---|
Self-Localization (localization) |
Where am I within the space (corner / side / center)? | 200 |
Relative Direction (direction) |
Where is a target relative to my current heading? | 834 |
Route Shape (shape) |
What is the shape of the path I traveled? | 546 |
Reverse Route Plan (revplan) |
How do I get back to where I started? | 229 |
Spatial Memory (memory) |
What changed in the scene between two visits? | 100 |
Spatial Affordance (affordance) |
What action is feasible from my current pose/position? | 162 |
| Total | 2,071 |
This project uses uv.
git clone https://github.com/UCSB-AI/SAW-Bench.git
cd SAW-Bench
uv sync # core deps (hosted-API models + baselines)
uv sync --extra local # also install torch/transformers for local modelsThen put your API keys in a .env file in the repo root (it's gitignored, so your
keys stay local). You only need the keys for the providers you want to run:
OPENAI_API_KEY=... # GPT models + the answer parser (parse_result.py)
GEMINI_API_KEY=... # Gemini models
DASHSCOPE_API_KEY=... # Qwen-API modelsThe benchmark β QA pairs and compressed videos β is hosted on the Hugging Face Hub
at ucsbai/SAW-Bench and is
not checked into this repo. Download and lay it out for evaluation with:
uv run bash scripts/download_data.sh # QA data + videos (~3 GB)
uv run bash scripts/download_data.sh --no-videos # QA data onlyThis downloads the Parquet shards, converts them into data/<task>.json, and
fetches the clips into videos_compressed/Scene_*/<key>.mp4 β exactly the layout
the evaluation code reads from.
The pipeline has three stages: generate β parse β score. Run all three for one model with the helper script:
uv run bash scripts/run_eval.sh gemini-3-flash-preview 2 # <model> <fps>
uv run bash scripts/run_eval.sh blind # text-only baseline (no video; fps ignored)That's all you need. Under the hood run_eval.sh just runs the three Python
modules below in order β call them directly only if you want finer control (e.g.
re-parse or re-score without re-generating).
uv run python src/evaluate.py --model gemini-3-flash-preview --fps 2--modelβ any model listed insrc/config.json.--fpsβ sampling rate passed to the model (default 2). Mutually exclusive with--total_frames.--reasoning_typeβ restrict to specific tasks (comma-separated), defaultALL.
Raw responses are written to results/<task>/<fps>/<model>.jsonl. Runs are
resumable β already-answered IDs are skipped if you re-run.
uv run python src/parse_result.pyConverts free-text responses in results/ into a single choice (A/B/β¦) using
regex first and a GPT-4o-mini fallback, writing to parsed_results/.
uv run python src/get_score.py # overall accuracy per result file
uv run python src/result.py --fps 2 # leaderboard-style per-task accuracy tableSee src/config.json for the full registry. Out of the box:
- Hosted APIs: Gemini (2.5 / 3), GPT-5.x, Qwen-VL (DashScope API).
- Baselines:
blind(a text-only language-prior baseline β it answers from the question and options alone, without any visual information;fpsis ignored),socratic(caption-then-answer). - Local (optional, needs
--extra local): Qwen2.5/3-VL, LLaVA-NeXT-Video, LLaVA-OneVision, InternVL, VideoLLaVA.
- Add
src/generate_lib/<family>.pyexposinggenerate_response(model_name, queries, fps, output_dir, shuffle=False). - Register the model under its family in
src/config.json.
Each data/<task>.json is a dict keyed by string id:
{
"0": {
"question": "Are you positioned near the corner, along the side, or near the center of the lawn?",
"options": ["Center", "Corner", "Side"],
"ground_truth": "Corner",
"answer": 1,
"key": "0_0",
"scene_category": "outdoor"
}
}key is "<scene>_<video>" and maps to videos_compressed/Scene_<scene>/<key>.mp4.
answer is the index of ground_truth within options.
SAW-Bench/
βββ src/
β βββ config.json # model registry + tasks + defaults
β βββ evaluate.py # stage 1: generate responses
β βββ parse_result.py # stage 2: parse to answer letters
β βββ get_score.py # stage 3: overall accuracy
β βββ result.py # stage 3: per-task leaderboard table
β βββ generate_lib/ # per-model adapters + prompts + frame sampling
βββ scripts/
β βββ download_data.sh # fetch QA + videos from the HF Hub
β βββ run_eval.sh # run generate -> parse -> score for one model
β βββ prepare_hf_dataset.py # (maintainer) build Parquet + upload to the HF Hub
β βββ hf_dataset_card.md # (maintainer) the HF dataset card
β
β # the directories below are NOT in git β they are created at runtime:
βββ data/ # QA pairs (download_data.sh)
βββ videos_compressed/ # egocentric clips (download_data.sh)
βββ results/ # raw model responses (run_eval.sh)
βββ parsed_results/ # parsed answers + scores (run_eval.sh)
SAW-Bench consists of real-world egocentric videos. Please read this statement before using the data.
Collection & consent. Videos were self-recorded by participants who consented to wearing the camera (Ray-Ban Meta Gen 2 smart glasses). Recording took place in everyday indoor and outdoor environments, so incidental third parties (e.g., passers-by) and identifiable locations may appear in the background. No individuals were deliberately targeted, tracked, or directed.
Privacy minimization. Audio is removed from all clips, so no speech is included. A face/identity-blurred variant of the videos was produced during the study. Even so, faces, license plates, or other identifying details may remain partially visible in some frames.
Permitted use. The dataset is released for non-commercial academic research only, under CC BY-NC 4.0.
Prohibited use. You may not:
- attempt to identify, re-identify, locate, or contact any individual appearing in the videos;
- use the data to train or evaluate face-recognition, biometric, surveillance, or person-tracking systems;
- use the data for any commercial purpose.
Removal requests. If you appear in a video, or are a rights holder, and would like a clip removed, contact chuhan_li@ucsb.edu and we will promptly remove it.
By downloading the data you agree to these terms.
@inproceedings{li2026sawbench,
title = {{SAW}-Bench: Learning Situated Awareness in the Real World},
author = {Chuhan Li and Rilyn R. Han and Joy Hsu and Yongyuan Liang and
Rajiv Dhawan and Jiajun Wu and Ming-Hsuan Yang and Xin Eric Wang},
booktitle = {Forty-third International Conference on Machine Learning},
year = {2026},
url = {https://openreview.net/forum?id=8lwrYjv6r7}
}Code and data are released under CC BY-NC 4.0.