SAW-Bench: Learning Situated Awareness in the Real World

SAW-Bench (Situated Awareness in the Real World) is a benchmark for evaluating observer-centric situated awareness in multimodal foundation models (MFMs) — the ability to reason about space, motion, and possible actions relative to one's own egocentric viewpoint as it evolves over time.

Unlike prior benchmarks that emphasize environment-centric relations (how objects relate to each other in a scene), SAW-Bench probes whether a model can maintain a coherent observer-centric spatial state from egocentric video. It comprises 786 real-world videos captured with Ray-Ban Meta (Gen 2) smart glasses and 2,071 human-annotated question–answer pairs across six tasks. Even the best model trails humans by 37.66%.

📄 Paper: arXiv:2602.16682 · 🌐 Project page · 🤗 Dataset

The six tasks

Task (`key`)	What it probes	# QA
Self-Localization (`localization`)	Where am I within the space (corner / side / center)?	200
Relative Direction (`direction`)	Where is a target relative to my current heading?	834
Route Shape (`shape`)	What is the shape of the path I traveled?	546
Reverse Route Plan (`revplan`)	How do I get back to where I started?	229
Spatial Memory (`memory`)	What changed in the scene between two visits?	100
Spatial Affordance (`affordance`)	What action is feasible from my current pose/position?	162
Total		2,071

Installation

This project uses uv.

git clone https://github.com/UCSB-AI/SAW-Bench.git
cd SAW-Bench
uv sync                      # core deps (hosted-API models + baselines)
uv sync --extra local        # also install torch/transformers for local models

Then put your API keys in a .env file in the repo root (it's gitignored, so your keys stay local). You only need the keys for the providers you want to run:

OPENAI_API_KEY=...      # GPT models + the answer parser (parse_result.py)
GEMINI_API_KEY=...      # Gemini models
DASHSCOPE_API_KEY=...   # Qwen-API models

Get the data

The benchmark — QA pairs and compressed videos — is hosted on the Hugging Face Hub at ucsbai/SAW-Bench and is not checked into this repo. Download and lay it out for evaluation with:

uv run bash scripts/download_data.sh             # QA data + videos (~3 GB)
uv run bash scripts/download_data.sh --no-videos # QA data only

This downloads the Parquet shards, converts them into data/<task>.json, and fetches the clips into videos_compressed/Scene_*/<key>.mp4 — exactly the layout the evaluation code reads from.

Run the evaluation

The pipeline has three stages: generate → parse → score. Run all three for one model with the helper script:

uv run bash scripts/run_eval.sh gemini-3-flash-preview 2   # <model> <fps>
uv run bash scripts/run_eval.sh blind                      # text-only baseline (no video; fps ignored)

That's all you need. Under the hood run_eval.sh just runs the three Python modules below in order — call them directly only if you want finer control (e.g. re-parse or re-score without re-generating).

1. Generate model responses

uv run python src/evaluate.py --model gemini-3-flash-preview --fps 2

--model — any model listed in src/config.json.
--fps — sampling rate passed to the model (default 2). Mutually exclusive with --total_frames.
--reasoning_type — restrict to specific tasks (comma-separated), default ALL.

Raw responses are written to results/<task>/<fps>/<model>.jsonl. Runs are resumable — already-answered IDs are skipped if you re-run.

2. Parse responses into answer letters

uv run python src/parse_result.py

Converts free-text responses in results/ into a single choice (A/B/…) using regex first and a GPT-4o-mini fallback, writing to parsed_results/.

3. Score

uv run python src/get_score.py        # overall accuracy per result file
uv run python src/result.py --fps 2   # leaderboard-style per-task accuracy table

Supported models

See src/config.json for the full registry. Out of the box:

Hosted APIs: Gemini (2.5 / 3), GPT-5.x, Qwen-VL (DashScope API).
Baselines: blind (a text-only language-prior baseline — it answers from the question and options alone, without any visual information; fps is ignored), socratic (caption-then-answer).
Local (optional, needs --extra local): Qwen2.5/3-VL, LLaVA-NeXT-Video, LLaVA-OneVision, InternVL, VideoLLaVA.

Adding a new model

Add src/generate_lib/<family>.py exposing generate_response(model_name, queries, fps, output_dir, shuffle=False).
Register the model under its family in src/config.json.

Data format

Each data/<task>.json is a dict keyed by string id:

{
  "0": {
    "question": "Are you positioned near the corner, along the side, or near the center of the lawn?",
    "options": ["Center", "Corner", "Side"],
    "ground_truth": "Corner",
    "answer": 1,
    "key": "0_0",
    "scene_category": "outdoor"
  }
}

key is "<scene>_<video>" and maps to videos_compressed/Scene_<scene>/<key>.mp4. answer is the index of ground_truth within options.

Repository layout

SAW-Bench/
├── src/
│   ├── config.json             # model registry + tasks + defaults
│   ├── evaluate.py             # stage 1: generate responses
│   ├── parse_result.py         # stage 2: parse to answer letters
│   ├── get_score.py            # stage 3: overall accuracy
│   ├── result.py               # stage 3: per-task leaderboard table
│   └── generate_lib/           # per-model adapters + prompts + frame sampling
├── scripts/
│   ├── download_data.sh        # fetch QA + videos from the HF Hub
│   ├── run_eval.sh             # run generate -> parse -> score for one model
│   ├── prepare_hf_dataset.py   # (maintainer) build Parquet + upload to the HF Hub
│   └── hf_dataset_card.md      # (maintainer) the HF dataset card
│
│   # the directories below are NOT in git — they are created at runtime:
├── data/                       # QA pairs        (download_data.sh)
├── videos_compressed/          # egocentric clips (download_data.sh)
├── results/                    # raw model responses     (run_eval.sh)
└── parsed_results/             # parsed answers + scores (run_eval.sh)

Ethics, privacy & responsible use

SAW-Bench consists of real-world egocentric videos. Please read this statement before using the data.

Collection & consent. Videos were self-recorded by participants who consented to wearing the camera (Ray-Ban Meta Gen 2 smart glasses). Recording took place in everyday indoor and outdoor environments, so incidental third parties (e.g., passers-by) and identifiable locations may appear in the background. No individuals were deliberately targeted, tracked, or directed.

Privacy minimization. Audio is removed from all clips, so no speech is included. A face/identity-blurred variant of the videos was produced during the study. Even so, faces, license plates, or other identifying details may remain partially visible in some frames.

Permitted use. The dataset is released for non-commercial academic research only, under CC BY-NC 4.0.

Prohibited use. You may not:

attempt to identify, re-identify, locate, or contact any individual appearing in the videos;
use the data to train or evaluate face-recognition, biometric, surveillance, or person-tracking systems;
use the data for any commercial purpose.

Removal requests. If you appear in a video, or are a rights holder, and would like a clip removed, contact chuhan_li@ucsb.edu and we will promptly remove it.

By downloading the data you agree to these terms.

Citation

@inproceedings{li2026sawbench,
  title     = {{SAW}-Bench: Learning Situated Awareness in the Real World},
  author    = {Chuhan Li and Rilyn R. Han and Joy Hsu and Yongyuan Liang and
               Rajiv Dhawan and Jiajun Wu and Ming-Hsuan Yang and Xin Eric Wang},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=8lwrYjv6r7}
}

License

Code and data are released under CC BY-NC 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SAW-Bench: Learning Situated Awareness in the Real World

The six tasks

Installation

Get the data

Run the evaluation

1. Generate model responses

2. Parse responses into answer letters

3. Score

Supported models

Adding a new model

Data format

Repository layout

Ethics, privacy & responsible use

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SAW-Bench: Learning Situated Awareness in the Real World

The six tasks

Installation

Get the data

Run the evaluation

1. Generate model responses

2. Parse responses into answer letters

3. Score

Supported models

Adding a new model

Data format

Repository layout

Ethics, privacy & responsible use

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages