Add image identification tasks by olearycrew · Pull Request #263 · pinchbench/skill

olearycrew · 2026-04-09T02:42:45Z

Test Plan

Use the --suite flag with the task ID.
./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_26_image_identification --no-upload
Direct Python entrypoint (equivalent):
python3 scripts/benchmark.py --model openrouter/openai/gpt-4o --suite task_26_image_identification --no-upload
Useful variants:

Multiple runs for stability:
./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_26_image_identification --runs 3 --no-upload
- Use a direct API judge model:
  ./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_26_image_identification --judge openai/gpt-4o --no-upload
- Verbose debugging:
  ./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_26_image_identification --no-upload -v

Result JSON will be written under results/ and transcript under results/<run_id>_transcripts/task_26_image_identification.jsonl.

ScuttleBot · 2026-04-09T03:08:10Z

🧪 Debug Test Results

Tested the image identification task on a fresh Vultr instance (Ubuntu 22.04, 2 CPU, 4GB RAM).

Results

Model	Score	Time	Cost	Notes
GPT-4o	✅ 100%	65s	$0.09	All categories correct (phone, food, menu)
Claude Sonnet 4	✅ 100%	70s	$0.14	All categories correct
Gemini 2.5 Flash	❌ 0%	52s	$0.01	Failed — claimed no image recognition capability

Gemini Failure Details

Gemini refused to attempt classification, responding:

"I do not have direct image recognition or classification capabilities to analyze the content of these .jpg files."

This reveals that Gemini's vision capability may not be properly enabled in the OpenClaw agent configuration, or it's not receiving the images in its prompt correctly.

Grading Breakdown (passing models)

All 8 grading criteria passed:

file_created: ✅
valid_json_shape: ✅
has_required_categories: ✅
values_are_valid_paths: ✅
uses_each_image_once: ✅
phone_correct: ✅
food_correct: ✅
menu_correct: ✅

Observations

Task works correctly with vision-capable models
Gemini issue needs investigation — might be an OpenClaw config or API routing issue rather than a task problem
Cost efficient — ~$0.10 per run for GPT-4o

Instance cleaned up after testing.

Add image identification tasks

9de1a20

olearycrew merged commit 643ccd9 into main Apr 9, 2026
1 check passed

olearycrew mentioned this pull request Apr 10, 2026

[task-proposal] Image Identification #137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add image identification tasks#263

Add image identification tasks#263
olearycrew merged 1 commit intomainfrom
bdo/image-identification

olearycrew commented Apr 9, 2026

Uh oh!

ScuttleBot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

olearycrew commented Apr 9, 2026

Test Plan

Uh oh!

ScuttleBot commented Apr 9, 2026

🧪 Debug Test Results

Results

Gemini Failure Details

Grading Breakdown (passing models)

Observations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants