Skip to content

Add image identification tasks#263

Merged
olearycrew merged 1 commit intomainfrom
bdo/image-identification
Apr 9, 2026
Merged

Add image identification tasks#263
olearycrew merged 1 commit intomainfrom
bdo/image-identification

Conversation

@olearycrew
Copy link
Copy Markdown
Member

Test Plan

Use the --suite flag with the task ID.
./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_26_image_identification --no-upload
Direct Python entrypoint (equivalent):
python3 scripts/benchmark.py --model openrouter/openai/gpt-4o --suite task_26_image_identification --no-upload
Useful variants:

  • Multiple runs for stability:
    ./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_26_image_identification --runs 3 --no-upload
    • Use a direct API judge model:
      ./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_26_image_identification --judge openai/gpt-4o --no-upload
    • Verbose debugging:
      ./scripts/run.sh --model openrouter/openai/gpt-4o --suite task_26_image_identification --no-upload -v

Result JSON will be written under results/ and transcript under results/<run_id>_transcripts/task_26_image_identification.jsonl.

@ScuttleBot
Copy link
Copy Markdown

🧪 Debug Test Results

Tested the image identification task on a fresh Vultr instance (Ubuntu 22.04, 2 CPU, 4GB RAM).

Results

Model Score Time Cost Notes
GPT-4o ✅ 100% 65s $0.09 All categories correct (phone, food, menu)
Claude Sonnet 4 ✅ 100% 70s $0.14 All categories correct
Gemini 2.5 Flash ❌ 0% 52s $0.01 Failed — claimed no image recognition capability

Gemini Failure Details

Gemini refused to attempt classification, responding:

"I do not have direct image recognition or classification capabilities to analyze the content of these .jpg files."

This reveals that Gemini's vision capability may not be properly enabled in the OpenClaw agent configuration, or it's not receiving the images in its prompt correctly.

Grading Breakdown (passing models)

All 8 grading criteria passed:

  • file_created: ✅
  • valid_json_shape: ✅
  • has_required_categories: ✅
  • values_are_valid_paths: ✅
  • uses_each_image_once: ✅
  • phone_correct: ✅
  • food_correct: ✅
  • menu_correct: ✅

Observations

  1. Task works correctly with vision-capable models
  2. Gemini issue needs investigation — might be an OpenClaw config or API routing issue rather than a task problem
  3. Cost efficient — ~$0.10 per run for GPT-4o

Instance cleaned up after testing.

@olearycrew olearycrew merged commit 643ccd9 into main Apr 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants