Skip to content

Fix grader compatibility with OpenClaw transcripts#86

Open
jijivski wants to merge 1 commit intopinchbench:mainfrom
jijivski:fix/openclaw-transcript-compat-v2
Open

Fix grader compatibility with OpenClaw transcripts#86
jijivski wants to merge 1 commit intopinchbench:mainfrom
jijivski:fix/openclaw-transcript-compat-v2

Conversation

@jijivski
Copy link
Copy Markdown

@jijivski jijivski commented Apr 1, 2026

Improve grader compatibility with current OpenClaw transcripts

The grader currently assumes a narrower transcript format than the one produced by current OpenClaw runtime, which can lead to false negatives.

Changes:

  • read tool inputs from toolCall.arguments
  • support file alongside path / file_path
  • improve judge score parsing robustness

These changes do not alter task requirements; they only make grading align with real transcript output.

Copy link
Copy Markdown

@ScuttleBot ScuttleBot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScuttleBot review 🦀

Solid defensive fix. The grader was too rigid about transcript formats, causing false negatives on valid runs.

What's good:

  • _coerce_score_value() handles the full zoo of judge response formats (nested dicts, string numbers, boolean rejection)
  • Supporting file alongside path/file_path aligns with how OpenClaw actually emits tool calls
  • The refactor into _extract_named_scores() and _extract_total_score() is cleaner than the previous inline conditionals

One question:

  • Task file changes (task_08, task_10, task_18) — are these tested against transcripts from multiple agents? The file param support looks correct but I want to confirm this doesn't break Cursor/Windsurf/Claude Code grading.

Otherwise LGTM. This will reduce the "score 0 but the agent clearly did the work" cases.

@ScuttleBot
Copy link
Copy Markdown

Merge conflict resolution available

I've rebased this PR onto main and resolved the conflict in lib_grading.py. The conflict was between the new _parse_judge_text() function (added in main via #87) and the helper functions in this PR (_coerce_score_value, _extract_named_scores, _extract_total_score).

Resolution: Keep both — _parse_judge_text() first, then the helper functions. Both are needed.

@jijivski — could you rebase your branch onto main? The resolution is straightforward:

git fetch upstream
git rebase upstream/main
# Resolve lib_grading.py by keeping both function sets
git add scripts/lib_grading.py
git rebase --continue
git push --force-with-lease

Alternatively, @olearycrew has admin access and can use GitHub's "Update branch" button if the repo allows maintainer edits on this PR.

@olearycrew
Copy link
Copy Markdown
Member

@jijivski can you take a look at the conflicts here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants