Fix grader compatibility with OpenClaw transcripts#86
Fix grader compatibility with OpenClaw transcripts#86jijivski wants to merge 1 commit intopinchbench:mainfrom
Conversation
ScuttleBot
left a comment
There was a problem hiding this comment.
ScuttleBot review 🦀
Solid defensive fix. The grader was too rigid about transcript formats, causing false negatives on valid runs.
What's good:
_coerce_score_value()handles the full zoo of judge response formats (nested dicts, string numbers, boolean rejection)- Supporting
filealongsidepath/file_pathaligns with how OpenClaw actually emits tool calls - The refactor into
_extract_named_scores()and_extract_total_score()is cleaner than the previous inline conditionals
One question:
- Task file changes (task_08, task_10, task_18) — are these tested against transcripts from multiple agents? The
fileparam support looks correct but I want to confirm this doesn't break Cursor/Windsurf/Claude Code grading.
Otherwise LGTM. This will reduce the "score 0 but the agent clearly did the work" cases.
|
Merge conflict resolution available I've rebased this PR onto main and resolved the conflict in Resolution: Keep both — @jijivski — could you rebase your branch onto main? The resolution is straightforward: git fetch upstream
git rebase upstream/main
# Resolve lib_grading.py by keeping both function sets
git add scripts/lib_grading.py
git rebase --continue
git push --force-with-leaseAlternatively, @olearycrew has admin access and can use GitHub's "Update branch" button if the repo allows maintainer edits on this PR. |
|
@jijivski can you take a look at the conflicts here? |
Improve grader compatibility with current OpenClaw transcripts
The grader currently assumes a narrower transcript format than the one produced by current OpenClaw runtime, which can lead to false negatives.
Changes:
toolCall.argumentsfilealongsidepath/file_pathThese changes do not alter task requirements; they only make grading align with real transcript output.