Skip to content

chore(evals): unblock ToolPredictionScorer#962

Draft
JoshFerge wants to merge 1 commit into
mainfrom
chore/evals-unblock-prediction-scorer
Draft

chore(evals): unblock ToolPredictionScorer#962
JoshFerge wants to merge 1 commit into
mainfrom
chore/evals-unblock-prediction-scorer

Conversation

@JoshFerge
Copy link
Copy Markdown
Member

@JoshFerge JoshFerge commented May 13, 2026

Summary

Two unrelated bugs were both blocking the entire eval suite — every eval currently crashes with `MCPClientError: Connection closed` before any scoring runs. Both are fixed here.

  1. Stale CLI flag. `toolPredictionScorer` spawns `pnpm exec sentry-mcp` with `--access-token=mocked-access-token --all-scopes`. `--all-scopes` was removed in refactor: remove legacy grantedScopes in favor of grantedSkills #668 ("remove legacy grantedScopes in favor of grantedSkills"), so the spawned binary prints its usage banner and exits, surfacing as a closed-stdio error. Drop the flag.
  2. OpenAI strict response_format rejection. The prediction schema uses `arguments: z.record(z.any()).optional().default({})`. zod-to-JSON-Schema emits this as `additionalProperties: {}` (no `type` key), which OpenAI's strict structured-output endpoint refuses with `AI_APICallError: Invalid schema for response_format 'response': … schema must have a 'type' key`. Verified this also fails on the latest `@ai-sdk/openai@3.0.63` + `ai@6.0.182`, so it's not a stale-dep issue. Encode arguments as a JSON string instead.

With both fixed, the eval suite runs end-to-end again (verified locally — autofix.eval.ts cases score 1.00 against the current main).

Test plan

  • `pnpm run tsc`
  • `pnpm run lint`
  • `pnpm run test` — all 1341 tests pass
  • `pnpm vitest run autofix.eval.ts` with `OPENAI_API_KEY` set — both cases score 1.00

Notes for reviewers

  • `predictedTools[*].arguments` was already only read for the rationale/metadata field in the scorer's return value, so renaming to `argumentsJson` doesn't break scoring.
  • First in a stack of three. Subsequent PRs do the Sentry-side autofix migration to explorer-mode endpoints.

🤖 Generated with Claude Code

Agent transcript: https://claudescope.sentry.dev/share/HYSozfr18sU5lMEbTaUO5UX2YmVyaBdNoueQpOS6xJQ

- Drop the stale `--all-scopes` flag the bin no longer accepts; the
  scorer-side stdio spawn was printing usage and exiting, surfacing as
  `MCPClientError: Connection closed` across every eval.
- Switch the `predictedTools[*].arguments` field to a JSON-encoded string;
  `z.record(z.any())` emits `additionalProperties` with no `type`, which
  OpenAI's strict response_format rejects.

With both fixes, `autofix.eval.ts` scores 1.00 / 1.00 against the new
explorer-mode tool flow.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
"exec",
"sentry-mcp",
"--access-token=mocked-access-token",
"--all-scopes",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im actually surprised this wouldnt break something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants