Skip to content

Conversation

@christso
Copy link
Collaborator

@christso christso commented Dec 31, 2025

Summary

  • Add two OpenSpec proposals for built-in tool evaluation features based on research across 7 evaluation frameworks
  • Add plugin showcase demonstrating patterns that should remain as code judges

OpenSpec Proposals

1. add-trajectory-argument-matching

Extend tool_trajectory evaluator to validate tool arguments, not just names:

  • Exact match: args: { query: "weather" }
  • Pattern match: args: { query: /weather/i }
  • Skip validation: args: any

2. add-execution-metrics

Add extended execution metrics to traces:

  • tokenUsage: { input, output, cached? }
  • costUsd, durationMs, toolDurations
  • Computed: explorationRatio, tokensPerTool

Plugin Showcase

Demonstrates code judge plugins for domain-specific tool evaluation:

Plugin Purpose
tool_selection_judge.py Semantic tool selection correctness
efficiency_scorer.py Efficiency metrics with custom thresholds
pairwise_tool_compare.py Pairwise comparison with bias mitigation

Test plan

  • openspec validate add-trajectory-argument-matching --strict passes
  • openspec validate add-execution-metrics --strict passes
  • Review proposals for completeness
  • Review plugin examples for correctness

🤖 Generated with Claude Code

Add two OpenSpec proposals for built-in tool evaluation features:

1. add-trajectory-argument-matching: Extend tool_trajectory evaluator
   to validate tool arguments (exact, pattern, or skip modes)

2. add-execution-metrics: Add extended execution metrics to traces
   (tokenUsage, costUsd, durationMs, explorationRatio)

Add plugin showcase demonstrating patterns that should remain as
code judges rather than built-ins:
- tool_selection_judge.py: Semantic tool selection evaluation
- efficiency_scorer.py: Custom efficiency thresholds
- pairwise_tool_compare.py: Position-bias-mitigated comparison

Based on research across 7 evaluation frameworks (Google ADK, Azure
SDK, Mastra, Sniffbench, LangWatch, Letta, Agent-Skills).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants