docs: add tool evaluation proposals and plugin showcase #91

christso · 2025-12-31T13:49:03Z

Summary

Add two OpenSpec proposals for built-in tool evaluation features based on research across 7 evaluation frameworks
Add plugin showcase demonstrating patterns that should remain as code judges

OpenSpec Proposals

1. `add-trajectory-argument-matching`

Extend tool_trajectory evaluator to validate tool arguments, not just names:

Exact match: args: { query: "weather" }
Pattern match: args: { query: /weather/i }
Skip validation: args: any

2. `add-execution-metrics`

Add extended execution metrics to traces:

tokenUsage: { input, output, cached? }
costUsd, durationMs, toolDurations
Computed: explorationRatio, tokensPerTool

Plugin Showcase

Demonstrates code judge plugins for domain-specific tool evaluation:

Plugin	Purpose
`tool_selection_judge.py`	Semantic tool selection correctness
`efficiency_scorer.py`	Efficiency metrics with custom thresholds
`pairwise_tool_compare.py`	Pairwise comparison with bias mitigation

Test plan

openspec validate add-trajectory-argument-matching --strict passes
openspec validate add-execution-metrics --strict passes
Review proposals for completeness
Review plugin examples for correctness

🤖 Generated with Claude Code

Add two OpenSpec proposals for built-in tool evaluation features: 1. add-trajectory-argument-matching: Extend tool_trajectory evaluator to validate tool arguments (exact, pattern, or skip modes) 2. add-execution-metrics: Add extended execution metrics to traces (tokenUsage, costUsd, durationMs, explorationRatio) Add plugin showcase demonstrating patterns that should remain as code judges rather than built-ins: - tool_selection_judge.py: Semantic tool selection evaluation - efficiency_scorer.py: Custom efficiency thresholds - pairwise_tool_compare.py: Position-bias-mitigated comparison Based on research across 7 evaluation frameworks (Google ADK, Azure SDK, Mastra, Sniffbench, LangWatch, Letta, Agent-Skills).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: add tool evaluation proposals and plugin showcase #91

docs: add tool evaluation proposals and plugin showcase #91

Uh oh!

christso commented Dec 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

docs: add tool evaluation proposals and plugin showcase #91

Are you sure you want to change the base?

docs: add tool evaluation proposals and plugin showcase #91

Uh oh!

Conversation

christso commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

OpenSpec Proposals

1. add-trajectory-argument-matching

2. add-execution-metrics

Plugin Showcase

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

christso commented Dec 31, 2025 •

edited

Loading

1. `add-trajectory-argument-matching`

2. `add-execution-metrics`