Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
d019e0b
feat: Add ExecuteSkillScriptTool for running skill scripts via code e…
caohy1988 Feb 21, 2026
06d995e
fix: Address Gemini Code Assist review — shell injection, shlex, chec…
caohy1988 Feb 21, 2026
e83de80
fix: Address code review findings for ExecuteSkillScriptTool
caohy1988 Feb 21, 2026
52b8563
docs: Add code executor enhancements design document
caohy1988 Feb 22, 2026
4142c37
docs: Address 8 architectural review findings in code executor design…
caohy1988 Feb 22, 2026
8ca1111
docs: Fix 6 review findings — execution_id, PID namespace, Py version…
caohy1988 Feb 22, 2026
d8692ba
docs: Fix container timeout DoS, pkill scope, stale recommendations
caohy1988 Feb 22, 2026
f55da53
docs: Align roadmap with Option A, unify recovery policy, fix fallbac…
caohy1988 Feb 22, 2026
f4fd794
docs: Fix PermissionError kill fallback, align non-goals with Option A
caohy1988 Feb 22, 2026
4bb83a0
docs: Surface cleanup failure as unhealthy state, add post-kill threa…
caohy1988 Feb 22, 2026
3221ac1
docs: Add _healthy guard and post-restart readiness validation
caohy1988 Feb 22, 2026
c3a003d
docs: Document _healthy lifecycle (init, failure, reinit)
caohy1988 Feb 22, 2026
369bba8
docs: Add public reinitialize() method to ContainerCodeExecutor API
caohy1988 Feb 22, 2026
c735183
docs: Check exit_code on post-restart readiness validation
caohy1988 Feb 22, 2026
11f65f0
feat: Add SkillsBench Docker-based evaluation pipeline
caohy1988 Feb 23, 2026
f9a78a6
fix: Add per-command timeout to Docker executor
caohy1988 Feb 23, 2026
8169ce3
Rename ExecuteSkillScriptTool to RunSkillScriptTool
caohy1988 Feb 24, 2026
446f8a6
Rename ExecuteSkillScriptTool to RunSkillScriptTool
caohy1988 Feb 24, 2026
e2445c2
Add optional path and working_dir to execution dataclasses
caohy1988 Feb 24, 2026
8b99a5e
Implement temporary directory for code execution
caohy1988 Feb 24, 2026
c2e92cf
Enhance code execution with working directory support
caohy1988 Feb 24, 2026
c77ac54
Add 'path' field to input files in tests
caohy1988 Feb 24, 2026
e77473d
Add design for skill execution script in ADK
caohy1988 Feb 24, 2026
e81293c
fix: Add Docker build retry logic and skill script test coverage
caohy1988 Feb 24, 2026
58d1feb
fix: Fix tool name mismatch, empty file handling, and args validation
caohy1988 Feb 24, 2026
6cf4522
refactor: Remove allowed_tools resolution and global execution lock
caohy1988 Feb 24, 2026
a9b4f6a
fix: Restore execution lock to prevent concurrent data races
caohy1988 Feb 24, 2026
04198b0
refactor: Only sandbox executor when input_files or working_dir is set
caohy1988 Feb 24, 2026
3532cef
fix: Hold execution lock for both sandbox and plain paths
caohy1988 Feb 24, 2026
b65d353
test: Add concurrency test for mixed sandbox/plain execution
caohy1988 Feb 24, 2026
4e49596
test: Assert plain-path cwd stays original in concurrency test
caohy1988 Feb 24, 2026
0f4165a
docs(code-executor): align design doc with RunSkillScriptTool and cur…
caohy1988 Feb 24, 2026
97397ba
docs(code-executor): prioritize run-script roadmap and tool contract …
caohy1988 Feb 24, 2026
4e69ace
docs(code-executor): fix 18 inaccuracies in design doc vs implementation
caohy1988 Feb 24, 2026
36f0148
docs: add P0 RFC for RunSkillScriptTool production-readiness
caohy1988 Feb 24, 2026
702a77f
docs(rfc): address 6 review findings in RunSkillScriptTool P0 RFC
caohy1988 Feb 24, 2026
992b864
docs(code-executor): address 7 review findings in design doc
caohy1988 Feb 24, 2026
c950d79
docs: address 6 cross-doc review findings in design doc and RFC
caohy1988 Feb 24, 2026
d94527c
docs: fix 4 remaining cross-doc inconsistencies
caohy1988 Feb 24, 2026
faa5fa2
docs: final polish — SemVer policy, missing import, reinit failure mode
caohy1988 Feb 24, 2026
ab1dd28
docs(code-executor): align §6.5 graduation row with SemVer policy
caohy1988 Feb 24, 2026
0272370
feat(benchmarks): add BigQueryBench evaluation pipeline
caohy1988 Feb 24, 2026
b164b87
docs(bigquerybench): add complete walkthrough for new BigQuery skill …
caohy1988 Feb 24, 2026
70a2f3f
refactor(bigquerybench): simplify to trace-based evaluation
caohy1988 Feb 24, 2026
faf852c
feat(bigquerybench): add skill invocation + LLM-as-judge instruction …
caohy1988 Feb 24, 2026
76912e5
feat(bigquerybench): add Vertex AI API key auth, retry backoff, and o…
caohy1988 Feb 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
337 changes: 337 additions & 0 deletions benchmarks/bigquerybench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,337 @@
# BigQueryBench: Skill Invocation & Instruction Adherence Evaluation

## Overview

BigQueryBench evaluates agents built with ADK's `SkillToolset` on
two dimensions:

1. **Skill invocation correctness** (trace-based) — Did the agent
call the right skill tools with the right arguments?
2. **Instruction adherence** (LLM-as-judge) — Did the agent follow
the skill's instructions and produce correct results?

The trace-based checks are deterministic. The instruction adherence
checks use natural-language rubrics evaluated by a judge LLM, making
them easy to write and immune to exact-wording variance.

## Quick Start

```bash
# Skill-only mode (no BigQuery credentials needed):
export GOOGLE_CLOUD_API_KEY=your-vertex-ai-api-key

# Full mode (BigQuery + skills — requires ADC + project):
export GOOGLE_CLOUD_PROJECT=your-project-id
export GOOGLE_CLOUD_API_KEY=your-vertex-ai-api-key

# Run all eval cases
python -m benchmarks.bigquerybench.runner

# Run one case
python -m benchmarks.bigquerybench.runner --filter skill_load

# Dry-run (validate JSON only, no LLM calls)
python -m benchmarks.bigquerybench.runner --dry-run

# Run unit tests (no API keys needed)
pytest tests/unittests/benchmarks/bigquerybench/ -v
```

## How It Works

```
eval_sets/bigquerybench_eval.json
↓ (user query + expected tool_uses + rubrics)
runner.py
↓ runs agent via ADK Runner
↓ collects event trace → Invocations
metrics.py
├── tool_invocation_score: expected skill tool names ⊆ actual?
├── tool_args_score: expected (tool, skill-arg) pairs ⊆ actual?
└── instruction_adherence_score: LLM judge checks rubrics
PASS if all three scores meet thresholds
```

## Three Metrics

| Metric | Type | What It Checks | Pass Condition |
|--------|------|----------------|----------------|
| `tool_invocation_score` | Trace | Correct skill tools called | Score = 1.0 |
| `tool_args_score` | Trace | Correct skill/resource/script targeted | Score = 1.0 |
| `instruction_adherence_score` | LLM judge | Agent followed instructions, output correct | Score >= 0.75 |

A case **passes** when all three metrics meet their thresholds.

## Eval Case Format

Each eval case has three parts:
1. **`conversation`** — user query + expected skill tool calls
2. **`rubrics`** — natural-language assertions checked by the judge

```json
{
"eval_id": "skill_load_reference",
"conversation": [
{
"invocation_id": "inv-01",
"user_content": {
"parts": [{"text": "Load the public datasets reference from bq-sql-analyst."}],
"role": "user"
},
"intermediate_data": {
"tool_uses": [
{"name": "load_skill", "args": {"name": "bq-sql-analyst"}},
{"name": "load_skill_resource", "args": {
"skill_name": "bq-sql-analyst",
"path": "references/public-datasets.md"
}}
],
"tool_responses": [],
"intermediate_responses": []
},
"creation_timestamp": 0.0
}
],
"rubrics": [
{
"rubric_id": "shows_datasets",
"rubric_content": {
"text_property": "The response contains information about BigQuery public datasets."
}
},
{
"rubric_id": "loaded_skill_first",
"rubric_content": {
"text_property": "The agent loaded the skill instructions before loading the resource."
}
}
],
"creation_timestamp": 0.0
}
```

### Trace Checks (deterministic)

| Field in `args` | Checked? | Why |
|-----------------|----------|-----|
| `name` | Yes | Must load the right skill (`load_skill`) |
| `skill_name` | Yes | Must target the right skill (`load_skill_resource`, `run_skill_script`) |
| `path` | Yes | Must load the right resource (`load_skill_resource`) |
| `script_path` | Yes | Must run the right script (`run_skill_script`) |

### Rubrics (LLM-as-judge)

Each rubric is a natural-language assertion about the agent's behavior
or output. The judge LLM reads the conversation (user request + tool
trace + final response) and answers yes/no per rubric.

**Example rubrics:**
```json
{"rubric_id": "r1", "rubric_content": {"text_property": "The agent used AI.classify to classify the data."}}
{"rubric_id": "r2", "rubric_content": {"text_property": "The result contains a markdown table with group statistics."}}
{"rubric_id": "r3", "rubric_content": {"text_property": "The agent loaded the skill before running the script."}}
```

## Included Eval Cases

| eval_id | Expected Trace | Rubrics |
|---------|---------------|---------|
| `skill_list_skills` | `list_skills()` | Lists bq-sql-analyst; includes description |
| `skill_load_sql_analyst` | `load_skill(name=bq-sql-analyst)` | Describes capabilities; mentions scripts |
| `skill_load_reference` | `load_skill` → `load_skill_resource` | Shows datasets; loaded skill first |
| `skill_query_with_reference` | `load_skill` → `load_skill_resource` | Consulted reference; summarizes datasets; followed workflow |
| `skill_run_format_script` | `load_skill` → `run_skill_script` | Loaded before run; has table; has columns |

## Example Output

```
========================================================================
BigQueryBench — Skill Evaluation
========================================================================

[1/5] skill_list_skills
-> list_skills()
tools=1.00 args=1.00 adherence=1.00 PASS

[2/5] skill_load_sql_analyst
-> load_skill(name='bq-sql-analyst')
tools=1.00 args=1.00 adherence=1.00 PASS

[3/5] skill_load_reference
-> load_skill(name='bq-sql-analyst')
-> load_skill_resource(skill_name='bq-sql-analyst', path='references/public-datasets.md')
tools=1.00 args=1.00 adherence=1.00 PASS

[4/5] skill_query_with_reference
-> load_skill(name='bq-sql-analyst')
-> load_skill_resource(skill_name='bq-sql-analyst', path='references/public-datasets.md')
tools=1.00 args=1.00 adherence=1.00 PASS

[5/5] skill_run_format_script
-> load_skill(name='bq-sql-analyst')
-> run_skill_script(skill_name='bq-sql-analyst', script_path='scripts/format_results.py')
tools=1.00 args=1.00 adherence=1.00 PASS

Eval Case Tools Args Adhere Result
------------------------------------------------------------------------
skill_list_skills 1.00 1.00 1.00 PASS
skill_load_sql_analyst 1.00 1.00 1.00 PASS
skill_load_reference 1.00 1.00 1.00 PASS
skill_query_with_reference 1.00 1.00 1.00 PASS
skill_run_format_script 1.00 1.00 1.00 PASS
------------------------------------------------------------------------

========================================================================
Summary
========================================================================
Cases: 5/5 (100.0%)
Avg Tool Match: 1.00
Avg Args Match: 1.00
Avg Adherence: 1.00
Elapsed: 488.8s
========================================================================
```

## Adding a New Eval Case

### For an existing skill

Add a JSON object to `bigquerybench_eval.json` with both `tool_uses`
(trace expectations) and `rubrics` (instruction adherence assertions).

```json
{
"eval_id": "skill_explore_usa_names",
"conversation": [
{
"invocation_id": "inv-new-01",
"user_content": {
"parts": [{"text": "Use the bq-sql-analyst skill to explore the usa_names dataset."}],
"role": "user"
},
"intermediate_data": {
"tool_uses": [
{"name": "load_skill", "args": {"name": "bq-sql-analyst"}},
{"name": "load_skill_resource", "args": {"skill_name": "bq-sql-analyst", "path": "references/public-datasets.md"}}
],
"tool_responses": [],
"intermediate_responses": []
},
"creation_timestamp": 0.0
}
],
"rubrics": [
{
"rubric_id": "consulted_ref",
"rubric_content": {"text_property": "The agent consulted the public datasets reference."}
},
{
"rubric_id": "mentions_usa_names",
"rubric_content": {"text_property": "The response mentions the usa_names dataset and its columns."}
}
],
"creation_timestamp": 0.0
}
```

### For a new skill

1. Create a skill directory under `skills/`:
```
skills/my-new-skill/
├── SKILL.md
├── references/
└── scripts/
```

2. Register it in `agent.py`:
```python
_SKILL_NAMES = [
"bq-sql-analyst",
"my-new-skill", # ← add here
]
```

3. Add eval cases with trace expectations + rubrics.

### Writing Good Rubrics

**Do:**
- Assert observable behavior: "The agent loaded the skill before running the script."
- Assert output properties: "The response contains a table with columns name and count."
- Assert domain correctness: "The result includes the top 3 Shakespeare works."

**Don't:**
- Assert exact wording: "The response starts with 'Here are the results'."
- Assert implementation details: "The agent called execute_sql with SELECT DISTINCT."
- Use vague assertions: "The response is good."

### When Do You Need Code Changes?

| Scenario | JSON | `metrics.py` | `runner.py` | `agent.py` |
|----------|:----:|:------------:|:-----------:|:----------:|
| New eval case, existing skill | Yes | - | - | - |
| New skill added to `skills/` | Yes | - | - | Yes (add to `_SKILL_NAMES`) |
| Change judge model or threshold | - | Yes | - | - |
| Need entirely new metric | Yes | Yes | Yes | - |
| Agent instruction change | - | - | - | Yes |

## Architecture

```
benchmarks/bigquerybench/
├── __init__.py
├── agent.py # LlmAgent + BigQueryToolset + SkillToolset
├── runner.py # Runs agent, scores trace + rubrics
├── metrics.py # 3 metrics: trace (x2) + LLM-as-judge (x1)
├── eval_sets/
│ └── bigquerybench_eval.json # 5 eval cases with rubrics
└── skills/
└── bq-sql-analyst/
├── SKILL.md
├── references/
│ └── public-datasets.md
└── scripts/
└── format_results.py

tests/unittests/benchmarks/bigquerybench/
└── test_metrics.py # 14 tests (trace + LLM judge + JSON validation)
```

## Retry Backoff

Both the agent model and the LLM judge use exponential backoff with
retry on 429 (rate limit) errors:

- **Agent model**: 5 attempts, 2s initial delay, 2x exponential
backoff (via `HttpRetryOptions`)
- **LLM judge**: 5 attempts, 2s → 4s → 8s → 16s manual backoff,
plus 3 HTTP-level retries per attempt

## Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `GOOGLE_CLOUD_API_KEY` | Yes | Vertex AI API key for agent model + judge |
| `GOOGLE_CLOUD_PROJECT` | No | GCP project for BigQuery API (enables BigQuery toolset) |
| `BQ_EVAL_WRITE_MODE` | No | `blocked` (default) / `protected` / `allowed` |

**Two modes:**
- **Skill-only** (default): Set `GOOGLE_CLOUD_API_KEY` only.
BigQuery toolset is skipped; all 5 skill eval cases run.
- **Full mode**: Set both `GOOGLE_CLOUD_API_KEY` and
`GOOGLE_CLOUD_PROJECT` (+ ADC configured). BigQuery toolset
is enabled alongside skills.

## Troubleshooting

| Symptom | Fix |
|---------|-----|
| `tool_invocation_score = 0` | Agent didn't call expected skill tool — check agent instructions |
| `tool_args_score < 1.0` | Agent targeted wrong skill or resource — check user query specificity |
| `adherence < 0.75` | Agent produced wrong output — review rubrics and skill instructions |
| 429 RESOURCE_EXHAUSTED | Rate limit — retry backoff handles this automatically; wait and retry |
| Skill not found | Verify skill dir exists in `skills/` and name is in `_SKILL_NAMES` in `agent.py` |
| Judge LLM fails | Check `GOOGLE_CLOUD_API_KEY` is set correctly |
| `load_skill_resource` fails | Check the `path` arg matches a real file under the skill dir |
13 changes: 13 additions & 0 deletions benchmarks/bigquerybench/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Loading