Phase 0 + 0b spikes: OpenAI Agents SDK vs PydanticAI by pofallon · Pull Request #97 · get2knowio/maverick

pofallon · 2026-05-15T15:20:27Z

Summary

Three commits across two spikes investigating the BURR.md migration
decision gate. The original Phase 0 (PR title) closed with "proceed
with caveats"; a sharp question from the reviewer surfaced a deeper
concern (MCP-era tool-use pain) and triggered Phase 0b against
PydanticAI as an alternative substrate.

Findings (both spikes together)

Probe	OpenAI Agents SDK + LiteLLM	PydanticAI
Copilot OAuth	PASS via LiteLLM device flow	PASS (reuses Phase 0 creds)
codex + tools + typed	PASS (75.8s, V2) — needs `strict_json_schema=False`, Copilot IDE headers	PASS (27.7s) — clean, no workarounds
claude + tools + typed	FAIL — model returns markdown-fenced JSON; with `StopAtTools` model doesn't call tools at all	FAIL — Copilot Chat Completions response can't parse against `ChatCompletion` schema (missing `index`)
prompt cache	PASS — 97% hit, only `cached_tokens` exposed	PASS — 79% hit, both `cache_read_tokens` and `cache_write_tokens` exposed
cost telemetry	partial — tokens yes, `cost_usd` / `cache_write` no	partial — richer surface, `cost_usd` still no

Crucial finding: PydanticAI implements typed output as a hidden
final_result function-tool call by default — the architecturally-
clean pattern. No strict_json_schema gotcha; the pattern I argued
for in code review is what PydanticAI ships.

Equally crucial: Claude on Copilot fails on BOTH substrates, for
different reasons. The root cause is upstream of either library —
Copilot's Chat Completions endpoint doesn't behave like OpenAI's for
the Claude family.

Files

scripts/spike-openai-agents-sdk-litellm.py + results — Phase 0.
scripts/spike-v3-redo.py + results — function-tool / StopAtTools
follow-up. Validates that the pattern works for codex (cleaner than
output_type=) but doesn't fix Claude on Copilot.
scripts/spike-pydantic-ai.py + results — Phase 0b.
docs/migration-phase-0-report.md — original Phase 0 report.
docs/migration-phase-0b-report.md — Phase 0b report with three
candidate paths (A: stay on OpenAI Agents SDK; B: switch to
PydanticAI, requires Anthropic key for full validation; C: defer
the migration).

What we still don't know

PydanticAI + native Anthropic Messages API + Claude — the
load-bearing question. Not testable in this dev env without an
Anthropic API key.
Whether Copilot's Claude path is salvageable at all through any
in-process Python SDK, or only via OpenCode's adapter.

Decision pending

Awaiting a call on Path A / B / C. Phase 1 is on hold.

🤖 Generated with Claude Code

Adds the throwaway spike for the BURR.md migration decision gate and a one-page report against the spec's decision-gate matrix. Five validations against `sample-maverick-project-37n.3`: V1 OAuth PASS LiteLLM device flow works; OpenCode auth.json cannot bootstrap (stale ghu_ token) V2 codex+tools PASS SubmitImplementationPayload first try; implementer shipped renderer.py + tests V3 claude+tools FAIL Claude on Copilot Chat Completions returns text / markdown-fenced JSON; SDK has no envelope unwrap layer V4 prompt cache PASS 97% of seeded prefix served from cache run 2 V5 cost telemetry PARTIAL tokens reachable; cost_usd / cache_write not exposed via openai-agents Usage Outcome: row 4 of the decision matrix — pass | pass | partial | * | * → Proceed to Phase 1 Phase 1 scope additions captured in the report: - MaverickCascadingModel injects Copilot IDE-auth headers (openai-agents SDK's User-Agent override breaks Claude path) - MaverickCascadingModel ships an envelope-unwrap layer (mirrors OpenCode `_unwrap_envelope`) to strip markdown fences - Agent.__init__ wraps output_type with strict_json_schema=False - DEFAULT_TIERS routes claude-favoured roles to openai/* first - cost_usd computed against a Maverick-owned price table Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two follow-up spikes to the original Phase 0: 1. scripts/spike-v3-redo.py — tests the function-tool + StopAtTools output pattern against the OpenAI Agents SDK. Confirms it works cleanly for codex (29s / 5 raw_responses / 22 RunItems — tighter than V2's output_type= pattern at 75.8s / 14 / 37) but claude on Copilot Chat Completions doesn't reliably invoke any tool at all. Asked "What is 17 + 25?" with an explicit add tool, claude returns "Sure! Let me calculate that for you!" and ends the turn — exactly the MCP-era failure mode that motivated this whole migration. 2. scripts/spike-pydantic-ai.py — Phase 0b proper. Validates PydanticAI on the same surface that broke Phase 0: V1 codex auth PASS (1.9s) V2 codex + tools + typed PASS (27.7s — 2.7x faster than 0a) V3 claude via Copilot FAIL — but for a different reason V4 prompt cache PASS (79% hit) V5 cost telemetry richer than 0a (cache_write exposed) Key finding: PydanticAI implements typed output as a hidden ``final_result`` function-tool call — the architecturally-clean pattern by default. No strict_json_schema gotcha. But Claude on Copilot is unreachable through PydanticAI's strict response validation (Copilot's ChatCompletion response is missing required fields). Net: neither substrate makes Claude-on-Copilot work. The root cause is upstream of both libraries — Copilot's Chat Completions endpoint doesn't behave like OpenAI's for the Claude family. Phase 0b report lays out three paths: A. Continue with OpenAI Agents SDK + LiteLLM, eat the workarounds B. Switch to PydanticAI, validate against Anthropic-direct first (needs API key) C. Defer the migration; benefits are smaller than the spec assumed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Paul O'Fallon and others added 2 commits May 15, 2026 15:19

pofallon changed the title ~~Phase 0 spike: OpenAI Agents SDK + LiteLLM end-to-end probe~~ Phase 0 + 0b spikes: OpenAI Agents SDK vs PydanticAI May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 0 + 0b spikes: OpenAI Agents SDK vs PydanticAI#97

Phase 0 + 0b spikes: OpenAI Agents SDK vs PydanticAI#97
pofallon wants to merge 2 commits into
mainfrom
047-phase-0-spike

pofallon commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pofallon commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Findings (both spikes together)

Files

What we still don't know

Decision pending

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pofallon commented May 15, 2026 •

edited

Loading