Skip to content

Phase 0 + 0b spikes: OpenAI Agents SDK vs PydanticAI#97

Open
pofallon wants to merge 2 commits into
mainfrom
047-phase-0-spike
Open

Phase 0 + 0b spikes: OpenAI Agents SDK vs PydanticAI#97
pofallon wants to merge 2 commits into
mainfrom
047-phase-0-spike

Conversation

@pofallon

@pofallon pofallon commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Three commits across two spikes investigating the BURR.md migration
decision gate. The original Phase 0 (PR title) closed with "proceed
with caveats"; a sharp question from the reviewer surfaced a deeper
concern (MCP-era tool-use pain) and triggered Phase 0b against
PydanticAI as an alternative substrate.

Findings (both spikes together)

Probe OpenAI Agents SDK + LiteLLM PydanticAI
Copilot OAuth PASS via LiteLLM device flow PASS (reuses Phase 0 creds)
codex + tools + typed PASS (75.8s, V2) — needs strict_json_schema=False, Copilot IDE headers PASS (27.7s) — clean, no workarounds
claude + tools + typed FAIL — model returns markdown-fenced JSON; with StopAtTools model doesn't call tools at all FAIL — Copilot Chat Completions response can't parse against ChatCompletion schema (missing index)
prompt cache PASS — 97% hit, only cached_tokens exposed PASS — 79% hit, both cache_read_tokens and cache_write_tokens exposed
cost telemetry partial — tokens yes, cost_usd / cache_write no partial — richer surface, cost_usd still no

Crucial finding: PydanticAI implements typed output as a hidden
final_result function-tool call by default — the architecturally-
clean pattern. No strict_json_schema gotcha; the pattern I argued
for in code review is what PydanticAI ships.

Equally crucial: Claude on Copilot fails on BOTH substrates, for
different reasons. The root cause is upstream of either library —
Copilot's Chat Completions endpoint doesn't behave like OpenAI's for
the Claude family.

Files

  • scripts/spike-openai-agents-sdk-litellm.py + results — Phase 0.
  • scripts/spike-v3-redo.py + results — function-tool / StopAtTools
    follow-up. Validates that the pattern works for codex (cleaner than
    output_type=) but doesn't fix Claude on Copilot.
  • scripts/spike-pydantic-ai.py + results — Phase 0b.
  • docs/migration-phase-0-report.md — original Phase 0 report.
  • docs/migration-phase-0b-report.md — Phase 0b report with three
    candidate paths (A: stay on OpenAI Agents SDK; B: switch to
    PydanticAI, requires Anthropic key for full validation; C: defer
    the migration).

What we still don't know

  • PydanticAI + native Anthropic Messages API + Claude — the
    load-bearing question. Not testable in this dev env without an
    Anthropic API key.
  • Whether Copilot's Claude path is salvageable at all through any
    in-process Python SDK, or only via OpenCode's adapter.

Decision pending

Awaiting a call on Path A / B / C. Phase 1 is on hold.

🤖 Generated with Claude Code

Paul O'Fallon and others added 2 commits May 15, 2026 15:19
Adds the throwaway spike for the BURR.md migration decision gate and a
one-page report against the spec's decision-gate matrix.

Five validations against `sample-maverick-project-37n.3`:

  V1 OAuth          PASS    LiteLLM device flow works; OpenCode auth.json
                            cannot bootstrap (stale ghu_ token)
  V2 codex+tools    PASS    SubmitImplementationPayload first try;
                            implementer shipped renderer.py + tests
  V3 claude+tools   FAIL    Claude on Copilot Chat Completions returns
                            text / markdown-fenced JSON; SDK has no
                            envelope unwrap layer
  V4 prompt cache   PASS    97% of seeded prefix served from cache run 2
  V5 cost telemetry PARTIAL tokens reachable; cost_usd / cache_write
                            not exposed via openai-agents Usage

Outcome: row 4 of the decision matrix —
  pass | pass | partial | * | * → Proceed to Phase 1

Phase 1 scope additions captured in the report:
  - MaverickCascadingModel injects Copilot IDE-auth headers
    (openai-agents SDK's User-Agent override breaks Claude path)
  - MaverickCascadingModel ships an envelope-unwrap layer (mirrors
    OpenCode `_unwrap_envelope`) to strip markdown fences
  - Agent.__init__ wraps output_type with strict_json_schema=False
  - DEFAULT_TIERS routes claude-favoured roles to openai/* first
  - cost_usd computed against a Maverick-owned price table

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-up spikes to the original Phase 0:

1. scripts/spike-v3-redo.py — tests the function-tool + StopAtTools
   output pattern against the OpenAI Agents SDK. Confirms it works
   cleanly for codex (29s / 5 raw_responses / 22 RunItems — tighter
   than V2's output_type= pattern at 75.8s / 14 / 37) but claude on
   Copilot Chat Completions doesn't reliably invoke any tool at all.
   Asked "What is 17 + 25?" with an explicit add tool, claude returns
   "Sure! Let me calculate that for you!" and ends the turn — exactly
   the MCP-era failure mode that motivated this whole migration.

2. scripts/spike-pydantic-ai.py — Phase 0b proper. Validates PydanticAI
   on the same surface that broke Phase 0:
     V1 codex auth                PASS (1.9s)
     V2 codex + tools + typed     PASS (27.7s — 2.7x faster than 0a)
     V3 claude via Copilot        FAIL — but for a different reason
     V4 prompt cache              PASS (79% hit)
     V5 cost telemetry            richer than 0a (cache_write exposed)

Key finding: PydanticAI implements typed output as a hidden
``final_result`` function-tool call — the architecturally-clean pattern
by default. No strict_json_schema gotcha. But Claude on Copilot is
unreachable through PydanticAI's strict response validation (Copilot's
ChatCompletion response is missing required fields).

Net: neither substrate makes Claude-on-Copilot work. The root cause is
upstream of both libraries — Copilot's Chat Completions endpoint
doesn't behave like OpenAI's for the Claude family.

Phase 0b report lays out three paths:
  A. Continue with OpenAI Agents SDK + LiteLLM, eat the workarounds
  B. Switch to PydanticAI, validate against Anthropic-direct first
     (needs API key)
  C. Defer the migration; benefits are smaller than the spec assumed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pofallon pofallon changed the title Phase 0 spike: OpenAI Agents SDK + LiteLLM end-to-end probe Phase 0 + 0b spikes: OpenAI Agents SDK vs PydanticAI May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant