feat(responses): image_generation server-tool shim by Menci · Pull Request #24 · Menci/Floway

Menci · 2026-05-29T08:15:14Z

Route the Responses hosted image_generation tool through the gateway's
image-capable upstream (gpt-image-*) instead of forwarding it to a
Responses upstream, gated by the new responses-image-generation-shim
flag. When a Responses request is routed to a non-Responses backend the
shim always runs, since those targets cannot carry the hosted tool.

The hosted image_generation tool is replaced by a generated function
tool the orchestrator model calls with a prompt; the shim drives the
standalone /images/{generations,edits} backend and synthesizes the
native image_generation_call lifecycle. Edit vs. generate is resolved
from request input images and action; an inline input_image_mask is
forwarded to /images/edits. Tool-config validation mirrors the public
Azure-strict surface, reproducing its unknown_parameter / invalid_value
and integer-range rejection codes.

Direct probing of Azure settled two design points. partial_images is a
genuine streaming-only feature: with stream:true the backend emits N
distinct progressively rendered frames before the final image, which a
single non-streaming response cannot reconstruct and no surveyed gateway
fabricates. So partial_images > 0 now drives the backend with stream:true
and relays each real image_generation.partial_image frame as a native
response.image_generation_call.partial_image; partial_images = 0 takes a
single non-streaming call and emits no preview frames. Carrying frames
that arrive over the life of the call required the server-tool slot's
deferred portion to become an async generator that yields intermediate
events and returns the terminal item; web search keeps a buffered
constructor that yields nothing.

The completed item and partial frames also echo the server-RESOLVED
config (background / output_format / quality / size) read straight off
the backend payload rather than inferred from the request, so values the
request left as auto/unset surface what was actually rendered. Probing
confirmed both the standalone response body and its streaming events
carry these resolved fields, and that the native hosted tool's completed
item exposes the same field set the shim synthesizes (the native tool
additionally requires the x-ms-oai-image-generation-deployment header,
which is why the shim drives the standalone endpoint instead).

Verified with typecheck, lint, and the full test suite. A hosted
image_generation request through a real gpt-image-* upstream still needs
manual end-to-end confirmation in a client.

Serve the Responses `image_generation` hosted tool on upstreams that don't support it natively, on top of the generic server-tool framework. Gated by the new `responses-image-generation-shim` flag; mirrors the established hosted-tool shim shape. Design (from two parallel research sessions + the .research spec): the native `image_generation` tool exposes the orchestrator a single `prompt` parameter — size/quality/background/etc. are read from the client's tools[] config and layered onto the backend call by the runtime, not chosen by the model. The shim does the same: it rewrites the hosted tool into a prompt-only function tool (~50 input tokens vs the native ~2300), resolves an image-capable binding via resolveModelForRequest, and drives the standalone /images/{generations,edits} backend. Behavior: - Prompt-only function tool; collision-safe naming and tool_choice rewrite handled by the framework. - Per-entry, Azure-strict validation before rewrite: reject unknown fields (incl. client-supplied `n`) as unknown_parameter; reject output_format:webp and arbitrary size as invalid_value; integer-range codes for output_compression / partial_images; reject a present-but-invalid model; every hosted entry is validated, not just the last. - Edit vs generate resolved from input image presence (input_image blocks + full-echo image_generation_call results); all sources attached (order irrelevant — the model picks the target by prompt); input_fidelity and an inline input_image_mask forwarded to /images/edits; file_id masks rejected. - Synthesized lifecycle: in_progress -> generating -> one honest partial_image -> completed -> output_item.done, with the Azure-superset fields echoed on both the partial event and the item. - Backend failures become {ok:false,error:{type,code,message,retryable}} tool outputs replayed to the orchestrator (never a downstream response.failed), preserving the upstream error type/code so the model can tell a transient overload from a terminal policy block. - Output items round-trip through transformItems as function_call + function_call_output pairs so non-Responses upstreams stay coherent; a successful call additionally feeds the generated image back as an input_image message, matching the native flow where the model can see (and iteratively edit) what it produced. - Per-response cap on image backend dispatches bounds cost when a model loops on retries. - Best-effort image-backend usage recording on the resolved image model key. Extends the shared shim's invalid-request envelope with a custom error code, the protocol's image item/event types with the resolved-config echo fields and an error type, and documents at both replay seams what a future stateful response store would and would not need to persist.

The image_generation shim previously fabricated a single partial_image event from the final bytes. Direct probing of Azure's gpt-image-2 standalone endpoint confirmed partial_images is a genuine streaming-only feature: with stream:true the backend emits N distinct progressively rendered frames (image_generation.partial_image, index 0..N-1) before image_generation.completed carries the final bytes and usage. A single final image cannot reconstruct those frames, and no surveyed gateway fakes one — they either relay real frames or emit none. Drive the backend with stream:true when partial_images > 0 and relay each real frame as a native response.image_generation_call.partial_image; fall back to a single non-streaming call (no preview frames) otherwise. To carry frames that arrive over the life of the call, the server-tool slot's deferred portion becomes an async generator (run) that yields intermediate lifecycle events and returns the terminal item, replacing the single result promise. Web search keeps a buffered constructor that yields nothing.

Direct probing of Azure's native hosted image_generation confirmed the completed item echoes the server-RESOLVED tool config (e.g. a request without background comes back background:"opaque"). Both the standalone images response body and its streaming partial_image/completed events carry the resolved background/output_format/quality/size. Read those fields straight off the backend payload instead of inferring them from the request config, so the partial_image frames and the final item surface what was actually rendered — including server defaults the request left as auto/unset. The success outcome now carries the echo and each partial frame carries the echo from its own event.

…tion-shim # Conflicts: # apps/api/src/data-plane/providers/flags.ts

Menci added 4 commits May 29, 2026 16:14

Merge remote-tracking branch 'origin/main' into worktree-image-genera…

447995f

…tion-shim # Conflicts: # apps/api/src/data-plane/providers/flags.ts

Menci changed the title ~~feat(responses): image_generation server-tool shim~~ Responses image_generation server-tool shim May 29, 2026

Menci changed the title ~~Responses image_generation server-tool shim~~ feat(responses): image_generation server-tool shim May 29, 2026

Menci marked this pull request as draft May 29, 2026 18:22

Menci marked this pull request as ready for review May 29, 2026 18:23

Menci closed this May 29, 2026

Menci reopened this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(responses): image_generation server-tool shim#24

feat(responses): image_generation server-tool shim#24
Menci wants to merge 4 commits into
mainfrom
image-generation-shim

Menci commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Menci commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Menci commented May 29, 2026 •

edited

Loading