feat(responses): image_generation server-tool shim#24
Open
Menci wants to merge 4 commits into
Open
Conversation
Serve the Responses `image_generation` hosted tool on upstreams that don't
support it natively, on top of the generic server-tool framework. Gated by
the new `responses-image-generation-shim` flag; mirrors the established
hosted-tool shim shape.
Design (from two parallel research sessions + the .research spec): the native
`image_generation` tool exposes the orchestrator a single `prompt` parameter
— size/quality/background/etc. are read from the client's tools[] config and
layered onto the backend call by the runtime, not chosen by the model. The
shim does the same: it rewrites the hosted tool into a prompt-only function
tool (~50 input tokens vs the native ~2300), resolves an image-capable
binding via resolveModelForRequest, and drives the standalone
/images/{generations,edits} backend.
Behavior:
- Prompt-only function tool; collision-safe naming and tool_choice rewrite
handled by the framework.
- Per-entry, Azure-strict validation before rewrite: reject unknown fields
(incl. client-supplied `n`) as unknown_parameter; reject output_format:webp
and arbitrary size as invalid_value; integer-range codes for
output_compression / partial_images; reject a present-but-invalid model;
every hosted entry is validated, not just the last.
- Edit vs generate resolved from input image presence (input_image blocks +
full-echo image_generation_call results); all sources attached (order
irrelevant — the model picks the target by prompt); input_fidelity and an
inline input_image_mask forwarded to /images/edits; file_id masks rejected.
- Synthesized lifecycle: in_progress -> generating -> one honest partial_image
-> completed -> output_item.done, with the Azure-superset fields echoed on
both the partial event and the item.
- Backend failures become {ok:false,error:{type,code,message,retryable}} tool
outputs replayed to the orchestrator (never a downstream response.failed),
preserving the upstream error type/code so the model can tell a transient
overload from a terminal policy block.
- Output items round-trip through transformItems as function_call +
function_call_output pairs so non-Responses upstreams stay coherent; a
successful call additionally feeds the generated image back as an
input_image message, matching the native flow where the model can see (and
iteratively edit) what it produced.
- Per-response cap on image backend dispatches bounds cost when a model loops
on retries.
- Best-effort image-backend usage recording on the resolved image model key.
Extends the shared shim's invalid-request envelope with a custom error code,
the protocol's image item/event types with the resolved-config echo fields
and an error type, and documents at both replay seams what a future stateful
response store would and would not need to persist.
The image_generation shim previously fabricated a single partial_image event from the final bytes. Direct probing of Azure's gpt-image-2 standalone endpoint confirmed partial_images is a genuine streaming-only feature: with stream:true the backend emits N distinct progressively rendered frames (image_generation.partial_image, index 0..N-1) before image_generation.completed carries the final bytes and usage. A single final image cannot reconstruct those frames, and no surveyed gateway fakes one — they either relay real frames or emit none. Drive the backend with stream:true when partial_images > 0 and relay each real frame as a native response.image_generation_call.partial_image; fall back to a single non-streaming call (no preview frames) otherwise. To carry frames that arrive over the life of the call, the server-tool slot's deferred portion becomes an async generator (run) that yields intermediate lifecycle events and returns the terminal item, replacing the single result promise. Web search keeps a buffered constructor that yields nothing.
Direct probing of Azure's native hosted image_generation confirmed the completed item echoes the server-RESOLVED tool config (e.g. a request without background comes back background:"opaque"). Both the standalone images response body and its streaming partial_image/completed events carry the resolved background/output_format/quality/size. Read those fields straight off the backend payload instead of inferring them from the request config, so the partial_image frames and the final item surface what was actually rendered — including server defaults the request left as auto/unset. The success outcome now carries the echo and each partial frame carries the echo from its own event.
…tion-shim # Conflicts: # apps/api/src/data-plane/providers/flags.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Route the Responses hosted image_generation tool through the gateway's
image-capable upstream (gpt-image-*) instead of forwarding it to a
Responses upstream, gated by the new responses-image-generation-shim
flag. When a Responses request is routed to a non-Responses backend the
shim always runs, since those targets cannot carry the hosted tool.
The hosted image_generation tool is replaced by a generated function
tool the orchestrator model calls with a prompt; the shim drives the
standalone /images/{generations,edits} backend and synthesizes the
native image_generation_call lifecycle. Edit vs. generate is resolved
from request input images and action; an inline input_image_mask is
forwarded to /images/edits. Tool-config validation mirrors the public
Azure-strict surface, reproducing its unknown_parameter / invalid_value
and integer-range rejection codes.
Direct probing of Azure settled two design points. partial_images is a
genuine streaming-only feature: with stream:true the backend emits N
distinct progressively rendered frames before the final image, which a
single non-streaming response cannot reconstruct and no surveyed gateway
fabricates. So partial_images > 0 now drives the backend with stream:true
and relays each real image_generation.partial_image frame as a native
response.image_generation_call.partial_image; partial_images = 0 takes a
single non-streaming call and emits no preview frames. Carrying frames
that arrive over the life of the call required the server-tool slot's
deferred portion to become an async generator that yields intermediate
events and returns the terminal item; web search keeps a buffered
constructor that yields nothing.
The completed item and partial frames also echo the server-RESOLVED
config (background / output_format / quality / size) read straight off
the backend payload rather than inferred from the request, so values the
request left as auto/unset surface what was actually rendered. Probing
confirmed both the standalone response body and its streaming events
carry these resolved fields, and that the native hosted tool's completed
item exposes the same field set the shim synthesizes (the native tool
additionally requires the x-ms-oai-image-generation-deployment header,
which is why the shim drives the standalone endpoint instead).
Verified with typecheck, lint, and the full test suite. A hosted
image_generation request through a real gpt-image-* upstream still needs
manual end-to-end confirmation in a client.