Skip to content

feat(responses): image_generation server-tool shim#24

Open
Menci wants to merge 4 commits into
mainfrom
image-generation-shim
Open

feat(responses): image_generation server-tool shim#24
Menci wants to merge 4 commits into
mainfrom
image-generation-shim

Conversation

@Menci
Copy link
Copy Markdown
Owner

@Menci Menci commented May 29, 2026

Route the Responses hosted image_generation tool through the gateway's
image-capable upstream (gpt-image-*) instead of forwarding it to a
Responses upstream, gated by the new responses-image-generation-shim
flag. When a Responses request is routed to a non-Responses backend the
shim always runs, since those targets cannot carry the hosted tool.

The hosted image_generation tool is replaced by a generated function
tool the orchestrator model calls with a prompt; the shim drives the
standalone /images/{generations,edits} backend and synthesizes the
native image_generation_call lifecycle. Edit vs. generate is resolved
from request input images and action; an inline input_image_mask is
forwarded to /images/edits. Tool-config validation mirrors the public
Azure-strict surface, reproducing its unknown_parameter / invalid_value
and integer-range rejection codes.

Direct probing of Azure settled two design points. partial_images is a
genuine streaming-only feature: with stream:true the backend emits N
distinct progressively rendered frames before the final image, which a
single non-streaming response cannot reconstruct and no surveyed gateway
fabricates. So partial_images > 0 now drives the backend with stream:true
and relays each real image_generation.partial_image frame as a native
response.image_generation_call.partial_image; partial_images = 0 takes a
single non-streaming call and emits no preview frames. Carrying frames
that arrive over the life of the call required the server-tool slot's
deferred portion to become an async generator that yields intermediate
events and returns the terminal item; web search keeps a buffered
constructor that yields nothing.

The completed item and partial frames also echo the server-RESOLVED
config (background / output_format / quality / size) read straight off
the backend payload rather than inferred from the request, so values the
request left as auto/unset surface what was actually rendered. Probing
confirmed both the standalone response body and its streaming events
carry these resolved fields, and that the native hosted tool's completed
item exposes the same field set the shim synthesizes (the native tool
additionally requires the x-ms-oai-image-generation-deployment header,
which is why the shim drives the standalone endpoint instead).

Verified with typecheck, lint, and the full test suite. A hosted
image_generation request through a real gpt-image-* upstream still needs
manual end-to-end confirmation in a client.

Menci added 4 commits May 29, 2026 16:14
Serve the Responses `image_generation` hosted tool on upstreams that don't
support it natively, on top of the generic server-tool framework. Gated by
the new `responses-image-generation-shim` flag; mirrors the established
hosted-tool shim shape.

Design (from two parallel research sessions + the .research spec): the native
`image_generation` tool exposes the orchestrator a single `prompt` parameter
— size/quality/background/etc. are read from the client's tools[] config and
layered onto the backend call by the runtime, not chosen by the model. The
shim does the same: it rewrites the hosted tool into a prompt-only function
tool (~50 input tokens vs the native ~2300), resolves an image-capable
binding via resolveModelForRequest, and drives the standalone
/images/{generations,edits} backend.

Behavior:
- Prompt-only function tool; collision-safe naming and tool_choice rewrite
  handled by the framework.
- Per-entry, Azure-strict validation before rewrite: reject unknown fields
  (incl. client-supplied `n`) as unknown_parameter; reject output_format:webp
  and arbitrary size as invalid_value; integer-range codes for
  output_compression / partial_images; reject a present-but-invalid model;
  every hosted entry is validated, not just the last.
- Edit vs generate resolved from input image presence (input_image blocks +
  full-echo image_generation_call results); all sources attached (order
  irrelevant — the model picks the target by prompt); input_fidelity and an
  inline input_image_mask forwarded to /images/edits; file_id masks rejected.
- Synthesized lifecycle: in_progress -> generating -> one honest partial_image
  -> completed -> output_item.done, with the Azure-superset fields echoed on
  both the partial event and the item.
- Backend failures become {ok:false,error:{type,code,message,retryable}} tool
  outputs replayed to the orchestrator (never a downstream response.failed),
  preserving the upstream error type/code so the model can tell a transient
  overload from a terminal policy block.
- Output items round-trip through transformItems as function_call +
  function_call_output pairs so non-Responses upstreams stay coherent; a
  successful call additionally feeds the generated image back as an
  input_image message, matching the native flow where the model can see (and
  iteratively edit) what it produced.
- Per-response cap on image backend dispatches bounds cost when a model loops
  on retries.
- Best-effort image-backend usage recording on the resolved image model key.

Extends the shared shim's invalid-request envelope with a custom error code,
the protocol's image item/event types with the resolved-config echo fields
and an error type, and documents at both replay seams what a future stateful
response store would and would not need to persist.
The image_generation shim previously fabricated a single partial_image
event from the final bytes. Direct probing of Azure's gpt-image-2
standalone endpoint confirmed partial_images is a genuine streaming-only
feature: with stream:true the backend emits N distinct progressively
rendered frames (image_generation.partial_image, index 0..N-1) before
image_generation.completed carries the final bytes and usage. A single
final image cannot reconstruct those frames, and no surveyed gateway
fakes one — they either relay real frames or emit none.

Drive the backend with stream:true when partial_images > 0 and relay
each real frame as a native response.image_generation_call.partial_image;
fall back to a single non-streaming call (no preview frames) otherwise.

To carry frames that arrive over the life of the call, the server-tool
slot's deferred portion becomes an async generator (run) that yields
intermediate lifecycle events and returns the terminal item, replacing
the single result promise. Web search keeps a buffered constructor that
yields nothing.
Direct probing of Azure's native hosted image_generation confirmed the
completed item echoes the server-RESOLVED tool config (e.g. a request
without background comes back background:"opaque"). Both the standalone
images response body and its streaming partial_image/completed events
carry the resolved background/output_format/quality/size.

Read those fields straight off the backend payload instead of inferring
them from the request config, so the partial_image frames and the final
item surface what was actually rendered — including server defaults the
request left as auto/unset. The success outcome now carries the echo and
each partial frame carries the echo from its own event.
…tion-shim

# Conflicts:
#	apps/api/src/data-plane/providers/flags.ts
@Menci Menci changed the title feat(responses): image_generation server-tool shim Responses image_generation server-tool shim May 29, 2026
@Menci Menci changed the title Responses image_generation server-tool shim feat(responses): image_generation server-tool shim May 29, 2026
@Menci Menci marked this pull request as draft May 29, 2026 18:22
@Menci Menci marked this pull request as ready for review May 29, 2026 18:23
@Menci Menci closed this May 29, 2026
@Menci Menci reopened this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant