Support Realtime custom voice objects#3473
Conversation
eed10dc to
20e7135
Compare
| return normalized | ||
|
|
||
|
|
||
| def _create_realtime_audio_output(audio_output_args: dict[str, Any]) -> Any: |
There was a problem hiding this comment.
If we upgrade openai package to openai>=2.36.0 , this workaround is not necessary while _normalize_custom_voice_for_server_event_validation is still required even with the latest version.
Can you add quick TODO comments explaining why and when to remove to these internal workarounds?
20e7135 to
393c530
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 393c53087a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if "previous_item_id" in event and event["previous_item_id"] is None: | ||
| event["previous_item_id"] = "" # TODO (rm) remove | ||
| parsed: AllRealtimeServerEvents = self._server_event_type_adapter.validate_python(event) | ||
| validation_event = _normalize_custom_voice_for_server_event_validation(event) |
There was a problem hiding this comment.
Limit voice normalization to events that can contain voice objects
_normalize_custom_voice_for_server_event_validation is applied to every inbound WebSocket event before validation, including high-frequency streaming events like response.output_audio.delta. In long audio turns this adds an extra full recursive walk/allocation per event even when no voice field exists, which can unnecessarily increase CPU/GC pressure and degrade realtime playback latency. Since the workaround is only needed for server events carrying session/response voice settings, scope it to those event types (or fast-path when no voice key is present).
Useful? React with 👍 / 👎.
Summary
This PR fixes Realtime custom voice handling in the Agents SDK.
Realtime sessions can receive and send structured custom voice objects such as
{"id": "voice_..."}, but the SDK previously typed voice settings as strings and validated inbound server events before updating response lifecycle state. If a server event such asresponse.createdorresponse.donecontained a structured voice object that failed validation, the SDK could skip response state updates and leave the response-create sequencer blocked. That could prevent the nextresponse.createfrom being sent after tool output.The change adds typed support for custom voice objects in Realtime session settings, preserves structured voices when building outbound
session.updatepayloads, and adds a validation fallback for inbound server events so custom voice objects do not break response lifecycle tracking.Tests
make formatmake lintuv run pytest -q tests/realtime/test_openai_realtime.py tests/realtime/test_realtime_model_settings.pyuv run pytest -q tests/realtime/test_session.py -k "handoff_session_update_preserves_custom_voice or handoff_tool_handling"uv run mypy src/agents/realtime/config.py src/agents/realtime/openai_realtime.py tests/realtime/test_openai_realtime.pyuv run pyright src/agents/realtime/config.py src/agents/realtime/openai_realtime.py tests/realtime/test_openai_realtime.pyuv run mypy tests/realtime/test_session.pyuv run pyright tests/realtime/test_session.pyFull
make tests/make typecheckwere not completed locally because optional dependency installation was blocked by a socket-firewall tunnel failure while downloadingdocstring-parser==0.18.0.