feat(desktop): real-time voice dictation in composer#1511
Conversation
Adds dictation support using OpenAI's Realtime API over WebRTC:
Relay:
- New /transcribe/status and /transcribe/session endpoints
- BUZZ_OPENAI_API_KEY env var gates the feature (hidden when absent)
- Proxies ephemeral client-secret minting from OpenAI
Desktop:
- New features/dictation module with:
- AudioWorklet for 24kHz PCM capture + buffering
- WebRTC peer connection to OpenAI Realtime API
- Real-time transcript merging into composer
- Auto-submit on trigger phrase ('submit')
- Mic button in composer toolbar (red pulse when recording)
- Integrated into MessageComposer via useComposerDictation hook
New public API needs doc comments — clippy runs with -D missing-docs, so TranscribeStatus and TranscribeSession were failing the Rust Lint gate.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6c12132e30
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Both /transcribe/status and /transcribe/session now require NIP-98 authentication and relay membership (with NIP-OA fallback), matching the security posture of /events, /query, and /count. Promotes verify_bridge_auth, check_nip98_replay, and nip98_expected_url to pub(crate) so the transcribe module can reuse them without duplication.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 195d741e65
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e874a53dbf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Add nonce tag to NIP-98 auth events to prevent replay rejection when
multiple components call /transcribe/status in the same second.
- Wire dictation text into both the Tiptap editor and contentRef via
setComposerContent + setEditorContentRef, so dictated text actually
appears in the composer and is serialized on submit.
- Call submitMessageRef.current() synchronously in onSend instead of via
queueMicrotask, ensuring the editor content is consumed before the
subsequent setText('') clears it.
- Replace naive append-based transcript merging with segment-aware state
tracking (TranscriptSegmentState). Delta events accumulate into
pendingDelta; completed events replace accumulated deltas with the
finalized text, preventing duplication.
e874a53 to
ebcd42e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ebcd42e0b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Switch relay from /v1/realtime/sessions to /v1/realtime/client_secrets
with the wrapped { session: { ... } } request shape per OpenAI's current
WebRTC guide. The old endpoint returns non-2xx, breaking dictation.
- Redesign TranscriptSegmentState to track per-item segments keyed by
item_id. Completed events for different turns can arrive out of order;
reconciling by item_id preserves utterance ordering and prevents text
reordering or partial-turn drops during fast consecutive speech.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9cc335a8e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "session": { | ||
| "model": "gpt-4o-mini-realtime-preview", | ||
| "modalities": ["text"], | ||
| "input_audio_transcription": { | ||
| "model": model, |
There was a problem hiding this comment.
Create a typed transcription session
Fresh evidence after the endpoint fix is that the payload still uses the legacy realtime fields here (model/modalities/input_audio_transcription). OpenAI's current client-secrets reference defines session as either a typed realtime or transcription session, with transcription config under audio.input.transcription (https://developers.openai.com/api/reference/resources/realtime/subresources/client_secrets/methods/create), and the realtime transcription guide uses type: "transcription"; with this shape, configured relays will either get a rejected client-secret request or a realtime session with transcription off, so mic starts never produce transcript events.
Useful? React with 👍 / 👎.
| submitMessageRef, | ||
| }: UseComposerDictationOptions) { | ||
| return useDictation({ | ||
| text: contentRef.current, |
There was a problem hiding this comment.
Sync editor content before merging dictation
In normal typing, MessageComposer only updates the empty/non-empty state from Tiptap onUpdate; contentRef.current is synced lazily for sends/drafts. Passing that stale ref as the dictation source means a user who types or pastes text and then starts dictation has the first transcript merged against the old ref value, and setEditorContentRef.current(text) replaces the editor, dropping the manually entered prefix.
Useful? React with 👍 / 👎.
| // holds the dictated text. | ||
| submitMessageRef.current(); | ||
| }, | ||
| sendDisabled: disabled || isSending, |
There was a problem hiding this comment.
Respect blocked sends before auto-clearing
Even with the synchronous submit fix, sendDisabled only reflects disabled || isSending here. If the user says “submit” while an attachment is still uploading, MessageComposer.submitMessage returns early because isUploadingRef.current is true, but useDictation then clears the composer after onSend; include the same send blockers (at least uploads/mention preparation) or only clear after the send actually proceeds.
Useful? React with 👍 / 👎.
Summary
Adds real-time voice dictation to the message composer using OpenAI's Realtime API over WebRTC.
How it works
POST /transcribe/session)Relay changes (
crates/buzz-relay)POST /transcribe/session— mints an ephemeral OpenAI Realtime client secretGET /transcribe/status— returns whether transcription is configuredBUZZ_OPENAI_API_KEYenv var — no key = mic button hidden (graceful degradation)reqwestas a direct dependency for the upstream HTTP callDesktop changes (
desktop/src/features/dictation/)lib/realtimeBufferWorklet.tslib/realtimeAudio.tslib/voiceInput.tsapi/transcribeSession.tshooks/useRealtimeDictation.tshooks/useDictation.tshooks/useComposerDictation.tsui/DictationButton.tsxIntegrated into
MessageComposervia the toolbarextraActionsslot.Configuration
Design decisions
reqwestworkspace dep.