Skip to content

feat(desktop): real-time voice dictation in composer#1511

Open
klopez4212 wants to merge 6 commits into
mainfrom
kennylopez-dictation
Open

feat(desktop): real-time voice dictation in composer#1511
klopez4212 wants to merge 6 commits into
mainfrom
kennylopez-dictation

Conversation

@klopez4212

Copy link
Copy Markdown
Contributor

Summary

Adds real-time voice dictation to the message composer using OpenAI's Realtime API over WebRTC.

How it works

  1. User clicks the mic button in the composer toolbar
  2. Mic audio is captured immediately via an AudioWorklet (24kHz PCM)
  3. Desktop requests an ephemeral client secret from the relay (POST /transcribe/session)
  4. WebRTC peer connection streams audio directly to OpenAI
  5. Transcript deltas stream back and merge into the composer in real-time
  6. User clicks mic again to stop, or says "submit" to auto-send

Relay changes (crates/buzz-relay)

  • POST /transcribe/session — mints an ephemeral OpenAI Realtime client secret
  • GET /transcribe/status — returns whether transcription is configured
  • Gated by BUZZ_OPENAI_API_KEY env var — no key = mic button hidden (graceful degradation)
  • Added reqwest as a direct dependency for the upstream HTTP call

Desktop changes (desktop/src/features/dictation/)

File Purpose
lib/realtimeBufferWorklet.ts AudioWorklet: resample mic → 24kHz 16-bit PCM
lib/realtimeAudio.ts WebRTC peer connection, audio buffer flush, transcript merge
lib/voiceInput.ts Text merging logic, auto-submit phrase detection
api/transcribeSession.ts HTTP client for relay transcribe endpoints
hooks/useRealtimeDictation.ts Core WebRTC dictation hook
hooks/useDictation.ts Higher-level hook with auto-submit
hooks/useComposerDictation.ts Thin wrapper pre-wired for MessageComposer state
ui/DictationButton.tsx Mic button (rounded-full, red pulse when recording)

Integrated into MessageComposer via the toolbar extraActions slot.

Configuration

# .env (relay)
BUZZ_OPENAI_API_KEY=sk-...          # required — enables dictation
BUZZ_TRANSCRIPTION_MODEL=whisper-1  # optional — defaults to whisper-1

Design decisions

  • Relay-proxied secrets — the relay holds the API key and mints short-lived client secrets. The frontend never sees the real key.
  • Audio buffering — PCM is buffered during the ~1-2s WebRTC setup so no audio is lost.
  • OSS-friendly — no Block-specific URLs. Self-hosters configure their own key; absent key = feature hidden.
  • No new crates — uses existing reqwest workspace dep.

klopez4212 and others added 3 commits July 4, 2026 15:02
Adds dictation support using OpenAI's Realtime API over WebRTC:

Relay:
- New /transcribe/status and /transcribe/session endpoints
- BUZZ_OPENAI_API_KEY env var gates the feature (hidden when absent)
- Proxies ephemeral client-secret minting from OpenAI

Desktop:
- New features/dictation module with:
  - AudioWorklet for 24kHz PCM capture + buffering
  - WebRTC peer connection to OpenAI Realtime API
  - Real-time transcript merging into composer
  - Auto-submit on trigger phrase ('submit')
  - Mic button in composer toolbar (red pulse when recording)
- Integrated into MessageComposer via useComposerDictation hook
New public API needs doc comments — clippy runs with -D missing-docs, so
TranscribeStatus and TranscribeSession were failing the Rust Lint gate.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6c12132e30

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/buzz-relay/src/api/transcribe.rs Outdated
Comment thread desktop/src/features/dictation/hooks/useComposerDictation.ts Outdated
Comment thread desktop/src/features/dictation/hooks/useDictation.ts

@klopez4212 klopez4212 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review withdrawn — findings are being addressed directly on the branch.)

Both /transcribe/status and /transcribe/session now require NIP-98
authentication and relay membership (with NIP-OA fallback), matching
the security posture of /events, /query, and /count.

Promotes verify_bridge_auth, check_nip98_replay, and nip98_expected_url
to pub(crate) so the transcribe module can reuse them without duplication.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 195d741e65

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread desktop/src/features/dictation/api/transcribeSession.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e874a53dbf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread desktop/src/features/dictation/api/transcribeSession.ts
Comment thread desktop/src/features/dictation/lib/realtimeAudio.ts Outdated
- Add nonce tag to NIP-98 auth events to prevent replay rejection when
  multiple components call /transcribe/status in the same second.

- Wire dictation text into both the Tiptap editor and contentRef via
  setComposerContent + setEditorContentRef, so dictated text actually
  appears in the composer and is serialized on submit.

- Call submitMessageRef.current() synchronously in onSend instead of via
  queueMicrotask, ensuring the editor content is consumed before the
  subsequent setText('') clears it.

- Replace naive append-based transcript merging with segment-aware state
  tracking (TranscriptSegmentState). Delta events accumulate into
  pendingDelta; completed events replace accumulated deltas with the
  finalized text, preventing duplication.
@klopez4212 klopez4212 force-pushed the kennylopez-dictation branch from e874a53 to ebcd42e Compare July 4, 2026 15:35

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ebcd42e0b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/buzz-relay/src/api/transcribe.rs Outdated
Comment thread desktop/src/features/dictation/lib/realtimeAudio.ts Outdated
- Switch relay from /v1/realtime/sessions to /v1/realtime/client_secrets
  with the wrapped { session: { ... } } request shape per OpenAI's current
  WebRTC guide. The old endpoint returns non-2xx, breaking dictation.

- Redesign TranscriptSegmentState to track per-item segments keyed by
  item_id. Completed events for different turns can arrive out of order;
  reconciling by item_id preserves utterance ordering and prevents text
  reordering or partial-turn drops during fast consecutive speech.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9cc335a8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +84 to +88
"session": {
"model": "gpt-4o-mini-realtime-preview",
"modalities": ["text"],
"input_audio_transcription": {
"model": model,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Create a typed transcription session

Fresh evidence after the endpoint fix is that the payload still uses the legacy realtime fields here (model/modalities/input_audio_transcription). OpenAI's current client-secrets reference defines session as either a typed realtime or transcription session, with transcription config under audio.input.transcription (https://developers.openai.com/api/reference/resources/realtime/subresources/client_secrets/methods/create), and the realtime transcription guide uses type: "transcription"; with this shape, configured relays will either get a rejected client-secret request or a realtime session with transcription off, so mic starts never produce transcript events.

Useful? React with 👍 / 👎.

submitMessageRef,
}: UseComposerDictationOptions) {
return useDictation({
text: contentRef.current,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Sync editor content before merging dictation

In normal typing, MessageComposer only updates the empty/non-empty state from Tiptap onUpdate; contentRef.current is synced lazily for sends/drafts. Passing that stale ref as the dictation source means a user who types or pastes text and then starts dictation has the first transcript merged against the old ref value, and setEditorContentRef.current(text) replaces the editor, dropping the manually entered prefix.

Useful? React with 👍 / 👎.

// holds the dictated text.
submitMessageRef.current();
},
sendDisabled: disabled || isSending,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect blocked sends before auto-clearing

Even with the synchronous submit fix, sendDisabled only reflects disabled || isSending here. If the user says “submit” while an attachment is still uploading, MessageComposer.submitMessage returns early because isUploadingRef.current is true, but useDictation then clears the composer after onSend; include the same send blockers (at least uploads/mention preparation) or only clear after the send actually proceeds.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant