From 9188c044fe46bc51ce8e17dbb884576e919aa4bb Mon Sep 17 00:00:00 2001 From: Lorna Armstrong Date: Wed, 18 Mar 2026 10:48:09 +0000 Subject: [PATCH 1/8] Restructure and add missing messages --- docs/private/voice-agent-api.mdx | 514 +++++++++++++++++++------------ 1 file changed, 311 insertions(+), 203 deletions(-) diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index 869444a..08a5e3c 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -14,133 +14,140 @@ description: Early access to the Voice Agent API — a turn-based API built for ## Introduction -The Voice Agent API is a turn-based API built for voice agents. It is designed for developers building low-latency integrations between speech and LLMs — with turn detection, speaker awareness, and segment-based output built in so you can focus on your agent logic. +The Voice Agent API is a WebSocket API for building voice agents. Stream audio in and receive speaker-labelled, turn-based transcription back — clean, punctuated, and ready to pass directly to an LLM. ---- - -## What it does - -The Voice Agent API is a turn-based API. Rather than a stream of word-level events, speech is grouped into segments — and turn detection determines when a speaker has finished, triggering fast finalisation of those segments. - -You receive: - -- `StartOfTurn` — when a speaker begins talking -- `AddPartialSegment` — interim transcript updates as they speak -- `AddSegment` — the final, complete transcript for that turn -- `EndOfTurn` — when the turn is complete - -When a turn ends, you receive an `AddSegment` containing the finalised utterance. In multi-speaker scenarios, a single message may contain segments from multiple speakers, returned in time order: - -```json -{ - "message": "AddSegment", - "segments": [ - { - "speaker_id": "S1", - "is_active": true, - "timestamp": "2025-01-01T12:00:00.000+00:00", - "language": "en", - "text": "Welcome to Speechmatics.", - "is_eou": true, - "metadata": { - "start_time": 0.84, - "end_time": 1.56 - } - }, - { - "speaker_id": "S2", - "is_active": true, - "timestamp": "2025-01-01T12:00:02.000+00:00", - "language": "en", - "text": "Thank you for testing the Voice Agent API.", - "is_eou": true, - "metadata": { - "start_time": 2.10, - "end_time": 3.80 - } - } - ], - "metadata": { - "start_time": 0.84, - "end_time": 3.80, - "processing_time": 0.25 - } -} -``` +Turn detection runs server-side. Choose a [profile](#profiles) based on your use case and the API handles when to finalise each speaker's turn. -Each segment's `text` field is clean, punctuated, and ready to use. When a message contains multiple segments, you'll need to concatenate them. The SDK reconstructs the exchange using `speaker_id` and `is_active` — non-active speakers (outside your focus list) are marked as `[background]`: - -```python -' '.join([f"@{s.speaker_id}{'' if s.is_active else ' [background]'}: {s.text}" for s in segments]) -``` - -Which produces: - -``` -@S1: Hello there. @S2 [background]: It was yesterday. @S1: How are you getting on? -``` - -No accumulating partials, no stitching words together, no guessing when the speaker has finished. The turn detection handles all of that, so your agent can respond as fast as possible. +To jump straight into code, see working examples in [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer) for both Python and JavaScript. --- ## Profiles -Profiles are pre-tuned configurations for voice agents. Each profile sets the right defaults for turn detection, latency, and endpointing — no need to configure the API settings yourself. +Profiles are pre-configured turn detection modes. Each profile sets the right defaults for your use case — you choose one when connecting, and the server handles the rest. -Choose the profile that best fits your use case: +| Profile | Turn detection | Best for | +|---------|---------------|----------| +| `agile` | VAD-based silence detection | Speed-first use cases | +| `adaptive` | Adapts to speaker pace and hesitation | General conversational agents | +| `smart` | `adaptive` + ML acoustic turn prediction | High-stakes conversations | +| `external` | Manual — you trigger turn end | Push-to-talk, custom VAD, LLM-driven | ### `agile` **Endpoint:** `/v2/agent/agile` -Lowest end-of-speech to final latency. Uses voice activity detection to finalise turns as quickly as possible. +Uses voice activity detection (VAD) to detect silence and finalise turns as quickly as possible. The lowest latency profile. -**Best for:** Use cases where response speed is the top priority. +**Best for:** Use cases where response speed is the top priority and occasional mid-speech finalisations are acceptable. -**Trade-off:** May produce more finalised segments mid-speaker, which can result in additional downstream LLM calls. - ---- +**Trade-off:** Because it relies on silence, it may finalise a turn while the speaker is still mid-sentence — for example, during a natural pause. This can result in additional downstream LLM calls. ### `adaptive` **Endpoint:** `/v2/agent/adaptive` -Adapts to each speaker over the course of a conversation. Waits longer for slow speakers or those who hesitate frequently. Works with all languages. +Adapts to each speaker's pace over the course of a conversation. It adjusts the turn-end threshold based on speech rate and disfluencies (e.g. hesitations, filler words), waiting longer for speakers who tend to pause mid-thought. **Best for:** General conversational voice agents. -**Trade-off:** Latency is not consistently the fastest. Disfluency/hesitation detection is English-only — other languages use speech-rate adaptation only. - ---- +**Trade-off:** Latency varies by speaker. Disfluency detection is English-only — other languages fall back to speech-rate adaptation. ### `smart` **Endpoint:** `/v2/agent/smart` -Builds on `adaptive` and additionally analyses vocal tone to improve turn completion. The most conservative profile. +Builds on `adaptive` with an additional ML model that analyses acoustic cues to predict whether a speaker has genuinely finished their turn. The most conservative profile — least likely to interrupt. -**Best for:** High-stakes conversations where interrupting the user is costly (finance, healthcare, legal). +**Best for:** High-stakes conversations where cutting off the user is costly — finance, healthcare, legal. **Trade-off:** Higher latency than `adaptive`. Supported languages: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, Vietnamese. ---- - ### `external` **Endpoint:** `/v2/agent/external` -You control when a turn ends. Send a `ForceEndOfUtterance` message to trigger finalisation — the server will return a combined segment of everything spoken up to that point. +Turn detection is fully manual. The server accumulates audio and transcript until you send a `ForceEndOfUtterance` message, at which point it finalises everything spoken up to that point and emits an `AddSegment`. + +**Best for:** Push-to-talk interfaces, custom VAD pipelines, or setups where an LLM decides when to respond. + +**Trade-off:** You are responsible for all turn detection logic. -**Best for:** Push-to-talk, custom VAD, or LLM-driven turn detection. +--- + +## Session Flow + +Every session follows the same structure: connect, start recognition, stream audio, receive turn events, close. + +```mermaid +sequenceDiagram + participant C as Client + participant S as Server + + C->>S: Connect to endpoint with profile via WebSocket + C->>S: StartRecognition + S-->>C: RecognitionStarted + + loop Audio Stream + C->>S: Audio frames (binary) + S-->>C: AudioAdded + S-->>C: StartOfTurn + S-->>C: AddPartialSegment + S-->>C: AddSegment + S-->>C: EndOfTurn + + opt Optional — speaker events + S-->>C: SpeakerStarted / SpeakerEnded + S-->>C: SessionMetrics / SpeakerMetrics + end + + opt Optional — mid-session controls + C->>S: ForceEndOfUtterance (external profile only) + C->>S: UpdateSpeakerFocus + C->>S: GetSpeakers + S-->>C: SpeakersResult + end + end + + C->>S: EndOfStream + S-->>C: EndOfTranscript +``` + +**Client → Server** + +| Message | When to send | +|---------|-------------| +| [`StartRecognition`](#startrecognition) | First message after connecting. Starts the session and passes configuration. | +| Audio frames | Binary WebSocket frames containing raw PCM audio, sent continuously while audio is available. | +| [`ForceEndOfUtterance`](#forceendofutterance) | `external` profile only. Signals that the current turn is complete. | +| [`UpdateSpeakerFocus`](#updatespeakerfocus) | Any time during the session. Changes which speakers are in focus. | +| [`GetSpeakers`](#getspeakers) | Any time during the session. Requests voice identifiers for enrolled speakers. | +| [`EndOfStream`](#endofstream) | When there is no more audio to send. | -**Trade-off:** Most complex to implement — you are responsible for turn detection logic. +**Server → Client** + +| Message | When it's emitted | +|---------|------------------| +| [`RecognitionStarted`](#standard-messages) | Session is ready. | +| [`StartOfTurn`](#startofturn) | A speaker begins a new turn. | +| [`AddPartialSegment`](#addpartialsegment) | Interim transcript update. Replaces the previous partial. | +| [`AddSegment`](#addsegment) | Final transcript for the turn. Send this to your LLM. | +| [`EndOfTurn`](#endofturn) | Turn is complete. Your agent can now respond. | +| [`EndOfTranscript`](#standard-messages) | All audio processed. Emitted after `EndOfStream`. | --- -## Getting started +## Getting Started + +### 1. Connect -### Authentication +Open a WebSocket connection to the preview endpoint. To do this, you must specify the [profile](#profiles) to use: + +``` +wss://preview.rt.speechmatics.com/v2/agent/ +``` + +### 2. Authenticate Authenticate every connection using one of the following: @@ -153,113 +160,171 @@ Authenticate every connection using one of the following: See [Authentication](/get-started/authentication) for details including temporary keys. -### Endpoint +### 3. Start the session -The Voice Agent API is available at the preview endpoint. Choose a [profile](#profiles) based on your use case: +Send [`StartRecognition`](#startrecognition) as your first message: +```json +{ + "message": "StartRecognition", + "transcription_config": { + "language": "en" + } +} ``` -wss://preview.rt.speechmatics.com/v2/agent/ -``` +For all configuration options, see [Configuration](#configuration). +The server responds with `RecognitionStarted` when the session is ready. You should wait for this message before sending audio. -For example, to use the `adaptive` profile: -``` -wss://preview.rt.speechmatics.com/v2/agent/adaptive -``` +### 4. Stream audio and handle responses +Send audio as binary WebSocket frames. Turn events will arrive in real time as the API processes speech — see [Session Flow](#session-flow) for the full message sequence. + +--- -### Session flow +## Configuration -1. Open the WebSocket connection. -2. Send `StartRecognition` as the first JSON message. -3. Stream raw PCM audio as binary frames. -4. Send `EndOfStream` when audio is finished. -5. Read server messages until `EndOfTranscript`. -6. Close the connection. +Configuration is passed in [`StartRecognition`](#startrecognition) and is split across two levels of the payload: `audio_format` (top-level) and `transcription_config`. + +**`audio_format`** + +| Field | Notes | +|-------|-------| +| `type` | Must be `raw` | +| `encoding` | Must be `pcm_s16le` (16-bit signed little-endian PCM) | +| `sample_rate` | Must be `8000` or `16000` | + +**`transcription_config`** + +| Field | Default | Notes | +|-------|---------|-------| +| `language` | `en` | All supported languages | +| `output_locale` | — | Output locale (e.g. `en-US`) | +| `additional_vocab` | — | Custom vocabulary entries | +| `punctuation_overrides` | — | Custom punctuation rules | +| `domain` | — | Domain-specific model (e.g. `medical`) | +| `enable_entities` | `false` | Entity detection | +| `enable_partials` | `true` | Emit partial segments during speech | +| `diarization` | `speaker` | Speaker diarization; `none` to disable | +| `volume_threshold` | — | Minimum audio volume to process | + +**`transcription_config.speaker_diarization_config`** + +Note: The following require `diarization: speaker` to be set. +| Field | Default | Notes | +|-------|---------|-------| +| `max_speakers` | — | Maximum number of speakers to track | +| `speaker_sensitivity` | — | Sensitivity of speaker separation | +| `prefer_current_speaker` | — | Bias toward the most recently active speaker | +| `known_speakers` | — | Pre-enrolled speaker identifiers for cross-session recognition (see [Speaker ID](#speaker-id)) | + +**Not supported — will be rejected if present** + +| Field | Notes | +|-------|-------| +| `translation_config` | Not supported on this endpoint | +| `audio_events_config` | Not supported on this endpoint | + +--- + +## API Reference - Client Messages ### StartRecognition -Send this as the first message after connecting: +The first message you send after connecting. Starts the recognition session and passes configuration. The server responds with `RecognitionStarted`. ```json { "message": "StartRecognition", + "audio_format": { + "type": "raw", + "encoding": "pcm_s16le", + "sample_rate": 16000 + }, "transcription_config": { "language": "en" } } ``` -### Configuration reference +For all configuration options, see [Configuration](#configuration). -**Configurable settings (`transcription_config`)** +### EndOfStream -| Setting | Default | Notes | -|---------|---------|-------| -| `language` | `en` | All supported languages | -| `output_locale` | - | Client can specify an output locale (e.g. `en-US`) | -| `additional_vocab` | - | Custom vocabulary entries | -| `punctuation_overrides` | - | Punctuation overrides | -| `domain` | - | Client can specify a domain (e.g. `medical`) | -| `enable_entities` | `false` | Enable entity detection | -| `enable_partials` | `true` | Enable partials in output | -| `diarization` | `speaker` | Supports `none` or `speaker` only | -| `speaker_diarization_config.max_speakers` | - | Limit speaker count | -| `speaker_diarization_config.speaker_sensitivity` | - | Diarization sensitivity | -| `speaker_diarization_config.prefer_current_speaker` | - | Hold on to current speaker | -| `speaker_diarization_config.speakers` | - | Known speakers | -| `volume_threshold` | - | Audio filtering | - -**Not configurable (`transcription_config`)** - -| Setting | Notes | -|---------|-------| -| `operating_point` | Managed per profile | -| `max_delay` | Managed per profile | -| `max_delay_mode` | Managed per profile | -| `streaming_mode` | Always enabled | -| `conversation_config` | Managed by profile / Voice SDK | -| `audio_filtering_config` | Managed by profile | -| `transcript_filtering_config` | Managed by profile | -| `channel_diarization_labels` | Not available | - -**Payload-level settings** - -| Setting | Configurable? | Notes | -|---------|--------------|-------| -| `audio_format` | Yes | Client declares encoding and sample rate | -| `translation_config` | No* | Not supported — rejected if present in the payload | -| `audio_events_config` | No* | Not supported — rejected if present in the payload | -| `message_control` | No | Adjust which messages are forwarded (hidden) | - -### Code examples - -Full working examples in Python and JavaScript are available in the [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer). +Send when you have finished streaming audio. The server finalises any remaining transcript and then emits `EndOfTranscript`. + +`last_seq_no` is the sequence number of the last audio frame you sent. +```json +{ + "message": "EndOfStream", + "last_seq_no": 1234 +} +``` + +### ForceEndOfUtterance + +Only applies to the `external` profile. Immediately ends the current turn — the server finalises all audio received so far and emits a single `AddSegment` containing the complete transcript for that turn, followed by `EndOfTurn`. + +Use this wherever your application decides a turn is complete: on button release (push-to-talk), on VAD silence, or on an LLM signal. + +```json +{ + "message": "ForceEndOfUtterance" +} +``` + +### UpdateSpeakerFocus + +Updates which speakers are in focus, mid-session. Takes effect immediately. See [Speaker Focus](#speaker-focus) for full details. + +```json +{ + "message": "UpdateSpeakerFocus", + "speaker_focus": { + "focus_speakers": ["S1"], + "ignore_speakers": [], + "focus_mode": "retain" + } +} +``` + +### GetSpeakers + +Requests voice identifiers for all speakers diarized so far in the session. The server responds with a `SpeakersResult` message. See [Speaker ID](#speaker-id) for full details. + +```json +{ + "message": "GetSpeakers" +} +``` --- -## Server messages +## API Reference - Server Messages ### Standard messages -Standard RT messages are emitted alongside Voice Agent API messages. See the [API reference](/api-ref) for full details. +The following standard RT messages are emitted alongside Voice Agent API messages. See the [API reference](/api-ref) for full payload details. -- `RecognitionStarted` -- `AddPartialTranscript` -- `AddTranscript` -- `EndOfUtterance` -- `EndOfTranscript` -- `Info` -- `Warning` -- `Error` +| Message | When it's emitted | +|---------|------------------| +| `AudioAdded` | | +| `RecognitionStarted` | Session is ready; emitted in response to `StartRecognition` | +| `AddPartialTranscript` | Word-level partial transcript update (lower-level than `AddPartialSegment`) | +| `AddTranscript` | Word-level final transcript (lower-level than `AddSegment`) | +| `EndOfUtterance` | Silence threshold reached; precedes turn finalisation | +| `EndOfTranscript` | All audio processed; emitted after `EndOfStream` | +| `Info` | Non-critical informational message from the server | +| `Warning` | Non-fatal issue (e.g. unsupported config ignored) | +| `Error` | Fatal error; connection will close | ### Voice Agent API messages -These messages are only emitted when using a voice profile (`/v2/agent/`). +These messages are only emitted when using a voice agent profile (`/v2/agent/`). #### `StartOfTurn` -Emitted when a speaker begins a new turn. +Emitted when a speaker begins a new turn. Use this to signal to your agent that it should stop speaking if it currently is. ```json { @@ -268,9 +333,12 @@ Emitted when a speaker begins a new turn. } ``` +**Fields:** +- `turn_id` — monotonically increasing integer; pairs with the corresponding `EndOfTurn` + #### `EndOfTurn` -Emitted when a turn is complete. +Emitted when turn detection decides the speaker has finished. This is the trigger for your agent to respond. The finalised transcript for the turn is in the preceding `AddSegment`. ```json { @@ -283,9 +351,13 @@ Emitted when a turn is complete. } ``` +**Fields:** +- `turn_id` — matches the `StartOfTurn` for this turn +- `metadata.start_time` / `metadata.end_time` — audio time range for the turn, in seconds from session start + #### `AddPartialSegment` -Interim transcript updates emitted as the speaker talks. Each new partial replaces the previous one. +Interim transcript update emitted continuously while the speaker is talking. Each new `AddPartialSegment` replaces the previous one — do not concatenate them. ```json { @@ -314,7 +386,9 @@ Interim transcript updates emitted as the speaker talks. Each new partial replac #### `AddSegment` -The final, complete transcript for a turn. Emitted at `EndOfTurn`. This is the stable output to send to your LLM. +The final, complete transcript for a turn. Emitted just before `EndOfTurn`. This is the stable output to pass to your LLM — do not use `AddPartialSegment` for this. + +In multi-speaker scenarios, a single `AddSegment` may contain segments from multiple speakers, returned in time order. ```json { @@ -341,22 +415,25 @@ The final, complete transcript for a turn. Emitted at `EndOfTurn`. This is the s } ``` -**Key fields:** -- `speaker_id` — speaker label (e.g. `S1`, `S2`) -- `is_active` — whether this speaker is in your focus list (see [Speaker focus](#speaker-focus)) -- `is_eou` — `true` on final segments -- `start_time` / `end_time` — time in seconds relative to session start -- `processing_time` (message-level `metadata`) — transcription latency in seconds +**Segment fields:** +- `speaker_id` — speaker label (e.g. `S1`, `S2`, or a custom label if using [Speaker ID](#speaker-id)) +- `is_active` — `true` if this speaker is in your current focus list; `false` if they are a background speaker (see [Speaker Focus](#speaker-focus)) +- `is_eou` — `true` on final segments, `false` on partials +- `text` — clean, punctuated transcript text +- `metadata.start_time` / `metadata.end_time` — time range of this segment in seconds from session start + +**Message-level fields:** +- `metadata.processing_time` — transcription latency in seconds for this message #### `SpeakerStarted` / `SpeakerEnded` -Emitted when a specific speaker starts or stops speaking. Useful for multi-party conversations. +Emitted when a specific speaker starts or stops being heard. These are voice activity events — they fire based on detected speech, independently of turn boundaries. ```json { "message": "SpeakerStarted", - "is_active": true, "speaker_id": "S1", + "is_active": true, "time": 0.84, "metadata": { "start_time": 0.84, "end_time": 0.84 } } @@ -365,21 +442,23 @@ Emitted when a specific speaker starts or stops speaking. Useful for multi-party ```json { "message": "SpeakerEnded", - "is_active": false, "speaker_id": "S1", + "is_active": true, "time": 3.24, "metadata": { "start_time": 0.84, "end_time": 3.24 } } ``` -**Key fields:** -- `time` — seconds of audio from session start when the speaker activity occurred -- `metadata.start_time` — when that speaker started their current speaking interval -- `metadata.end_time` (`SpeakerEnded` only) — when that speaker stopped speaking +**Fields:** +- `speaker_id` — the speaker whose activity changed +- `is_active` — whether this speaker is in your current focus list +- `time` — seconds from session start when the activity was detected +- `metadata.start_time` — when this speaker started their current speaking interval +- `metadata.end_time` — when this speaker stopped speaking (`SpeakerEnded` only) -#### `SessionMetrics` / `SpeakerMetrics` +#### `SessionMetrics` -`SessionMetrics` is emitted every 5 seconds and at the end of the session. `SpeakerMetrics` is emitted each time a speaker speaks a word. +Emitted every 5 seconds and once at the end of the session. ```json { @@ -391,6 +470,10 @@ Emitted when a specific speaker starts or stops speaking. Useful for multi-party } ``` +#### `SpeakerMetrics` + +Emitted each time a speaker produces a recognised word. + ```json { "message": "SpeakerMetrics", @@ -405,49 +488,67 @@ Emitted when a specific speaker starts or stops speaking. Useful for multi-party } ``` +#### SpeakersResult + +Emitted as a response to a `GetSpeakers` message. + +```json +{ + "message": "SpeakersResult", + "speakers": [ + { "label": "S1", "speaker_identifiers": [""] }, + { "label": "S2", "speaker_identifiers": [""] } + ] +} +``` + + --- -## Speaker focus +## Features + +The Voice Agent API introduces key features built with voice agents in mind. These include: +### **Speaker Focus** +This lets you control which speakers' output your agent acts on. By default, all detected speakers are active and and their transcripts are included in `AddSegment` output. + + Speaker IDs (`S1`, `S2`, etc.) are assigned automatically when diarization is enabled and persist for the lifetime of the session. + Send `UpdateSpeakerFocus` at any point during the session to change who is in focus - the new config takes place immediately and replaces the previous one. + + +### Speaker Focus -You can update speaker focus mid-session using `UpdateSpeakerFocus`. This is a Voice Agent API feature — sending it in standard RT mode has no effect. +Speaker focus lets you control which speakers' output your agent acts on. By default, all detected speakers are active and their transcripts are included in `AddSegment` output. -Diarization is enabled by default when using the Voice Agent API. Speaker IDs (`S1`, `S2`, etc.) are assigned automatically and persist across the session. +Speaker IDs (`S1`, `S2`, etc.) are assigned automatically when diarization is enabled, and persist for the lifetime of the session. Send `UpdateSpeakerFocus` at any point during the session to change who is in focus — the new config takes effect immediately and replaces the previous one. ```json { "message": "UpdateSpeakerFocus", "speaker_focus": { "focus_speakers": ["S1"], - "ignore_speakers": [], + "ignore_speakers": ["S3"], "focus_mode": "retain" } } ``` -**`focus_mode` options:** - -- `retain` — non-focused speakers remain in output as passive speakers (`is_active: false`) -- `ignore` — non-focused speakers are excluded from output entirely +**Fields:** -The new config replaces the existing config immediately. +- `focus_speakers` — speaker IDs to treat as active. Their segments appear with `is_active: true`. +- `ignore_speakers` — speaker IDs to exclude entirely. Their speech is dropped and does not affect turn detection. +- `focus_mode` — what happens to speakers who are neither in `focus_speakers` nor `ignore_speakers`: + - `retain` — they remain in the output as passive speakers (`is_active: false`) + - `ignore` — they are excluded from the output entirely ---- +### Speaker ID -## Speaker ID +Speaker ID lets you recognise the same person across separate sessions. At the end of a session, you can retrieve voice identifiers for each speaker and store them. In future sessions, pass those identifiers into `StartRecognition` and the system will tag matching speakers with a consistent label rather than a generic `S1`, `S2`. -Speaker identifiers let you recognise known speakers across sessions. Once you have identifiers for a speaker, you can pass them into future sessions so the system tags them with a consistent label rather than a generic `S1`, `S2`. +#### Getting identifiers -### Getting identifiers — `GetSpeakers` +Send `GetSpeakers` at any point during a session to retrieve identifiers for all diarized speakers so far. The server responds with a `SpeakersResult` message. -Send `GetSpeakers` during a session to request identifiers for all diarized speakers so far: - -```json -{ - "message": "GetSpeakers" -} -``` - -The server responds with a `SpeakersResult` message: +`SpeakersResult` response: ```json { @@ -459,11 +560,11 @@ The server responds with a `SpeakersResult` message: } ``` -Store the `speaker_identifiers` values — these are opaque tokens that represent the speaker's voice profile. +Store the `speaker_identifiers` values. These are opaque tokens tied to a speaker's voice profile — treat them as credentials and store them securely. -### Using identifiers in future sessions +#### Using identifiers in future sessions -Pass stored identifiers into `StartRecognition` via `known_speakers`. You can assign any label you like: +Pass stored identifiers into `StartRecognition` via `transcription_config.known_speakers`. You can assign any label: ```json { @@ -478,18 +579,25 @@ Pass stored identifiers into `StartRecognition` via `known_speakers`. You can as } ``` -When those speakers are detected, segments will be tagged with `"Alice"` or `"Bob"` instead of generic labels. Any unrecognised speakers are still assigned generic labels (`S1`, `S2`, etc.). +When those speakers are detected, their segments will carry `"Alice"` or `"Bob"` as the `speaker_id` instead of generic labels. Any unrecognised speakers are still assigned generic labels (`S1`, `S2`, etc.). + +--- + +## Code Examples + +For working code examples in Python and JavaScript, see the [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer). --- ## Feedback -This is a preview and your feedback shapes what goes to GA. We'd love to hear from you — whether that's something that didn't work as expected, a profile that behaved differently than you anticipated, or a feature you'd want before we ship this more broadly. +This is a preview and your feedback shapes what goes to GA (General Availability). +We'd love to hear from you — Tell us what works well, which features you use, whether something didn't work as expected, a profile that behaved differently than you anticipated, or a feature you'd want before we ship this more broadly. Specific areas of interest: -- integration experience (documentation, SDKs, API messages/metadata) -- Accuracy/Latency (including data capture if it's relevant (e.g. phone numbers, spell outs of names/account numbers) +- Integration experience (documentation, SDKs, API messages/metadata) +- Accuracy and latency (including data capture if it's relevant. E.g. phone numbers, spell outs of names/account numbers) - Turn detection and experience with different profiles - Any missing capabilities which would make your product better - What would stop you using this in production From 1dd76a54da7d456f85ee6ee60beff8dbbaa2ee49 Mon Sep 17 00:00:00 2001 From: Lorna Armstrong Date: Thu, 19 Mar 2026 08:03:25 +0000 Subject: [PATCH 2/8] Restructure and Expand Message Coverage --- docs/private/voice-agent-api.mdx | 211 ++++++++++++++++++++----------- 1 file changed, 134 insertions(+), 77 deletions(-) diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index 08a5e3c..04bc085 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -18,31 +18,21 @@ The Voice Agent API is a WebSocket API for building voice agents. Stream audio i Turn detection runs server-side. Choose a [profile](#profiles) based on your use case and the API handles when to finalise each speaker's turn. -To jump straight into code, see working examples in [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer) for both Python and JavaScript. +**Looking for code examples?** See working examples in [Speechmatics Academy](https://github.com/speechmatics/speechmatics-academy/tree/main/basics/11-voice-api-explorer) for Python and JavaScript. --- ## Profiles -Profiles are pre-configured turn detection modes. Each profile sets the right defaults for your use case — you choose one when connecting, and the server handles the rest. +Profiles are pre-configured turn detection modes. Each profile sets the right defaults for your use case — you choose one when connecting, include it in your endpoint URL, and the server handles the rest. | Profile | Turn detection | Best for | |---------|---------------|----------| -| `agile` | VAD-based silence detection | Speed-first use cases | | `adaptive` | Adapts to speaker pace and hesitation | General conversational agents | +| `agile` | VAD-based silence detection | Speed-first use cases | | `smart` | `adaptive` + ML acoustic turn prediction | High-stakes conversations | | `external` | Manual — you trigger turn end | Push-to-talk, custom VAD, LLM-driven | -### `agile` - -**Endpoint:** `/v2/agent/agile` - -Uses voice activity detection (VAD) to detect silence and finalise turns as quickly as possible. The lowest latency profile. - -**Best for:** Use cases where response speed is the top priority and occasional mid-speech finalisations are acceptable. - -**Trade-off:** Because it relies on silence, it may finalise a turn while the speaker is still mid-sentence — for example, during a natural pause. This can result in additional downstream LLM calls. - ### `adaptive` **Endpoint:** `/v2/agent/adaptive` @@ -53,6 +43,16 @@ Adapts to each speaker's pace over the course of a conversation. It adjusts the **Trade-off:** Latency varies by speaker. Disfluency detection is English-only — other languages fall back to speech-rate adaptation. +### `agile` + +**Endpoint:** `/v2/agent/agile` + +Uses voice activity detection (VAD) to detect silence and finalise turns as quickly as possible. The lowest latency profile. + +**Best for:** Use cases where response speed is the top priority and occasional mid-speech finalisations are acceptable. + +**Trade-off:** Because it relies on silence, it may finalise a turn while the speaker is still mid-sentence — for example, during a natural pause. This can result in additional downstream LLM calls. + ### `smart` **Endpoint:** `/v2/agent/smart` @@ -91,17 +91,26 @@ sequenceDiagram loop Audio Stream C->>S: Audio frames (binary) S-->>C: AudioAdded + + S-->>C: SpeechStarted S-->>C: StartOfTurn S-->>C: AddPartialSegment + + opt Turn prediction (adaptive, smart profiles) + S-->>C: EndOfTurnPrediction + S-->>C: SmartTurnPrediction (smart only) + end + S-->>C: AddSegment S-->>C: EndOfTurn + S-->>C: SpeechEnded - opt Optional — speaker events + opt Speaker activity S-->>C: SpeakerStarted / SpeakerEnded S-->>C: SessionMetrics / SpeakerMetrics end - opt Optional — mid-session controls + opt Mid-session controls C->>S: ForceEndOfUtterance (external profile only) C->>S: UpdateSpeakerFocus C->>S: GetSpeakers @@ -113,27 +122,7 @@ sequenceDiagram S-->>C: EndOfTranscript ``` -**Client → Server** - -| Message | When to send | -|---------|-------------| -| [`StartRecognition`](#startrecognition) | First message after connecting. Starts the session and passes configuration. | -| Audio frames | Binary WebSocket frames containing raw PCM audio, sent continuously while audio is available. | -| [`ForceEndOfUtterance`](#forceendofutterance) | `external` profile only. Signals that the current turn is complete. | -| [`UpdateSpeakerFocus`](#updatespeakerfocus) | Any time during the session. Changes which speakers are in focus. | -| [`GetSpeakers`](#getspeakers) | Any time during the session. Requests voice identifiers for enrolled speakers. | -| [`EndOfStream`](#endofstream) | When there is no more audio to send. | - -**Server → Client** - -| Message | When it's emitted | -|---------|------------------| -| [`RecognitionStarted`](#standard-messages) | Session is ready. | -| [`StartOfTurn`](#startofturn) | A speaker begins a new turn. | -| [`AddPartialSegment`](#addpartialsegment) | Interim transcript update. Replaces the previous partial. | -| [`AddSegment`](#addsegment) | Final transcript for the turn. Send this to your LLM. | -| [`EndOfTurn`](#endofturn) | Turn is complete. Your agent can now respond. | -| [`EndOfTranscript`](#standard-messages) | All audio processed. Emitted after `EndOfStream`. | +For a full reference of all messages, see [Messages Overview](#messages-overview). --- @@ -173,6 +162,7 @@ Send [`StartRecognition`](#startrecognition) as your first message: } ``` For all configuration options, see [Configuration](#configuration). + The server responds with `RecognitionStarted` when the session is ready. You should wait for this message before sending audio. @@ -227,11 +217,81 @@ Note: The following require `diarization: speaker` to be set. --- +## Messages Overview + +All messages exchanged during a Voice Agent API session. For payload details, see the API Reference sections. + +### Client → Server + +| Message | When to send | +|---------|-------------| +| [`StartRecognition`](#startrecognition) | First message after connecting. Starts the session and passes configuration. | +| Audio frames | Binary WebSocket frames containing raw PCM audio, sent continuously. | +| [`ForceEndOfUtterance`](#forceendofutterance) | `external` profile only. Triggers immediate turn finalisation. | +| [`UpdateSpeakerFocus`](#updatespeakerfocus) | Any time during the session. Changes which speakers are in focus. | +| [`GetSpeakers`](#getspeakers) | Any time during the session. Requests voice identifiers for diarized speakers. | +| [`EndOfStream`](#endofstream) | When there is no more audio to send. | + +### Server → Client + +**Core turn events** — the messages your agent logic acts on + +| Message | Profile | When it's emitted | +|---------|---------|------------------| +| [`StartOfTurn`](#startofturn) | All | A speaker begins a new turn | +| [`AddPartialSegment`](#addpartialsegment) | All | Interim transcript update; each replaces the previous | +| [`AddSegment`](#addsegment) | All | Final transcript for the turn — pass this to your LLM | +| [`EndOfTurn`](#endofturn) | All | Turn complete; your agent can now respond | + +**Turn prediction** — early signals you can use to prepare a response + +| Message | Profile | When it's emitted | +|---------|---------|------------------| +| [`EndOfTurnPrediction`](#endofturnprediction) | `adaptive`, `smart` | The model predicts the current turn will end soon | +| [`SmartTurnPrediction`](#smartturnprediction) | `smart` only | High-confidence acoustic prediction of turn completion | + +**Speech and speaker activity** + +| Message | Profile | When it's emitted | +|---------|---------|------------------| +| [`SpeechStarted`](#speechstarted--speechended) | All | Voice activity detected in the audio stream | +| [`SpeechEnded`](#speechstarted--speechended) | All | Voice activity stopped | +| [`SpeakerStarted`](#speakerstarted--speakerended) | All | A specific diarized speaker began talking | +| [`SpeakerEnded`](#speakerstarted--speakerended) | All | A specific diarized speaker stopped talking | +| [`SpeakersResult`](#speakersresult) | All | Response to `GetSpeakers` | + +**Session lifecycle** + +| Message | When it's emitted | +|---------|------------------| +| `RecognitionStarted` | Session ready; emitted in response to `StartRecognition` | +| `AudioAdded` | Audio frame acknowledged | +| `EndOfTranscript` | Session closing; emitted by the proxy after `EndOfStream` | + +**Metrics and diagnostics** + +| Message | When it's emitted | +|---------|------------------| +| [`SessionMetrics`](#sessionmetrics) | Session stats; emitted every 5 seconds and at session end | +| [`SpeakerMetrics`](#speakermetrics) | Per-speaker word count and volume; emitted on each recognised word | + +**Shared messages with the RT API** - messages shared with the RT API. See the [RT API Reference](/api-ref) for full payload details. + +| Message | When it's emitted | +|---------|------------------| +| `EndOfUtterance` | Silence threshold reached; precedes turn finalisation | +| `Info` | Non-critical informational message | +| `Warning` | Non-fatal issue (e.g. unsupported config field ignored) | +| `Error` | Fatal error; connection will close | + +--- + ## API Reference - Client Messages -### StartRecognition +#### StartRecognition -The first message you send after connecting. Starts the recognition session and passes configuration. The server responds with `RecognitionStarted`. +The first message you send after connecting. Starts the recognition session and passes configuration. +The server responds with `RecognitionStarted`. ```json { @@ -249,7 +309,7 @@ The first message you send after connecting. Starts the recognition session and For all configuration options, see [Configuration](#configuration). -### EndOfStream +#### EndOfStream Send when you have finished streaming audio. The server finalises any remaining transcript and then emits `EndOfTranscript`. @@ -261,7 +321,7 @@ Send when you have finished streaming audio. The server finalises any remaining } ``` -### ForceEndOfUtterance +#### ForceEndOfUtterance Only applies to the `external` profile. Immediately ends the current turn — the server finalises all audio received so far and emits a single `AddSegment` containing the complete transcript for that turn, followed by `EndOfTurn`. @@ -273,7 +333,7 @@ Use this wherever your application decides a turn is complete: on button release } ``` -### UpdateSpeakerFocus +#### UpdateSpeakerFocus Updates which speakers are in focus, mid-session. Takes effect immediately. See [Speaker Focus](#speaker-focus) for full details. @@ -288,7 +348,7 @@ Updates which speakers are in focus, mid-session. Takes effect immediately. See } ``` -### GetSpeakers +#### GetSpeakers Requests voice identifiers for all speakers diarized so far in the session. The server responds with a `SpeakersResult` message. See [Speaker ID](#speaker-id) for full details. @@ -302,27 +362,9 @@ Requests voice identifiers for all speakers diarized so far in the session. The ## API Reference - Server Messages -### Standard messages - -The following standard RT messages are emitted alongside Voice Agent API messages. See the [API reference](/api-ref) for full payload details. +This section covers Voice Agent API-specific messages only. For shared messages (`RecognitionStarted`, `AudioAdded`, `AddPartialTranscript`, `AddTranscript`, `EndOfUtterance`, `EndOfTranscript`, `Info`, `Warning`, `Error`), see the [RT API reference](/api-ref). -| Message | When it's emitted | -|---------|------------------| -| `AudioAdded` | | -| `RecognitionStarted` | Session is ready; emitted in response to `StartRecognition` | -| `AddPartialTranscript` | Word-level partial transcript update (lower-level than `AddPartialSegment`) | -| `AddTranscript` | Word-level final transcript (lower-level than `AddSegment`) | -| `EndOfUtterance` | Silence threshold reached; precedes turn finalisation | -| `EndOfTranscript` | All audio processed; emitted after `EndOfStream` | -| `Info` | Non-critical informational message from the server | -| `Warning` | Non-fatal issue (e.g. unsupported config ignored) | -| `Error` | Fatal error; connection will close | - -### Voice Agent API messages - -These messages are only emitted when using a voice agent profile (`/v2/agent/`). - -#### `StartOfTurn` +#### StartOfTurn Emitted when a speaker begins a new turn. Use this to signal to your agent that it should stop speaking if it currently is. @@ -336,7 +378,7 @@ Emitted when a speaker begins a new turn. Use this to signal to your agent that **Fields:** - `turn_id` — monotonically increasing integer; pairs with the corresponding `EndOfTurn` -#### `EndOfTurn` +#### EndOfTurn Emitted when turn detection decides the speaker has finished. This is the trigger for your agent to respond. The finalised transcript for the turn is in the preceding `AddSegment`. @@ -355,7 +397,7 @@ Emitted when turn detection decides the speaker has finished. This is the trigge - `turn_id` — matches the `StartOfTurn` for this turn - `metadata.start_time` / `metadata.end_time` — audio time range for the turn, in seconds from session start -#### `AddPartialSegment` +#### AddPartialSegment Interim transcript update emitted continuously while the speaker is talking. Each new `AddPartialSegment` replaces the previous one — do not concatenate them. @@ -384,7 +426,7 @@ Interim transcript update emitted continuously while the speaker is talking. Eac } ``` -#### `AddSegment` +#### AddSegment The final, complete transcript for a turn. Emitted just before `EndOfTurn`. This is the stable output to pass to your LLM — do not use `AddPartialSegment` for this. @@ -425,7 +467,7 @@ In multi-speaker scenarios, a single `AddSegment` may contain segments from mult **Message-level fields:** - `metadata.processing_time` — transcription latency in seconds for this message -#### `SpeakerStarted` / `SpeakerEnded` +#### SpeakerStarted / SpeakerEnded Emitted when a specific speaker starts or stops being heard. These are voice activity events — they fire based on detected speech, independently of turn boundaries. @@ -456,7 +498,7 @@ Emitted when a specific speaker starts or stops being heard. These are voice act - `metadata.start_time` — when this speaker started their current speaking interval - `metadata.end_time` — when this speaker stopped speaking (`SpeakerEnded` only) -#### `SessionMetrics` +#### SessionMetrics Emitted every 5 seconds and once at the end of the session. @@ -470,7 +512,7 @@ Emitted every 5 seconds and once at the end of the session. } ``` -#### `SpeakerMetrics` +#### SpeakerMetrics Emitted each time a speaker produces a recognised word. @@ -490,7 +532,7 @@ Emitted each time a speaker produces a recognised word. #### SpeakersResult -Emitted as a response to a `GetSpeakers` message. +Emitted in response to `GetSpeakers`. Contains voice identifiers for all diarized speakers so far. See [Speaker ID](#speaker-id) for how to store and use these. ```json { @@ -502,18 +544,33 @@ Emitted as a response to a `GetSpeakers` message. } ``` +#### EndOfTurnPrediction ---- +Emitted by `adaptive` and `smart` profiles when the model predicts the current turn is about to end. Can be used to begin preparing a response before `EndOfTurn` arrives, reducing perceived latency. -## Features +:::note +todo - payload details. +::: + +#### SmartTurnPrediction + +Emitted by the `smart` profile only. A higher-confidence acoustic prediction of turn completion, based on the ML model that analyses vocal cues. + +:::note +todo - payload details. +::: + +#### SpeechStarted / SpeechEnded -The Voice Agent API introduces key features built with voice agents in mind. These include: -### **Speaker Focus** -This lets you control which speakers' output your agent acts on. By default, all detected speakers are active and and their transcripts are included in `AddSegment` output. - - Speaker IDs (`S1`, `S2`, etc.) are assigned automatically when diarization is enabled and persist for the lifetime of the session. - Send `UpdateSpeakerFocus` at any point during the session to change who is in focus - the new config takes place immediately and replaces the previous one. +Voice activity detection events. Emitted when speech is first detected in the audio stream (`SpeechStarted`) or stops (`SpeechEnded`). These fire independently of speaker identity and turn boundaries. +:::note +todo - payload details. +::: + +--- + +## Features ### Speaker Focus From 29ad2f5fa0a5482e1bb55df865d86ba192757107 Mon Sep 17 00:00:00 2001 From: Archie McMullan Date: Tue, 24 Mar 2026 15:32:01 +0000 Subject: [PATCH 3/8] Add audio format warning and example to Voice Agent API docs Clarifies that only pcm_s16le at 8000/16000 Hz is supported. Other formats may be silently accepted but will not produce correct output. --- docs/private/voice-agent-api.mdx | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index 04bc085..a68df4e 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -178,12 +178,18 @@ Configuration is passed in [`StartRecognition`](#startrecognition) and is split **`audio_format`** +:::warning +Only `pcm_s16le` at `8000` or `16000` Hz is supported. Other encodings (e.g. `pcm_f32le`, `mulaw`) and sample rates (e.g. `44100`) may be silently accepted by the API but will not produce correct output. +::: + | Field | Notes | |-------|-------| | `type` | Must be `raw` | | `encoding` | Must be `pcm_s16le` (16-bit signed little-endian PCM) | | `sample_rate` | Must be `8000` or `16000` | +Example: `{"type":"raw","encoding":"pcm_s16le","sample_rate":16000}` + **`transcription_config`** | Field | Default | Notes | From 85e3cf42bcbec5c4a64b35da7c918cc53a325de1 Mon Sep 17 00:00:00 2001 From: Archie McMullan Date: Tue, 24 Mar 2026 16:29:23 +0000 Subject: [PATCH 4/8] Fill in todo payloads and fix SmartTurnResult message name - Add real payloads and field descriptions for EndOfTurnPrediction, SmartTurnResult, SpeechStarted and SpeechEnded - Rename SmartTurnPrediction to SmartTurnResult throughout (mermaid diagram, messages table, section header) - Add audio format warning and example to audio_format section --- docs/private/voice-agent-api.mdx | 86 +++++++++++++++++++++++++++----- 1 file changed, 74 insertions(+), 12 deletions(-) diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index a68df4e..56c0824 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -98,7 +98,7 @@ sequenceDiagram opt Turn prediction (adaptive, smart profiles) S-->>C: EndOfTurnPrediction - S-->>C: SmartTurnPrediction (smart only) + S-->>C: SmartTurnResult (smart only) end S-->>C: AddSegment @@ -254,7 +254,7 @@ All messages exchanged during a Voice Agent API session. For payload details, se | Message | Profile | When it's emitted | |---------|---------|------------------| | [`EndOfTurnPrediction`](#endofturnprediction) | `adaptive`, `smart` | The model predicts the current turn will end soon | -| [`SmartTurnPrediction`](#smartturnprediction) | `smart` only | High-confidence acoustic prediction of turn completion | +| [`SmartTurnResult`](#smartturnresult) | `smart` only | High-confidence acoustic prediction of turn completion | **Speech and speaker activity** @@ -554,25 +554,87 @@ Emitted in response to `GetSpeakers`. Contains voice identifiers for all diarize Emitted by `adaptive` and `smart` profiles when the model predicts the current turn is about to end. Can be used to begin preparing a response before `EndOfTurn` arrives, reducing perceived latency. -:::note -todo - payload details. -::: +```json +{ + "message": "EndOfTurnPrediction", + "turn_id": 2, + "predicted_wait": 0.73, + "metadata": { + "ttl": 0.73, + "reasons": ["not__ends_with_eos"] + } +} +``` + +**Fields:** +- `turn_id` — the turn this prediction applies to +- `predicted_wait` — estimated seconds until the turn ends +- `metadata.ttl` — time to live; how long this prediction remains valid +- `metadata.reasons` — internal signals that contributed to the prediction -#### SmartTurnPrediction +#### SmartTurnResult Emitted by the `smart` profile only. A higher-confidence acoustic prediction of turn completion, based on the ML model that analyses vocal cues. -:::note -todo - payload details. -::: +```json +{ + "message": "SmartTurnResult", + "prediction": { + "prediction": true, + "probability": 0.979, + "processing_time": 0.128 + }, + "metadata": { + "start_time": 0.0, + "end_time": 2.2, + "language": "en", + "speaker_id": "S1", + "total_time": 2.2 + } +} +``` + +**Fields:** +- `prediction.prediction` — `true` if the model predicts the turn is complete +- `prediction.probability` — confidence score (0–1) +- `prediction.processing_time` — time taken by the ML model in seconds +- `metadata.start_time` / `metadata.end_time` — audio window analysed +- `metadata.total_time` — total session time at point of prediction +- `metadata.speaker_id` — speaker being analysed (`null` if not yet identified) #### SpeechStarted / SpeechEnded Voice activity detection events. Emitted when speech is first detected in the audio stream (`SpeechStarted`) or stops (`SpeechEnded`). These fire independently of speaker identity and turn boundaries. -:::note -todo - payload details. -::: +```json +{ + "message": "SpeechStarted", + "probability": 0.508, + "transition_duration_ms": 192.0, + "metadata": { + "start_time": 2.1, + "end_time": 2.1 + } +} +``` + +```json +{ + "message": "SpeechEnded", + "probability": 0.307, + "transition_duration_ms": 192.0, + "metadata": { + "start_time": 0.4, + "end_time": 2.5 + } +} +``` + +**Fields:** +- `probability` — VAD confidence score (0–1) +- `transition_duration_ms` — duration of the speech/silence transition in milliseconds +- `metadata.start_time` — when speech began (`SpeechStarted`: same as `end_time`; `SpeechEnded`: when the speaking interval started) +- `metadata.end_time` — when the event was detected --- From b064ebc462b5ea09ad7236c2444b08fec2d775db Mon Sep 17 00:00:00 2001 From: Archie McMullan Date: Tue, 24 Mar 2026 17:09:11 +0000 Subject: [PATCH 5/8] =?UTF-8?q?Add=20preview=20warning=20to=20SmartTurnRes?= =?UTF-8?q?ult=20=E2=80=94=20will=20be=20renamed=20at=20GA?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/private/voice-agent-api.mdx | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index 56c0824..ec3e8b8 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -574,6 +574,10 @@ Emitted by `adaptive` and `smart` profiles when the model predicts the current t #### SmartTurnResult +:::warning +This message is currently emitted as `SmartTurnResult` during preview. It will be renamed to `SmartTurnPrediction` at GA. +::: + Emitted by the `smart` profile only. A higher-confidence acoustic prediction of turn completion, based on the ML model that analyses vocal cues. ```json From 99a674b604d1898f8c76a514b9988b696da4cb1d Mon Sep 17 00:00:00 2001 From: Archie McMullan Date: Wed, 25 Mar 2026 11:10:57 +0000 Subject: [PATCH 6/8] Remove duplicate SpeakersResult payload from Speaker ID feature section The payload is already documented in the SpeakersResult server message section. Replace with links to the API reference entries for GetSpeakers and SpeakersResult. --- docs/private/voice-agent-api.mdx | 16 ++-------------- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index ec3e8b8..38c1e24 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -675,21 +675,9 @@ Speaker ID lets you recognise the same person across separate sessions. At the e #### Getting identifiers -Send `GetSpeakers` at any point during a session to retrieve identifiers for all diarized speakers so far. The server responds with a `SpeakersResult` message. +Send [`GetSpeakers`](#getspeakers) at any point during a session to retrieve identifiers for all diarized speakers so far. The server responds with a [`SpeakersResult`](#speakersresult) message. -`SpeakersResult` response: - -```json -{ - "message": "SpeakersResult", - "speakers": [ - { "label": "S1", "speaker_identifiers": [""] }, - { "label": "S2", "speaker_identifiers": [""] } - ] -} -``` - -Store the `speaker_identifiers` values. These are opaque tokens tied to a speaker's voice profile — treat them as credentials and store them securely. +Store the `speaker_identifiers` values from the response. These are opaque tokens tied to a speaker's voice profile — treat them as credentials and store them securely. #### Using identifiers in future sessions From 0573df8eeb041192ece3552e7a180be5388bc4a6 Mon Sep 17 00:00:00 2001 From: Archie McMullan Date: Wed, 25 Mar 2026 14:38:39 +0000 Subject: [PATCH 7/8] =?UTF-8?q?Fix=20session=20flow=20diagram=20=E2=80=94?= =?UTF-8?q?=20correct=20message=20ordering=20and=20remove=20opt=20blocks?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix ordering: SpeakerStarted after StartOfTurn, SpeechEnded/EndOfUtterance/ SpeakerEnded before AddSegment, EndOfTurn last - Add missing EndOfUtterance - Remove incorrect Speaker activity opt block — speaker events now in correct position in the main flow - Add profile annotations inline (adaptive/smart, smart only, external only) - Move ForceEndOfUtterance into mid-session controls opt block - Add SessionMetrics note below diagram --- docs/private/voice-agent-api.mdx | 26 +++++++++++--------------- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index 38c1e24..78f9c98 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -91,27 +91,21 @@ sequenceDiagram loop Audio Stream C->>S: Audio frames (binary) S-->>C: AudioAdded - S-->>C: SpeechStarted S-->>C: StartOfTurn - S-->>C: AddPartialSegment - - opt Turn prediction (adaptive, smart profiles) - S-->>C: EndOfTurnPrediction - S-->>C: SmartTurnResult (smart only) - end - + S-->>C: SpeakerStarted + S-->>C: AddPartialSegment (repeating) + S-->>C: SpeakerMetrics (repeating) + S-->>C: EndOfTurnPrediction (adaptive, smart) + S-->>C: SmartTurnResult (smart only) + S-->>C: SpeechEnded + S-->>C: EndOfUtterance + S-->>C: SpeakerEnded S-->>C: AddSegment S-->>C: EndOfTurn - S-->>C: SpeechEnded - - opt Speaker activity - S-->>C: SpeakerStarted / SpeakerEnded - S-->>C: SessionMetrics / SpeakerMetrics - end opt Mid-session controls - C->>S: ForceEndOfUtterance (external profile only) + C->>S: ForceEndOfUtterance (external only) C->>S: UpdateSpeakerFocus C->>S: GetSpeakers S-->>C: SpeakersResult @@ -122,6 +116,8 @@ sequenceDiagram S-->>C: EndOfTranscript ``` +`SessionMetrics` is emitted every 5 seconds independently of turn boundaries. + For a full reference of all messages, see [Messages Overview](#messages-overview). --- From c39297ef3b26ee8373cdb2b64f21dab3f1fc8248 Mon Sep 17 00:00:00 2001 From: Archie McMullan Date: Wed, 25 Mar 2026 14:45:15 +0000 Subject: [PATCH 8/8] Add Languages field to each profile section Move language info out of Trade-off text into a dedicated Languages field, consistent with Best for / Trade-off pattern. All profiles now explicitly state language support. --- docs/private/voice-agent-api.mdx | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/private/voice-agent-api.mdx b/docs/private/voice-agent-api.mdx index 78f9c98..0ec355a 100644 --- a/docs/private/voice-agent-api.mdx +++ b/docs/private/voice-agent-api.mdx @@ -41,7 +41,9 @@ Adapts to each speaker's pace over the course of a conversation. It adjusts the **Best for:** General conversational voice agents. -**Trade-off:** Latency varies by speaker. Disfluency detection is English-only — other languages fall back to speech-rate adaptation. +**Languages:** All supported languages. Disfluency detection is English-only — other languages fall back to speech-rate adaptation. + +**Trade-off:** Latency varies by speaker. ### `agile` @@ -51,6 +53,8 @@ Uses voice activity detection (VAD) to detect silence and finalise turns as quic **Best for:** Use cases where response speed is the top priority and occasional mid-speech finalisations are acceptable. +**Languages:** All supported languages. + **Trade-off:** Because it relies on silence, it may finalise a turn while the speaker is still mid-sentence — for example, during a natural pause. This can result in additional downstream LLM calls. ### `smart` @@ -61,7 +65,9 @@ Builds on `adaptive` with an additional ML model that analyses acoustic cues to **Best for:** High-stakes conversations where cutting off the user is costly — finance, healthcare, legal. -**Trade-off:** Higher latency than `adaptive`. Supported languages: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, Vietnamese. +**Languages:** Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, Vietnamese. + +**Trade-off:** Higher latency than `adaptive`. ### `external` @@ -71,6 +77,8 @@ Turn detection is fully manual. The server accumulates audio and transcript unti **Best for:** Push-to-talk interfaces, custom VAD pipelines, or setups where an LLM decides when to respond. +**Languages:** All supported languages. + **Trade-off:** You are responsible for all turn detection logic. ---