Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,9 @@ A detailed setup guide for Windows, macOS, and Linux can be found in the Agent Z
- The Web UI output is very clean, fluid, colorful, readable, and interactive; nothing is hidden.
- You can load or save chats directly within the Web UI.
- The same output you see in the terminal is automatically saved to an HTML file in **logs/** folder for every session.
- Voice is provided by the built-in `_kokoro_tts` and `_whisper_stt` plugins.
- Docker/bootstrap remains responsible for installing Kokoro, Whisper, `ffmpeg`, and related speech dependencies.
- If `_kokoro_tts` is disabled, spoken output falls back to the browser's native speech synthesis.

![Time example](/docs/res/time_example.jpg)

Expand Down
96 changes: 0 additions & 96 deletions api/synthesize.py

This file was deleted.

18 changes: 0 additions & 18 deletions api/transcribe.py

This file was deleted.

31 changes: 19 additions & 12 deletions docs/guides/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -748,20 +748,27 @@ If you encounter issues with the tunnel feature:
> Combine tunneling with authentication for secure remote access to your Agent Zero instance from any device, including mobile phones and tablets.

## Voice Interface
Agent Zero provides both Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities for natural voice interaction:
Agent Zero provides both Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities for natural voice interaction through built-in plugins:

- `_kokoro_tts` handles server-side Kokoro speech synthesis when enabled
- `_whisper_stt` handles server-side Whisper transcription and injects the microphone UI when enabled
- Browser-native `speechSynthesis` remains the fallback output path when `_kokoro_tts` is disabled

Use the Agent Plugins section in Settings to enable or disable either plugin independently.

### Text-to-Speech
Enable voice responses from agents:
* Toggle the "Speech" switch in the Preferences section of the sidebar
* Agents will use your system's built-in voice synthesizer to speak their messages
* If `_kokoro_tts` is enabled, agents will use Kokoro for spoken output
* If `_kokoro_tts` is disabled, agents will use your browser's built-in voice synthesizer
* Click the "Stop Speech" button above the input area to immediately stop any ongoing speech
* You can also click the speech button when hovering over messages to speak individual messages or their parts

![TTS Stop Speech](../res/usage/ui-tts-stop-speech1.png)

- The interface allows users to stop speech at any time if a response is too lengthy or if they wish to intervene during the conversation.

The TTS uses a standard voice interface provided by modern browsers, which may sound robotic but is effective and does not require complex AI models. This ensures low latency and quick responses across various platforms, including mobile devices.
Kokoro gives you a local container-side TTS path when the plugin is enabled. When it is disabled, Agent Zero falls back to the browser voice stack, which is lower-friction and works well across devices.


> [!TIP]
Expand All @@ -771,19 +778,20 @@ The TTS uses a standard voice interface provided by modern browsers, which may s
> - Creating a more interactive experience

### Speech-to-Text
Send voice messages to agents using OpenAI's Whisper model (does not require OpenAI API key!):
Send voice messages to agents using Whisper (does not require an OpenAI API key):

1. Click the microphone button in the input area to start recording
- The microphone button only appears when `_whisper_stt` is enabled
2. The button color indicates the current status:
- Grey: Inactive
- Red: Listening
- Green: Recording
- Teal: Waiting
- Cyan (pulsing): Processing
- Teal: Listening
- Red: Recording
- Amber: Waiting
- Purple: Processing or activating

Users can adjust settings such as silence threshold and message duration before sending to optimize their interaction experience.

Configure STT settings in the Settings page:
Configure Whisper STT from the plugin settings screen in the Voice section or from Agent Plugins:
* **Model Size:** Choose between Base (74M, English) or other models
- Note: Only Large and Turbo models support multiple languages
* **Language Code:** Set your preferred language (e.g., 'en', 'fr', 'it', 'cz')
Expand All @@ -795,9 +803,8 @@ Configure STT settings in the Settings page:
![Speech to Text Settings](../res/usage/ui-settings-5-speech-to-text.png)

> [!IMPORTANT]
> All STT and TTS functionalities operate locally within the Docker container,
> ensuring that no data is transmitted to external servers or OpenAI APIs. This
> enhances user privacy while maintaining functionality.
> Whisper STT and Kokoro TTS operate locally within the Docker/container runtime when their plugins are enabled.
> Browser fallback TTS runs locally in the browser. No voice path requires OpenAI APIs.

## Mathematical Expressions
* **Complex Mathematics:** Supports full KaTeX syntax for:
Expand Down
10 changes: 6 additions & 4 deletions docs/setup/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -352,11 +352,13 @@ Use `claude-sonnet-4-5` for Anthropic, but use `anthropic/claude-sonnet-4-5` for
> [!NOTE]
> Agent Zero uses a local embedding model by default (runs on CPU), but you can switch to OpenAI embeddings like `text-embedding-3-small` or `text-embedding-3-large` if preferred.

### Speech to Text Options
### Built-in Voice Plugins

- **Model Size:** Choose the speech recognition model size
- **Language Code:** Set the primary language for voice recognition
- **Silence Settings:** Configure silence threshold, duration, and timeout parameters for voice input
- Agent Zero ships Whisper STT as the built-in `_whisper_stt` plugin and Kokoro TTS as the built-in `_kokoro_tts` plugin.
- Docker/bootstrap remains responsible for installing the required speech dependencies such as `ffmpeg`, Kokoro, Whisper, and `soundfile`.
- Both plugins can be enabled or disabled independently from the Agent Plugins section in the Web UI.
- Whisper model size, language, and silence behavior are configured from the plugin settings screen.
- If `_kokoro_tts` is disabled, spoken output falls back to the browser's native speech synthesis instead of the container runtime.

### API Keys

Expand Down
127 changes: 0 additions & 127 deletions helpers/kokoro_tts.py

This file was deleted.

27 changes: 25 additions & 2 deletions helpers/mcp_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
from helpers import dirty_json
from helpers.print_style import PrintStyle
from helpers.tool import Tool, Response
from helpers.extension import call_extensions_async


def normalize_name(name: str) -> str:
Expand Down Expand Up @@ -1105,10 +1106,21 @@ async def _create_stdio_transport(
# Check if this is a streaming HTTP type
if _is_streaming_http_type(server.type):
# Use streamable HTTP client
# Before passing headers to httpx, allow extensions to resolve placeholders
resolved_headers = await call_extensions_async(
"resolve_mcp_server_headers",
agent=None,
server_name=server.name,
headers=dict(server.headers or {}),
)
if resolved_headers is not None:
headers_to_use = resolved_headers
else:
headers_to_use = server.headers
transport_result = await current_exit_stack.enter_async_context(
streamablehttp_client(
url=server.url,
headers=server.headers,
headers=headers_to_use,
timeout=timedelta(seconds=init_timeout),
sse_read_timeout=timedelta(seconds=tool_timeout),
httpx_client_factory=client_factory,
Expand All @@ -1123,10 +1135,21 @@ async def _create_stdio_transport(
return read_stream, write_stream
else:
# Use traditional SSE client (default behavior)
# Before passing headers to httpx, allow extensions to resolve placeholders
resolved_headers = await call_extensions_async(
"resolve_mcp_server_headers",
agent=None,
server_name=server.name,
headers=dict(server.headers or {}),
)
if resolved_headers is not None:
headers_to_use = resolved_headers
else:
headers_to_use = server.headers
stdio_transport = await current_exit_stack.enter_async_context(
sse_client(
url=server.url,
headers=server.headers,
headers=headers_to_use,
timeout=init_timeout,
sse_read_timeout=tool_timeout,
httpx_client_factory=client_factory,
Expand Down
Loading