agent0ai · 3clyp50 · Mar 30, 2026 · Mar 29, 2026 · Mar 29, 2026
diff --git a/README.md b/README.md
@@ -144,6 +144,9 @@ A detailed setup guide for Windows, macOS, and Linux can be found in the Agent Z
 - The Web UI output is very clean, fluid, colorful, readable, and interactive; nothing is hidden.
 - You can load or save chats directly within the Web UI.
 - The same output you see in the terminal is automatically saved to an HTML file in **logs/** folder for every session.
+- Voice is provided by the built-in `_kokoro_tts` and `_whisper_stt` plugins.
+- Docker/bootstrap remains responsible for installing Kokoro, Whisper, `ffmpeg`, and related speech dependencies.
+- If `_kokoro_tts` is disabled, spoken output falls back to the browser's native speech synthesis.
 
 ![Time example](/docs/res/time_example.jpg)
 

diff --git a/api/synthesize.py b/api/synthesize.py
diff --git a/api/transcribe.py b/api/transcribe.py
diff --git a/docs/guides/usage.md b/docs/guides/usage.md
@@ -748,20 +748,27 @@ If you encounter issues with the tunnel feature:
 > Combine tunneling with authentication for secure remote access to your Agent Zero instance from any device, including mobile phones and tablets.
 
 ## Voice Interface
-Agent Zero provides both Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities for natural voice interaction:
+Agent Zero provides both Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities for natural voice interaction through built-in plugins:
+
+- `_kokoro_tts` handles server-side Kokoro speech synthesis when enabled
+- `_whisper_stt` handles server-side Whisper transcription and injects the microphone UI when enabled
+- Browser-native `speechSynthesis` remains the fallback output path when `_kokoro_tts` is disabled
+
+Use the Agent Plugins section in Settings to enable or disable either plugin independently.
 
 ### Text-to-Speech
 Enable voice responses from agents:
 * Toggle the "Speech" switch in the Preferences section of the sidebar
-* Agents will use your system's built-in voice synthesizer to speak their messages
+* If `_kokoro_tts` is enabled, agents will use Kokoro for spoken output
+* If `_kokoro_tts` is disabled, agents will use your browser's built-in voice synthesizer
 * Click the "Stop Speech" button above the input area to immediately stop any ongoing speech
 * You can also click the speech button when hovering over messages to speak individual messages or their parts
 
 ![TTS Stop Speech](../res/usage/ui-tts-stop-speech1.png)
 
 - The interface allows users to stop speech at any time if a response is too lengthy or if they wish to intervene during the conversation.
 
-The TTS uses a standard voice interface provided by modern browsers, which may sound robotic but is effective and does not require complex AI models. This ensures low latency and quick responses across various platforms, including mobile devices.
+Kokoro gives you a local container-side TTS path when the plugin is enabled. When it is disabled, Agent Zero falls back to the browser voice stack, which is lower-friction and works well across devices.
 
 
 > [!TIP]
@@ -771,19 +778,20 @@ The TTS uses a standard voice interface provided by modern browsers, which may s
 > - Creating a more interactive experience
 
 ### Speech-to-Text
-Send voice messages to agents using OpenAI's Whisper model (does not require OpenAI API key!):
+Send voice messages to agents using Whisper (does not require an OpenAI API key):
 
 1. Click the microphone button in the input area to start recording
+   - The microphone button only appears when `_whisper_stt` is enabled
 2. The button color indicates the current status:
    - Grey: Inactive
-   - Red: Listening
-   - Green: Recording
-   - Teal: Waiting
-   - Cyan (pulsing): Processing
+   - Teal: Listening
+   - Red: Recording
+   - Amber: Waiting
+   - Purple: Processing or activating
 
 Users can adjust settings such as silence threshold and message duration before sending to optimize their interaction experience.
 
-Configure STT settings in the Settings page:
+Configure Whisper STT from the plugin settings screen in the Voice section or from Agent Plugins:
 * **Model Size:** Choose between Base (74M, English) or other models
   - Note: Only Large and Turbo models support multiple languages
 * **Language Code:** Set your preferred language (e.g., 'en', 'fr', 'it', 'cz')
@@ -795,9 +803,8 @@ Configure STT settings in the Settings page:
 ![Speech to Text Settings](../res/usage/ui-settings-5-speech-to-text.png)
 
 > [!IMPORTANT]
-> All STT and TTS functionalities operate locally within the Docker container,
-> ensuring that no data is transmitted to external servers or OpenAI APIs. This
-> enhances user privacy while maintaining functionality.
+> Whisper STT and Kokoro TTS operate locally within the Docker/container runtime when their plugins are enabled.
+> Browser fallback TTS runs locally in the browser. No voice path requires OpenAI APIs.
 
 ## Mathematical Expressions
 * **Complex Mathematics:** Supports full KaTeX syntax for:

diff --git a/docs/setup/installation.md b/docs/setup/installation.md
@@ -352,11 +352,13 @@ Use `claude-sonnet-4-5` for Anthropic, but use `anthropic/claude-sonnet-4-5` for
 > [!NOTE]
 > Agent Zero uses a local embedding model by default (runs on CPU), but you can switch to OpenAI embeddings like `text-embedding-3-small` or `text-embedding-3-large` if preferred.
 
-### Speech to Text Options
+### Built-in Voice Plugins
 
-- **Model Size:** Choose the speech recognition model size
-- **Language Code:** Set the primary language for voice recognition
-- **Silence Settings:** Configure silence threshold, duration, and timeout parameters for voice input
+- Agent Zero ships Whisper STT as the built-in `_whisper_stt` plugin and Kokoro TTS as the built-in `_kokoro_tts` plugin.
+- Docker/bootstrap remains responsible for installing the required speech dependencies such as `ffmpeg`, Kokoro, Whisper, and `soundfile`.
+- Both plugins can be enabled or disabled independently from the Agent Plugins section in the Web UI.
+- Whisper model size, language, and silence behavior are configured from the plugin settings screen.
+- If `_kokoro_tts` is disabled, spoken output falls back to the browser's native speech synthesis instead of the container runtime.
 
 ### API Keys
 

diff --git a/helpers/kokoro_tts.py b/helpers/kokoro_tts.py
diff --git a/helpers/mcp_handler.py b/helpers/mcp_handler.py
@@ -42,6 +42,7 @@
 from helpers import dirty_json
 from helpers.print_style import PrintStyle
 from helpers.tool import Tool, Response
+from helpers.extension import call_extensions_async
 
 
 def normalize_name(name: str) -> str:
@@ -1105,10 +1106,21 @@ async def _create_stdio_transport(
         # Check if this is a streaming HTTP type
         if _is_streaming_http_type(server.type):
             # Use streamable HTTP client
+            # Before passing headers to httpx, allow extensions to resolve placeholders
+            resolved_headers = await call_extensions_async(
+                "resolve_mcp_server_headers",
+                agent=None,
+                server_name=server.name,
+                headers=dict(server.headers or {}),
+            )
+            if resolved_headers is not None:
+                headers_to_use = resolved_headers
+            else:
+                headers_to_use = server.headers
             transport_result = await current_exit_stack.enter_async_context(
                 streamablehttp_client(
                     url=server.url,
-                    headers=server.headers,
+                    headers=headers_to_use,
                     timeout=timedelta(seconds=init_timeout),
                     sse_read_timeout=timedelta(seconds=tool_timeout),
                     httpx_client_factory=client_factory,
@@ -1123,10 +1135,21 @@ async def _create_stdio_transport(
             return read_stream, write_stream
         else:
             # Use traditional SSE client (default behavior)
+            # Before passing headers to httpx, allow extensions to resolve placeholders
+            resolved_headers = await call_extensions_async(
+                "resolve_mcp_server_headers",
+                agent=None,
+                server_name=server.name,
+                headers=dict(server.headers or {}),
+            )
+            if resolved_headers is not None:
+                headers_to_use = resolved_headers
+            else:
+                headers_to_use = server.headers
             stdio_transport = await current_exit_stack.enter_async_context(
                 sse_client(
                     url=server.url,
-                    headers=server.headers,
+                    headers=headers_to_use,
                     timeout=init_timeout,
                     sse_read_timeout=tool_timeout,
                     httpx_client_factory=client_factory,