Skip to content

Refactor speech stack into built-in Kokoro TTS and Whisper STT plugins#1371

Open
3clyp50 wants to merge 3 commits intoagent0ai:readyfrom
3clyp50:tts_stt
Open

Refactor speech stack into built-in Kokoro TTS and Whisper STT plugins#1371
3clyp50 wants to merge 3 commits intoagent0ai:readyfrom
3clyp50:tts_stt

Conversation

@3clyp50
Copy link
Copy Markdown
Contributor

@3clyp50 3clyp50 commented Mar 29, 2026

Split the legacy core speech stack into two built-in, independently toggleable plugins: _kokoro_tts for TTS and _whisper_stt for STT.

This extraction keeps dependency installation and bootstrap concerns in Docker/bootstrap/preload, while moving speech-specific tooling, APIs, prompts, UI, and runtime behavior into the plugins. Core now exposes engine-agnostic tts-service and stt-service brokers, with browser-native TTS preserved as the fallback when Kokoro is disabled.

Included in this change:

  • add built-in _kokoro_tts plugin with plugin-owned synth API, config, status UI, and provider registration
  • add built-in _whisper_stt plugin with plugin-owned transcribe API, mic runtime, device UI, prompt injection, and provider registration
  • remove legacy core speech APIs/helpers/settings/UI and delete unused webui/js/speech_browser.js
  • replace the old hardcoded speech settings section with a generic voice surface backed by plugin extensions
  • update preload/docs/tests to match the new plugin-owned speech architecture

Behavioral intent:

  • both plugins are built-in but not always_enabled
  • users can now hot-switch TTS and STT independently
  • browser TTS remains available when _kokoro_tts is off
  • Whisper mic UI only appears when _whisper_stt is enabled

@3clyp50 3clyp50 force-pushed the tts_stt branch 3 times, most recently from 7bd9eb6 to 19c8c60 Compare April 2, 2026 14:27
@3clyp50 3clyp50 changed the base branch from development to ready April 2, 2026 14:27
Deimos-Agent and others added 3 commits April 2, 2026 17:46
…gs (local PR-B+C shape)

Adds resolve_mcp_server_headers async extension point at both MCP transport
paths in mcp_handler.py (streamablehttp + sse). Enables plugins to resolve
credential placeholders at header construction time without monkey-patching.

Adds @extensible to set_settings() and set_settings_delta() in settings.py.
Enables plugins to intercept settings writes for credential scanning.

Local patch shape for upstream PR-B+C submission.
Ref: deimos_openbao_secrets IMPLEMENTATION_PLAN.md Step 1
Adds sidebar-chat-item-start and sidebar-chat-item-end x-extension
points inside the x-for loop in chats-list.html.

Previously only sidebar-chats-list-start/end existed, both outside
the x-for loop. This forced plugins that need per-chat-row UI (e.g.
status indicators, labels, badges) to resort to MutationObserver +
index-based DOM scanning and monkey-patching internal store methods.

With these new extension points, plugins can inject content into
each chat row with access to the reactive Alpine context object
(context.id, context.name, context.running, context.project, etc.)
entirely through declarative Alpine bindings — no DOM scanning,
no method patching, no index arithmetic.
Split the legacy core speech stack into two built-in, independently toggleable plugins: `_kokoro_tts` for TTS and `_whisper_stt` for STT.

This refactor keeps dependency installation and bootstrap concerns in Docker/bootstrap/preload, while moving speech-specific tooling, APIs, prompts, UI, and runtime behavior into the plugins. Core now exposes engine-agnostic `tts-service` and `stt-service` brokers, with browser-native TTS preserved as the fallback when Kokoro is disabled.

Included in this change:
- add built-in `_kokoro_tts` plugin with plugin-owned synth API, config, status UI, and provider registration
- add built-in `_whisper_stt` plugin with plugin-owned transcribe API, mic runtime, device UI, prompt injection, and provider registration
- remove legacy core speech APIs/helpers/settings/UI and delete unused `webui/js/speech_browser.js`
- replace the old hardcoded speech settings section with a generic voice surface backed by plugin extensions
- update preload/docs/tests to match the new plugin-owned speech architecture

Behavioral intent:
- both plugins are built-in but not `always_enabled`
- users can now hot-switch TTS and STT independently
- browser TTS remains available when `_kokoro_tts` is off
- Whisper mic UI only appears when `_whisper_stt` is enabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants