Skip to content

fix(tts): sync canonical/alias fields to prevent voice/model/base URL rollback#968

Closed
dbmizrahi wants to merge 13 commits intonextlevelbuilder:devfrom
dbmizrahi:fix/voice-vs-voice_id-ignored-for-tts
Closed

fix(tts): sync canonical/alias fields to prevent voice/model/base URL rollback#968
dbmizrahi wants to merge 13 commits intonextlevelbuilder:devfrom
dbmizrahi:fix/voice-vs-voice_id-ignored-for-tts

Conversation

@dbmizrahi
Copy link
Copy Markdown
Contributor

@dbmizrahi dbmizrahi commented Apr 19, 2026

Summary

This PR fixes a TTS settings persistence bug where updated provider values (voice/model/base URL) could revert after page reload. The fix keeps canonical and compatibility alias fields synchronized in the UI payload so backend resolution cannot persist stale values.

Type

  • Feature
  • Bug fix
  • Hotfix (targeting main)
  • Refactor
  • Docs
  • CI/CD

Target Branch

dev

Checklist

  • go build ./... passes
  • go build -tags sqliteonly ./... passes (if Go changes)
  • go vet ./... passes
  • Tests pass: go test -race ./...
  • Web UI builds: cd ui/web && pnpm build (if UI changes)
  • No hardcoded secrets or credentials
  • SQL queries use parameterized $1, $2 (no string concat)
  • New user-facing strings added to all 3 locales (en/vi/zh)
  • Migration version bumped in internal/upgrade/version.go (if new migration)

Test Plan

  1. Open TTS settings page.
  2. Select provider elevenlabs and change:
    • voice
    • model
    • base URL
  3. Save settings.
  4. Reload the page.
  5. Verify values remain unchanged (no rollback).
  6. Repeat with provider minimax for voice/model.

Validation performed in this fix:

  • ui/web/src/pages/tts/tts-page.tsx lint: no errors/warnings.
  • ui/web/src/pages/tts/sections/credentials-section.tsx lint: no errors/warnings.

Implementation notes covered by this PR:

  • voice and voice_id are updated together for elevenlabs and minimax.
  • model and model_id are updated together for elevenlabs and minimax.
  • api_base and base_url are updated together for elevenlabs.
  • Read-side fallback now tolerates either canonical or alias field when populating selectors.

@dbmizrahi dbmizrahi force-pushed the fix/voice-vs-voice_id-ignored-for-tts branch from aaa892a to f4ec1c4 Compare April 19, 2026 20:31
@dbmizrahi
Copy link
Copy Markdown
Contributor Author

Since I don't have permissions to re-run the PR checks I can only confirm that integration test that failed in the pipeline has passed on my machine:

go test -race -tags integration ./tests/integration -run TestOpenAIChat_AbortContextCancel -count=1
ok      github.com/nextlevelbuilder/goclaw/tests/integration    2.072s

dbmizrahi and others added 12 commits April 20, 2026 10:57
…xtlevelbuilder#1002)

* refactor(providers): migrate ToolDefinition.Function to pointer + add image response fields

ToolDefinition.Function becomes *ToolFunctionSchema with omitempty so native tool types (image_generation, web_search, etc.) can be declared without a function body. All 9 internal construction sites updated. CleanToolSchemas refactored — function-shape cleaning extracted into cleanFunctionSchema helper, outer pass-through handles native tools.

Added image response fields needed by the next commits: ChatResponse.Images, StreamChunk.Images, ImageContent.Partial (distinguishes partial frames from final images).

* feat(providers): native image_generation for Codex + OpenAI-compat tracks

Codex native (POST /codex/responses): emit image_generation tool object in request tools[] (type, action, model, output_format, partial_images). Handle SSE events response.image_generation_call.partial_image + response.output_item.done (type image_generation_call) + response.completed output[] walk for non-stream. Dedup per item_id. Extend codexSSEEvent/codexItem with output_format, result, partial_image_b64, partial_image_index.

OpenAI-compat (/v1/chat/completions): serialize ToolDefinition{Type:'image_generation'} as {type:'image_generation'} pass-through. Parse choices[0].message.images[] + delta.images[] (data URLs) via new parseDataURL helper; append to ChatResponse.Images.

ProviderCapabilities.ImageGeneration flag; Codex provider + adapter set true. Other providers default false.

* feat(agent,http,store): persist assistant images + tri-level image_generation gate

Agent loop tri-level gate: (provider capability) AND (AgentConfig.AllowImageGeneration, default true, stored in other_config.allow_image_generation) AND (request lacks x-goclaw-no-image-gen header). Gate in loop_tool_filter.go appends ToolDefinition{Type:'image_generation'} only when all three pass. Per-request opt-out parsed in chat_completions.go and propagated via RunRequest.NoImageGen.

Media persistence: persistAssistantImages writes final images (Partial:false) to {workspace}/media/{sha256}.{ext}, returns MediaRef entries, clears inline Images[] from the message. Idempotent on hash, traversal-safe, symlink-guarded. Invoked from pipeline.FinalizeStage via new Deps.PersistAssistantImages callback — covers both stream-final and non-stream paths.

Agent store reads AllowImageGeneration from other_config JSONB with absent/nil = true default (matches V3Flags pattern). No DB migration — code-only default.

* feat(ui/web): image_generation toggle + streaming placeholder + download filename

Composer chip 'Images' visible only when active agent's provider has ImageGeneration capability. Per-agent localStorage persistence via useImageGenToggle hook. When off, sends noImageGen:true to WS chat method (maps to x-goclaw-no-image-gen on upstream HTTP call path).

ActiveRunZone renders a skeleton placeholder while streaming partial_image frames arrive. MediaGallery assigns generated-YYYYMMDD-HHmmss.png as the download filename for hex/UUID PNGs.

i18n keys added to en/vi/zh chat.json: imageGenToggle, imageGenGenerating, imageGenDownloadName. 8 vitest tests for the toggle hook.

* docs: add Image Generation section to codebase-summary + changelog entry

Documents the new native image_generation pipeline across providers layer (Codex + OpenAI-compat), agent gate, media persistence, and web UI surface.

* fix(ui/web): match Codex-routed providers for image_generation toggle

Image-gen toggle visibility was hard-coded to provider id 'chatgpt_oauth' but real Codex-routed agents in production use provider ids like 'cliproxy-codex'. The toggle never rendered.

Replace the Set-has check with a small helper that accepts the literal ids plus any provider string containing 'codex' (case-insensitive). Same logic applied in both chat-input.tsx (composer chip) and chat-page.tsx (streaming placeholder gate).

Verified against a live Codex-routed agent: toggle now renders, noImageGen:true propagates on toggle-off.

* docs(pr-1002): targeted-mode UX evidence report

Captures the UI integration surface for native image_generation against a live Codex-routed agent on the remote dev backend.

Includes: composer toggle chip (rendered), streaming skeleton placeholder, and honest failure-path capture showing the legacy create_image builtin fallback. Self-contained HTML report in .github/pr-assets/1002/index.html.

* fix(permissions): classify sessions.compact as write method

CI RBAC-drift test (TestMethodRole_DriftCoverage_AllProtocolMethodsClassified) was failing because the new sessions.compact method added upstream was not classified in any of isPublicMethod / isAdminMethod / isWriteMethod / isReadMethod.

Sessions compaction mutates session history (compacts messages into summaries), so it belongs with the other sessions.* write methods.

* fix(tests): remove duplicate contains() in integration package

Both tts_gemini_live_test.go and mcp_grant_revoke_test.go declared a file-local func contains(s, substr string) bool with identical bodies, causing 'contains redeclared in this block' at compile time in the integration job.

Replace all call sites with strings.Contains (same semantics, stdlib) and drop the duplicates. No behavior change.

* feat(providers): NativeImageProvider interface + Codex implementation

Defines a provider-level contract (NativeImageProvider.GenerateImage) that OAuth-backed providers can implement to serve image generation without exposing static API credentials. Re-uses the PR's Track A native wire format (POST /codex/responses with image_generation tool, item.result decoding, SSE fallback).

CodexProvider + CodexAdapter implement it. Also adds MediaRef.Prompt field so downstream layers can propagate the generating prompt alongside the asset.

* feat(tools): route create_image through NativeImageProvider for OAuth providers

create_image.callProvider now checks for a NativeImageProvider implementation before the credentialProvider interface. When the provider chain points at a Codex-family provider (no static API key), the tool delegates to the provider's GenerateImage which executes the native ChatGPT Responses API image_generation flow.

Resolves 'provider X does not expose API credentials required for image generation' errors for openai-codex / cliproxy-codex chains. On success the tool embeds the user's prompt as a PNG tEXt 'Description' chunk (file-local pngEmbedPrompt helper to avoid tools→agent import cycle), writes the image to /tmp, and threads the prompt through result.MediaPrompts for downstream MediaRef propagation.

* feat(agent,pipeline): propagate image prompt through MediaRef + PNG tEXt embed helper

Adds EmbedPNGPrompt public helper in internal/agent/png_metadata.go that inserts a tEXt 'Description' chunk (plus 'Software: goclaw') into PNG byte streams before the IEND chunk. Non-PNG inputs are passed through unchanged — the helper never errors on unknown formats.

FinalizeStage wires MediaResult.Prompt (from create_image tool output) onto MediaRef.Prompt so the UI can render the generating prompt alongside the image. Per-image prompt list threaded via pipeline RunState.

* feat(ui/web): show generating prompt as caption under each image in MediaGallery

When a MediaRef carries a prompt, MediaGallery renders it as a muted italic caption (line-clamp-2) beneath the image with the full text in the title tooltip. Caption is hidden when the prompt is absent so non-assistant images (user uploads, legacy data) look unchanged.

MediaItem + session media_refs types extended with an optional prompt field; the chat-message adapter threads ref.prompt through when converting WS payloads to UI state.

* fix(providers/codex): stream:true + instructions for native image_generation

The ChatGPT Responses API on /codex/responses rejects two things hard:

- stream:false → HTTP 400 "Stream must be set to true"

- missing instructions → HTTP 400 "Instructions are required"

buildNativeImageRequestBody now sets stream:true and a purpose-specific instructions string ("Generate an image matching the user's description using the image_generation tool. Return only the image; do not describe it in text."). The existing parseNativeImageSSE path was already in place for stream parsing; routing changed from the non-stream branch to the SSE branch.

Regression assertions added to TestCodexGenerateImage_BuildsNativeRequest so these two fields can't silently regress.

* feat(providers,tools,ui): image_model whitelist (gpt-image-2 default, gpt-image-1.5 legacy)

Replaces the hardcoded "gpt-image-2" literal in buildNativeImageRequestBody with a user-configurable field threaded through NativeImageRequest.ImageModel. The whitelist is enforced by ValidateImageModel which rejects anything outside {gpt-image-2, gpt-image-1.5} with a clear error — prevents silent upstream 400s from model names the Responses API would reject.

create_image.callProvider reads params.image_model from the chain entry and threads it through. Empty / absent falls back to DefaultImageModel (gpt-image-2).

UI: added an 'Image model' select inside the existing openai-codex Settings panel on the Create Image Provider Chain dialog. Options: Default · gpt-image-2 (recommended) and Legacy · gpt-image-1.5. i18n keys in en/vi/zh tools.json under builtin.mediaChain.

Tests: TestCodexGenerateImage covers default/legacy/rejected model cases; TestCreateImageTool_ThreadsImageModel covers params→request threading with empty/legacy/explicit sub-cases.

* fix(tools): raise media chain default timeout to 600s/1 retry for image gen

gpt-image-2 on complex prompts (dense Vietnamese text, infographic layouts) legitimately takes 4–8 minutes to complete. The previous default of 120s × 2 retries routinely died mid-generation with 'context deadline exceeded' — the upstream run was still producing bytes when our ctx cancelled.

Default is now Timeout: 600 / MaxRetries: 1. Retries dropped to 1 because image generation is stateful per upstream run: a mid-flight timeout leaves orphan server work, and retrying a fresh generation doubles cost for no gain. Surface the failure fast so operators can widen the timeout instead.

Operators can still set a tighter value explicitly via the Chain dialog.

* refactor: remove user-facing Images toggle, keep admin-level AllowImageGeneration

The per-request opt-out toggle (composer chip + streaming placeholder + noImageGen header plumbing) was a support footgun — users toggle OFF, forget, then can't generate images and think it's broken. Removed in full.

Kept: AgentConfig.AllowImageGeneration (admin kill-switch, stored in other_config.allow_image_generation, default true). Tri-level gate simplifies to two tiers: provider capability AND agent config allows.

Removed: useImageGenToggle hook, IMAGE_GEN_PROVIDER_IDS set in chat-input, supportsImageGenProvider helper, agentProvider/agentKey props on ChatInput, showImageGenPlaceholder prop on MessageBubble/ActiveRunZone/ChatThread, noImageGen param on use-chat-send, parseNoImageGen in chat_completions.go, NoImageGen on RunRequest, no_image_gen_header_test.go, imageGenToggle/imageGenGenerating i18n keys. Kept imageGenDownloadName — used by MediaGallery for generated-\*.png filename resolution.

* docs(pr-1002): refreshed UX trace + updated codebase notes

Replaces the earlier stealth-state evidence with a clean three-capture trace from a real successful run: inline image + prompt caption, MediaGallery lightbox expansion, Chain dialog with the new Image model dropdown open. Skill-routing rows scrubbed from the capture — they reflect per-agent skill setup, not anything this PR introduces.

codebase-summary + changelog: reflect final state (toggle removed, image_model selector, 600s default chain timeout, gpt-image-2 as quality baseline).

* fix(pipeline): preserve mid-loop image_generation output across iterations

FinalizeStage previously read state.Think.LastResponse.Images, which holds
only the last iteration's response. If the LLM emitted image_generation_call
in iteration N alongside a function_call, then responded text-only in N+1,
the image from N was silently dropped on finalize.

Accumulate final (non-partial) images into state.Observe.AssistantImages
across every iteration via ObserveStage, and source FinalizeStage from the
accumulator instead of LastResponse. Partial streaming frames are filtered
defensively; response.Images is cleared on drain to prevent double-counting
on re-exec.

* test(pipeline): regression coverage for mid-loop image accumulation

Six ObserveStage cases covering image accumulation semantics:
- single-iter image-only, image+tool_call same iter, mid-loop image
  surviving text-only final iter, multiple images across iters, partial
  frame filtering, nil-response safety.

Two FinalizeStage cases verifying accumulator is the source of truth:
- PersistsFromObserveAccumulator: image in Observe + empty LastResponse
  must still be persisted via PersistAssistantImages.
- NoPersistWhenAccumulatorEmpty: no call when no images were emitted.

---------

Co-authored-by: viettranx <viettranx@gmail.com>
- new send_file(path, caption?) tool with DenyPaths guard and duplicate-delivery check
- patch message(MEDIA:) to Mark DeliveredMedia on send success (closes cross-tool dup gap)
- register in gateway wiring + builtin seed
- add to systemprompt coreToolSummaries; clarify write_file deliver=true description
- 16 tests green (PG + SQLite builds clean, invariants green)
Agents previously saw their own Telegram handle (e.g. "@viet_super_bot")
in user messages and mistook it for a different bot, replying NO_REPLY.
The username was only used for the mention gate, never removed from the
content passed to the LLM.

Slack and Feishu already strip their own bot mentions (handlers_mention.go
stripBotMention, bot_parse.go resolveMentions); Telegram was the odd one out.

Implementation:
- Add stripBotMention helper with leading/trailing word-boundary anchors so
  inline matches inside words (e.g. contact@viet_super_bot.com) are not
  falsely stripped.
- Apply in handleMessage right after the mention gate, before pairing/media
  processing, so history recording for unmentioned messages keeps raw text.
- Restore "[empty message]" placeholder when a message consisting only of
  "@botName" becomes empty after stripping.
After stripBotMention removes the bot's own @mention from content, the LLM
still lacks knowledge of its own platform handle. In multi-bot groups (e.g.
"@bota @botb do X") the other bot's mention remains in the content and the
agent incorrectly treats it as the intended target, replying NO_REPLY.

Capture the bot's first_name from GetMe during channel Start and expose it
alongside the username via a new MetaChannelSelfIdentity metadata key. The
consumer appends the formatted hint ("You are @{username} ({display_name})
on this Telegram channel.") to the agent's extraSystemPrompt so the LLM can
reliably identify itself across single- and multi-bot scenarios.

Falls back to "You are @{username} on this Telegram channel." when the
display name is not available, and no-ops when the username has not been
resolved yet (startup race).
- HTTP synthesize + test-connection now read tenant tts.timeout_ms
  (default 120s, was hardcoded 15s/10s). Gemini client default also
  bumped 30s→120s so both layers align when tenant config unset.
- Inline prefix "Speak naturally: " prepended to single-voice text;
  multi-speaker transcripts pass through unchanged.
- ErrTextOnlyResponse sentinel for 400 "text generation" bodies;
  single-voice retries once with stronger prefix. Narrowed needle
  list avoids false positives on unrelated 400s.
- SynthesizeWithFallbackAdapted now returns errors.Join so sentinel
  survives fallback chain; HTTP 422 mapping + locale-translated
  ForLLM in agent tool (EN/VI/ZH catalogs).
- Default Gemini model bumped to gemini-3.1-flash-tts-preview.
…ct max_tokens

- Add TokenCounter.CountToolSchemas() to measure JSON schema size for all tools
- Include tool schemas in OverheadTokens calculation for accurate context usage
- Implement dynamic max_tokens: in/25 clamp [1024, 8192] for compaction
- Add characterization tests: count_tool_schemas_test.go
- Add overhead verification tests: context_stage_overhead_test.go, context_stage_tool_overhead_test.go
- Add integration tests: context_stage_integration_test.go
- Add compact tests: loop_compact_dynamic_max_test.go, loop_compact_max_tokens_test.go
- Add sanitize tests: loop_history_sanitize_max_tokens_test.go
- Add integration test: loop_compact_integration_test.go
…rate UI display

- Store last_prompt_tokens in sessions.metadata JSONB (PostgreSQL + SQLite)
- Update SessionsList queries to retrieve metadata and provide token display values
- Add fallback heuristic for sessions without metadata (estimated from history)
- Add tests: sessions_list_heuristic_test.go, sessions_list_metadata_tokens_test.go
- Add integration test: sessions_display_tokens_integration_test.go
…t-quality fixes

- Update docs/00-architecture-overview.md with TokenCounter and sessions.metadata details
- Update docs/codebase-summary.md with overhead accounting and compaction logic
- Update docs/project-changelog.md with v3.11.0 context-tokens accuracy fixes
# Conflicts:
#	ui/web/src/pages/tts/tts-page.tsx
@dbmizrahi
Copy link
Copy Markdown
Contributor Author

Closing it since heavily outdated

@dbmizrahi dbmizrahi closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants