fix(tts): sync canonical/alias fields to prevent voice/model/base URL rollback#968
Closed
dbmizrahi wants to merge 13 commits intonextlevelbuilder:devfrom
Closed
fix(tts): sync canonical/alias fields to prevent voice/model/base URL rollback#968dbmizrahi wants to merge 13 commits intonextlevelbuilder:devfrom
dbmizrahi wants to merge 13 commits intonextlevelbuilder:devfrom
Conversation
aaa892a to
f4ec1c4
Compare
Contributor
Author
|
Since I don't have permissions to re-run the PR checks I can only confirm that integration test that failed in the pipeline has passed on my machine: |
… method sessions.compact as a write method
…n name was fixed by renaming the function
…xtlevelbuilder#1002) * refactor(providers): migrate ToolDefinition.Function to pointer + add image response fields ToolDefinition.Function becomes *ToolFunctionSchema with omitempty so native tool types (image_generation, web_search, etc.) can be declared without a function body. All 9 internal construction sites updated. CleanToolSchemas refactored — function-shape cleaning extracted into cleanFunctionSchema helper, outer pass-through handles native tools. Added image response fields needed by the next commits: ChatResponse.Images, StreamChunk.Images, ImageContent.Partial (distinguishes partial frames from final images). * feat(providers): native image_generation for Codex + OpenAI-compat tracks Codex native (POST /codex/responses): emit image_generation tool object in request tools[] (type, action, model, output_format, partial_images). Handle SSE events response.image_generation_call.partial_image + response.output_item.done (type image_generation_call) + response.completed output[] walk for non-stream. Dedup per item_id. Extend codexSSEEvent/codexItem with output_format, result, partial_image_b64, partial_image_index. OpenAI-compat (/v1/chat/completions): serialize ToolDefinition{Type:'image_generation'} as {type:'image_generation'} pass-through. Parse choices[0].message.images[] + delta.images[] (data URLs) via new parseDataURL helper; append to ChatResponse.Images. ProviderCapabilities.ImageGeneration flag; Codex provider + adapter set true. Other providers default false. * feat(agent,http,store): persist assistant images + tri-level image_generation gate Agent loop tri-level gate: (provider capability) AND (AgentConfig.AllowImageGeneration, default true, stored in other_config.allow_image_generation) AND (request lacks x-goclaw-no-image-gen header). Gate in loop_tool_filter.go appends ToolDefinition{Type:'image_generation'} only when all three pass. Per-request opt-out parsed in chat_completions.go and propagated via RunRequest.NoImageGen. Media persistence: persistAssistantImages writes final images (Partial:false) to {workspace}/media/{sha256}.{ext}, returns MediaRef entries, clears inline Images[] from the message. Idempotent on hash, traversal-safe, symlink-guarded. Invoked from pipeline.FinalizeStage via new Deps.PersistAssistantImages callback — covers both stream-final and non-stream paths. Agent store reads AllowImageGeneration from other_config JSONB with absent/nil = true default (matches V3Flags pattern). No DB migration — code-only default. * feat(ui/web): image_generation toggle + streaming placeholder + download filename Composer chip 'Images' visible only when active agent's provider has ImageGeneration capability. Per-agent localStorage persistence via useImageGenToggle hook. When off, sends noImageGen:true to WS chat method (maps to x-goclaw-no-image-gen on upstream HTTP call path). ActiveRunZone renders a skeleton placeholder while streaming partial_image frames arrive. MediaGallery assigns generated-YYYYMMDD-HHmmss.png as the download filename for hex/UUID PNGs. i18n keys added to en/vi/zh chat.json: imageGenToggle, imageGenGenerating, imageGenDownloadName. 8 vitest tests for the toggle hook. * docs: add Image Generation section to codebase-summary + changelog entry Documents the new native image_generation pipeline across providers layer (Codex + OpenAI-compat), agent gate, media persistence, and web UI surface. * fix(ui/web): match Codex-routed providers for image_generation toggle Image-gen toggle visibility was hard-coded to provider id 'chatgpt_oauth' but real Codex-routed agents in production use provider ids like 'cliproxy-codex'. The toggle never rendered. Replace the Set-has check with a small helper that accepts the literal ids plus any provider string containing 'codex' (case-insensitive). Same logic applied in both chat-input.tsx (composer chip) and chat-page.tsx (streaming placeholder gate). Verified against a live Codex-routed agent: toggle now renders, noImageGen:true propagates on toggle-off. * docs(pr-1002): targeted-mode UX evidence report Captures the UI integration surface for native image_generation against a live Codex-routed agent on the remote dev backend. Includes: composer toggle chip (rendered), streaming skeleton placeholder, and honest failure-path capture showing the legacy create_image builtin fallback. Self-contained HTML report in .github/pr-assets/1002/index.html. * fix(permissions): classify sessions.compact as write method CI RBAC-drift test (TestMethodRole_DriftCoverage_AllProtocolMethodsClassified) was failing because the new sessions.compact method added upstream was not classified in any of isPublicMethod / isAdminMethod / isWriteMethod / isReadMethod. Sessions compaction mutates session history (compacts messages into summaries), so it belongs with the other sessions.* write methods. * fix(tests): remove duplicate contains() in integration package Both tts_gemini_live_test.go and mcp_grant_revoke_test.go declared a file-local func contains(s, substr string) bool with identical bodies, causing 'contains redeclared in this block' at compile time in the integration job. Replace all call sites with strings.Contains (same semantics, stdlib) and drop the duplicates. No behavior change. * feat(providers): NativeImageProvider interface + Codex implementation Defines a provider-level contract (NativeImageProvider.GenerateImage) that OAuth-backed providers can implement to serve image generation without exposing static API credentials. Re-uses the PR's Track A native wire format (POST /codex/responses with image_generation tool, item.result decoding, SSE fallback). CodexProvider + CodexAdapter implement it. Also adds MediaRef.Prompt field so downstream layers can propagate the generating prompt alongside the asset. * feat(tools): route create_image through NativeImageProvider for OAuth providers create_image.callProvider now checks for a NativeImageProvider implementation before the credentialProvider interface. When the provider chain points at a Codex-family provider (no static API key), the tool delegates to the provider's GenerateImage which executes the native ChatGPT Responses API image_generation flow. Resolves 'provider X does not expose API credentials required for image generation' errors for openai-codex / cliproxy-codex chains. On success the tool embeds the user's prompt as a PNG tEXt 'Description' chunk (file-local pngEmbedPrompt helper to avoid tools→agent import cycle), writes the image to /tmp, and threads the prompt through result.MediaPrompts for downstream MediaRef propagation. * feat(agent,pipeline): propagate image prompt through MediaRef + PNG tEXt embed helper Adds EmbedPNGPrompt public helper in internal/agent/png_metadata.go that inserts a tEXt 'Description' chunk (plus 'Software: goclaw') into PNG byte streams before the IEND chunk. Non-PNG inputs are passed through unchanged — the helper never errors on unknown formats. FinalizeStage wires MediaResult.Prompt (from create_image tool output) onto MediaRef.Prompt so the UI can render the generating prompt alongside the image. Per-image prompt list threaded via pipeline RunState. * feat(ui/web): show generating prompt as caption under each image in MediaGallery When a MediaRef carries a prompt, MediaGallery renders it as a muted italic caption (line-clamp-2) beneath the image with the full text in the title tooltip. Caption is hidden when the prompt is absent so non-assistant images (user uploads, legacy data) look unchanged. MediaItem + session media_refs types extended with an optional prompt field; the chat-message adapter threads ref.prompt through when converting WS payloads to UI state. * fix(providers/codex): stream:true + instructions for native image_generation The ChatGPT Responses API on /codex/responses rejects two things hard: - stream:false → HTTP 400 "Stream must be set to true" - missing instructions → HTTP 400 "Instructions are required" buildNativeImageRequestBody now sets stream:true and a purpose-specific instructions string ("Generate an image matching the user's description using the image_generation tool. Return only the image; do not describe it in text."). The existing parseNativeImageSSE path was already in place for stream parsing; routing changed from the non-stream branch to the SSE branch. Regression assertions added to TestCodexGenerateImage_BuildsNativeRequest so these two fields can't silently regress. * feat(providers,tools,ui): image_model whitelist (gpt-image-2 default, gpt-image-1.5 legacy) Replaces the hardcoded "gpt-image-2" literal in buildNativeImageRequestBody with a user-configurable field threaded through NativeImageRequest.ImageModel. The whitelist is enforced by ValidateImageModel which rejects anything outside {gpt-image-2, gpt-image-1.5} with a clear error — prevents silent upstream 400s from model names the Responses API would reject. create_image.callProvider reads params.image_model from the chain entry and threads it through. Empty / absent falls back to DefaultImageModel (gpt-image-2). UI: added an 'Image model' select inside the existing openai-codex Settings panel on the Create Image Provider Chain dialog. Options: Default · gpt-image-2 (recommended) and Legacy · gpt-image-1.5. i18n keys in en/vi/zh tools.json under builtin.mediaChain. Tests: TestCodexGenerateImage covers default/legacy/rejected model cases; TestCreateImageTool_ThreadsImageModel covers params→request threading with empty/legacy/explicit sub-cases. * fix(tools): raise media chain default timeout to 600s/1 retry for image gen gpt-image-2 on complex prompts (dense Vietnamese text, infographic layouts) legitimately takes 4–8 minutes to complete. The previous default of 120s × 2 retries routinely died mid-generation with 'context deadline exceeded' — the upstream run was still producing bytes when our ctx cancelled. Default is now Timeout: 600 / MaxRetries: 1. Retries dropped to 1 because image generation is stateful per upstream run: a mid-flight timeout leaves orphan server work, and retrying a fresh generation doubles cost for no gain. Surface the failure fast so operators can widen the timeout instead. Operators can still set a tighter value explicitly via the Chain dialog. * refactor: remove user-facing Images toggle, keep admin-level AllowImageGeneration The per-request opt-out toggle (composer chip + streaming placeholder + noImageGen header plumbing) was a support footgun — users toggle OFF, forget, then can't generate images and think it's broken. Removed in full. Kept: AgentConfig.AllowImageGeneration (admin kill-switch, stored in other_config.allow_image_generation, default true). Tri-level gate simplifies to two tiers: provider capability AND agent config allows. Removed: useImageGenToggle hook, IMAGE_GEN_PROVIDER_IDS set in chat-input, supportsImageGenProvider helper, agentProvider/agentKey props on ChatInput, showImageGenPlaceholder prop on MessageBubble/ActiveRunZone/ChatThread, noImageGen param on use-chat-send, parseNoImageGen in chat_completions.go, NoImageGen on RunRequest, no_image_gen_header_test.go, imageGenToggle/imageGenGenerating i18n keys. Kept imageGenDownloadName — used by MediaGallery for generated-\*.png filename resolution. * docs(pr-1002): refreshed UX trace + updated codebase notes Replaces the earlier stealth-state evidence with a clean three-capture trace from a real successful run: inline image + prompt caption, MediaGallery lightbox expansion, Chain dialog with the new Image model dropdown open. Skill-routing rows scrubbed from the capture — they reflect per-agent skill setup, not anything this PR introduces. codebase-summary + changelog: reflect final state (toggle removed, image_model selector, 600s default chain timeout, gpt-image-2 as quality baseline). * fix(pipeline): preserve mid-loop image_generation output across iterations FinalizeStage previously read state.Think.LastResponse.Images, which holds only the last iteration's response. If the LLM emitted image_generation_call in iteration N alongside a function_call, then responded text-only in N+1, the image from N was silently dropped on finalize. Accumulate final (non-partial) images into state.Observe.AssistantImages across every iteration via ObserveStage, and source FinalizeStage from the accumulator instead of LastResponse. Partial streaming frames are filtered defensively; response.Images is cleared on drain to prevent double-counting on re-exec. * test(pipeline): regression coverage for mid-loop image accumulation Six ObserveStage cases covering image accumulation semantics: - single-iter image-only, image+tool_call same iter, mid-loop image surviving text-only final iter, multiple images across iters, partial frame filtering, nil-response safety. Two FinalizeStage cases verifying accumulator is the source of truth: - PersistsFromObserveAccumulator: image in Observe + empty LastResponse must still be persisted via PersistAssistantImages. - NoPersistWhenAccumulatorEmpty: no call when no images were emitted. --------- Co-authored-by: viettranx <viettranx@gmail.com>
- new send_file(path, caption?) tool with DenyPaths guard and duplicate-delivery check - patch message(MEDIA:) to Mark DeliveredMedia on send success (closes cross-tool dup gap) - register in gateway wiring + builtin seed - add to systemprompt coreToolSummaries; clarify write_file deliver=true description - 16 tests green (PG + SQLite builds clean, invariants green)
Agents previously saw their own Telegram handle (e.g. "@viet_super_bot") in user messages and mistook it for a different bot, replying NO_REPLY. The username was only used for the mention gate, never removed from the content passed to the LLM. Slack and Feishu already strip their own bot mentions (handlers_mention.go stripBotMention, bot_parse.go resolveMentions); Telegram was the odd one out. Implementation: - Add stripBotMention helper with leading/trailing word-boundary anchors so inline matches inside words (e.g. contact@viet_super_bot.com) are not falsely stripped. - Apply in handleMessage right after the mention gate, before pairing/media processing, so history recording for unmentioned messages keeps raw text. - Restore "[empty message]" placeholder when a message consisting only of "@botName" becomes empty after stripping.
After stripBotMention removes the bot's own @mention from content, the LLM still lacks knowledge of its own platform handle. In multi-bot groups (e.g. "@bota @botb do X") the other bot's mention remains in the content and the agent incorrectly treats it as the intended target, replying NO_REPLY. Capture the bot's first_name from GetMe during channel Start and expose it alongside the username via a new MetaChannelSelfIdentity metadata key. The consumer appends the formatted hint ("You are @{username} ({display_name}) on this Telegram channel.") to the agent's extraSystemPrompt so the LLM can reliably identify itself across single- and multi-bot scenarios. Falls back to "You are @{username} on this Telegram channel." when the display name is not available, and no-ops when the username has not been resolved yet (startup race).
- HTTP synthesize + test-connection now read tenant tts.timeout_ms (default 120s, was hardcoded 15s/10s). Gemini client default also bumped 30s→120s so both layers align when tenant config unset. - Inline prefix "Speak naturally: " prepended to single-voice text; multi-speaker transcripts pass through unchanged. - ErrTextOnlyResponse sentinel for 400 "text generation" bodies; single-voice retries once with stronger prefix. Narrowed needle list avoids false positives on unrelated 400s. - SynthesizeWithFallbackAdapted now returns errors.Join so sentinel survives fallback chain; HTTP 422 mapping + locale-translated ForLLM in agent tool (EN/VI/ZH catalogs). - Default Gemini model bumped to gemini-3.1-flash-tts-preview.
…ct max_tokens - Add TokenCounter.CountToolSchemas() to measure JSON schema size for all tools - Include tool schemas in OverheadTokens calculation for accurate context usage - Implement dynamic max_tokens: in/25 clamp [1024, 8192] for compaction - Add characterization tests: count_tool_schemas_test.go - Add overhead verification tests: context_stage_overhead_test.go, context_stage_tool_overhead_test.go - Add integration tests: context_stage_integration_test.go - Add compact tests: loop_compact_dynamic_max_test.go, loop_compact_max_tokens_test.go - Add sanitize tests: loop_history_sanitize_max_tokens_test.go - Add integration test: loop_compact_integration_test.go
…rate UI display - Store last_prompt_tokens in sessions.metadata JSONB (PostgreSQL + SQLite) - Update SessionsList queries to retrieve metadata and provide token display values - Add fallback heuristic for sessions without metadata (estimated from history) - Add tests: sessions_list_heuristic_test.go, sessions_list_metadata_tokens_test.go - Add integration test: sessions_display_tokens_integration_test.go
…t-quality fixes - Update docs/00-architecture-overview.md with TokenCounter and sessions.metadata details - Update docs/codebase-summary.md with overhead accounting and compaction logic - Update docs/project-changelog.md with v3.11.0 context-tokens accuracy fixes
Contributor
Author
|
Closing it since heavily outdated |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a TTS settings persistence bug where updated provider values (voice/model/base URL) could revert after page reload. The fix keeps canonical and compatibility alias fields synchronized in the UI payload so backend resolution cannot persist stale values.
Type
main)Target Branch
devChecklist
go build ./...passesgo build -tags sqliteonly ./...passes (if Go changes)go vet ./...passesgo test -race ./...cd ui/web && pnpm build(if UI changes)$1, $2(no string concat)internal/upgrade/version.go(if new migration)Test Plan
elevenlabsand change:minimaxfor voice/model.Validation performed in this fix:
ui/web/src/pages/tts/tts-page.tsxlint: no errors/warnings.ui/web/src/pages/tts/sections/credentials-section.tsxlint: no errors/warnings.Implementation notes covered by this PR:
voiceandvoice_idare updated together forelevenlabsandminimax.modelandmodel_idare updated together forelevenlabsandminimax.api_baseandbase_urlare updated together forelevenlabs.