Skip to content

fix(telegram): surface agent errors + balance markdown/HTML across chunks#996

Open
minhdang03 wants to merge 1 commit intonextlevelbuilder:devfrom
minhdang03:fix/telegram-agent-bugs-2026-04-22
Open

fix(telegram): surface agent errors + balance markdown/HTML across chunks#996
minhdang03 wants to merge 1 commit intonextlevelbuilder:devfrom
minhdang03:fix/telegram-agent-bugs-2026-04-22

Conversation

@minhdang03
Copy link
Copy Markdown

Summary

Infrastructure fixes for two Telegram-agent bugs observed while debugging a Vietnamese rideshare bot ("Gia Hân"). Both bugs affect any agent on an external channel, not just the specific one.

  • Silent error suppression: agent-run errors on external channels (Telegram, Facebook, Zalo, …) were turned into empty outbound messages. Typing indicator stopped, no reply was sent, and users had no clue whether the bot died or was still thinking. Now surfaces a localized generic fallback (vi/en/zh).
  • Cut output / unclosed markdown: when an LLM response was truncated (max_tokens hit, finish_reason=length, stream cut) mid-bold, the raw text ended with **70 and the non-greedy regex silently passed it through as literal ** — users saw • Anh thực nhận: **70. Pre-sanitize unpaired **/__ before formatting.
  • Cross-chunk HTML tag balance: chunkHTML previously avoided cuts inside a tag but did not balance inline formatting (<b>/<i>/<u>/<s>/<code>) across chunk boundaries — a long <b>foo\n\nbar</b> split at \n\n produced chunk 1 with unclosed <b> (Telegram rejects) and chunk 2 with orphan </b>. Now each chunk auto-closes unclosed tags and the next chunk re-opens them.
  • Observability: ThinkStage now logs a warning when finish_reason=length hits text-only responses (no tool calls) — previously invisible to operators, though it still reaches the user as a cut message.

Why it matters

From user bug report reproducing in production on the "Gia Hân" Telegram agent:

Gia Hân đang typing... nhưng không gửi được reply
• Tên liên hệ: **
• Anh thực nhận: **70

Both symptoms are explained by the two code bugs above; fixing the markdown/HTML balance removes the literal ** leak, and fixing the silent-error path means the user gets an actionable fallback instead of a stopped typing indicator with no message.

Files changed

Path What
`cmd/gateway_consumer_normal.go` Replace silent `errContent=""` with localized `i18n.T(locale, MsgAgentErrorGeneric)` for external channels. Thread `locale` into the result goroutine.
`internal/i18n/{keys,catalog_en,catalog_vi,catalog_zh}.go` Add `MsgAgentErrorGeneric` with en/vi/zh translations.
`internal/channels/telegram/format.go` `balanceMarkdownMarker` strips unpaired `**`/`__` before format. `balanceChunkTags` + `closeUnclosedInlineTags` auto-close/reopen inline HTML tags across chunks.
`internal/pipeline/think_stage.go` Warn log on `finish_reason=length` for text-only responses.
`internal/channels/telegram/format_balance_test.go` New — 9 cases covering unclosed markers, cross-chunk tag balance, exact reproduction of the Gia Hân `**70` bug.

Test plan

  • `go build ./...` (PG variant)
  • `go build -tags sqliteonly ./...` (desktop variant)
  • `go vet ./...`
  • `go test ./internal/channels/telegram/... ./internal/i18n/... ./internal/pipeline/...` — all pass, 9 new cases included
  • Manual: trigger an agent error on a Telegram chat (e.g. revoke provider key) and confirm user receives the localized fallback instead of silence
  • Manual: send a long bold paragraph that spans two Telegram chunks, confirm both render without broken markdown
  • Manual: trigger a `max_tokens`-capped response and confirm no literal `**` reaches the message

Out of scope (flagged for follow-up)

The companion bugs in the user's report about "Gia Hân forgets the 125$ it just acknowledged" and "fills customer data from an earlier trip into a new parse" are prompt-engineering issues on that specific agent, not gateway bugs — the history is passed correctly and USER.md is loaded. They need prompt-level rule changes applied in the agent's context files, not code changes.

Possible follow-up code changes worth discussing separately:

  • Reduce `defaultContextCacheTTL` in `internal/tools/context_file_interceptor.go` from 5m → 1m so USER.md writes become visible faster.
  • Bump `DefaultMaxTokens` in `internal/config/defaults.go` from 8192 to e.g. 16384 (cross-tenant impact, needs agreement).
  • Re-start the typing indicator when a queued message (maxConcurrent=1 DM) actually dequeues, instead of only at handler ingress.

🤖 Generated with Claude Code

…unks

External channels (Telegram, FB, Zalo) silently suppressed agent run errors
to empty content, leaving users staring at a stopped typing indicator with no
clue whether the bot died or was still thinking. Send a localized generic
fallback (vi/en/zh) instead so users know to retry.

LLM output cut mid-bold (max_tokens truncation, finish_reason=length, or
stream terminate) leaked literal `**` markers to Telegram — the non-greedy
bold regex silently dropped unmatched markers. Pre-sanitize raw content to
strip unpaired `**`/`__` before formatting.

chunkHTML's tag-safety check prevented cuts inside a tag but did NOT balance
inline formatting (`<b>/<i>/<u>/<s>/<code>`) that legitimately spans chunk
boundaries — e.g. `<b>foo\n\nbar</b>` split at `\n\n` produced chunk 1 with
unclosed `<b>` (Telegram rejects) and chunk 2 with orphan `</b>`. Now each
chunk auto-closes unclosed tags and the next chunk re-opens them.

Also log a warning when ThinkStage sees `finish_reason=length` with no tool
calls — text-only truncation still reaches the user (as a cut message) but
was previously invisible to operators.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant