feat(provider): AGENTMEMORY_DISABLE_THINKING to recover structured output on Qwen3/GLM/Kimi#569
feat(provider): AGENTMEMORY_DISABLE_THINKING to recover structured output on Qwen3/GLM/Kimi#569efenex wants to merge 1 commit into
Conversation
…mode When AGENTMEMORY_DISABLE_THINKING=true, the OpenAI-compatible provider forces thinking mode OFF on hybrid-reasoning models (Qwen3 family, GLM, Kimi, DeepSeek V4-Flash). Without this, every call burns tokens on a <think>...</think> block before the actual answer, and structured- output prompts (graph extraction, XML/JSON-mode summarization, etc.) often truncate inside the thinking block — yielding empty `content` and a meandering `reasoning` field that parsers can't recover. Belt-and-suspenders: send `chat_template_kwargs.enable_thinking=false` as the server-side signal AND prefix `/no_think` to the system message as the client-side fallback (same pattern as gitops-assistant's llm_engine.py:6207-6260, which has a documented "$7 Qwen3-32B incident" from missing this signal). The env var is opt-in (default off), so existing setups with thinking- required models are unaffected. Operators running Qwen3.x / GLM / Kimi behind an OpenAI-compatible endpoint (vLLM, LM Studio, Ollama, etc.) can set this to recover deterministic structured-output behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@efenex is attempting to deploy a commit to the rohitg00's projects Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughOpenAI provider now conditionally disables "thinking" for hybrid reasoning models. When ChangesThinking Disable Configuration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/providers/openai.ts (2)
93-94: ⚡ Quick winUse
getEnvVar()for consistency with rest of file.Line 94 accesses the env var directly via
process.env, while all other env var reads in this file use thegetEnvVar()helper (lines 62, 63, 66, 171, 175). This breaks the abstraction and bypasses any normalization or mocking thatgetEnvVarmay provide.♻️ Proposed fix
const disableThinking = - (process.env["AGENTMEMORY_DISABLE_THINKING"] || "").toLowerCase() === "true"; + (getEnvVar("AGENTMEMORY_DISABLE_THINKING") || "").toLowerCase() === "true";🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/providers/openai.ts` around lines 93 - 94, The code reads AGENTMEMORY_DISABLE_THINKING directly via process.env for the disableThinking constant; replace that direct access with the existing helper getEnvVar to keep env normalization/mocking consistent (use getEnvVar("AGENTMEMORY_DISABLE_THINKING", "") and then apply the .toLowerCase() === "true" check) so the disableThinking assignment uses getEnvVar instead of process.env.
80-92: 💤 Low valueConsider trimming the extended comment.
The comment provides valuable context about why this feature exists (truncation issues, the referenced incident). However, the coding guidelines suggest avoiding comments explaining WHAT in favor of clear naming. The first half (lines 80-86) documents the problem well; lines 87-92 about "belt-and-suspenders" could be shortened since the code itself shows both signals are applied.
As per coding guidelines: "Avoid code comments explaining WHAT — use clear naming instead".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/providers/openai.ts` around lines 80 - 92, Trim the extended block comment to keep the concise explanation of the problem and removal of redundant detail: keep the first sentence(s) that describe why AGENTMEMORY_DISABLE_THINKING exists (token waste / truncation) and remove the extra "belt-and-suspenders" paragraph; ensure the remaining comment still mentions the two signals by name (AGENTMEMORY_DISABLE_THINKING, chat_template_kwargs.enable_thinking=false and the '/no_think' system prefix) so the intent is clear from naming and the code (no additional incident history or long rationale).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/providers/openai.ts`:
- Around line 93-94: The code reads AGENTMEMORY_DISABLE_THINKING directly via
process.env for the disableThinking constant; replace that direct access with
the existing helper getEnvVar to keep env normalization/mocking consistent (use
getEnvVar("AGENTMEMORY_DISABLE_THINKING", "") and then apply the .toLowerCase()
=== "true" check) so the disableThinking assignment uses getEnvVar instead of
process.env.
- Around line 80-92: Trim the extended block comment to keep the concise
explanation of the problem and removal of redundant detail: keep the first
sentence(s) that describe why AGENTMEMORY_DISABLE_THINKING exists (token waste /
truncation) and remove the extra "belt-and-suspenders" paragraph; ensure the
remaining comment still mentions the two signals by name
(AGENTMEMORY_DISABLE_THINKING, chat_template_kwargs.enable_thinking=false and
the '/no_think' system prefix) so the intent is clear from naming and the code
(no additional incident history or long rationale).
Summary
Adds an opt-in env var `AGENTMEMORY_DISABLE_THINKING=true` that forces thinking mode OFF on hybrid-reasoning models served behind OpenAI-compatible endpoints (Qwen3 family, GLM, Kimi, DeepSeek V4-Flash, etc.).
Motivation
When the configured `OPENAI_BASE_URL` points at a hybrid-reasoning model (e.g. Qwen3.6-35B-A3B-FP8 via vLLM, or a serverless provider hosting Qwen3-32B), every chat completion burns input tokens on a `...` block before the actual answer. Worse, structured-output prompts (graph extraction, XML/JSON-mode summarization, `mem::compress`, `mem::summarize`'s XML schema) frequently truncate inside the thinking block — `message.content` comes back empty and `message.reasoning` carries the half-finished meandering, which the existing parsers can't recover into a valid summary/graph entity.
I hit this concretely while bulk-rebuilding the graph (~8k nodes / ~5k edges across observations + memories) against a local Qwen3.6 vLLM, where ~30% of chunks failed to parse due to mid-thinking truncation. Same pattern documented in another OSS project's `llm_engine.py:6207-6260` as the "$7 Qwen3-32B incident".
What it does
Two signals, belt-and-suspenders:
Both fire only when `AGENTMEMORY_DISABLE_THINKING=true` is set. Default off; existing setups with non-thinking models are unaffected.
Test plan
Related
🤖 Generated with Claude Code
Summary by CodeRabbit