Skip to content

feat(provider): AGENTMEMORY_DISABLE_THINKING to recover structured output on Qwen3/GLM/Kimi#569

Open
efenex wants to merge 1 commit into
rohitg00:mainfrom
efenex:feat/v3-e-disable-thinking
Open

feat(provider): AGENTMEMORY_DISABLE_THINKING to recover structured output on Qwen3/GLM/Kimi#569
efenex wants to merge 1 commit into
rohitg00:mainfrom
efenex:feat/v3-e-disable-thinking

Conversation

@efenex
Copy link
Copy Markdown
Contributor

@efenex efenex commented May 20, 2026

Summary

Adds an opt-in env var `AGENTMEMORY_DISABLE_THINKING=true` that forces thinking mode OFF on hybrid-reasoning models served behind OpenAI-compatible endpoints (Qwen3 family, GLM, Kimi, DeepSeek V4-Flash, etc.).

Motivation

When the configured `OPENAI_BASE_URL` points at a hybrid-reasoning model (e.g. Qwen3.6-35B-A3B-FP8 via vLLM, or a serverless provider hosting Qwen3-32B), every chat completion burns input tokens on a `...` block before the actual answer. Worse, structured-output prompts (graph extraction, XML/JSON-mode summarization, `mem::compress`, `mem::summarize`'s XML schema) frequently truncate inside the thinking block — `message.content` comes back empty and `message.reasoning` carries the half-finished meandering, which the existing parsers can't recover into a valid summary/graph entity.

I hit this concretely while bulk-rebuilding the graph (~8k nodes / ~5k edges across observations + memories) against a local Qwen3.6 vLLM, where ~30% of chunks failed to parse due to mid-thinking truncation. Same pattern documented in another OSS project's `llm_engine.py:6207-6260` as the "$7 Qwen3-32B incident".

What it does

Two signals, belt-and-suspenders:

  1. Server-side: send `chat_template_kwargs.enable_thinking=false` in the chat completions request body. Honored by vLLM, SGLang, llama.cpp's OpenAI-compatible server, and most thinking-aware hosting layers.
  2. Client-side fallback: prefix `/no_think` to the system message. This is the convention Qwen3 and friends recognize even when the server doesn't pass through `chat_template_kwargs`.

Both fire only when `AGENTMEMORY_DISABLE_THINKING=true` is set. Default off; existing setups with non-thinking models are unaffected.

Test plan

  • `npm test` passes (existing OpenAI provider tests still green; no new tests added — the change is a body-shape additive)
  • With `AGENTMEMORY_DISABLE_THINKING=true` and a Qwen3 endpoint, `mem::compress` returns parseable XML on the first attempt instead of intermittently truncating mid-think
  • With the env var unset, request body is unchanged (verified by inspection)

Related

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added environment variable configuration to control AI reasoning behavior for hybrid reasoning models, providing operational flexibility without code modifications.

Review Change Stack

…mode

When AGENTMEMORY_DISABLE_THINKING=true, the OpenAI-compatible provider
forces thinking mode OFF on hybrid-reasoning models (Qwen3 family,
GLM, Kimi, DeepSeek V4-Flash). Without this, every call burns tokens
on a <think>...</think> block before the actual answer, and structured-
output prompts (graph extraction, XML/JSON-mode summarization, etc.)
often truncate inside the thinking block — yielding empty `content`
and a meandering `reasoning` field that parsers can't recover.

Belt-and-suspenders: send `chat_template_kwargs.enable_thinking=false`
as the server-side signal AND prefix `/no_think` to the system message
as the client-side fallback (same pattern as gitops-assistant's
llm_engine.py:6207-6260, which has a documented "$7 Qwen3-32B
incident" from missing this signal).

The env var is opt-in (default off), so existing setups with thinking-
required models are unaffected. Operators running Qwen3.x / GLM / Kimi
behind an OpenAI-compatible endpoint (vLLM, LM Studio, Ollama, etc.)
can set this to recover deterministic structured-output behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 20, 2026

@efenex is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

OpenAI provider now conditionally disables "thinking" for hybrid reasoning models. When AGENTMEMORY_DISABLE_THINKING=true, the system prompt is prefixed with /no_think and enable_thinking=false is added to the request body; otherwise, the original prompt is sent unchanged.

Changes

Thinking Disable Configuration

Layer / File(s) Summary
Conditional thinking disable in request construction
src/providers/openai.ts
The call() method reads AGENTMEMORY_DISABLE_THINKING and computes an effective system prompt. When disabled, it prefixes /no_think and adds chat_template_kwargs.enable_thinking=false; otherwise uses the original prompt.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • rohitg00/agentmemory#307: Also updates src/providers/openai.ts with conditional thinking disable logic for the same OpenAI provider request-building path.

Poem

A thought blocker springs to life,
/no_think calms the strife.
When flags flip true, minds rest—
Reasoning models run their best. 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: introducing AGENTMEMORY_DISABLE_THINKING feature for OpenAI provider to fix structured output issues on specific hybrid-reasoning models.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/providers/openai.ts (2)

93-94: ⚡ Quick win

Use getEnvVar() for consistency with rest of file.

Line 94 accesses the env var directly via process.env, while all other env var reads in this file use the getEnvVar() helper (lines 62, 63, 66, 171, 175). This breaks the abstraction and bypasses any normalization or mocking that getEnvVar may provide.

♻️ Proposed fix
     const disableThinking =
-      (process.env["AGENTMEMORY_DISABLE_THINKING"] || "").toLowerCase() === "true";
+      (getEnvVar("AGENTMEMORY_DISABLE_THINKING") || "").toLowerCase() === "true";
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/providers/openai.ts` around lines 93 - 94, The code reads
AGENTMEMORY_DISABLE_THINKING directly via process.env for the disableThinking
constant; replace that direct access with the existing helper getEnvVar to keep
env normalization/mocking consistent (use
getEnvVar("AGENTMEMORY_DISABLE_THINKING", "") and then apply the .toLowerCase()
=== "true" check) so the disableThinking assignment uses getEnvVar instead of
process.env.

80-92: 💤 Low value

Consider trimming the extended comment.

The comment provides valuable context about why this feature exists (truncation issues, the referenced incident). However, the coding guidelines suggest avoiding comments explaining WHAT in favor of clear naming. The first half (lines 80-86) documents the problem well; lines 87-92 about "belt-and-suspenders" could be shortened since the code itself shows both signals are applied.

As per coding guidelines: "Avoid code comments explaining WHAT — use clear naming instead".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/providers/openai.ts` around lines 80 - 92, Trim the extended block
comment to keep the concise explanation of the problem and removal of redundant
detail: keep the first sentence(s) that describe why
AGENTMEMORY_DISABLE_THINKING exists (token waste / truncation) and remove the
extra "belt-and-suspenders" paragraph; ensure the remaining comment still
mentions the two signals by name (AGENTMEMORY_DISABLE_THINKING,
chat_template_kwargs.enable_thinking=false and the '/no_think' system prefix) so
the intent is clear from naming and the code (no additional incident history or
long rationale).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/providers/openai.ts`:
- Around line 93-94: The code reads AGENTMEMORY_DISABLE_THINKING directly via
process.env for the disableThinking constant; replace that direct access with
the existing helper getEnvVar to keep env normalization/mocking consistent (use
getEnvVar("AGENTMEMORY_DISABLE_THINKING", "") and then apply the .toLowerCase()
=== "true" check) so the disableThinking assignment uses getEnvVar instead of
process.env.
- Around line 80-92: Trim the extended block comment to keep the concise
explanation of the problem and removal of redundant detail: keep the first
sentence(s) that describe why AGENTMEMORY_DISABLE_THINKING exists (token waste /
truncation) and remove the extra "belt-and-suspenders" paragraph; ensure the
remaining comment still mentions the two signals by name
(AGENTMEMORY_DISABLE_THINKING, chat_template_kwargs.enable_thinking=false and
the '/no_think' system prefix) so the intent is clear from naming and the code
(no additional incident history or long rationale).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 86909026-1531-4b08-aed6-3dc9e8f625b1

📥 Commits

Reviewing files that changed from the base of the PR and between 93d1bdd and ed03f59.

📒 Files selected for processing (1)
  • src/providers/openai.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant