feat(engine): env override for max_output_tokens on tight providers by h3c-hexin · Pull Request #2147 · Hmbown/CodeWhale

h3c-hexin · 2026-05-26T01:43:23Z

Problem

effective_max_output_tokens decides the max_tokens we send in each API request. For models not in the context_window_for_model table it falls back to a 128K window assumption and returns API_MAX_OUTPUT_TOKENS (64K).

That heuristic is fine for the hosted CodeWhale/DeepSeek API but immediately HTTP 400s on self-hosted providers with a tight --max-model-len. Concrete failure:

vLLM serving Qwen3.6 with --max-model-len 65536
Request input ≈ 1.5K tokens + the requested 64K output ⇒ provider rejects (total > 65536 by 1 token).

Operators today have no way to lower the requested output without forking the engine, because the cap is a code-internal constant rather than config.

Fix

A short escape hatch at the top of effective_max_output_tokens: if DEEPSEEK_MAX_OUTPUT_TOKENS is set to a positive integer, return it directly. Otherwise existing behavior is unchanged.

if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
    && let Ok(n) = raw.trim().parse::<u32>()
    && n > 0
{
    return n;
}

Typical use: DEEPSEEK_MAX_OUTPUT_TOKENS=16384 for a 64K-window vLLM deployment.

Scope

One function in crates/tui/src/core/engine/context.rs, +14/-0 of behavior.
No new config struct field. Env-only override keeps the public API unchanged, matching the existing DEEPSEEK_* knob convention (DEEPSEEK_CAPACITY_*, DEEPSEEK_BASE_URL, ...).
Composes with fix(engine): keep auto-compaction working on sub-500K self-hosted windows #2060: that PR fixed the input budget math for sub-500K windows; this one lets operators cap the output request to fit the provider's hard limit.
Does not touch the trust-boundary surface (no auth / sandbox / publishing / prompts).

Tests

Two new unit tests in crates/tui/src/core/engine/tests.rs:

effective_max_output_tokens_env_override_returns_positive_value — env=16384 applies regardless of model (V4 hosted, V4 flash, sub-500K self-hosted).
effective_max_output_tokens_env_override_rejects_zero_and_invalid — env=`0` / `abc` / `` / ` ` / `-1` all fall through to the heuristic (no silent zero cap on typo'd config).

Three pre-existing tests that internally call effective_max_output_tokens now acquire lock_test_env() so they stay isolated from the new env-mutating tests:

context_budget_reserves_output_and_headroom
effective_max_output_tokens_caps_api_request_for_large_window_models
internal_context_budget_tiers_reserved_output_by_window

```
cargo test -p codewhale-tui --bin codewhale-tui -- effective_max_output_tokens context_budget context_input_budget internal_context_budget
```

All 6 pass. Full `cargo test --workspace --all-features` is green except for one pre-existing failure (`prompts::tests::system_prompt_skips_locale_preamble_for_english`) that also reproduces on a clean `origin/main` checkout and is unrelated to this change.

🤖 Generated with Claude Code

Greptile Summary

This PR adds a runtime escape hatch — a new env var — that lets operators cap max_tokens in API requests when running against self-hosted providers (vLLM/SGLang) whose --max-model-len is tighter than the engine's default 64 K output assumption. Existing behavior is completely unchanged when the var is unset.

effective_max_output_tokens gains an early-return for positive-integer values of the env var, with the existing heuristic as fallback for zero, non-numeric, or unset values.
Three pre-existing tests are updated to hold the process-wide env lock for isolation; two new tests cover the override and the rejection of zero/invalid values.

Confidence Score: 4/5

Safe to merge with one fix: the env override is not clamped, so a misconfigured value above the model's context window can silently disable compaction for sub-500K deployments.

The override correctly falls through on zero/invalid input and leaves V4 models unaffected. The risk is that context_input_budget reuses effective_max_output_tokens for sub-500K models, so an env value larger than the window causes a checked_sub underflow, returning None and disabling compaction without any error. A one-line .min(API_MAX_OUTPUT_TOKENS) clamp would close this gap.

crates/tui/src/core/engine/context.rs — the new early-return in effective_max_output_tokens and its downstream effect on context_input_budget

Important Files Changed

Filename	Overview
crates/tui/src/core/engine/context.rs	Adds DEEPSEEK_MAX_OUTPUT_TOKENS env-var early-exit to effective_max_output_tokens; no upper-bound guard against values that exceed the model's context window
crates/tui/src/core/engine/tests.rs	Adds ScopedDeepSeekMaxOutputTokens RAII guard with proper lock-serialised env cleanup; three existing tests acquire lock_test_env() for isolation; two new tests cover the override path

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["effective_max_output_tokens(model)"] --> B{"env var set and > 0?"}
    B -- Yes --> C["return env value (new path)"]
    B -- No --> D["context_window_for_model(model)"]
    D -- Some --> E{"window >= 500K?"}
    D -- None --> F["fallback: 128K window"]
    F --> E
    E -- Yes --> G["return API_MAX_OUTPUT_TOKENS (64K)"]
    E -- No --> H["return min(window/2, 64K)"]
    C --> I["API request: max_tokens = env value"]
    G --> I
    H --> I
    J["context_input_budget(model)"] --> K{"window >= 500K?"}
    K -- Yes --> L["reserved = TURN_MAX_OUTPUT_TOKENS (262K)"]
    K -- No --> M["reserved = effective_max_output_tokens(model) -- affected by env var"]
    L --> N["budget = window - reserved - 1024"]
    M --> N
    N --> O{"underflow?"}
    O -- Yes --> P["None: compaction disabled"]
    O -- No --> Q["Some(budget): compaction active"]

_{Reviews (1): Last reviewed commit: "feat(engine): allow DEEPSEEK_MAX_OUTPUT_..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

…-context providers The `effective_max_output_tokens` heuristic defaults to 64K for any model not in the known-context-window table. This is fine for DeepSeek's hosted API (1M context) but causes immediate HTTP 400s on self-hosted providers with tight `max-model-len`. Example: vLLM serving Qwen3.6 with `--max-model-len 65536` rejects requests because 64000 (output) + ~1500 (input) exceeds the limit by 1 token. This change lets the operator set `DEEPSEEK_MAX_OUTPUT_TOKENS=16384` (or whatever fits their deployment) to override the heuristic. The env var takes precedence over the model-table lookup when set to a positive integer; otherwise the existing behavior is preserved. No new config struct field — env-only override keeps the public API unchanged. Useful for embedded users (e.g. pinvou3) who need to control output budget without forking the engine config schema.

gemini-code-assist

Code Review

This pull request introduces an environment variable override DEEPSEEK_MAX_OUTPUT_TOKENS to allow manual configuration of the maximum output tokens, which is particularly useful for self-hosted providers. It also adds corresponding unit tests that safely manage the environment state. The reviewer identified a potential issue where setting this override to a value larger than the model's context window could cause an underflow in the input budget calculation, and suggested capping the override value to prevent this safety risk.

gemini-code-assist · 2026-05-26T01:44:17Z

+    if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
+        && let Ok(n) = raw.trim().parse::<u32>()
+        && n > 0
+    {
+        return n;
+    }
    let window = context_window_for_model(model).unwrap_or(128_000);


If DEEPSEEK_MAX_OUTPUT_TOKENS is configured to a value that is too large relative to the model's context window, it will cause context_input_budget to underflow and return None. This silently disables all preflight and emergency recovery paths, which is a significant safety risk.

To prevent this, we should retrieve the model's context window first, and then cap the user-provided override to ensure we always leave enough room for the headroom (CONTEXT_HEADROOM_TOKENS) and at least 1 token for input.

let window = context_window_for_model(model).unwrap_or(128_000); if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS") && let Ok(n) = raw.trim().parse::<u32>() && n > 0 { let max_allowed = window.saturating_sub(CONTEXT_HEADROOM_TOKENS as u32 + 1); return n.min(max_allowed).max(1); }

greptile-apps · 2026-05-26T01:47:23Z

+    if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
+        && let Ok(n) = raw.trim().parse::<u32>()
+        && n > 0
+    {
+        return n;


DEEPSEEK_* prefix misleads non-DeepSeek operators

The env var controls max_tokens for any model routed through the engine — a Qwen3 or Llama operator using vLLM would not intuitively reach for DEEPSEEK_MAX_OUTPUT_TOKENS to solve their tight-context problem. The DEEPSEEK_* convention makes sense for vendor-specific keys (DEEPSEEK_BASE_URL, DEEPSEEK_CAPACITY_*), but this knob is provider-agnostic. Was a more generic name like CODEWHALE_MAX_OUTPUT_TOKENS considered?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-05-26T01:47:24Z

+    if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
+        && let Ok(n) = raw.trim().parse::<u32>()
+        && n > 0
+    {
+        return n;
+    }


Missing upper-bound guard can silently disable compaction

context_input_budget calls effective_max_output_tokens as the reserved_output term for sub-500K models. If the env override is set above window − CONTEXT_HEADROOM_TOKENS (e.g. 70000 on a 65 536-token vLLM deployment), checked_sub underflows to None, silently disabling all compaction and preflight checks for that session. Adding .min(API_MAX_OUTPUT_TOKENS) before returning n caps the override at the same ceiling already used by the heuristic path.

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(engine): env override for max_output_tokens on tight providers#2147

feat(engine): env override for max_output_tokens on tight providers#2147
h3c-hexin wants to merge 1 commit into
Hmbown:mainfrom
h3c-hexin:feat/max-output-tokens-env

h3c-hexin commented May 26, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

greptile-apps Bot May 26, 2026

Uh oh!

greptile-apps Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

h3c-hexin commented May 26, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Scope

Tests

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

h3c-hexin commented May 26, 2026 •

edited by greptile-apps Bot

Loading