feat(engine): env override for max_output_tokens on tight providers#2147
feat(engine): env override for max_output_tokens on tight providers#2147h3c-hexin wants to merge 1 commit into
Conversation
…-context providers The `effective_max_output_tokens` heuristic defaults to 64K for any model not in the known-context-window table. This is fine for DeepSeek's hosted API (1M context) but causes immediate HTTP 400s on self-hosted providers with tight `max-model-len`. Example: vLLM serving Qwen3.6 with `--max-model-len 65536` rejects requests because 64000 (output) + ~1500 (input) exceeds the limit by 1 token. This change lets the operator set `DEEPSEEK_MAX_OUTPUT_TOKENS=16384` (or whatever fits their deployment) to override the heuristic. The env var takes precedence over the model-table lookup when set to a positive integer; otherwise the existing behavior is preserved. No new config struct field — env-only override keeps the public API unchanged. Useful for embedded users (e.g. pinvou3) who need to control output budget without forking the engine config schema.
There was a problem hiding this comment.
Code Review
This pull request introduces an environment variable override DEEPSEEK_MAX_OUTPUT_TOKENS to allow manual configuration of the maximum output tokens, which is particularly useful for self-hosted providers. It also adds corresponding unit tests that safely manage the environment state. The reviewer identified a potential issue where setting this override to a value larger than the model's context window could cause an underflow in the input budget calculation, and suggested capping the override value to prevent this safety risk.
| if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS") | ||
| && let Ok(n) = raw.trim().parse::<u32>() | ||
| && n > 0 | ||
| { | ||
| return n; | ||
| } | ||
| let window = context_window_for_model(model).unwrap_or(128_000); |
There was a problem hiding this comment.
If DEEPSEEK_MAX_OUTPUT_TOKENS is configured to a value that is too large relative to the model's context window, it will cause context_input_budget to underflow and return None. This silently disables all preflight and emergency recovery paths, which is a significant safety risk.
To prevent this, we should retrieve the model's context window first, and then cap the user-provided override to ensure we always leave enough room for the headroom (CONTEXT_HEADROOM_TOKENS) and at least 1 token for input.
let window = context_window_for_model(model).unwrap_or(128_000);
if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
&& let Ok(n) = raw.trim().parse::<u32>()
&& n > 0
{
let max_allowed = window.saturating_sub(CONTEXT_HEADROOM_TOKENS as u32 + 1);
return n.min(max_allowed).max(1);
}| if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS") | ||
| && let Ok(n) = raw.trim().parse::<u32>() | ||
| && n > 0 | ||
| { | ||
| return n; |
There was a problem hiding this comment.
DEEPSEEK_* prefix misleads non-DeepSeek operators
The env var controls max_tokens for any model routed through the engine — a Qwen3 or Llama operator using vLLM would not intuitively reach for DEEPSEEK_MAX_OUTPUT_TOKENS to solve their tight-context problem. The DEEPSEEK_* convention makes sense for vendor-specific keys (DEEPSEEK_BASE_URL, DEEPSEEK_CAPACITY_*), but this knob is provider-agnostic. Was a more generic name like CODEWHALE_MAX_OUTPUT_TOKENS considered?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS") | ||
| && let Ok(n) = raw.trim().parse::<u32>() | ||
| && n > 0 | ||
| { | ||
| return n; | ||
| } |
There was a problem hiding this comment.
Missing upper-bound guard can silently disable compaction
context_input_budget calls effective_max_output_tokens as the reserved_output term for sub-500K models. If the env override is set above window − CONTEXT_HEADROOM_TOKENS (e.g. 70000 on a 65 536-token vLLM deployment), checked_sub underflows to None, silently disabling all compaction and preflight checks for that session. Adding .min(API_MAX_OUTPUT_TOKENS) before returning n caps the override at the same ceiling already used by the heuristic path.
Problem
effective_max_output_tokensdecides themax_tokenswe send in each API request. For models not in thecontext_window_for_modeltable it falls back to a 128K window assumption and returnsAPI_MAX_OUTPUT_TOKENS(64K).That heuristic is fine for the hosted CodeWhale/DeepSeek API but immediately HTTP 400s on self-hosted providers with a tight
--max-model-len. Concrete failure:--max-model-len 65536Operators today have no way to lower the requested output without forking the engine, because the cap is a code-internal constant rather than config.
Fix
A short escape hatch at the top of
effective_max_output_tokens: ifDEEPSEEK_MAX_OUTPUT_TOKENSis set to a positive integer, return it directly. Otherwise existing behavior is unchanged.Typical use:
DEEPSEEK_MAX_OUTPUT_TOKENS=16384for a 64K-window vLLM deployment.Scope
crates/tui/src/core/engine/context.rs, +14/-0 of behavior.DEEPSEEK_*knob convention (DEEPSEEK_CAPACITY_*,DEEPSEEK_BASE_URL, ...).Tests
Two new unit tests in
crates/tui/src/core/engine/tests.rs:effective_max_output_tokens_env_override_returns_positive_value— env=16384 applies regardless of model (V4 hosted, V4 flash, sub-500K self-hosted).effective_max_output_tokens_env_override_rejects_zero_and_invalid— env=`0` / `abc` / `` / ` ` / `-1` all fall through to the heuristic (no silent zero cap on typo'd config).Three pre-existing tests that internally call
effective_max_output_tokensnow acquirelock_test_env()so they stay isolated from the new env-mutating tests:context_budget_reserves_output_and_headroomeffective_max_output_tokens_caps_api_request_for_large_window_modelsinternal_context_budget_tiers_reserved_output_by_window```
cargo test -p codewhale-tui --bin codewhale-tui -- effective_max_output_tokens context_budget context_input_budget internal_context_budget
```
All 6 pass. Full `cargo test --workspace --all-features` is green except for one pre-existing failure (`prompts::tests::system_prompt_skips_locale_preamble_for_english`) that also reproduces on a clean `origin/main` checkout and is unrelated to this change.
🤖 Generated with Claude Code
Greptile Summary
This PR adds a runtime escape hatch — a new env var — that lets operators cap
max_tokensin API requests when running against self-hosted providers (vLLM/SGLang) whose--max-model-lenis tighter than the engine's default 64 K output assumption. Existing behavior is completely unchanged when the var is unset.effective_max_output_tokensgains an early-return for positive-integer values of the env var, with the existing heuristic as fallback for zero, non-numeric, or unset values.Confidence Score: 4/5
Safe to merge with one fix: the env override is not clamped, so a misconfigured value above the model's context window can silently disable compaction for sub-500K deployments.
The override correctly falls through on zero/invalid input and leaves V4 models unaffected. The risk is that
context_input_budgetreuseseffective_max_output_tokensfor sub-500K models, so an env value larger than the window causes achecked_subunderflow, returningNoneand disabling compaction without any error. A one-line.min(API_MAX_OUTPUT_TOKENS)clamp would close this gap.crates/tui/src/core/engine/context.rs — the new early-return in
effective_max_output_tokensand its downstream effect oncontext_input_budgetImportant Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["effective_max_output_tokens(model)"] --> B{"env var set and > 0?"} B -- Yes --> C["return env value (new path)"] B -- No --> D["context_window_for_model(model)"] D -- Some --> E{"window >= 500K?"} D -- None --> F["fallback: 128K window"] F --> E E -- Yes --> G["return API_MAX_OUTPUT_TOKENS (64K)"] E -- No --> H["return min(window/2, 64K)"] C --> I["API request: max_tokens = env value"] G --> I H --> I J["context_input_budget(model)"] --> K{"window >= 500K?"} K -- Yes --> L["reserved = TURN_MAX_OUTPUT_TOKENS (262K)"] K -- No --> M["reserved = effective_max_output_tokens(model) -- affected by env var"] L --> N["budget = window - reserved - 1024"] M --> N N --> O{"underflow?"} O -- Yes --> P["None: compaction disabled"] O -- No --> Q["Some(budget): compaction active"]Reviews (1): Last reviewed commit: "feat(engine): allow DEEPSEEK_MAX_OUTPUT_..." | Re-trigger Greptile