Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions crates/tui/src/core/engine/context.rs
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,21 @@ const API_MAX_OUTPUT_TOKENS: u32 = 65_536;
/// model. Uses `API_MAX_OUTPUT_TOKENS` (64K) which fits within common provider
/// limits (128K+ total). For non-V4 models with smaller context windows, caps
/// at half the context window.
///
/// Override: when the env var `DEEPSEEK_MAX_OUTPUT_TOKENS` is set to a positive
/// integer, this function returns that value directly. Use this for self-hosted
/// providers (vLLM/SGLang) whose `max-model-len` is tight and where the
/// model-table heuristic above would over-allocate. Example: vLLM serving
/// Qwen3.6 with `--max-model-len 65536` should set
/// `DEEPSEEK_MAX_OUTPUT_TOKENS=16384` so input + output stays well under the
/// provider's hard limit.
pub(super) fn effective_max_output_tokens(model: &str) -> u32 {
if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
&& let Ok(n) = raw.trim().parse::<u32>()
&& n > 0
{
return n;
Comment on lines +40 to +44
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 DEEPSEEK_* prefix misleads non-DeepSeek operators

The env var controls max_tokens for any model routed through the engine — a Qwen3 or Llama operator using vLLM would not intuitively reach for DEEPSEEK_MAX_OUTPUT_TOKENS to solve their tight-context problem. The DEEPSEEK_* convention makes sense for vendor-specific keys (DEEPSEEK_BASE_URL, DEEPSEEK_CAPACITY_*), but this knob is provider-agnostic. Was a more generic name like CODEWHALE_MAX_OUTPUT_TOKENS considered?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Codex Fix in Claude Code Fix in Cursor

}
Comment on lines +40 to +45
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing upper-bound guard can silently disable compaction

context_input_budget calls effective_max_output_tokens as the reserved_output term for sub-500K models. If the env override is set above window − CONTEXT_HEADROOM_TOKENS (e.g. 70000 on a 65 536-token vLLM deployment), checked_sub underflows to None, silently disabling all compaction and preflight checks for that session. Adding .min(API_MAX_OUTPUT_TOKENS) before returning n caps the override at the same ceiling already used by the heuristic path.

Fix in Codex Fix in Claude Code Fix in Cursor

let window = context_window_for_model(model).unwrap_or(128_000);
Comment on lines +40 to 46
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If DEEPSEEK_MAX_OUTPUT_TOKENS is configured to a value that is too large relative to the model's context window, it will cause context_input_budget to underflow and return None. This silently disables all preflight and emergency recovery paths, which is a significant safety risk.

To prevent this, we should retrieve the model's context window first, and then cap the user-provided override to ensure we always leave enough room for the headroom (CONTEXT_HEADROOM_TOKENS) and at least 1 token for input.

    let window = context_window_for_model(model).unwrap_or(128_000);
    if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
        && let Ok(n) = raw.trim().parse::<u32>()
        && n > 0
    {
        let max_allowed = window.saturating_sub(CONTEXT_HEADROOM_TOKENS as u32 + 1);
        return n.min(max_allowed).max(1);
    }

if window >= 500_000 {
// V4-class models on large-context providers: use 64K which is safe
Expand Down
82 changes: 82 additions & 0 deletions crates/tui/src/core/engine/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -915,6 +915,9 @@ fn detects_context_length_errors_from_provider_payloads() {

#[test]
fn context_budget_reserves_output_and_headroom() {
// Serialize with other tests that mutate DEEPSEEK_MAX_OUTPUT_TOKENS so
// the internal effective_max_output_tokens() call sees a stable env.
let _lock = lock_test_env();
// V4 has a 1M context window — the only family that comfortably hosts
// a 256K output reservation without saturating the input budget to 0.
let budget = context_input_budget("deepseek-v4-pro")
Expand All @@ -926,6 +929,9 @@ fn context_budget_reserves_output_and_headroom() {

#[test]
fn effective_max_output_tokens_caps_api_request_for_large_window_models() {
// Serialize with other tests that mutate DEEPSEEK_MAX_OUTPUT_TOKENS so
// v4_cap and flash_cap below see the same env state.
let _lock = lock_test_env();
// V4 models have a 1M context window but the API request cap must stay
// well below common provider limits (e.g., 131K total on self-hosted
// vLLM/SGLang). The cap should never exceed 65K.
Expand All @@ -943,8 +949,84 @@ fn effective_max_output_tokens_caps_api_request_for_large_window_models() {
assert_eq!(v4_cap, flash_cap);
}

struct ScopedDeepSeekMaxOutputTokens {
previous: Option<OsString>,
}

impl ScopedDeepSeekMaxOutputTokens {
fn set(value: &str) -> Self {
let previous = std::env::var_os("DEEPSEEK_MAX_OUTPUT_TOKENS");
// Safety: tests using this helper serialize with lock_test_env() and
// restore the original value in Drop.
unsafe {
std::env::set_var("DEEPSEEK_MAX_OUTPUT_TOKENS", value);
}
Self { previous }
}

fn unset() -> Self {
let previous = std::env::var_os("DEEPSEEK_MAX_OUTPUT_TOKENS");
// Safety: see set().
unsafe {
std::env::remove_var("DEEPSEEK_MAX_OUTPUT_TOKENS");
}
Self { previous }
}
}

impl Drop for ScopedDeepSeekMaxOutputTokens {
fn drop(&mut self) {
// Safety: tests using this helper serialize with lock_test_env().
unsafe {
if let Some(previous) = self.previous.take() {
std::env::set_var("DEEPSEEK_MAX_OUTPUT_TOKENS", previous);
} else {
std::env::remove_var("DEEPSEEK_MAX_OUTPUT_TOKENS");
}
}
}
}

#[test]
fn effective_max_output_tokens_env_override_returns_positive_value() {
let _lock = lock_test_env();
let _guard = ScopedDeepSeekMaxOutputTokens::set("16384");

// Override applies regardless of model — V4 hosted, V4 flash, sub-500K
// self-hosted all return the env value verbatim.
assert_eq!(effective_max_output_tokens("deepseek-v4-pro"), 16_384);
assert_eq!(effective_max_output_tokens("deepseek-v4-flash"), 16_384);
assert_eq!(effective_max_output_tokens("qwen3-32b-256k"), 16_384);
}

#[test]
fn effective_max_output_tokens_env_override_rejects_zero_and_invalid() {
let _lock = lock_test_env();
// Establish the heuristic baseline with the env unset.
let baseline = {
let _guard = ScopedDeepSeekMaxOutputTokens::unset();
effective_max_output_tokens("deepseek-v4-pro")
};
assert!(baseline > 0);

// 0, non-numeric, and empty values must all fall through to the heuristic
// rather than producing a zero/garbage cap that would silently break
// request budgeting.
for raw in ["0", "abc", "", " ", "-1"] {
let _guard = ScopedDeepSeekMaxOutputTokens::set(raw);
assert_eq!(
effective_max_output_tokens("deepseek-v4-pro"),
baseline,
"env={raw:?} should fall through to heuristic"
);
}
}

#[test]
fn internal_context_budget_tiers_reserved_output_by_window() {
// Serialize with other tests that mutate DEEPSEEK_MAX_OUTPUT_TOKENS so
// both branches below see a stable env.
let _lock = lock_test_env();
// Large-context (>=500K) models reserve the full TURN_MAX_OUTPUT_TOKENS
// headroom so long V4 sessions don't compact prematurely.
let internal_budget =
Expand Down
Loading