-
Notifications
You must be signed in to change notification settings - Fork 3k
feat(engine): env override for max_output_tokens on tight providers #2147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,7 +28,21 @@ const API_MAX_OUTPUT_TOKENS: u32 = 65_536; | |
| /// model. Uses `API_MAX_OUTPUT_TOKENS` (64K) which fits within common provider | ||
| /// limits (128K+ total). For non-V4 models with smaller context windows, caps | ||
| /// at half the context window. | ||
| /// | ||
| /// Override: when the env var `DEEPSEEK_MAX_OUTPUT_TOKENS` is set to a positive | ||
| /// integer, this function returns that value directly. Use this for self-hosted | ||
| /// providers (vLLM/SGLang) whose `max-model-len` is tight and where the | ||
| /// model-table heuristic above would over-allocate. Example: vLLM serving | ||
| /// Qwen3.6 with `--max-model-len 65536` should set | ||
| /// `DEEPSEEK_MAX_OUTPUT_TOKENS=16384` so input + output stays well under the | ||
| /// provider's hard limit. | ||
| pub(super) fn effective_max_output_tokens(model: &str) -> u32 { | ||
| if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS") | ||
| && let Ok(n) = raw.trim().parse::<u32>() | ||
| && n > 0 | ||
| { | ||
| return n; | ||
| } | ||
|
Comment on lines
+40
to
+45
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| let window = context_window_for_model(model).unwrap_or(128_000); | ||
|
Comment on lines
+40
to
46
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If To prevent this, we should retrieve the model's context window first, and then cap the user-provided override to ensure we always leave enough room for the headroom ( let window = context_window_for_model(model).unwrap_or(128_000);
if let Ok(raw) = std::env::var("DEEPSEEK_MAX_OUTPUT_TOKENS")
&& let Ok(n) = raw.trim().parse::<u32>()
&& n > 0
{
let max_allowed = window.saturating_sub(CONTEXT_HEADROOM_TOKENS as u32 + 1);
return n.min(max_allowed).max(1);
} |
||
| if window >= 500_000 { | ||
| // V4-class models on large-context providers: use 64K which is safe | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEEPSEEK_*prefix misleads non-DeepSeek operatorsThe env var controls
max_tokensfor any model routed through the engine — a Qwen3 or Llama operator using vLLM would not intuitively reach forDEEPSEEK_MAX_OUTPUT_TOKENSto solve their tight-context problem. TheDEEPSEEK_*convention makes sense for vendor-specific keys (DEEPSEEK_BASE_URL,DEEPSEEK_CAPACITY_*), but this knob is provider-agnostic. Was a more generic name likeCODEWHALE_MAX_OUTPUT_TOKENSconsidered?Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!