From d5b8aa4288fb25a4b6bdfe99c811e45f44a155a1 Mon Sep 17 00:00:00 2001 From: AlinsRan Date: Fri, 12 Jun 2026 09:05:51 +0800 Subject: [PATCH] docs(ai-proxy): restore max_stream_duration_ms and max_response_bytes These two streaming/response safeguard fields are defined in the plugin schema but were dropped from the attribute table, leaving the guardrails undocumented. Restore both rows (en + zh) and the per-socket clarification on the `timeout` description. Signed-off-by: AlinsRan --- docs/en/latest/plugins/ai-proxy.md | 4 +++- docs/zh/latest/plugins/ai-proxy.md | 4 +++- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/en/latest/plugins/ai-proxy.md b/docs/en/latest/plugins/ai-proxy.md index 802b495939f7..14906b20c2f5 100644 --- a/docs/en/latest/plugins/ai-proxy.md +++ b/docs/en/latest/plugins/ai-proxy.md @@ -93,7 +93,9 @@ When `provider` is set to `bedrock`, the Plugin expects requests in the [Bedrock | logging | object | False | | | Logging configurations. Does not affect `error.log`. | | logging.summaries | boolean | False | false | | If true, logs request LLM model, duration, request, and response tokens. | | logging.payloads | boolean | False | false | | If true, logs request and response payload. | -| timeout | integer | False | 30000 | 1 - 600000 | Request timeout in milliseconds when requesting the LLM service. | +| timeout | integer | False | 30000 | 1 - 600000 | Request timeout in milliseconds when requesting the LLM service. Applied per socket operation (connect / send / read block); does not cap the total duration of a streaming response. | +| max_stream_duration_ms | integer | False | | ≥ 1 | Maximum wall-clock duration (in milliseconds) for a streaming AI response. If the upstream keeps sending data past this deadline, the gateway closes the connection. Unset means no cap. Use this to protect the gateway from upstream bugs that produce tokens indefinitely. When the limit is hit mid-stream, the downstream SSE stream is truncated (no protocol-specific terminator such as `[DONE]`, `message_stop`, or `response.completed`); well-behaved clients should treat a missing terminator as an incomplete response. | +| max_response_bytes | integer | False | | ≥ 1 | Maximum total bytes read from the upstream for a single AI response (streaming or non-streaming). If exceeded, the gateway closes the connection. For non-streaming responses with `Content-Length`, the check is performed before reading the body; for chunked (no-`Content-Length`) non-streaming responses and for streaming responses, the cap is enforced incrementally as bytes are received. Unset means no cap. | | max_req_body_size | integer | False | 67108864 | >= 1 | Maximum request body size in bytes that the plugin reads into memory. Requests whose body exceeds this limit are rejected with `413`. Prevents unbounded memory buffering of large request bodies. | | keepalive | boolean | False | true | | If true, keeps the connection alive when requesting the LLM service. | | keepalive_timeout | integer | False | 60000 | ≥ 1000 | Keepalive timeout in milliseconds when connecting to the LLM service. | diff --git a/docs/zh/latest/plugins/ai-proxy.md b/docs/zh/latest/plugins/ai-proxy.md index 8d7b18102eb3..78d0fecb73ae 100644 --- a/docs/zh/latest/plugins/ai-proxy.md +++ b/docs/zh/latest/plugins/ai-proxy.md @@ -93,7 +93,9 @@ import TabItem from '@theme/TabItem'; | logging | object | 否 | | | 日志配置。不影响 `error.log`。 | | logging.summaries | boolean | 否 | false | | 如果为 true,记录请求 LLM 模型、持续时间、请求和响应令牌。 | | logging.payloads | boolean | 否 | false | | 如果为 true,记录请求和响应负载。 | -| timeout | integer | 否 | 30000 | 1 - 600000 | 请求 LLM 服务时的请求超时时间(毫秒)。 | +| timeout | integer | 否 | 30000 | 1 - 600000 | 请求 LLM 服务时的请求超时时间(毫秒)。按单次 socket 操作(连接 / 发送 / 读取数据块)计算,不限制流式响应的总时长。 | +| max_stream_duration_ms | integer | 否 | | ≥ 1 | 流式 AI 响应的最大墙钟时长(毫秒)。如果上游在该截止时间后仍持续发送数据,网关会关闭连接。不设置表示不限制。用于防止上游异常无限产出 token。当在流式过程中触发该限制时,下游 SSE 流会被截断(不会发送 `[DONE]`、`message_stop`、`response.completed` 等协议终止标记);行为正常的客户端应将缺少终止标记视为不完整的响应。 | +| max_response_bytes | integer | 否 | | ≥ 1 | 单次 AI 响应(流式或非流式)从上游读取的最大总字节数。超过则网关关闭连接。对于带 `Content-Length` 的非流式响应,在读取响应体前进行检查;对于分块(无 `Content-Length`)的非流式响应以及流式响应,则在接收字节的过程中增量地强制执行该上限。不设置表示不限制。 | | keepalive | boolean | 否 | true | | 如果为 true,在请求 LLM 服务时保持连接活跃。 | | keepalive_timeout | integer | 否 | 60000 | ≥ 1000 | 连接到 LLM 服务时的保活超时时间(毫秒)。 | | keepalive_pool | integer | 否 | 30 | ≥ 1 | LLM 服务连接的保活池大小。 |