The current OpenAISSEAccumulator design appears to assume that each streamed SSE delta belongs to only one token channel at a time: content, reasoning, or tool-call data. That invariant is not guaranteed by OpenAI-compatible serving frameworks, especially when stream batching / larger stream intervals are used.
Code pointers:
Concrete example:
If one streamed delta contains both content and reasoning:
from inference_endpoint.openai.accumulator import OpenAISSEAccumulator
from inference_endpoint.openai.types import SSEChoice, SSEDelta
acc = OpenAISSEAccumulator("qid", stream_all_chunks=True)
acc.add_chunk(
SSEChoice(
delta=SSEDelta(
content="answer",
reasoning_content="think",
tool_calls=[
{
"index": 0,
"id": "call_1",
"type": "function",
"function": {"name": "f", "arguments": "{}"},
}
],
)
)
)
result = acc.get_final_output()
print(result.response_output.as_message_parts())
Observed result:
("answer", None, (...tool_calls...))
The reasoning_content="think" token data is silently dropped because the content branch wins and the reasoning_content branch is skipped.
This may be less likely with stream_interval=1, but it is not a safe design invariant to rely on. Multiple token channels in the same streamed chunk could cause incorrect final outputs, missing reasoning data, and potentially incorrect downstream token accounting.
Potential solution:
Represent each streamed SSE frame as a structured chunk object that can carry all token-channel deltas present in that frame, e.g. content delta, reasoning delta, and tool-call delta together. The accumulator can then maintain a list of these structured chunks and build the final TextModelOutput by independently folding each channel across all chunks, instead of using mutually exclusive branch logic during ingestion.
The current
OpenAISSEAccumulatordesign appears to assume that each streamed SSE delta belongs to only one token channel at a time: content, reasoning, or tool-call data. That invariant is not guaranteed by OpenAI-compatible serving frameworks, especially when stream batching / larger stream intervals are used.Code pointers:
src/inference_endpoint/openai/accumulator.py#L52-L76tool_callsare appended independently, butcontent,reasoning_content/reasoning, and pure tool-call sentinel handling are selected through anif/elifchain.src/inference_endpoint/openai/accumulator.py#L104-L130output_chunks,reasoning_chunks, and_tool_calls, so anything skipped duringadd_chunk()is permanently lost.src/inference_endpoint/core/types.py#L179-L195TextModelOutput.as_message_parts()can represent content, reasoning, and tool calls together, so the final model shape can support multiple channels if the accumulator preserves them.Concrete example:
If one streamed delta contains both content and reasoning:
Observed result:
The
reasoning_content="think"token data is silently dropped because thecontentbranch wins and thereasoning_contentbranch is skipped.This may be less likely with
stream_interval=1, but it is not a safe design invariant to rely on. Multiple token channels in the same streamed chunk could cause incorrect final outputs, missing reasoning data, and potentially incorrect downstream token accounting.Potential solution:
Represent each streamed SSE frame as a structured chunk object that can carry all token-channel deltas present in that frame, e.g. content delta, reasoning delta, and tool-call delta together. The accumulator can then maintain a list of these structured chunks and build the final
TextModelOutputby independently folding each channel across all chunks, instead of using mutually exclusive branch logic during ingestion.