OpenAI SSE accumulator assumes one token channel per streamed chunk

The current `OpenAISSEAccumulator` design appears to assume that each streamed SSE delta belongs to only one token channel at a time: content, reasoning, or tool-call data. That invariant is not guaranteed by OpenAI-compatible serving frameworks, especially when stream batching / larger stream intervals are used.

Code pointers:

- [`src/inference_endpoint/openai/accumulator.py#L52-L76`](https://github.com/mlcommons/endpoints/blob/main/src/inference_endpoint/openai/accumulator.py#L52-L76)
  - `tool_calls` are appended independently, but `content`, `reasoning_content` / `reasoning`, and pure tool-call sentinel handling are selected through an `if` / `elif` chain.
- [`src/inference_endpoint/openai/accumulator.py#L104-L130`](https://github.com/mlcommons/endpoints/blob/main/src/inference_endpoint/openai/accumulator.py#L104-L130)
  - Final output reconstruction is based on separate accumulated `output_chunks`, `reasoning_chunks`, and `_tool_calls`, so anything skipped during `add_chunk()` is permanently lost.
- [`src/inference_endpoint/core/types.py#L179-L195`](https://github.com/mlcommons/endpoints/blob/main/src/inference_endpoint/core/types.py#L179-L195)
  - `TextModelOutput.as_message_parts()` can represent content, reasoning, and tool calls together, so the final model shape can support multiple channels if the accumulator preserves them.

Concrete example:

If one streamed delta contains both content and reasoning:

```python
from inference_endpoint.openai.accumulator import OpenAISSEAccumulator
from inference_endpoint.openai.types import SSEChoice, SSEDelta

acc = OpenAISSEAccumulator("qid", stream_all_chunks=True)
acc.add_chunk(
    SSEChoice(
        delta=SSEDelta(
            content="answer",
            reasoning_content="think",
            tool_calls=[
                {
                    "index": 0,
                    "id": "call_1",
                    "type": "function",
                    "function": {"name": "f", "arguments": "{}"},
                }
            ],
        )
    )
)

result = acc.get_final_output()
print(result.response_output.as_message_parts())
```

Observed result:

```python
("answer", None, (...tool_calls...))
```

The `reasoning_content="think"` token data is silently dropped because the `content` branch wins and the `reasoning_content` branch is skipped.

This may be less likely with `stream_interval=1`, but it is not a safe design invariant to rely on. Multiple token channels in the same streamed chunk could cause incorrect final outputs, missing reasoning data, and potentially incorrect downstream token accounting.

Potential solution:

Represent each streamed SSE frame as a structured chunk object that can carry all token-channel deltas present in that frame, e.g. content delta, reasoning delta, and tool-call delta together. The accumulator can then maintain a list of these structured chunks and build the final `TextModelOutput` by independently folding each channel across all chunks, instead of using mutually exclusive branch logic during ingestion.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAI SSE accumulator assumes one token channel per streamed chunk #325

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

OpenAI SSE accumulator assumes one token channel per streamed chunk #325

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions