Add structured, channel-aware streaming architecture to LLMEngine#24
Draft
Add structured, channel-aware streaming architecture to LLMEngine#24
Conversation
Co-authored-by: SerialKicked <1781563+SerialKicked@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Introduce structured streaming architecture for inference events
Add structured, channel-aware streaming architecture to LLMEngine
Feb 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The existing
OnInferenceStreamed/OnInferenceEndedevents emit flat strings, making it impossible to cleanly separate thinking/CoT content from response text at the stream level — and entirely inadequate for future tool/function calling where the LLM response is structured data, not text.New types —
LLM/InferenceStream.csInferenceChannelenum:Text,Thinking,ToolCall,ToolResult,SystemInferenceSegment: channel-tagged streaming chunk withText/ToolCall/ToolResultpayload andIsCompleteflagInferenceResult: final structured result withResponse(thinking stripped),ThinkingContent,ToolCalls,FinishReasonToolCallInfo,ToolResultInfo,ToolCallRecord: data classes for future tool-call plumbingNew events on
LLMEngineOld events marked
[Obsolete]but continue to fire unchanged for backward compatibility.Streaming state management
_currentChannel,_thinkingBuffer,_textBuffertrack per-generation stateResetStreamingState()clears all state; called at the start ofStartGeneration,RerollLastMessage, andSimpleQueryStreamingClient_StreamingMessageReceiveduse the existingInstruct.IsThinkingPrompt()for robust detection — no hardcoded<think>tagsInferenceResult.Responseis derived from the plugin-processed response viaRemoveThinkingBlocks();FinishReason == "tool_calls"emits a terminalToolCallsegment as a hook for future implementationUsage
Docs
Docs/LLMSYSTEM.mdEvents section updated with structured event examples and legacy event deprecation note.Original prompt
Problem
The current streaming event system in
LLMEngineis built around flatstringevents (OnInferenceStreamed/OnInferenceEnded). This was fine when the only content was plain text, but it's already becoming awkward with thinking/CoT models (thinking blocks are mashed into the same string and stripped out post-hoc in multiple places), and it won't scale at all to function calling / tool use, where the LLM response isn't text — it's structured tool-call data.The goal of this PR is to introduce a structured, channel-aware streaming architecture that cleanly separates different types of inference content (text, thinking, tool calls, tool results, errors) at the stream level, while maintaining full backward compatibility with the existing
string-based events.This is the foundational infrastructure needed before function calling can be implemented.
What to implement
1. New types —
LLM/InferenceStream.cs(new file)Create a new file
LLM/InferenceStream.cswith these types:InferenceChannelenum:Text— Normal visible text responseThinking— Chain-of-thought / thinking block contentToolCall— The LLM is requesting a tool/function callToolResult— Result being fed back after tool executionSystem— Error or system-level messageInferenceSegmentclass:InferenceChannel Channel— What kind of content this isstring? Text— The text delta (for Text/Thinking channels)ToolCallInfo? ToolCall— Tool call data (for ToolCall channel)ToolResultInfo? ToolResult— Tool result data (for ToolResult channel)bool IsComplete— Whether this is the final chunk in its channelToolCallInfoclass:string CallIdstring FunctionNamestring ArgumentsJson— Raw JSON argumentsToolResultInfoclass:string CallIdstring FunctionNamebool Successstring ResultJsonstring? ErrorInferenceResultclass (final structured result of a complete inference cycle):string Response— The final visible text responsestring? ThinkingContent— The thinking/CoT block, if anyList<ToolCallRecord> ToolCalls— All tool calls made during this inferencestring? FinishReasonToolCallRecordclass:string CallIdstring FunctionNamestring ArgumentsJsonstring ResultJsonbool SuccessTimeSpan Duration2. New events on
LLMEngine(inLLM/LLMEngine.cs)Add these new events alongside the existing ones:
Add private raise methods:
Mark the old events as obsolete (but do NOT remove them):
3. Refactor
Client_StreamingMessageReceivedinLLM/LLMEngine.csThis is the core change. The current method (around line 635) does flat string accumulation. It needs to:
_currentChannel(defaultInferenceChannel.Text)StringBuilder _thinkingBufferThinkingStart/ThinkingEndtags fromInstructInferenceSegmentevents tagged with the correct channelTextchannel segment, also callRaiseOnInferenceStreamed(segment.Text). When completing, also callRaiseOnInferenceEnded(response).e.IsComplete), build anInferenceResultwith the separated text and thinking content, and raiseOnInferenceCompleted.Key behavior:
_thinkingBuffer, text content in existingStreamingTextProgress(for backward compat) but only text-channel tokens go theree.IsComplete, buildInferenceResultwithResponse(text only) andThinkingContent(if any)FinishReasonfrom the event args should be carried through toInferenceResult.FinishReasonThis pull request was created from Copilot chat.
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.