Skip to content

Add structured, channel-aware streaming architecture to LLMEngine#24

Draft
Copilot wants to merge 2 commits intomasterfrom
copilot/add-structured-streaming-architecture
Draft

Add structured, channel-aware streaming architecture to LLMEngine#24
Copilot wants to merge 2 commits intomasterfrom
copilot/add-structured-streaming-architecture

Conversation

Copy link
Contributor

Copilot AI commented Feb 26, 2026

The existing OnInferenceStreamed/OnInferenceEnded events emit flat strings, making it impossible to cleanly separate thinking/CoT content from response text at the stream level — and entirely inadequate for future tool/function calling where the LLM response is structured data, not text.

New types — LLM/InferenceStream.cs

  • InferenceChannel enum: Text, Thinking, ToolCall, ToolResult, System
  • InferenceSegment: channel-tagged streaming chunk with Text/ToolCall/ToolResult payload and IsComplete flag
  • InferenceResult: final structured result with Response (thinking stripped), ThinkingContent, ToolCalls, FinishReason
  • ToolCallInfo, ToolResultInfo, ToolCallRecord: data classes for future tool-call plumbing

New events on LLMEngine

// Per-token, channel-tagged — replaces OnInferenceStreamed
public static event EventHandler<InferenceSegment>? OnInferenceSegment;

// Structured completion — replaces OnInferenceEnded
public static event EventHandler<InferenceResult>? OnInferenceCompleted;

Old events marked [Obsolete] but continue to fire unchanged for backward compatibility.

Streaming state management

  • Private fields _currentChannel, _thinkingBuffer, _textBuffer track per-generation state
  • ResetStreamingState() clears all state; called at the start of StartGeneration, RerollLastMessage, and SimpleQueryStreaming
  • Channel transitions in Client_StreamingMessageReceived use the existing Instruct.IsThinkingPrompt() for robust detection — no hardcoded <think> tags
  • InferenceResult.Response is derived from the plugin-processed response via RemoveThinkingBlocks(); FinishReason == "tool_calls" emits a terminal ToolCall segment as a hook for future implementation

Usage

LLMEngine.OnInferenceSegment += (_, segment) =>
{
    switch (segment.Channel)
    {
        case InferenceChannel.Text:    Console.Write(segment.Text); break;
        case InferenceChannel.Thinking: /* suppress or show CoT */ break;
    }
};

LLMEngine.OnInferenceCompleted += (_, result) =>
{
    // result.Response has thinking blocks already stripped
    // result.ThinkingContent holds the CoT block if present
    LLMEngine.History.LogMessage(AuthorRole.Assistant, result.Response, ...);
};

Docs

Docs/LLMSYSTEM.md Events section updated with structured event examples and legacy event deprecation note.

Original prompt

Problem

The current streaming event system in LLMEngine is built around flat string events (OnInferenceStreamed / OnInferenceEnded). This was fine when the only content was plain text, but it's already becoming awkward with thinking/CoT models (thinking blocks are mashed into the same string and stripped out post-hoc in multiple places), and it won't scale at all to function calling / tool use, where the LLM response isn't text — it's structured tool-call data.

The goal of this PR is to introduce a structured, channel-aware streaming architecture that cleanly separates different types of inference content (text, thinking, tool calls, tool results, errors) at the stream level, while maintaining full backward compatibility with the existing string-based events.

This is the foundational infrastructure needed before function calling can be implemented.

What to implement

1. New types — LLM/InferenceStream.cs (new file)

Create a new file LLM/InferenceStream.cs with these types:

InferenceChannel enum:

  • Text — Normal visible text response
  • Thinking — Chain-of-thought / thinking block content
  • ToolCall — The LLM is requesting a tool/function call
  • ToolResult — Result being fed back after tool execution
  • System — Error or system-level message

InferenceSegment class:

  • InferenceChannel Channel — What kind of content this is
  • string? Text — The text delta (for Text/Thinking channels)
  • ToolCallInfo? ToolCall — Tool call data (for ToolCall channel)
  • ToolResultInfo? ToolResult — Tool result data (for ToolResult channel)
  • bool IsComplete — Whether this is the final chunk in its channel

ToolCallInfo class:

  • string CallId
  • string FunctionName
  • string ArgumentsJson — Raw JSON arguments

ToolResultInfo class:

  • string CallId
  • string FunctionName
  • bool Success
  • string ResultJson
  • string? Error

InferenceResult class (final structured result of a complete inference cycle):

  • string Response — The final visible text response
  • string? ThinkingContent — The thinking/CoT block, if any
  • List<ToolCallRecord> ToolCalls — All tool calls made during this inference
  • string? FinishReason

ToolCallRecord class:

  • string CallId
  • string FunctionName
  • string ArgumentsJson
  • string ResultJson
  • bool Success
  • TimeSpan Duration

2. New events on LLMEngine (in LLM/LLMEngine.cs)

Add these new events alongside the existing ones:

/// <summary> Called during inference with typed, channel-tagged segments. Provides richer information than OnInferenceStreamed. </summary>
public static event EventHandler<InferenceSegment>? OnInferenceSegment;

/// <summary> Called when a complete inference cycle finishes. Provides structured results including thinking content and tool call records. </summary>
public static event EventHandler<InferenceResult>? OnInferenceCompleted;

Add private raise methods:

private static void RaiseInferenceSegment(InferenceSegment segment) => OnInferenceSegment?.Invoke(Bot, segment);
private static void RaiseInferenceCompleted(InferenceResult result) => OnInferenceCompleted?.Invoke(Bot, result);

Mark the old events as obsolete (but do NOT remove them):

[Obsolete("Use OnInferenceSegment for channel-aware streaming. This event only receives Text channel content.")]
public static event EventHandler<string>? OnInferenceStreamed;

[Obsolete("Use OnInferenceCompleted for structured results including thinking content and tool calls.")]
public static event EventHandler<string>? OnInferenceEnded;

3. Refactor Client_StreamingMessageReceived in LLM/LLMEngine.cs

This is the core change. The current method (around line 635) does flat string accumulation. It needs to:

  1. Track which channel we're in using a private field _currentChannel (default InferenceChannel.Text)
  2. Track thinking content separately using a private StringBuilder _thinkingBuffer
  3. Detect channel transitions based on ThinkingStart/ThinkingEnd tags from Instruct
  4. Emit InferenceSegment events tagged with the correct channel
  5. Bridge to old events: when emitting a Text channel segment, also call RaiseOnInferenceStreamed(segment.Text). When completing, also call RaiseOnInferenceEnded(response).
  6. On completion (e.IsComplete), build an InferenceResult with the separated text and thinking content, and raise OnInferenceCompleted.

Key behavior:

  • During streaming, detect if the incoming token transitions us into/out of a thinking block
  • Thinking content accumulates in _thinkingBuffer, text content in existing StreamingTextProgress (for backward compat) but only text-channel tokens go there
  • When e.IsComplete, build InferenceResult with Response (text only) and ThinkingContent (if any)
  • The FinishReason from the event args should be carried through to InferenceResult.FinishReason
  • For `Finish...

This pull request was created from Copilot chat.


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Co-authored-by: SerialKicked <1781563+SerialKicked@users.noreply.github.com>
Copilot AI changed the title [WIP] Introduce structured streaming architecture for inference events Add structured, channel-aware streaming architecture to LLMEngine Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants