llm_runner: add Engine and Session interfaces by mergennachin · Pull Request #20243 · pytorch/executorch

mergennachin · 2026-06-12T20:13:58Z

The LLM serving path needs a stable contract between generic serving code and model-specific execution code. TextLLM, Qwen, Gemma, CUDA, and future backends all differ in how they own weights and mutable state, but the server should not know those details or grow a new Python binding for every model.

This introduces the minimal runner-level split needed for that contract. LLMEngine represents the loaded physical model and its serving capacity; LLMSession represents one logical conversation state and exposes reset/prefill/decode-style operations. That shape lets a worker drive different model implementations through one interface while keeping KV/recurrent/cache ownership inside C++.

This commit is only the interface and build export. It deliberately does not add a concrete adapter or change existing runner behavior, so model migrations and serving can be reviewed as downstream uses of the contract rather than hidden side effects.

#20001

pytorch-bot · 2026-06-12T20:14:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20243

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Pending

As of commit b23f126 with merge base d7ca5db ():

NEW FAILURES - The following jobs have failed:

pull / test-lora-linux / linux-job (gh)
RuntimeError: Command docker exec -t 24bbf2fb2a53e2a14e83ae944fdfff06c1cc400ee4d7d0ee26b2622dc7b7668f /exec failed with exit code 1
pull / unittest-editable / windows / windows-job (gh)
examples/models/test/test_export.py::ExportTest::test_efficient_sam_export_to_executorch

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-12T20:14:58Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR introduces a new, model-agnostic C++ contract for the LLM serving path by defining LLMEngine (loaded model/resources + serving capacity) and LLMSession (per-conversation mutable state with reset/prefill/decode operations). This establishes an interface boundary so serving/worker code can depend on a stable API while backends keep ownership of KV/cache/state internally.

Changes:

Adds new LLMEngine / LLMSession interfaces plus supporting structs (SamplingConfig, DecodeResult, LLMServingCapacity).
Exports the new header via the Bazel runner library target.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
extension/llm/runner/llm_session.h	Adds new engine/session interfaces and associated data types for a stable serving contract.
extension/llm/runner/targets.bzl	Exports the new header in the runner library’s public headers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mergennachin · 2026-06-12T20:56:22Z

@claude Review this PR

claude · 2026-06-12T20:57:18Z

Claude finished @mergennachin's task in 1m 16s —— View job

PR Review: `llm_runner: add Engine and Session interfaces`

Read changed files (llm_session.h, targets.bzl)
Analyze interface design & contract consistency
Check against Copilot review comments
Post review feedback

Reviewed the two changed files. This is a clean, well-scoped interface-only PR — no concrete adapters, no behavior change. The documentation is unusually thorough for a header, and the Engine/Session split (immutable model resources vs. per-conversation mutable state) is a sound contract for the serving path. Overall looks good to merge; comments below are minor.

On the Copilot comments — all already addressed ✅

The Copilot review was generated at an earlier snapshot. The current head of the branch already resolves every one of its six comments:

SamplingConfig, DecodeResult, LLMServingCapacity are all marked ET_EXPERIMENTAL (lines 31, 51, 62).
The cooperative-stop() terminal-step contract is now fully specified in three consistent places: DecodeResult (lines 47–50: "token_id must be 0, text_piece must be empty, position() must not advance"), decode_one() (lines 108–110), and stop() (lines 129–135).
The decode_one() doc no longer contradicts stop() — it explicitly states the short-circuit path.

No action needed there.

Minor observations

DecodeResult has no default member initializers (lines 52–55). The other three structs (SamplingConfig, LLMServingCapacity) initialize every field, but DecodeResult's four members are left uninitialized. An adapter that forgets a field yields a garbage is_eos/is_terminal — a nasty bug since is_terminal drives the decode loop. Recommend defaulting for consistency and safety:
```
struct ET_EXPERIMENTAL DecodeResult {
  uint64_t token_id = 0;
  std::string text_piece;
  bool is_eos = false;
  bool is_terminal = false;
};
```
Namespace style differs from the sibling header. This file uses C++17 nested namespace executorch::extension::llm { (line 25), while irunner.h in the same directory uses the older namespace executorch { namespace extension { namespace llm { form. Both compile fine; the new form is cleaner. Just flagging the inconsistency within one directory in case the codebase has a convention.
Relationship to GenerationConfig/IRunner is worth a one-line note. SamplingConfig overlaps conceptually with GenerationConfig::temperature from irunner.h, but operates at a different altitude (per-decode-step vs. per-generation). The header comment (lines 9–12) explains the adapter relationship well; a sentence clarifying that this Session contract sits below the existing IRunner/GenerationConfig layer (rather than replacing it) would help future readers who land on both files.
metadata() returning const std::unordered_map<std::string, int64_t>& ties the engine contract to a specific container type. That's consistent with how runner metadata is passed elsewhere in this codebase, so it's reasonable — just noting it constrains adapters to store metadata in exactly that shape.

None of these are blocking. The only one I'd actively recommend acting on is #1 (default-initialize DecodeResult) since uninitialized is_terminal is a real footgun. Fix this →

targets.bzl: the single-line addition correctly exports llm_session.h from the runner_lib target alongside the other public headers. ✓
· branch llm-pr-c-llm-engine-session-interface

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Gasoonjia · 2026-06-12T22:26:37Z

+  /// NotSupported for models whose state cannot be safely rewound (for example,
+  /// non-KV-cache, sliding-window, or recurrent-state models); callers should
+  /// fall back to reset() + full prefill.
+  virtual ::executorch::runtime::Error seek(int64_t pos) = 0;


why we have to have seek function in the top-level llm_session? For speculative decoding?

Good point, for now, we won't need it anymore. Originally it was needed for cache warm up, but Qwen/Gemma hybrid state does not support arbitrary rewind cleanly.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

+  /// caller must call reset() (and only proceed once it returns Ok) before any
+  /// further prefill/decode; it must NOT retry the failed call.
+  virtual ::executorch::runtime::Error prefill_tokens(
+      std::vector<uint64_t> tokens,


+  /// POISONED -- position() may no longer agree with resident state. The
+  /// caller must call reset() (and only proceed once it returns Ok) before any
+  /// further prefill/decode; it must NOT retry the failed call.
+  virtual ::executorch::runtime::Error prefill_tokens(


+
+  /// Build a new session that reuses this engine's model resources and owns
+  /// its own mutable generation state.
+  virtual ::executorch::runtime::Result<std::unique_ptr<LLMSession>>


The LLM serving path needs a stable contract between generic serving code and model-specific execution code. TextLLM, Qwen, Gemma, CUDA, and future backends all differ in how they own weights and mutable state, but the server should not know those details or grow a new Python binding for every model. This introduces the minimal runner-level split needed for that contract. LLMEngine represents the loaded physical model and its serving capacity; LLMSession represents one logical conversation state and exposes reset/prefill/decode-style operations. That shape lets a worker drive different model implementations through one interface while keeping KV/recurrent/cache ownership inside C++. This commit is only the interface and build export. It deliberately does not add a concrete adapter or change existing runner behavior, so model migrations and serving can be reviewed as downstream uses of the contract rather than hidden side effects.

Copilot AI review requested due to automatic review settings June 12, 2026 20:13

mergennachin requested a review from larryliu0820 as a code owner June 12, 2026 20:13

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026

mergennachin temporarily deployed to cadence June 12, 2026 20:14 — with GitHub Actions Inactive

Copilot started reviewing on behalf of mergennachin June 12, 2026 20:14 View session

mergennachin requested review from kirklandsign and psiddh June 12, 2026 20:14

Copilot AI reviewed Jun 12, 2026

View reviewed changes

mergennachin force-pushed the llm-pr-c-llm-engine-session-interface branch from 650b123 to 5201719 Compare June 12, 2026 20:55

mergennachin had a problem deploying to cadence June 12, 2026 20:55 — with GitHub Actions Error

Copilot AI review requested due to automatic review settings June 12, 2026 21:12

mergennachin force-pushed the llm-pr-c-llm-engine-session-interface branch from 5201719 to 6b74ce3 Compare June 12, 2026 21:12

Copilot started reviewing on behalf of mergennachin June 12, 2026 21:12 View session

mergennachin had a problem deploying to cadence June 12, 2026 21:13 — with GitHub Actions Error

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread extension/llm/runner/llm_session.h

Comment thread extension/llm/runner/targets.bzl

mergennachin force-pushed the llm-pr-c-llm-engine-session-interface branch from 6b74ce3 to e94d710 Compare June 12, 2026 21:23

mergennachin temporarily deployed to cadence June 12, 2026 21:24 — with GitHub Actions Inactive

Gasoonjia reviewed Jun 12, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 15, 2026 14:48

mergennachin force-pushed the llm-pr-c-llm-engine-session-interface branch from e94d710 to 9f5b657 Compare June 15, 2026 14:48

mergennachin had a problem deploying to cadence June 15, 2026 14:48 — with GitHub Actions Error

Copilot started reviewing on behalf of mergennachin June 15, 2026 14:48 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

mergennachin force-pushed the llm-pr-c-llm-engine-session-interface branch from 9f5b657 to b23f126 Compare June 15, 2026 15:01

mergennachin temporarily deployed to cadence June 15, 2026 15:03 — with GitHub Actions Inactive

Conversation

mergennachin commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20243

❌ 2 New Failures, 4 Pending

Uh oh!

github-actions Bot commented Jun 12, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: llm_runner: add Engine and Session interfaces

On the Copilot comments — all already addressed ✅

Minor observations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Gasoonjia Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergennachin commented Jun 12, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 12, 2026 •

edited

Loading

PR Review: `llm_runner: add Engine and Session interfaces`