llm_runner: add Engine and Session interfaces#20243
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20243
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 4 PendingAs of commit b23f126 with merge base d7ca5db ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a new, model-agnostic C++ contract for the LLM serving path by defining LLMEngine (loaded model/resources + serving capacity) and LLMSession (per-conversation mutable state with reset/prefill/decode operations). This establishes an interface boundary so serving/worker code can depend on a stable API while backends keep ownership of KV/cache/state internally.
Changes:
- Adds new
LLMEngine/LLMSessioninterfaces plus supporting structs (SamplingConfig,DecodeResult,LLMServingCapacity). - Exports the new header via the Bazel runner library target.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| extension/llm/runner/llm_session.h | Adds new engine/session interfaces and associated data types for a stable serving contract. |
| extension/llm/runner/targets.bzl | Exports the new header in the runner library’s public headers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
650b123 to
5201719
Compare
|
@claude Review this PR |
|
Claude finished @mergennachin's task in 1m 16s —— View job PR Review:
|
5201719 to
6b74ce3
Compare
6b74ce3 to
e94d710
Compare
| /// NotSupported for models whose state cannot be safely rewound (for example, | ||
| /// non-KV-cache, sliding-window, or recurrent-state models); callers should | ||
| /// fall back to reset() + full prefill. | ||
| virtual ::executorch::runtime::Error seek(int64_t pos) = 0; |
There was a problem hiding this comment.
why we have to have seek function in the top-level llm_session? For speculative decoding?
There was a problem hiding this comment.
Good point, for now, we won't need it anymore. Originally it was needed for cache warm up, but Qwen/Gemma hybrid state does not support arbitrary rewind cleanly.
e94d710 to
9f5b657
Compare
| /// caller must call reset() (and only proceed once it returns Ok) before any | ||
| /// further prefill/decode; it must NOT retry the failed call. | ||
| virtual ::executorch::runtime::Error prefill_tokens( | ||
| std::vector<uint64_t> tokens, |
| /// POISONED -- position() may no longer agree with resident state. The | ||
| /// caller must call reset() (and only proceed once it returns Ok) before any | ||
| /// further prefill/decode; it must NOT retry the failed call. | ||
| virtual ::executorch::runtime::Error prefill_tokens( |
|
|
||
| /// Build a new session that reuses this engine's model resources and owns | ||
| /// its own mutable generation state. | ||
| virtual ::executorch::runtime::Result<std::unique_ptr<LLMSession>> |
The LLM serving path needs a stable contract between generic serving code and model-specific execution code. TextLLM, Qwen, Gemma, CUDA, and future backends all differ in how they own weights and mutable state, but the server should not know those details or grow a new Python binding for every model. This introduces the minimal runner-level split needed for that contract. LLMEngine represents the loaded physical model and its serving capacity; LLMSession represents one logical conversation state and exposes reset/prefill/decode-style operations. That shape lets a worker drive different model implementations through one interface while keeping KV/recurrent/cache ownership inside C++. This commit is only the interface and build export. It deliberately does not add a concrete adapter or change existing runner behavior, so model migrations and serving can be reviewed as downstream uses of the contract rather than hidden side effects.
9f5b657 to
b23f126
Compare
The LLM serving path needs a stable contract between generic serving code and model-specific execution code. TextLLM, Qwen, Gemma, CUDA, and future backends all differ in how they own weights and mutable state, but the server should not know those details or grow a new Python binding for every model.
This introduces the minimal runner-level split needed for that contract. LLMEngine represents the loaded physical model and its serving capacity; LLMSession represents one logical conversation state and exposes reset/prefill/decode-style operations. That shape lets a worker drive different model implementations through one interface while keeping KV/recurrent/cache ownership inside C++.
This commit is only the interface and build export. It deliberately does not add a concrete adapter or change existing runner behavior, so model migrations and serving can be reviewed as downstream uses of the contract rather than hidden side effects.
#20001