Skip to content

cuda: add per-session mutable state rebinding#20241

Merged
mergennachin merged 1 commit into
mainfrom
llm-pr-a-cuda-mutable-state
Jun 15, 2026
Merged

cuda: add per-session mutable state rebinding#20241
mergennachin merged 1 commit into
mainfrom
llm-pr-a-cuda-mutable-state

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Local agent serving needs to host multiple logical conversations on one CUDA-resident model without multiplying the model weights. Loading one AOTI module per conversation is not viable for large local models, while sharing the default mutable state across conversations would let KV/recurrent/conv buffers bleed between users.

This adds the CUDA-private foundation for separating those concerns: weights remain owned by the loaded AOTI container, while mutable buffer FQNs can be registered as per-session state and rebound before execution. The path is fail-closed and dormant until a model opts in by creating a mutable-state context and validating coverage, so existing CUDA models keep their current behavior.

The branch also wires the new source and fall-closed unit test into both Buck and CMake so the primitive can land independently before any model-specific engine consumes it.

#20001

Copilot AI review requested due to automatic review settings June 12, 2026 20:11
@pytorch-bot

pytorch-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20241

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 26 Pending, 1 Unrelated Failure

As of commit 9f52f6a with merge base d7ca5db (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026
@mergennachin mergennachin requested a review from Gasoonjia June 12, 2026 20:11
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CUDA-private infrastructure to support per-session rebinding of mutable model state (e.g., KV/conv/recurrent buffers) while keeping model weights shared across logical conversations, enabling multi-session local serving without duplicating large CUDA-resident weights.

Changes:

  • Introduces cuda_mutable_state context/session manager with descriptor capture and per-execute rebinding hooks.
  • Integrates mutable-state hooks into CudaBackend::init (handle noting) and CudaBackend::execute (rebind-before-run).
  • Adds a fail-closed unit test and wires it into both Buck and CMake builds.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
backends/cuda/runtime/cuda_mutable_state.h Declares CUDA-private context/session API for mutable-state rebinding.
backends/cuda/runtime/cuda_mutable_state.cpp Implements context/session tracking, template capture, buffer allocation, and rebind logic.
backends/cuda/runtime/cuda_backend.cpp Calls into mutable-state hooks during init and before execute.
backends/cuda/runtime/test/test_cuda_mutable_state.cpp Adds unit test covering fail-closed defaults.
backends/cuda/runtime/TARGETS Adds new source/header to CUDA backend and registers Buck unit test target.
backends/cuda/CMakeLists.txt Adds new source to CMake backend build and registers CMake unit test target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp Outdated
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
@Gasoonjia

Copy link
Copy Markdown
Contributor

@claude review the PR

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @Gasoonjia's task in 1m 53s —— View job


Code Review: cuda — per-session mutable state rebinding

Reviewed all 6 changed files against main. This is a clean, well-isolated foundational primitive. The design is genuinely fail-closed: it stays dormant unless an engine opts in via create_context + begin_load/end_load + register_fqns + validate_coverage, and every error path returns a non-Ok Error rather than silently allocating or binding.

The three earlier Copilot comments are already resolved in the current revision, each with a matching regression test:

  • mutable_state_rebind_for_execute validates the token against c.sessions before allocating (cuda_mutable_state.cpp:495), covered by RebindRejectsUncreatedSessionToken.
  • mutable_state_begin_load guards against nested scopes and fails closed by poisoning build_error (:303), covered by NestedBeginLoadFailsClosed.
  • mutable_state_forget_handle exists and is called from CudaBackend::destroy (cuda_backend.cpp:840), covered by ForgetHandleDropsAssociation.

Nice work tightening those.


Findings

1. Shared per-FQN template/buffer can be under-sized if two methods declare the same mutable FQN with different shapescuda_mutable_state.cpp:165 / :216

The template (template_ptr/template_nbytes) and the session buffer are keyed by FQN and captured/sized once, from the first handle that discovers the FQN (:165). But the wrapper in ensure_bound uses this handle's Desc sizes/strides/dtype (:233-238) over that shared buffer. If a later method's container exposes the same FQN with a larger nbytes, the SlimTensor describes more bytes than were allocated → out-of-bounds device access / silent corruption, with no diagnostic.

Per-session KV/conv FQNs likely have consistent shapes across methods today, so this is latent rather than active, but it's exactly the kind of mismatch this fail-closed layer should catch. A cheap guard in ensure_bound would close it:

if (d.nbytes > c.template_nbytes[fqn]) {
  ET_LOG(Error,
      "mutable_state: descriptor '%s' (%zu B) exceeds shared template buffer (%zu B)",
      fqn.c_str(), d.nbytes, c.template_nbytes[fqn]);
  return Error::Internal;
}

Fix this →

2. mutable_state_destroy_context leaves stale thread-local active sessioncuda_mutable_state.cpp:275

If a thread has set_active(ctx, token) and the context is destroyed (e.g. teardown races, or a buggy engine), tl_active_ctx/tl_active_token stay set. This is not a memory-safety bug — the next rebind_for_execute fails closed at the handle_ctx lookup (the handle's association was erased during destroy → Error::Internal at :463). Worth confirming this is the intended contract; the header already states the engine owns lifecycle/serialization, so a one-line note that destroy does not reset other threads' active selection would help future callers.

3. Multi-session is serialized, not concurrent — make sure this is loud enoughcuda_mutable_state.h:84

Because the AOTI container is shared and update_user_managed_constant_buffer_pairs rebinds it globally, two threads executing different sessions on the same container would race on the container's constant pointers. The header does document "the engine must hold its serialization lock across set + execute + read-out," which is the right contract. This is the key architectural constraint for the consuming engine; flagging it here so reviewers of the future engine PR don't miss it. No change needed in this PR.


Minor / non-blocking

  • mutable_state_rebind_for_execute calls update_user_managed_constant_buffer_pairs on every execute (not just first bind). This is correct for true multi-session interleaving (the container may have been rebound by another session since), but adds a per-execute rebind cost even in the common single-active-session case. Fine for a foundation; a future optimization could skip the rebind when the container is already pointed at this session's buffers.
  • GPU allocation paths (build_descriptors, ensure_session_buffers, ensure_bound with non-empty templates) are not exercised by the unit test — the fake handle reports get_num_constants == 0. The fail-closed control flow is well covered; the actual clone/rebind logic will only get coverage once a real model or a fake with non-zero constants drives it. Reasonable for a primitive landing ahead of its consumer, but worth a follow-up.
  • Build wiring (CMake :242-250, Buck :142-156) looks correct and symmetric; the test is registered in both with CUDA_AVAILABLE=1.

Overall this is solid, defensive, and well-documented. Finding #1 is the only one I'd consider addressing before the consuming engine lands; the rest are confirmations/notes.
• Branch: llm-pr-a-cuda-mutable-state

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp Outdated
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
Comment thread backends/cuda/runtime/test/test_cuda_mutable_state.cpp Outdated
@mergennachin mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from a9c517c to 0be0912 Compare June 12, 2026 20:50
@mergennachin

Copy link
Copy Markdown
Contributor Author

@claude Review the PR

Double check if it addressed previous feedbacks from Copilot and Claude

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @mergennachin's task in 1m 19s —— View job


Re-review: did the PR address prior Copilot & Claude feedback?

Checked all 6 changed files at 0be0912 against every prior review comment. Verdict: all 7 prior findings are resolved, each with a fail-closed code path and (where applicable) a matching regression test.

Copilot round 1 (2026-06-12T20:16)

  • Implicit session_buf[token] allocation for uncreated tokensrebind_for_execute now validates c.sessions.find(token) and returns InvalidArgument before any allocation (cuda_mutable_state.cpp:509-512). Test: RebindRejectsUncreatedSessionToken.
  • begin_load overwrites tl_loading_ctx unconditionally (nesting) → guarded; nested entry poisons both contexts' build_error = InvalidState, logs, and clears the thread-local (:314-329). Test: NestedBeginLoadFailsClosed.
  • No way to unregister a destroyed handlemutable_state_forget_handle added (:457-470), called from CudaBackend::destroy (cuda_backend.cpp:840). Test: ForgetHandleDropsAssociation.

Copilot round 2 (2026-06-12T20:44)

  • Rebinding under CUDA graph capture/replay silently no-ops → rejected with NotSupported when cuda_graph_state.phase != Disabled (:513-518). Test: RebindRejectsCudaGraphHandle (the test Copilot explicitly asked for).
  • create_session succeeds despite earlier build_error → now returns build_error first (:402-404), matching validate_coverage.
  • available() ignores build_error → now folds it in: build_error == Error::Ok && symbols_available (:355-356).
  • Missing CUDA-graph regression test → added (see above).

Claude round 1 (2026-06-12T20:38) — Finding #1 (the only one flagged as worth fixing pre-merge)

  • Shared per-FQN buffer under-sized if a later method declares the same FQN with a larger nbytesensure_bound now guards d.nbytes > template_nbytes[fqn] (and missing-template), returning Error::Internal with a diagnostic before wrapping the buffer (:232-243). This closes the out-of-bounds GPU access path.

Findings #2 (stale thread-local active session after destroy) and #3 (multi-session serialization contract) were notes/confirmations, not change requests — both remain correctly fail-closed at the handle_ctx/build_error lookups.

Build wiring

  • CMake: source CMakeLists.txt:188, test target :246-249 with CUDA_AVAILABLE=1.
  • Buck: source TARGETS:109/113, test target :143-145. Symmetric with CMake.

Remaining observations (non-blocking, no prior comment)

  1. cudaMalloc template leak on partial build_descriptors failure:165-181: each FQN's template is cudaMalloc'd and recorded in c.template_ptr as the loop proceeds. If a later FQN's cudaMalloc/cudaMemcpy fails and returns Error::Internal, the already-allocated earlier templates are only freed later by mutable_state_destroy_context. Since the engine is contractually required to destroy the context on a non-Ok build, this is reclaimed — but worth a one-line note that cleanup is deferred to destroy_context, not immediate.

  2. GPU clone/rebind logic still has no positive-path coverage — the fake handle reports get_num_constants == 0, so build_descriptors/ensure_session_buffers/ensure_bound (including the new size guard at :232-243) never execute their allocation bodies in tests. The fail-closed control flow is thoroughly covered; the actual copy-and-bind path will only get exercised once a real model or a fake with non-zero constants drives it. Reasonable for a primitive landing ahead of its consumer — flagging for a follow-up. Add coverage →

Summary: every actionable item from both Copilot rounds and Claude's blocking Finding #1 is addressed in code with corresponding tests. No regressions introduced. The two observations above are follow-ups, not merge blockers.
• Branch: llm-pr-a-cuda-mutable-state

@mergennachin mergennachin requested a review from shoumikhin June 12, 2026 20:53
Copilot AI review requested due to automatic review settings June 12, 2026 21:00
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp Outdated
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp Outdated
}
auto& m = mgr();
std::lock_guard<std::mutex> g(m.mu);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This holds one process-wide lock across the external rebind call and the GPU allocations below. That forces every model in the process to take turns on the hottest part of execute, and it can stall everything if the rebind call blocks or re-enters. Use a per-context lock instead, and copy out the values you need (the pairs, container handle, function pointer) under the lock, then release it before making the rebind call and the allocations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a correctness blocker; moving to per-context locks/copy-out safely is a follow-up refactor.

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
Comment thread backends/cuda/runtime/test/test_cuda_mutable_state.cpp
@mergennachin

Copy link
Copy Markdown
Contributor Author

Thanks for the reviews @shoumikhin and @Gasoonjia

@shoumikhin, I addressed all your comments except the per-context mutex. I'll do that in follow-up PR

@Gasoonjia

I've one question regarding kv cache sharing across methods: like prefill and decode are separate methods (each its own AOTI container / CudaDelegateHandle) exporting the same KV-cache FQN: since session_buf is keyed (token → fqn → buffer) and ensure_bound points every handle's wrapper at the same session_buf[token][fqn], can you confirm the intended behavior is that, under one active session token, both handles rebind to the same GPU buffer so decode continues exactly what prefill wrote? Or is the V1 primitive only meant for single-method per-session isolation? Any test you can add to guard the target behavoir?

Yes. The intended behavior is that prefill and decode share the same KV buffer for the same logical session. So if both methods declare the same mutable buffer FQN, then under one active session:

  • prefill writes into that session’s buffer
  • decode is rebind to that same buffer
  • decode continues from the state prefill produced

Different sessions still get different buffers, so state does not bleed across requests.

I added a unit test.

Another thing im worried about whether the session support impact cuda graph inference, since cuda graph inference needs to capture a static path but now the path may not be static since we are gonna change session, maybe we should move the cuda graph infortaion from CudaBackendHandle to cuda_mutable_state.

Yes valid concern. CUDA graph path remains unchanged for normal single-session inference. But I reject cuda_graph during multi session. A future optimization could support one captured graph per session or recapture after rebinding.

Copilot AI review requested due to automatic review settings June 15, 2026 14:56
@mergennachin mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from cb55702 to 5fb0034 Compare June 15, 2026 14:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comment on lines +301 to +320
auto* t = reinterpret_cast<SlimTensor*>(it_t->second);
Desc d;
d.internal_name = it_name->second;
d.sizes.assign(t->sizes().begin(), t->sizes().end());
d.strides.assign(t->strides().begin(), t->strides().end());
d.dtype = t->dtype();
d.device = t->device();
d.nbytes = t->nbytes();
table.emplace(fqn, std::move(d));
c.discovered_fqns.insert(fqn);

if (c.template_ptr.find(fqn) == c.template_ptr.end()) {
// If a later FQN fails during this build, already captured templates are
// released by mutable_state_destroy_context().
auto device_res = tensor_cuda_device_index(*t);
ET_CHECK_OK_OR_RETURN_ERROR(device_res.error());
const int device = device_res.get();
CudaDeviceGuard guard;
ET_CHECK_OK_OR_RETURN_ERROR(guard.set(device));

Comment on lines +181 to +197
err = cudaMemcpy(
*device_ptr,
values.data(),
values.size() * sizeof(float),
cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
ADD_FAILURE() << "cudaMemcpy failed: " << cudaGetErrorString(err);
cudaFree(*device_ptr);
*device_ptr = nullptr;
return nullptr;
}
return std::make_unique<slim::SlimTensor>(slim::from_blob(
*device_ptr,
{static_cast<int64_t>(values.size())},
slimc10::ScalarType::Float,
slimc10::Device(slimc10::DeviceType::CUDA, 0)));
}
Comment thread backends/cuda/runtime/cuda_backend.cpp Outdated
Comment on lines +546 to +548
// If a mutable-state session is active on this thread, rebind this
// container's registered mutable buffers before running.
ET_CHECK_OK_OR_RETURN_ERROR(mutable_state_rebind_for_execute(handle));

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment on lines +488 to +500
void mutable_state_register_fqns(
MutableStateContext ctx,
const std::vector<std::string>& fqns) {
auto& m = mgr();
std::lock_guard<std::mutex> g(m.mu);
auto it = m.contexts.find(ctx);
if (it == m.contexts.end()) {
return;
}
it->second.fqns = fqns;
it->second.fqn_set.clear();
it->second.fqn_set.insert(fqns.begin(), fqns.end());
}
Comment on lines +561 to +568
if (!c.symbols_available) {
ET_LOG(
Error, "mutable_state: rebinding unavailable; cannot create session");
return Error::NotSupported;
}
int token = c.next_token++;
c.sessions.insert(token);
return token;
Comment on lines +76 to +83
ActiveSessionScope(MutableStateContext ctx, int token) {
detail::mutable_state_set_active(ctx, token);
}

~ActiveSessionScope() {
detail::mutable_state_set_active(
kInvalidMutableContext, kNoMutableSession);
}
}
}

ET_CHECK_OK_OR_RETURN_ERROR(mutable_state_rebind_for_execute(handle));
Local agent serving needs to host multiple logical conversations on one CUDA-resident model without multiplying the model weights. Loading one AOTI module per conversation is not viable for large local models, while sharing the default mutable state across conversations would let KV/recurrent/conv buffers bleed between users.

This adds the CUDA-private foundation for separating those concerns: weights remain owned by the loaded AOTI container, while mutable buffer FQNs can be registered as per-session state and rebound before execution. The path is fail-closed and dormant until a model opts in by creating a mutable-state context and validating coverage, so existing CUDA models keep their current behavior.

The branch also wires the new source and unit coverage into both Buck and CMake so the primitive can land independently before any model-specific engine consumes it.
@mergennachin mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from 35388a3 to 9f52f6a Compare June 15, 2026 18:56
@mergennachin mergennachin merged commit 182be0e into main Jun 15, 2026
288 of 293 checks passed
@mergennachin mergennachin deleted the llm-pr-a-cuda-mutable-state branch June 15, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants