feat: TaskScope ownership and Task Registry introspection for spawned tasks#905
Open
qdot wants to merge 20 commits into
Open
feat: TaskScope ownership and Task Registry introspection for spawned tasks#905qdot wants to merge 20 commits into
qdot wants to merge 20 commits into
Conversation
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Introduces the Task Registry in buttplug_core::util::task -- the global record of every live task. Provides TaskId, TaskInfo, TaskEvent, and TaskOutcome types, segment-aware prefix matching (live_count_under), broadcast lifecycle events, and wait_empty_under for shutdown synchronization. The scope module re-export is commented out until Task 2 lands TaskScope. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds spawn_detached() to buttplug_core::util::task for spawning tasks
with no owning scope. Registered under "detached/{name}" so they remain
visible in registry snapshots, but nothing can cancel them. Provides
Send and non-Send (wasm) cfg variants for parity.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…haviour - TaskId: document that u64 counter wrap is not a practical concern at any realistic task spawn rate within a single process lifetime. - TaskRegistry: document that DashMap shard capacity is not reclaimed after deregister, making peak concurrent-task count a memory high-water mark for the process lifetime. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Task 4: ServerDeviceManager now owns a device-manager root TaskScope instead of a bare CancellationToken. The event loop is spawned through the scope (receiving a child token), and a 'devices' child scope is threaded into the event loop so every device gets its own per-device scope at bringup. Shutdown cancels the scope and drains the subtree via registry().wait_empty_under(path) since TaskScope is not Clone. Task 5: DeviceHandle carries an Arc<TaskScope> (it is Clone, so the subtree cancels when the last clone drops). build_device_handle takes the per-device scope, wraps it in an Arc, and spawns both the device IO task and the event-forwarding task through it. run_device_task and the forwarding loop each gain a token.cancelled() select arm (the device task's arm sits directly after 'biased;' so cancellation wins over new work). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rewrite PingTimer to be owned by a TaskScope: the timer task selects on its cancellation token and exits cooperatively, so the PingMessage::End variant and the entire impl Drop for PingTimer (the Drop-spawn hack) are deleted. Dropping the timer now drops its scope, which cancels the task. server_builder creates the per-instance "server" root scope. The ping-timeout FnOnce callback moves a child scope in and uses spawn_and_hold for the stop-devices task. PingTimer::new and ButtplugServer::new gain the scope parameter; the server stores it. ButtplugServer::shutdown cancels its own scope and awaits only its own subtree via registry().wait_empty_under -- not the device manager's subtree, since a shared device manager outlives the server and handles its own shutdown. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add an owned task_scope: TaskScope parameter to
ProtocolHandler::handle_input_subscribe_cmd. DeviceHandle's wrapper
hands each handler a self.task_scope.child("subscription") scope so
subscription event-handler tasks are owned by the per-device scope and
get cancelled on device removal.
kgoal_boost's internal subscription spawn becomes
task_scope.spawn_and_hold with a biased token.cancelled() select arm
wrapping the hardware event recv loop, so a blocked recv is interrupted
on cancellation. The existing receiver_count/subscribed_sensors
early-exit is retained as a nobody-listening optimization.
galaku overrides the trait but performs no internal spawn, so it only
gains the (unused) parameter. kiiroo_v21's override is entirely inside
a block comment and is not a compiled overrider, so it is untouched.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Brings a real test-harness device online (which spawns per-device io and event-forwarding tasks plus the device-manager event loop), then asserts the task registry drains back to baseline once shutdown() returns -- before dropping the server, so ordinary channel-drop teardown cannot mask a missing cooperative-cancellation arm. A second post-drop check confirms nothing is stranded. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Bringup-during-shutdown race (server_device_manager_event_loop.rs):
Replace raw `async_manager::spawn` for device bringup with
`devices_scope.spawn("bringup-{address}", ...)` so the bringup task
is registered in the Task Registry for its full duration.
Ordering guarantee: io and event-forwarding tasks register inside
build_device_handle (synchronously, before it returns) and therefore
before the bringup wrapper deregisters — the registry count never
momentarily hits zero while work remains.
- Drop comment contradicts code (server_device_manager.rs:403-409):
Fix misleading comment that said "we only need to log here" while
the code still called self.task_scope.cancel() explicitly. Updated
comment now correctly states the explicit cancel is intentional and
fires eagerly before field drop order takes effect.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add emit_task_events flag to EngineOptions/EngineOptionsExternal/EngineOptionsBuilder mirroring the emit_output_observations pattern. Add TaskStarted, TaskEnded, TaskList EngineMessage variants, TaskListEntry struct, and RequestTaskList IntifaceMessage variant. Add stub match arm in frontend/mod.rs (Task 10 will wire the real logic). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
When emit_task_events is enabled in EngineOptions, the frontend server event loop subscribes to buttplug_core::util::task::registry().event_stream() before entering the select loop and forwards TaskStarted/TaskEnded as EngineMessages. When disabled, the arm is inert via the Option+pending pattern. RequestTaskList is now answered with a registry snapshot mapped into TaskListEntry values and sent back as EngineMessage::TaskList. The emit_task_events flag is read from EngineOptions at spawn time in engine.rs and passed into frontend_server_event_loop, matching how other flags (e.g. emit_output_observations) flow through the same code path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Document the new TaskScope/TaskRegistry ownership and introspection infrastructure (buttplug_core::util::task) as a Key Pattern, note the server's scope ownership, the ProtocolHandler subscribe signature change, and the intiface_engine emit_task_events plumbing. Add test_task_lifecycle.rs to the integration test crate context. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- DeviceHandle::new: add #[allow(clippy::too_many_arguments)] with explanatory comment; constructor bundles all device-state concerns and restructuring is not warranted for a pub(crate) constructor - CONTEXT.md Task Scope entry: fix example path from server/device-manager/device-3/keepalive to device-manager-1/devices/device-3/io to match the shipped two-root design (independent server-N and device-manager-N roots) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ServerDeviceManager::shutdown() cancelled the task scope synchronously before the returned future awaited stop_scanning/stop_devices/disconnect. Device io tasks select biased on their cancellation token, so queued stop commands were dropped and devices could keep running through disconnect. Capture the scope's CancellationToken and path synchronously, run cleanup first, then cancel and wait_empty_under. Align ButtplugServer::shutdown to the same cleanup-before-cancel ordering. Adds a regression test exercising the StopScanning-through-event-loop path with a connected device and active scanning. The stronger 'observe the actual stop write' variant is infeasible with the test harness: the test hardware sets a 1ms message_gap, so the disconnect teardown races the batched stop write independent of cancel ordering. The test instead asserts shutdown drives cleanup through the live event loop, drains every scope task, and returns Ok. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The bringup task spawned in the device-manager event loop ignored its cancellation token, so build_device_handle could stall indefinitely (e.g. a BLE connect that never completes). ServerDeviceManager::shutdown's wait_empty_under then waited forever for the bringup task to deregister. Wrap build_device_handle in a biased tokio::select! against token.cancelled(); on cancellation we log at info and fall through. Dropping the build_device_handle future drops device_scope, cancelling anything it already spawned. connecting_devices.remove(&address) still runs unconditionally as the last statement on every path (success, error, and cancellation) so a future scan can retry the address. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
finish_task deregistered a scoped task only after fut.await, so a panicking task unwound past deregistration and leaked its registry entry forever, hanging wait_empty_under on that subtree. Replace the post-await deregistration with a DeregisterGuard constructed BEFORE awaiting; its Drop impl deregisters with outcome Panicked when std::thread::panicking(), Cancelled when the token is cancelled, else Completed. The guard covers the spawn / spawn_and_hold paths (via finish_task) and spawn_detached (token None -> Panicked on panic, else Completed). Adds the Panicked variant to TaskOutcome, documents it on the intiface EngineMessage::TaskEnded outcome field, and adds a unit test asserting wait_empty_under resolves within a timeout after a scoped task panics (verified RED against the old post-await deregistration). Also softens the task module doc, which overstated that every task is scope-owned, to note buttplug_server is fully migrated while other crates follow up. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
test_spawn_and_hold_keeps_scope_alive now subscribes to the registry event stream before spawning and asserts the held task's Ended event carries TaskOutcome::Completed (not Cancelled) — proving spawn_and_hold does not wire drop-cancel to the normally-finishing held task. Lagged broadcast reads are tolerated so the assertion stays stable under parallel test load. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sion test Addresses code review of the task-scope-lifecycle PR-response fixes: - Fix 1: rename test_shutdown_runs_cleanup_through_event_loop_before_cancel to test_shutdown_under_load_drains_subtree and rewrite its docstring honestly. It is not RED-capable for the cleanup-before-cancel ordering (reverting that fix leaves it green), so it is documented as a shutdown-under-load smoke test that guards the coarser hang/leak failure mode. The existing NOTE on why the write-observation variant is infeasible (1ms message_gap teardown race) is kept; the inherently-flaky instrumented-ordering variant is explicitly not pursued. - Fix 2: add test_shutdown_resolves_with_stalled_bringup plus a StallingDeviceCommunicationManager util whose connect() never resolves. With a device bringup stalled in connect(), shutdown() must still resolve within 10s via the biased select on the bringup token. Verified RED: reverting the bringup select to the non-cancellable |_token| form makes shutdown hang and the test time out at 10s. - Parallel-stability: the leak-check tests derived their device-manager scope prefix by scanning the global registry for any new device-manager-N root, which could pick a concurrent test's root and flakily report leaked tasks. Add ServerDeviceManager::scope_path() and use server.device_manager(). scope_path() so each test inspects only its own subtree. Full file is now stable across 12 consecutive runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces fire-and-forget task spawning with an ownership tree. Every spawned task in
buttplug_servernow has an owner, a hierarchical name, a cooperative cancellation path, and a registry entry — so "what is the server running right now" is answerable in-process, via tracing, and from frontends.buttplug_core::util::task(new):TaskScopeownership tree —root/childpath derivation (device-manager-1/devices/device-3/io), cooperativeCancellationTokenhierarchy, cancel-on-drop,spawn_and_holdforFnOnce/handed-scope cases,shutdown()that drains the subtree. GlobalTaskRegistrywithsnapshot(),live_count_under(),event_stream(), andwait_empty_under()(segment-aware prefix matching, subscribe-before-count).spawn_detachedescape hatch for the rare unowned task. The publicAsyncManagertrait is unchanged; WASM parity throughout (noJoinHandle, no abort).buttplug_servermigrated:ServerDeviceManagerowns adevice-managerroot scope with per-device child scopes (io, event-forwarding, bringup);ButtplugServerowns aserverroot scope;PingTimeris scope-owned —PingMessage::Endand the spawn-in-Drop hack are deleted;ProtocolHandler::handle_input_subscribe_cmdgains aTaskScopeparameter (kgoal_boost subscription task migrated). Shutdown now waits for the subtree to drain instead of cancel-and-hope.intiface_engineintrospection (opt-in viaEngineOptions::emit_task_events, mirrors the Output Observation pattern):EngineMessage::TaskStarted/TaskEndedevent stream plusIntifaceMessage::RequestTaskList→EngineMessage::TaskListsnapshot, so Intiface Central can display the live task tree.Review-response fixes (second push)
External model review found two P1s and a P2, all confirmed and fixed:
6efcd78b): shutdown cancelled the scope before stop-scanning/stop-devices/disconnects ran, so queued stop commands lost thebiasedselect race to cancellation. Restored master's cleanup-then-cancel order in bothServerDeviceManager::shutdownandButtplugServer::shutdown(token cloned; cancel now happens inside the future after cleanup).a53fcb2e): the bringup task ignored its token, so a stalled BLE connect would hangwait_empty_under(and thus shutdown) forever.build_device_handleis now wrapped in a biased select against the token;connecting_devicescleanup runs on every path. RED-verified regression test with a newStallingDeviceCommunicationManagertest util (shutdown resolves in 0.25s with the fix; hangs past 10s without).49f8df4a): a panicking scoped task unwound past deregistration, leaking its registry entry forever and hanging anywait_empty_underon that subtree. Deregistration now happens via a drop guard; newTaskOutcome::Panicked(detected viastd::thread::panicking()). RED-verified unit test.Latent hazards removed (original push)
PingTimer::Dropspawning a task (panic path when no runtime is active)ServerDeviceManager::shutdownreturning before tasks actually stoppedreceiver_count()exit only ran after receiving an event)Known issue surfaced during testing (pre-existing on master, NOT addressed here)
With command batching enabled (
message_gap),stop_devices()resolves when stop commands are queued, not written. A disconnect immediately after can tear down the device io task inside the batch window and drop the pending stop write — i.e. a device may not receive its stop command during shutdown. This exists on master independent of this branch (it's why the shutdown-ordering test is framed as a smoke test rather than a write-observation test). Recommend a follow-up issue: flush pending batched commands before io-task teardown.Deferred to follow-up (documented in plan)
spawn!sites, hwmgr/client/transport crates, then deleting barespawn!(module docs scoped accordingly)kiiroo_v21::handle_input_subscribe_cmdis entirely commented out — input features for that protocol are silently dead (pre-existing; flagged during review)Docs
CLAUDE.md: new Task Lifecycle pattern section; formatting commands corrected tocargo +nightly fmt(stable fmt silently ignores this repo's nightly-only rustfmt options and rewrites the workspace)CONTEXT.md: Task Scope / Task Registry domain termsTest Plan
cargo test -p buttplug_core— 15 tests incl. scope/registry/panic-deregistrationcargo test -p buttplug_server— 26 testscargo test -p buttplug_tests— incl. 808 protocol conformance, 0 failurestest_task_lifecycle.rs: leak test (RED-verified), shutdown-under-load smoke test, stalled-bringup regression test (RED-verified) — parallel-stable across repeated runswasm-pack build --dev crates/buttplug_server --no-default-features --features wasmemit_task_eventson — verify TaskStarted/TaskEnded stream and TaskList snapshot🤖 Generated with Claude Code