Skip to content

feat: TaskScope ownership and Task Registry introspection for spawned tasks#905

Open
qdot wants to merge 20 commits into
masterfrom
feat/task-scope-lifecycle
Open

feat: TaskScope ownership and Task Registry introspection for spawned tasks#905
qdot wants to merge 20 commits into
masterfrom
feat/task-scope-lifecycle

Conversation

@qdot

@qdot qdot commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary

Replaces fire-and-forget task spawning with an ownership tree. Every spawned task in buttplug_server now has an owner, a hierarchical name, a cooperative cancellation path, and a registry entry — so "what is the server running right now" is answerable in-process, via tracing, and from frontends.

  • buttplug_core::util::task (new): TaskScope ownership tree — root/child path derivation (device-manager-1/devices/device-3/io), cooperative CancellationToken hierarchy, cancel-on-drop, spawn_and_hold for FnOnce/handed-scope cases, shutdown() that drains the subtree. Global TaskRegistry with snapshot(), live_count_under(), event_stream(), and wait_empty_under() (segment-aware prefix matching, subscribe-before-count). spawn_detached escape hatch for the rare unowned task. The public AsyncManager trait is unchanged; WASM parity throughout (no JoinHandle, no abort).
  • buttplug_server migrated: ServerDeviceManager owns a device-manager root scope with per-device child scopes (io, event-forwarding, bringup); ButtplugServer owns a server root scope; PingTimer is scope-owned — PingMessage::End and the spawn-in-Drop hack are deleted; ProtocolHandler::handle_input_subscribe_cmd gains a TaskScope parameter (kgoal_boost subscription task migrated). Shutdown now waits for the subtree to drain instead of cancel-and-hope.
  • intiface_engine introspection (opt-in via EngineOptions::emit_task_events, mirrors the Output Observation pattern): EngineMessage::TaskStarted/TaskEnded event stream plus IntifaceMessage::RequestTaskListEngineMessage::TaskList snapshot, so Intiface Central can display the live task tree.

Review-response fixes (second push)

External model review found two P1s and a P2, all confirmed and fixed:

  • Shutdown ordering regression (6efcd78b): shutdown cancelled the scope before stop-scanning/stop-devices/disconnects ran, so queued stop commands lost the biased select race to cancellation. Restored master's cleanup-then-cancel order in both ServerDeviceManager::shutdown and ButtplugServer::shutdown (token cloned; cancel now happens inside the future after cleanup).
  • Bringup hang (a53fcb2e): the bringup task ignored its token, so a stalled BLE connect would hang wait_empty_under (and thus shutdown) forever. build_device_handle is now wrapped in a biased select against the token; connecting_devices cleanup runs on every path. RED-verified regression test with a new StallingDeviceCommunicationManager test util (shutdown resolves in 0.25s with the fix; hangs past 10s without).
  • Panic-safe registry (49f8df4a): a panicking scoped task unwound past deregistration, leaking its registry entry forever and hanging any wait_empty_under on that subtree. Deregistration now happens via a drop guard; new TaskOutcome::Panicked (detected via std::thread::panicking()). RED-verified unit test.

Latent hazards removed (original push)

  • PingTimer::Drop spawning a task (panic path when no runtime is active)
  • ServerDeviceManager::shutdown returning before tasks actually stopped
  • kgoal subscription task strandable on a quiet device (receiver_count() exit only ran after receiving an event)

Known issue surfaced during testing (pre-existing on master, NOT addressed here)

With command batching enabled (message_gap), stop_devices() resolves when stop commands are queued, not written. A disconnect immediately after can tear down the device io task inside the batch window and drop the pending stop write — i.e. a device may not receive its stop command during shutdown. This exists on master independent of this branch (it's why the shutdown-ordering test is framed as a smoke test rather than a write-observation test). Recommend a follow-up issue: flush pending batched commands before io-task teardown.

Deferred to follow-up (documented in plan)

  • Remaining protocol output-loop spawn! sites, hwmgr/client/transport crates, then deleting bare spawn! (module docs scoped accordingly)
  • kiiroo_v21::handle_input_subscribe_cmd is entirely commented out — input features for that protocol are silently dead (pre-existing; flagged during review)

Docs

  • CLAUDE.md: new Task Lifecycle pattern section; formatting commands corrected to cargo +nightly fmt (stable fmt silently ignores this repo's nightly-only rustfmt options and rewrites the workspace)
  • CONTEXT.md: Task Scope / Task Registry domain terms

Test Plan

  • cargo test -p buttplug_core — 15 tests incl. scope/registry/panic-deregistration
  • cargo test -p buttplug_server — 26 tests
  • cargo test -p buttplug_tests — incl. 808 protocol conformance, 0 failures
  • test_task_lifecycle.rs: leak test (RED-verified), shutdown-under-load smoke test, stalled-bringup regression test (RED-verified) — parallel-stable across repeated runs
  • WASM: wasm-pack build --dev crates/buttplug_server --no-default-features --features wasm
  • Manual: Intiface Central against this engine with emit_task_events on — verify TaskStarted/TaskEnded stream and TaskList snapshot

🤖 Generated with Claude Code

qdot and others added 20 commits June 9, 2026 17:19
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Introduces the Task Registry in buttplug_core::util::task -- the global
record of every live task. Provides TaskId, TaskInfo, TaskEvent, and
TaskOutcome types, segment-aware prefix matching (live_count_under),
broadcast lifecycle events, and wait_empty_under for shutdown synchronization.

The scope module re-export is commented out until Task 2 lands TaskScope.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds spawn_detached() to buttplug_core::util::task for spawning tasks
with no owning scope. Registered under "detached/{name}" so they remain
visible in registry snapshots, but nothing can cancel them. Provides
Send and non-Send (wasm) cfg variants for parity.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…haviour

- TaskId: document that u64 counter wrap is not a practical concern at
  any realistic task spawn rate within a single process lifetime.
- TaskRegistry: document that DashMap shard capacity is not reclaimed
  after deregister, making peak concurrent-task count a memory
  high-water mark for the process lifetime.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Task 4: ServerDeviceManager now owns a device-manager root TaskScope
instead of a bare CancellationToken. The event loop is spawned through
the scope (receiving a child token), and a 'devices' child scope is
threaded into the event loop so every device gets its own per-device
scope at bringup. Shutdown cancels the scope and drains the subtree via
registry().wait_empty_under(path) since TaskScope is not Clone.

Task 5: DeviceHandle carries an Arc<TaskScope> (it is Clone, so the
subtree cancels when the last clone drops). build_device_handle takes
the per-device scope, wraps it in an Arc, and spawns both the device IO
task and the event-forwarding task through it. run_device_task and the
forwarding loop each gain a token.cancelled() select arm (the device
task's arm sits directly after 'biased;' so cancellation wins over new
work).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rewrite PingTimer to be owned by a TaskScope: the timer task selects on its
cancellation token and exits cooperatively, so the PingMessage::End variant
and the entire impl Drop for PingTimer (the Drop-spawn hack) are deleted.
Dropping the timer now drops its scope, which cancels the task.

server_builder creates the per-instance "server" root scope. The ping-timeout
FnOnce callback moves a child scope in and uses spawn_and_hold for the
stop-devices task. PingTimer::new and ButtplugServer::new gain the scope
parameter; the server stores it.

ButtplugServer::shutdown cancels its own scope and awaits only its own subtree
via registry().wait_empty_under -- not the device manager's subtree, since a
shared device manager outlives the server and handles its own shutdown.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add an owned task_scope: TaskScope parameter to
ProtocolHandler::handle_input_subscribe_cmd. DeviceHandle's wrapper
hands each handler a self.task_scope.child("subscription") scope so
subscription event-handler tasks are owned by the per-device scope and
get cancelled on device removal.

kgoal_boost's internal subscription spawn becomes
task_scope.spawn_and_hold with a biased token.cancelled() select arm
wrapping the hardware event recv loop, so a blocked recv is interrupted
on cancellation. The existing receiver_count/subscribed_sensors
early-exit is retained as a nobody-listening optimization.

galaku overrides the trait but performs no internal spawn, so it only
gains the (unused) parameter. kiiroo_v21's override is entirely inside
a block comment and is not a compiled overrider, so it is untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Brings a real test-harness device online (which spawns per-device io and
event-forwarding tasks plus the device-manager event loop), then asserts
the task registry drains back to baseline once shutdown() returns -- before
dropping the server, so ordinary channel-drop teardown cannot mask a missing
cooperative-cancellation arm. A second post-drop check confirms nothing is
stranded.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Bringup-during-shutdown race (server_device_manager_event_loop.rs):
  Replace raw `async_manager::spawn` for device bringup with
  `devices_scope.spawn("bringup-{address}", ...)` so the bringup task
  is registered in the Task Registry for its full duration.
  Ordering guarantee: io and event-forwarding tasks register inside
  build_device_handle (synchronously, before it returns) and therefore
  before the bringup wrapper deregisters — the registry count never
  momentarily hits zero while work remains.

- Drop comment contradicts code (server_device_manager.rs:403-409):
  Fix misleading comment that said "we only need to log here" while
  the code still called self.task_scope.cancel() explicitly. Updated
  comment now correctly states the explicit cancel is intentional and
  fires eagerly before field drop order takes effect.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add emit_task_events flag to EngineOptions/EngineOptionsExternal/EngineOptionsBuilder
mirroring the emit_output_observations pattern. Add TaskStarted, TaskEnded, TaskList
EngineMessage variants, TaskListEntry struct, and RequestTaskList IntifaceMessage
variant. Add stub match arm in frontend/mod.rs (Task 10 will wire the real logic).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
When emit_task_events is enabled in EngineOptions, the frontend server
event loop subscribes to buttplug_core::util::task::registry().event_stream()
before entering the select loop and forwards TaskStarted/TaskEnded as
EngineMessages. When disabled, the arm is inert via the Option+pending pattern.

RequestTaskList is now answered with a registry snapshot mapped into
TaskListEntry values and sent back as EngineMessage::TaskList.

The emit_task_events flag is read from EngineOptions at spawn time in
engine.rs and passed into frontend_server_event_loop, matching how other
flags (e.g. emit_output_observations) flow through the same code path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Document the new TaskScope/TaskRegistry ownership and introspection
infrastructure (buttplug_core::util::task) as a Key Pattern, note the
server's scope ownership, the ProtocolHandler subscribe signature change,
and the intiface_engine emit_task_events plumbing. Add test_task_lifecycle.rs
to the integration test crate context.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- DeviceHandle::new: add #[allow(clippy::too_many_arguments)] with
  explanatory comment; constructor bundles all device-state concerns
  and restructuring is not warranted for a pub(crate) constructor
- CONTEXT.md Task Scope entry: fix example path from
  server/device-manager/device-3/keepalive to
  device-manager-1/devices/device-3/io to match the shipped two-root
  design (independent server-N and device-manager-N roots)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ServerDeviceManager::shutdown() cancelled the task scope synchronously
before the returned future awaited stop_scanning/stop_devices/disconnect.
Device io tasks select biased on their cancellation token, so queued stop
commands were dropped and devices could keep running through disconnect.
Capture the scope's CancellationToken and path synchronously, run cleanup
first, then cancel and wait_empty_under. Align ButtplugServer::shutdown to
the same cleanup-before-cancel ordering.

Adds a regression test exercising the StopScanning-through-event-loop path
with a connected device and active scanning. The stronger 'observe the
actual stop write' variant is infeasible with the test harness: the test
hardware sets a 1ms message_gap, so the disconnect teardown races the
batched stop write independent of cancel ordering. The test instead asserts
shutdown drives cleanup through the live event loop, drains every scope
task, and returns Ok.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The bringup task spawned in the device-manager event loop ignored its
cancellation token, so build_device_handle could stall indefinitely (e.g. a
BLE connect that never completes). ServerDeviceManager::shutdown's
wait_empty_under then waited forever for the bringup task to deregister.

Wrap build_device_handle in a biased tokio::select! against
token.cancelled(); on cancellation we log at info and fall through. Dropping
the build_device_handle future drops device_scope, cancelling anything it
already spawned. connecting_devices.remove(&address) still runs
unconditionally as the last statement on every path (success, error, and
cancellation) so a future scan can retry the address.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
finish_task deregistered a scoped task only after fut.await, so a panicking
task unwound past deregistration and leaked its registry entry forever,
hanging wait_empty_under on that subtree. Replace the post-await
deregistration with a DeregisterGuard constructed BEFORE awaiting; its Drop
impl deregisters with outcome Panicked when std::thread::panicking(),
Cancelled when the token is cancelled, else Completed. The guard covers the
spawn / spawn_and_hold paths (via finish_task) and spawn_detached (token
None -> Panicked on panic, else Completed).

Adds the Panicked variant to TaskOutcome, documents it on the intiface
EngineMessage::TaskEnded outcome field, and adds a unit test asserting
wait_empty_under resolves within a timeout after a scoped task panics
(verified RED against the old post-await deregistration). Also softens the
task module doc, which overstated that every task is scope-owned, to note
buttplug_server is fully migrated while other crates follow up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
test_spawn_and_hold_keeps_scope_alive now subscribes to the registry event
stream before spawning and asserts the held task's Ended event carries
TaskOutcome::Completed (not Cancelled) — proving spawn_and_hold does not wire
drop-cancel to the normally-finishing held task. Lagged broadcast reads are
tolerated so the assertion stays stable under parallel test load.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sion test

Addresses code review of the task-scope-lifecycle PR-response fixes:

- Fix 1: rename test_shutdown_runs_cleanup_through_event_loop_before_cancel to
  test_shutdown_under_load_drains_subtree and rewrite its docstring honestly. It
  is not RED-capable for the cleanup-before-cancel ordering (reverting that fix
  leaves it green), so it is documented as a shutdown-under-load smoke test that
  guards the coarser hang/leak failure mode. The existing NOTE on why the
  write-observation variant is infeasible (1ms message_gap teardown race) is
  kept; the inherently-flaky instrumented-ordering variant is explicitly not
  pursued.

- Fix 2: add test_shutdown_resolves_with_stalled_bringup plus a
  StallingDeviceCommunicationManager util whose connect() never resolves. With a
  device bringup stalled in connect(), shutdown() must still resolve within 10s
  via the biased select on the bringup token. Verified RED: reverting the
  bringup select to the non-cancellable |_token| form makes shutdown hang and
  the test time out at 10s.

- Parallel-stability: the leak-check tests derived their device-manager scope
  prefix by scanning the global registry for any new device-manager-N root,
  which could pick a concurrent test's root and flakily report leaked tasks.
  Add ServerDeviceManager::scope_path() and use server.device_manager().
  scope_path() so each test inspects only its own subtree. Full file is now
  stable across 12 consecutive runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant