Skip to content

Refactor: host-build trb runtime arena#846

Open
poursoul wants to merge 5 commits into
hw-native-sys:mainfrom
poursoul:refactor-defer-slot-state-bind-to-prepare-task
Open

Refactor: host-build trb runtime arena#846
poursoul wants to merge 5 commits into
hw-native-sys:mainfrom
poursoul:refactor-defer-slot-state-bind-to-prepare-task

Conversation

@poursoul
Copy link
Copy Markdown
Collaborator

@poursoul poursoul commented May 22, 2026

Summary

Move the trb runtime arena's layout + data initialization from AICPU boot
onto host, so each AICPU launch reduces to a cheap pointer-fixup pass plus
the SM reset that can't run off-device. The pooled prebuilt image lives in
a per-Worker DeviceRunner pool and is reused across runs via a single
rtMemcpy — multi-launch boot cost drops from O(task_window_size) per
worker to a constant.

Two related cleanups ride along:

  • RingSchedState::init's O(task_window_size) slot-bind loop is lifted
    into per-submit orch::prepare_task, making startup independent of
    window size. The two extra stores hit the same 64B cache line that
    prepare_task already dirties, so the per-submit cost is essentially
    free.
  • AICPU SM reset (per-slot bind_ring + reset_for_reuse +
    fanin_count/active_mask zero) consolidated into
    PTO2SharedMemoryHandle::init_header_per_ring so the host-build path
    doesn't dereference SM.

Covers both a2a3 and a5 trb runtimes (platform layer onboard + sim).
hbg is unaffected by the runtime-arena split — its
setup_static_arena(...,0) keeps the third region unreserved.

Mechanism

  • runtime_create_from_sm split into four phases that run on either side:
    • runtime_reserve_layout — pure arithmetic; host computes sub-region
      offsets on a libc-backed DeviceArena.
    • runtime_init_data_from_layout — writes standalone fields, memset's
      arena regions, and stores SM device pointers (only stores, no
      dereferences).
    • runtime_wire_arena_pointers — walks every arena-internal pointer
      field and binds it to arena.base() + offset. Idempotent: host runs
      once with the host mirror, AICPU runs once after attach with device
      addresses.
    • runtime_finalize_after_wire — AICPU-only fixup for s_runtime_ops
      (device-side file-local global) and the SPMD core counts from the
      SchedulerContext.
  • DeviceArena::attach() wraps an externally-owned buffer with no
    per-attach allocation; re-attach is permitted so each AICPU boot can
    reuse the same pooled image. Pre-alignment / non-null / power-of-two
    checks std::abort() instead of assert() so release builds still
    trap on contract violations.
  • pto2_sm_layout namespace computes SM device-side field addresses by
    pure offset arithmetic so host init never reads SM. Takes a per-ring
    task_window_sizes[] array (mirroring the SM API) and asserts
    ring_id in range — structurally prevents the host-built image from
    silently disagreeing with the SM layout.
  • New runtime/shared/pto_runtime2_init.cpp holds the host-pluggable
    cold-path lifted from pto_runtime2.cpp / pto_orchestrator.cpp /
    scheduler/pto_scheduler.cpp. AICPU-only ops table / submit_task /
    dispatch / business logic stay in their original files.
  • DeviceRunner now owns three independent pooled arenas
    gm_heap_arena_, gm_sm_arena_, runtime_arena_pool_ — one
    device_malloc each. Split out from a single backing allocation
    because the combined size can exceed the device allocator's largest
    contiguous block. setup_static_arena(gm_heap_size, gm_sm_size, runtime_arena_size) commits each region independently;
    acquire_pooled_runtime_arena() returns nullptr when the region is
    unreserved (hbg's setup_static_arena(...,0) path) so misuse is loud,
    not undefined.
  • bind_prepared_to_runtime_impl (host runtime_maker) does the full
    reserve_layout → init_data → wire on a host arena, stashes the layout
    inside the PTO2Runtime image at prebuilt_layout, then rtMemcpys
    the whole arena into the pooled device region.
  • Dead fields and parameters dropped: PTO2TensorMap::orch back-pointer
    (never dereferenced), PTO2Runtime::prebuilt_arena_base mirror (host
    Runtime::prebuilt_arena_base_ is the real source of truth), unused
    task_window_size / dep_pool_capacity from
    PTO2SchedulerState::init_data_from_layout and
    RingSchedState::init_data_from_layout (scheduler only needs SM base
    • ring index, both window-size-independent).

Test plan

  • cpput: 25/25 pass. ready_queue / spsc_queue / scheduler_state /
    task_state / wiring / tensormap UTs migrated to the data+wire API.
    task_allocator.init grew an optional initial_local_task_id
    (default 0) so the near-INT32_MAX corner case is still exercised
    without reading the SM.
  • a5sim: L2 trb 21/21 + L2 host_build_graph 6/6 pass.
  • a2a3sim: L2 trb 29/29 + L2 host_build_graph 9/9 pass.
  • a2a3 hardware: tests/st/.../paged_attention_unroll passes on
    device 9 (--build, pto-isa commit pinned to CI).

poursoul added 2 commits May 22, 2026 12:22
Move the per-slot payload/task pointer assignments out of the
RingSchedState::init() O(task_window_size) loop and into orch::prepare_task.
Their value is per-slot constant (&task_payloads[slot] /
&task_descriptors[slot]) but writing them at submit time, on the same 64B
slot_state cache line prepare_task is already dirtying, is essentially
free — while removing the only "scale-dependent" pointer assignments from
the init path. ring_id stays in init (its value is per-ring constant, so
rewriting it each submit would only add noise without removing a loop).

Split PTO2TaskSlotState::bind() into bind_ring() (init-time) and
bind_buffers() (per-submit) to make the two call-site shapes explicit.

Mirrored across both a2a3 and a5 trb runtimes.
Previously the AICPU rebuilt the entire trb runtime arena (PTO2Runtime,
orchestrator/scheduler/tensor_map sub-regions, sm_handle wrapper,
mailbox) on every device boot via runtime_create_from_sm. This commit
moves layout + data init onto the host so the AICPU only does a cheap
arena-internal pointer wire pass plus the SM reset that can't run
off-device. Multi-run boots reuse the pooled prebuilt image with a
single rtMemcpy.

Mechanism
- DeviceArena::attach() wraps an externally-owned buffer; re-attach is
  permitted so each AICPU boot can reuse the pooled image.
- runtime_create_from_sm split into reserve_layout / init_data_from_layout
  / wire_arena_pointers / finalize_after_wire. orchestrator / scheduler /
  tensor_map / ready_queue / spsc gain matching data+wire pairs;
  finalize_after_wire stays AICPU-only since it binds s_runtime_ops.
- pto2_sm_layout helper computes SM field device addresses by pure
  offset arithmetic so host init never dereferences SM.
- Per-slot SM-side reset (bind_ring + reset_for_reuse + active_mask)
  moved from RingSchedState::init into
  PTO2SharedMemoryHandle::init_header_per_ring so the AICPU still owns
  it after the split.
- runtime/shared/pto_runtime2_init.cpp — new file holding the host-able
  pieces lifted out of pto_runtime2.cpp / pto_orchestrator.cpp /
  pto_scheduler.cpp. AICPU-only ops table / submit_task / dispatch
  stay in place.

Host wiring (runtime_maker.cpp)
- DeviceRunner::setup_static_arena gains a third runtime_arena_size
  region (hbg passes 0). The prebuilt image lives in the same pooled
  backing allocation as gm_heap and SM, keeping worker lifetime to one
  rtMalloc.
- bind_prepared_to_runtime_impl reserves layout on a host arena, sizes
  the pooled regions, runs init_data + wire, stashes prebuilt metadata
  into the rt image, rtMemcpys to device, and records base/offset on
  Runtime so the AICPU boot can find it.

AICPU boot (aicpu_executor.cpp)
- attach the runtime arena to the pooled buffer, take rt from
  base+off_runtime, wire arena-internal pointers, sm_handle->init
  (SM reset including the per-slot fields above), mailbox reset,
  finalize_after_wire (ops table + cluster/aiv counts).

Tests
- cpput: 25/25 pass. ready_queue / spsc_queue / scheduler_state /
  task_state / wiring / tensormap UTs migrated to the data+wire API.
  task_allocator.init grew an optional initial_local_task_id (default
  0) so UTs can still exercise task_id near INT32_MAX without reading
  the SM.
- a2a3sim trb: standalone (dynamic_register variants, L3
  group/dependency) + L2 tensormap_and_ringbuffer 29 tests all pass.
- a2a3sim host_build_graph: 9/9 pass (verifies the shared HostApi
  changes don't break hbg).
- a2a3 hardware: tests/st/.../paged_attention_unroll PASS on device 9
  (--build with pto-isa commit pinned to CI).
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a prebuilt-arena fast path for the PTO2 runtime, allowing the host to pre-compute the runtime arena image and upload it to the device. This optimization reduces AICPU boot time by replacing full initialization with a simple attachment and pointer "wiring" phase. Key changes include refactoring the initialization logic for the runtime, orchestrator, and scheduler into separate data-population and pointer-wiring stages, extending the DeviceRunner to manage a pooled runtime arena, and adding an attach method to DeviceArena for externally-owned buffers. Review feedback correctly identified potential undefined behavior in the new acquire_pooled_runtime_arena methods when the arena is not provisioned, suggesting defensive checks against SIZE_MAX offsets.

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp
Comment thread src/a2a3/platform/sim/host/device_runner.cpp
poursoul added 2 commits May 25, 2026 11:27
Address review feedback from PR hw-native-sys#846:

- pto2_sm_layout::ring_task_descriptors_addr: take per-ring task_window_sizes[]
  array (mirroring PTO2SharedMemoryHandle's SM API) and assert ring_id range,
  so a future per-ring SM layout cannot silently disagree with the addresses
  the host bakes into the prebuilt image.
- DeviceRunner::acquire_pooled_runtime_arena (onboard + sim): return nullptr
  when runtime_arena_region_off_ == SIZE_MAX so a stray hbg-path call cannot
  resolve to base + SIZE_MAX. Failure is now loud and contained at the
  acquire boundary.
- DeviceArena::attach(): rewrite doc to match real behavior (region table is
  not repopulated after attach, reserve() asserts !committed_ so cannot
  replay, region_size() returns 0); promote the pre-alignment / non-null /
  power-of-two checks from plain assert() to an unconditional abort() so
  release builds still trap on contract violations.
- PTO2TensorMap: drop the dead `orch` back-pointer field (a2a3 never
  dereferences it), strip parent_orch parameter from wire_arena_pointers,
  and remove the now-unused PTO2OrchestratorState forward declaration.
- PTO2RingFlowControl::init(): add a coupling comment so future fc-initial-
  value or boot-order changes flag PTO2TaskAllocator::init's
  initial_local_task_id default in the same edit.
- PTO2SchedulerState::init_data_from_layout / RingSchedState::
  init_data_from_layout: drop the task_window_size / dep_pool_capacity
  parameters that were never consumed (scheduler only needs SM base + ring
  index, both window-size-independent; orchestrator counterpart still takes
  task_window_size for ring_task_descriptors arithmetic). Updated all
  callsites (pto_runtime2_init.cpp + 4 cpput suites).
- PTO2Runtime::prebuilt_arena_base: removed the dead mirror field. The host
  Runtime's prebuilt_arena_base_ is the real source of truth (AICPU reads it
  to locate the pooled buffer *before* dereferencing the image); the
  PTO2Runtime image still carries prebuilt_layout, which the AICPU does
  consume.

cpput: 25/25 pass. a2a3sim trb: dummy_task / dynamic_register / L2 trb
suite pass with --build.
Sync of PR hw-native-sys#846 commit 2/3 to a5 — commit 1 (slot_state.bind split)
was already mirrored. Brings the a5 trb runtime up to the same
host-build arena fast path as a2a3.

- 4-phase API (reserve_layout / init_data_from_layout /
  wire_arena_pointers / finalize_after_wire) replaces
  runtime_create_from_sm.
- New runtime/shared/pto_runtime2_init.cpp (~355 lines) and
  shared/pto_tensormap.cpp (the old runtime/pto_tensormap.cpp
  moved + split) hold the host-pluggable cold-path lifted from
  pto_runtime2.cpp / pto_orchestrator.cpp / scheduler/pto_scheduler.cpp.
- AICPU boot becomes attach + wire + sm_handle->init + finalize.
- runtime_maker.cpp pre-builds the arena image on host and rtMemcpys
  it into a pooled runtime-arena region; onboard + sim DeviceRunner
  setup_static_arena grow a third runtime_arena_size argument with
  matching acquire_pooled_runtime_arena (hbg path passes 0).

a5-specific divergences kept: enable_l2_swimlane (bool) instead of
L2PerfLevel, no dep_gen subsystem, wait_init_complete naming,
alignas(64) PTO2SpscQueue queue, cache_invalidate_range + cond.retire
in async_wait, RUNTIME_MAX_WORKER 108.

Tests
- cpput: 25/25 pass.
- a5sim: trb 21/21 + host_build_graph 6/6 pass.
- a2a3sim regression: trb 29/29 + host_build_graph 9/9 pass.
@poursoul poursoul changed the title Refactor: host-build trb runtime arena (a2a3 only) Refactor: host-build trb runtime arena May 27, 2026
…tions

DeviceRunner's GM heap / PTO2 SM / trb prebuilt runtime arena used to live
in a single backing device buffer (one rtMalloc per worker, three regions
sub-divided via DeviceArena::reserve). The combined size can exceed the
device allocator's largest contiguous block on real hardware, so split
into three independent DeviceArena instances — each commits exactly one
region (one device_malloc), and acquire_pooled_* returns its base().

Touches all four DeviceRunner implementations (a2a3/a5 × onboard/sim).
The setup_static_arena and acquire_pooled_* signatures are unchanged;
the host_api / runtime_maker callers are unaffected. hbg keeps passing
runtime_arena_size = 0, which leaves runtime_arena_pool_ uncommitted
and acquire_pooled_runtime_arena returning nullptr.

Tests
- cpput: 25/25 pass.
- a5sim: L2 trb + host_build_graph full suite pass.
- a2a3sim: L2 trb + host_build_graph full suite pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant