Skip to content

Add: L3/L2 host-device mapped region design#861

Open
ccyywwen wants to merge 2 commits into
hw-native-sys:mainfrom
ccyywwen:host-device_mapped-region
Open

Add: L3/L2 host-device mapped region design#861
ccyywwen wants to merge 2 commits into
hw-native-sys:mainfrom
ccyywwen:host-device_mapped-region

Conversation

@ccyywwen
Copy link
Copy Markdown
Contributor

Summary

  • Add a design document for the proposed HostDeviceMappedRegion primitive.
  • Define the first-layer host/device mapped-region contract extracted from the PR803 lessons: raw mapped data bytes plus cache-line-sized signal slots, explicit host datacopy, and explicit host-side notify/wait.
  • Specify runtime ownership for L3 PROCESS mode, where the chip child owns the real mapped allocation and the L3 parent carries only an opaque region handle through the existing parent-child mailbox RPC path.
  • Define the planned public C ABI and Python API, including config/info structs, datacopy helpers, notify/wait helpers, error mapping, and masked host pointers in public Python info.
  • Document the region layout, signal slot layout, memory-ordering contract, side-band mailbox transport, platform support matrix, thin NPU example, and expected tests.
  • Keep higher-level shared-buffer, send/recv, and queue protocols explicitly out of scope for this design.

Backend Notes

  • On onboard A2/A3, the design uses device allocation plus halHostRegister so the owner process can initialize and access the region through a host mapping while kernels receive device-visible pointers.
  • On sim backends, ordinary host memory may serve as both the host mapping and simulation-visible device pointer, but the same layout, validation, and signal semantics still apply.
  • On onboard A5, the design calls for explicit -ENOTSUP stubs rather than missing ABI symbols.
  • The parent-child mailbox remains a control/proxy path only. It is not treated as the CPU-NPU mapped-region primitive itself.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the design document for HostDeviceMappedRegion, a low-level parent-to-child/NPU communication primitive in Simpler. The review feedback suggests improving the public ABI structure HostDeviceMappedRegionInfo by making implicit padding fields explicit to prevent potential security vulnerabilities and FFI errors, as well as updating the corresponding static assertions to validate these offsets.

Comment on lines +90 to +99
typedef struct {
uint64_t host_data_ptr;
uint64_t device_data_ptr;
uint64_t data_bytes;
uint64_t host_signal_ptr;
uint64_t device_signal_ptr;
uint32_t signal_count;
uint64_t total_bytes;
uint32_t flags;
} HostDeviceMappedRegionInfo;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The HostDeviceMappedRegionInfo structure contains implicit padding between signal_count (offset 40) and total_bytes (offset 48), as well as trailing padding after flags (offset 56) to align the structure to an 8-byte boundary (64 bytes total). Implicit padding in public ABI structures can lead to security vulnerabilities (such as uninitialized memory/information leakage) when copied or serialized directly across process boundaries. It can also make cross-language bindings (e.g., Python ctypes or Rust FFI) more error-prone. Adding explicit reserved fields ensures that padding is visible, documented, and can be explicitly zero-initialized.

Suggested change
typedef struct {
uint64_t host_data_ptr;
uint64_t device_data_ptr;
uint64_t data_bytes;
uint64_t host_signal_ptr;
uint64_t device_signal_ptr;
uint32_t signal_count;
uint64_t total_bytes;
uint32_t flags;
} HostDeviceMappedRegionInfo;
typedef struct {
uint64_t host_data_ptr;
uint64_t device_data_ptr;
uint64_t data_bytes;
uint64_t host_signal_ptr;
uint64_t device_signal_ptr;
uint32_t signal_count;
uint32_t reserved0;
uint64_t total_bytes;
uint32_t flags;
uint32_t reserved1;
} HostDeviceMappedRegionInfo;

Comment on lines +296 to +304
static_assert(offsetof(HostDeviceMappedRegionInfo, host_data_ptr) == 0);
static_assert(offsetof(HostDeviceMappedRegionInfo, device_data_ptr) == 8);
static_assert(offsetof(HostDeviceMappedRegionInfo, data_bytes) == 16);
static_assert(offsetof(HostDeviceMappedRegionInfo, host_signal_ptr) == 24);
static_assert(offsetof(HostDeviceMappedRegionInfo, device_signal_ptr) == 32);
static_assert(offsetof(HostDeviceMappedRegionInfo, signal_count) == 40);
static_assert(offsetof(HostDeviceMappedRegionInfo, total_bytes) == 48);
static_assert(offsetof(HostDeviceMappedRegionInfo, flags) == 56);
static_assert(sizeof(HostDeviceMappedRegionInfo) == 64);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the static assertions for HostDeviceMappedRegionInfo to explicitly validate the offsets of the newly added reserved0 and reserved1 fields. This ensures the structure layout remains exactly 64 bytes with the expected alignment and no implicit padding.

Suggested change
static_assert(offsetof(HostDeviceMappedRegionInfo, host_data_ptr) == 0);
static_assert(offsetof(HostDeviceMappedRegionInfo, device_data_ptr) == 8);
static_assert(offsetof(HostDeviceMappedRegionInfo, data_bytes) == 16);
static_assert(offsetof(HostDeviceMappedRegionInfo, host_signal_ptr) == 24);
static_assert(offsetof(HostDeviceMappedRegionInfo, device_signal_ptr) == 32);
static_assert(offsetof(HostDeviceMappedRegionInfo, signal_count) == 40);
static_assert(offsetof(HostDeviceMappedRegionInfo, total_bytes) == 48);
static_assert(offsetof(HostDeviceMappedRegionInfo, flags) == 56);
static_assert(sizeof(HostDeviceMappedRegionInfo) == 64);
static_assert(offsetof(HostDeviceMappedRegionInfo, host_data_ptr) == 0);
static_assert(offsetof(HostDeviceMappedRegionInfo, device_data_ptr) == 8);
static_assert(offsetof(HostDeviceMappedRegionInfo, data_bytes) == 16);
static_assert(offsetof(HostDeviceMappedRegionInfo, host_signal_ptr) == 24);
static_assert(offsetof(HostDeviceMappedRegionInfo, device_signal_ptr) == 32);
static_assert(offsetof(HostDeviceMappedRegionInfo, signal_count) == 40);
static_assert(offsetof(HostDeviceMappedRegionInfo, reserved0) == 44);
static_assert(offsetof(HostDeviceMappedRegionInfo, total_bytes) == 48);
static_assert(offsetof(HostDeviceMappedRegionInfo, flags) == 56);
static_assert(offsetof(HostDeviceMappedRegionInfo, reserved1) == 60);
static_assert(sizeof(HostDeviceMappedRegionInfo) == 64);

- Add HostDeviceMappedRegion ABI, common implementation, and sim/onboard backends
- Expose mapped-region lifecycle, datacopy, notify, and wait through ChipWorker and Worker
- Add L3 control-plane proxying, round-trip examples, UT coverage, and user documentation
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

📝 Walkthrough

Walkthrough

This PR introduces HostDeviceMappedRegion (HDMR), a host/device communication primitive enabling direct mapped memory access with cache management and signal-based synchronization. The feature spans documentation, C/C++ core runtime, platform-specific integrations, Python bindings, worker control-plane support, and a round-trip example demonstrating device-side XOR transformation coordinated via mapped signals.

Changes

HostDeviceMappedRegion Communication

Layer / File(s) Summary
Documentation and Protocol Specification
docs/L3-L2-host-device-communication.md, docs/dynamic-linking.md, docs/worker-manager.md
HDMR purpose, ownership rules, C ABI contract, Python API surface, datacopy/signal semantics, and memory ordering guarantees; dynamic linking updates for HAL; new mailbox control-plane protocol with CTRL_* sub-commands and fixed field layout.
C++ Common HDMR Core and Platform Interface
src/common/host_device_comm/host_device_mapped_region.h, host_device_mapped_region.cpp, src/common/worker/pto_runtime_c_api.h
Mapped-region memory layout (64-byte aligned header/signal slots), lifecycle management (open/close/close_all with thread-safe registries and active-op reference counting), datacopy (host→region, region→host), signal semantics (monotonic notify/wait with timeout), platform callback interface for cache ops.
Simulation-Only HDMR Allocation
src/common/host_device_comm/host_device_mapped_region_sim.h, host_device_mapped_region_sim.cpp
Test platform allocator using posix_memalign with aligned buffers and release callbacks.
ChipWorker and DeviceRunner HDMR Integration
src/common/worker/chip_worker.h, chip_worker.cpp, src/a2a3/platform/onboard/host/device_runner.h, device_runner.cpp
ChipWorker dynamic symbol loading and public HDMR methods (open/close/info/datacopy/notify/wait); DeviceRunner HAL registration via halHostRegister/halHostUnregister and AArch64/x86_64 cache maintenance (flush/invalidate).
A2A3 Onboard Platform Integration
src/a2a3/platform/onboard/host/CMakeLists.txt, host_device_mapped_region_onboard.h, host_device_mapped_region_onboard.cpp, pto_runtime_c_api.cpp
A2A3 onboard allocation via device tensor and host registration; platform C API wrappers forwarding to common helpers; resource cleanup in device context lifecycle.
A2A3 Simulation Platform Integration
src/a2a3/platform/sim/host/CMakeLists.txt, pto_runtime_c_api.cpp
A2A3 simulation C API wrappers and resource cleanup; build configuration.
A5 Onboard and Simulation Platform Integration
src/a5/platform/onboard/host/CMakeLists.txt, pto_runtime_c_api.cpp, src/a5/platform/sim/host/CMakeLists.txt, pto_runtime_c_api.cpp
A5 C API wrappers (onboard unsupported, simulation delegating to common helpers) and resource cleanup; build configurations.
Worker Control Plane HDMR Support
src/common/hierarchical/worker_manager.h, worker_manager.cpp, src/common/hierarchical/worker.h
Mailbox control protocol constants (CTRL_OPEN_MAPPED_REGION through CTRL_MAPPED_REGION_WAIT) with extended argument layout; WorkerThread/WorkerManager methods to coordinate HDMR across L3 hierarchy; Worker high-level delegation.
Python Nanobind Bindings
python/bindings/task_interface.cpp, python/bindings/worker_bind.h
MappedRegionInfo class with read-only fields; _ChipWorker HDMR lifecycle/data-movement/signal methods with buffer protocol support and timeout exception mapping; _Worker control plane bindings and exported constants.
Python High-Level HDMR API
python/simpler/task_interface.py, python/simpler/worker.py
ChipWorker wrapper methods; MappedRegion frozen dataclass; Worker client API supporting L2 (direct delegation) and L3 (control-plane proxy with shared-memory payloads); error/status code translation.
Round-Trip Example: Kernel and Orchestration
examples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/kernels/aiv/host_device_mapped_region_round_trip.cpp, kernels/orchestration/host_device_mapped_region_round_trip_orch.cpp
AICore kernel polling signal0 with cache invalidation, XOR-transforming data based on observed signal, flushing results, and writing completion to signal1; orchestration marshaling scalars and submitting AIV task.
Round-Trip Example: Python Runner and Tests
examples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/main.py, test_mapped_region_round_trip.py
Python runner compiling kernel/orch, opening mapped region, iterating payload/notify/run/wait/verify cycles with deterministic XOR payloads; pytest test wrapper with subprocess harness and CI integration.
Comprehensive Testing
tests/ut/cpp/CMakeLists.txt, tests/ut/cpp/common/test_host_device_mapped_region.cpp, tests/ut/cpp/hierarchical/test_mailbox_control_layout.cpp, tests/ut/py/test_chip_worker.py, tests/ut/py/test_worker/test_mapped_region_hw.py, test_mapped_region_round_trip_hw.py, test_mapped_region_sim.py
C++ ABI/layout stability, config validation, allocation failures, open/close/datacopy/notify/wait round-trips, cache sequencing, signal semantics, stale/cross-context rejection, close_all; Python control offset contracts, L2/L3 round-tripping, wrapper validation, string rejection, timeout/error handling; hardware smoke tests and integration tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 Mapped memory hops through the cache,
Signals bounce 'cross device so fast,
Data flows both ways with grace,
L2 and L3 keep pace,
Host and NPU joined at last!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
src/a2a3/platform/onboard/host/host_device_mapped_region_onboard.h (1)

17-20: ⚡ Quick win

Add documentation for the public allocation API.

The function declaration lacks documentation describing its parameters, return values, and usage contract. Public APIs should document their behavior, especially error conditions.

📝 Suggested documentation
+/**
+ * Allocate a host/device mapped region on A2/A3 onboard platforms.
+ *
+ * `@param` ctx          Device context handle.
+ * `@param` total_bytes  Total size in bytes (data + signals).
+ * `@param` platform     Output platform-specific handle and release callback.
+ * `@param` host_base    Output host-visible base pointer.
+ * `@param` device_base  Output device-visible base pointer.
+ * `@return` 0 on success, negative error code on failure.
+ */
 int a2a3_onboard_host_device_mapped_region_allocate(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/onboard/host/host_device_mapped_region_onboard.h` around
lines 17 - 20, The declaration of
a2a3_onboard_host_device_mapped_region_allocate is missing a public doc comment;
add a clear header comment above the function that documents each parameter
(DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform
*platform, void **host_base, void **device_base), the return value meaning
(e.g., 0 on success, negative error codes on failure), the ownership and
lifetime contract for host_base/device_base and platform, expected
alignment/size constraints, and explicit error conditions (out-of-memory,
invalid ctx, unsupported device) and side effects so callers know how to use and
clean up the allocated mapped region.
src/common/host_device_comm/host_device_mapped_region_sim.h (1)

17-20: ⚡ Quick win

Add documentation for the simulation allocation API.

The function declaration lacks documentation describing its parameters, return values, and usage contract.

📝 Suggested documentation
+/**
+ * Allocate a host/device mapped region in simulation mode.
+ *
+ * `@param` ctx          Device context handle (unused in sim).
+ * `@param` total_bytes  Total size in bytes (data + signals).
+ * `@param` platform     Output platform-specific handle and release callback.
+ * `@param` host_base    Output host-visible base pointer.
+ * `@param` device_base  Output device-visible base pointer (same as host_base in sim).
+ * `@return` 0 on success, negative error code on failure.
+ */
 int host_device_mapped_region_allocate_sim(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/host_device_comm/host_device_mapped_region_sim.h` around lines 17
- 20, Add a clear doc comment above the function declaration
host_device_mapped_region_allocate_sim explaining the purpose, detailed
parameter semantics for ctx (DeviceContextHandle), total_bytes, platform
(HostDeviceMappedRegionPlatform*), and output pointers host_base and
device_base, describe the return values/errno or error codes and
ownership/lifetime responsibilities (who frees the region and alignment/size
constraints), and any thread-safety or simulation-specific behavior/requirements
so callers know how to use and release the allocated mapped region.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@examples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/kernels/aiv/host_device_mapped_region_round_trip.cpp`:
- Around line 81-99: The poll loop using signal0 (kMaxPollIters) may time out
but the code still writes transformed output and sets signal1 to seq, hiding
failures; modify the tail of the function to check the observed flag and if
observed is false then do not publish the success seq: either return immediately
(bail out) before transforming/writing output, or write a distinct error
sentinel to *signal1 (e.g., a reserved value different from seq) and call
flush_range(signal1, kCacheLineBytes) so the host can detect the failure; ensure
any writes to data/nbytes and flush_range(data + nbytes, nbytes) only occur when
observed is true and keep references to signal0, signal1, seq, data, nbytes,
kMaxPollIters and kCacheLineBytes to locate the changes.

In `@python/bindings/task_interface.cpp`:
- Around line 72-76: The function make_mapped_region_info currently copies raw
host pointers into the Python-visible MappedRegionInfo (specifically
info.host_data_ptr and info.host_signal_ptr); change it to redact/mask those
host pointer fields (e.g., set to 0 or a sentinel) before constructing and
returning the MappedRegionInfo so raw host VAs are not exposed to Python; apply
the same masking to any other places that construct MappedRegionInfo from
HostDeviceMappedRegionInfo (the other similar construction sites referenced in
the review) to ensure host_data_ptr and host_signal_ptr are always
zeroed/redacted in Python-facing structures.

In `@src/a2a3/platform/onboard/host/device_runner.cpp`:
- Around line 507-515: The lazy init of the HAL (shared g_hal_handle) in
device_runner.cpp is not thread-safe—wrap the initialization path so only one
thread performs dlopen/publish and others wait; protect the call sequence around
load_hal_if_needed() and subsequent get_halHostRegister() with a std::once_flag
(or a mutex) to ensure load_hal_if_needed() is executed exactly once and
g_hal_handle is safely published; update/introduce a static std::once_flag (or a
static std::mutex) in the translation unit and call std::call_once (or lock the
mutex) before calling load_hal_if_needed(), then check/get HalHostRegisterFn as
before and propagate errors unchanged.

In `@src/a2a3/platform/onboard/host/device_runner.h`:
- Around line 254-256: Add clear doc comments for the two cache maintenance
APIs: document flush_host_cache_range(void *host_ptr, size_t bytes) and
invalidate_host_cache_range(void *host_ptr, size_t bytes) by describing each
parameter (host_ptr = start address in host memory, bytes = length to operate
on), the return contract (e.g., 0 on success, negative error codes on failure),
behavioral semantics (what “flush” does vs “invalidate” and whether the
operation is inclusive/exclusive of bounds), preconditions/requirements
(alignment or page boundaries, valid pointer lifetime), side effects/visibility
(synchronization guarantees with device access and whether caller must
fence/serialize), and thread-safety/usage notes (concurrent callers, recommended
use cases). Ensure the comments sit immediately above the declarations for
flush_host_cache_range and invalidate_host_cache_range so callers see the
contract.

In `@src/common/host_device_comm/host_device_mapped_region_sim.cpp`:
- Around line 18-21: In host_device_mapped_region_allocate_sim, add input
validation to ensure output pointer arguments platform, host_base, and
device_base are non-null before any dereference; if any are null, return a
failure code (e.g., -EINVAL or -1) immediately and do not proceed with
allocation or writes, otherwise continue as before. Ensure you reference the
function host_device_mapped_region_allocate_sim and check the three pointer
parameters (platform, host_base, device_base) at the top of the function to
prevent undefined behavior.

In `@src/common/host_device_comm/host_device_mapped_region.cpp`:
- Around line 430-437: The code currently reinterpret_casts slot->value to
std::atomic<uint32_t> (via atomic_value) which is UB; also the load/check/store
can lose a higher concurrent value. Fix by making the underlying storage a
proper std::atomic<uint32_t> (change the slot struct so value is
std::atomic<uint32_t>), then replace the load/check/store with a CAS loop that
performs a monotonic max update (use compare_exchange_weak/strong on the atomic
member so you only store the larger of current and value). Update both
occurrences (the current block using atomic_value and the other similar block
around ~464-465). Keep the release_region_op(region) and
flush_host_range(region, slot, sizeof(*slot)) calls in place after a successful
update.

---

Nitpick comments:
In `@src/a2a3/platform/onboard/host/host_device_mapped_region_onboard.h`:
- Around line 17-20: The declaration of
a2a3_onboard_host_device_mapped_region_allocate is missing a public doc comment;
add a clear header comment above the function that documents each parameter
(DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform
*platform, void **host_base, void **device_base), the return value meaning
(e.g., 0 on success, negative error codes on failure), the ownership and
lifetime contract for host_base/device_base and platform, expected
alignment/size constraints, and explicit error conditions (out-of-memory,
invalid ctx, unsupported device) and side effects so callers know how to use and
clean up the allocated mapped region.

In `@src/common/host_device_comm/host_device_mapped_region_sim.h`:
- Around line 17-20: Add a clear doc comment above the function declaration
host_device_mapped_region_allocate_sim explaining the purpose, detailed
parameter semantics for ctx (DeviceContextHandle), total_bytes, platform
(HostDeviceMappedRegionPlatform*), and output pointers host_base and
device_base, describe the return values/errno or error codes and
ownership/lifetime responsibilities (who frees the region and alignment/size
constraints), and any thread-safety or simulation-specific behavior/requirements
so callers know how to use and release the allocated mapped region.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8a5636fd-d12b-4b0b-af8b-1a7771e8b3ab

📥 Commits

Reviewing files that changed from the base of the PR and between 896bd02 and 0277e74.

📒 Files selected for processing (40)
  • docs/L3-L2-host-device-communication.md
  • docs/dynamic-linking.md
  • docs/worker-manager.md
  • examples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/kernels/aiv/host_device_mapped_region_round_trip.cpp
  • examples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/kernels/orchestration/host_device_mapped_region_round_trip_orch.cpp
  • examples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/main.py
  • examples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/test_mapped_region_round_trip.py
  • python/bindings/task_interface.cpp
  • python/bindings/worker_bind.h
  • python/simpler/task_interface.py
  • python/simpler/worker.py
  • src/a2a3/platform/onboard/host/CMakeLists.txt
  • src/a2a3/platform/onboard/host/device_runner.cpp
  • src/a2a3/platform/onboard/host/device_runner.h
  • src/a2a3/platform/onboard/host/host_device_mapped_region_onboard.cpp
  • src/a2a3/platform/onboard/host/host_device_mapped_region_onboard.h
  • src/a2a3/platform/onboard/host/pto_runtime_c_api.cpp
  • src/a2a3/platform/sim/host/CMakeLists.txt
  • src/a2a3/platform/sim/host/pto_runtime_c_api.cpp
  • src/a5/platform/onboard/host/CMakeLists.txt
  • src/a5/platform/onboard/host/pto_runtime_c_api.cpp
  • src/a5/platform/sim/host/CMakeLists.txt
  • src/a5/platform/sim/host/pto_runtime_c_api.cpp
  • src/common/hierarchical/worker.h
  • src/common/hierarchical/worker_manager.cpp
  • src/common/hierarchical/worker_manager.h
  • src/common/host_device_comm/host_device_mapped_region.cpp
  • src/common/host_device_comm/host_device_mapped_region.h
  • src/common/host_device_comm/host_device_mapped_region_sim.cpp
  • src/common/host_device_comm/host_device_mapped_region_sim.h
  • src/common/worker/chip_worker.cpp
  • src/common/worker/chip_worker.h
  • src/common/worker/pto_runtime_c_api.h
  • tests/ut/cpp/CMakeLists.txt
  • tests/ut/cpp/common/test_host_device_mapped_region.cpp
  • tests/ut/cpp/hierarchical/test_mailbox_control_layout.cpp
  • tests/ut/py/test_chip_worker.py
  • tests/ut/py/test_worker/test_mapped_region_hw.py
  • tests/ut/py/test_worker/test_mapped_region_round_trip_hw.py
  • tests/ut/py/test_worker/test_mapped_region_sim.py

Comment on lines +81 to +99
bool observed = false;
for (uint32_t i = 0; i < kMaxPollIters; ++i) {
invalidate_range(signal0, kCacheLineBytes);
if (*signal0 >= seq) {
observed = true;
break;
}
}

invalidate_range(data, nbytes);
for (uint32_t i = 0; i < nbytes; ++i) {
uint8_t mask = observed ? static_cast<uint8_t>(seq + i * 3U) : static_cast<uint8_t>(0xA5U);
data[nbytes + i] = static_cast<uint8_t>(data[i] ^ mask);
}
flush_range(data + nbytes, nbytes);

*signal1 = seq;
flush_range(signal1, kCacheLineBytes);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't signal success after the poll loop times out.

If signal0 never reaches seq, this still writes transformed output and then publishes signal1 = seq. That hides real notify/cache failures and can make the host accept corrupted data as a successful round trip. Bail out, or publish a distinct error sentinel, instead of reusing the success signal.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@examples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/kernels/aiv/host_device_mapped_region_round_trip.cpp`
around lines 81 - 99, The poll loop using signal0 (kMaxPollIters) may time out
but the code still writes transformed output and sets signal1 to seq, hiding
failures; modify the tail of the function to check the observed flag and if
observed is false then do not publish the success seq: either return immediately
(bail out) before transforming/writing output, or write a distinct error
sentinel to *signal1 (e.g., a reserved value different from seq) and call
flush_range(signal1, kCacheLineBytes) so the host can detect the failure; ensure
any writes to data/nbytes and flush_range(data + nbytes, nbytes) only occur when
observed is true and keep references to signal0, signal1, seq, data, nbytes,
kMaxPollIters and kCacheLineBytes to locate the changes.

Comment on lines +72 to +76
MappedRegionInfo make_mapped_region_info(const HostDeviceMappedRegionInfo &info) {
return MappedRegionInfo{
info.host_data_ptr, info.device_data_ptr, info.data_bytes, info.host_signal_ptr,
info.device_signal_ptr, info.signal_count, info.total_bytes, info.flags,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mask host pointers before returning MappedRegionInfo to Python.

This currently copies host_data_ptr and host_signal_ptr straight into a Python-visible object. The mapped-region design for this PR says public Python info should mask host pointers, and exposing raw host VAs leaks process layout information. Zero or redact those fields in make_mapped_region_info() before constructing the Python wrapper.

Proposed fix
 MappedRegionInfo make_mapped_region_info(const HostDeviceMappedRegionInfo &info) {
     return MappedRegionInfo{
-        info.host_data_ptr,     info.device_data_ptr, info.data_bytes,  info.host_signal_ptr,
+        0,                      info.device_data_ptr, info.data_bytes,  0,
         info.device_signal_ptr, info.signal_count,    info.total_bytes, info.flags,
     };
 }

Also applies to: 759-763, 882-885

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/bindings/task_interface.cpp` around lines 72 - 76, The function
make_mapped_region_info currently copies raw host pointers into the
Python-visible MappedRegionInfo (specifically info.host_data_ptr and
info.host_signal_ptr); change it to redact/mask those host pointer fields (e.g.,
set to 0 or a sentinel) before constructing and returning the MappedRegionInfo
so raw host VAs are not exposed to Python; apply the same masking to any other
places that construct MappedRegionInfo from HostDeviceMappedRegionInfo (the
other similar construction sites referenced in the review) to ensure
host_data_ptr and host_signal_ptr are always zeroed/redacted in Python-facing
structures.

Comment on lines +507 to +515
if (load_hal_if_needed() != 0) {
LOG_ERROR("Failed to load ascend_hal for mapped region: %s", dlerror());
return -1;
}
HalHostRegisterFn fn = get_halHostRegister();
if (fn == nullptr) {
LOG_ERROR("halHostRegister symbol not found: %s", dlerror());
return -1;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make the HAL lazy init thread-safe.

These new mapped-region entrypoints funnel through the shared g_hal_handle without synchronization. Two threads hitting first-use here can race on the dlopen/publish path, which is undefined behavior in C++. Please guard the lazy init with std::once_flag or a mutex.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/onboard/host/device_runner.cpp` around lines 507 - 515, The
lazy init of the HAL (shared g_hal_handle) in device_runner.cpp is not
thread-safe—wrap the initialization path so only one thread performs
dlopen/publish and others wait; protect the call sequence around
load_hal_if_needed() and subsequent get_halHostRegister() with a std::once_flag
(or a mutex) to ensure load_hal_if_needed() is executed exactly once and
g_hal_handle is safely published; update/introduce a static std::once_flag (or a
static std::mutex) in the translation unit and call std::call_once (or lock the
mutex) before calling load_hal_if_needed(), then check/get HalHostRegisterFn as
before and propagate errors unchanged.

Comment on lines +254 to +256
int flush_host_cache_range(void *host_ptr, size_t bytes);

int invalidate_host_cache_range(void *host_ptr, size_t bytes);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add documentation for cache maintenance methods.

The flush_host_cache_range and invalidate_host_cache_range methods lack documentation, unlike the other methods added in this section. Public APIs should document their parameters, return values, and usage contract.

📝 Suggested documentation
+    /**
+     * Flush host cache range to ensure device visibility of host writes.
+     *
+     * `@param` host_ptr  Host mapping pointer.
+     * `@param` bytes     Range size in bytes.
+     * `@return` 0 on success, non-zero HAL/runtime error on failure.
+     */
     int flush_host_cache_range(void *host_ptr, size_t bytes);
 
+    /**
+     * Invalidate host cache range to ensure host visibility of device writes.
+     *
+     * `@param` host_ptr  Host mapping pointer.
+     * `@param` bytes     Range size in bytes.
+     * `@return` 0 on success, non-zero HAL/runtime error on failure.
+     */
     int invalidate_host_cache_range(void *host_ptr, size_t bytes);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int flush_host_cache_range(void *host_ptr, size_t bytes);
int invalidate_host_cache_range(void *host_ptr, size_t bytes);
/**
* Flush host cache range to ensure device visibility of host writes.
*
* `@param` host_ptr Host mapping pointer.
* `@param` bytes Range size in bytes.
* `@return` 0 on success, non-zero HAL/runtime error on failure.
*/
int flush_host_cache_range(void *host_ptr, size_t bytes);
/**
* Invalidate host cache range to ensure host visibility of device writes.
*
* `@param` host_ptr Host mapping pointer.
* `@param` bytes Range size in bytes.
* `@return` 0 on success, non-zero HAL/runtime error on failure.
*/
int invalidate_host_cache_range(void *host_ptr, size_t bytes);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/onboard/host/device_runner.h` around lines 254 - 256, Add
clear doc comments for the two cache maintenance APIs: document
flush_host_cache_range(void *host_ptr, size_t bytes) and
invalidate_host_cache_range(void *host_ptr, size_t bytes) by describing each
parameter (host_ptr = start address in host memory, bytes = length to operate
on), the return contract (e.g., 0 on success, negative error codes on failure),
behavioral semantics (what “flush” does vs “invalidate” and whether the
operation is inclusive/exclusive of bounds), preconditions/requirements
(alignment or page boundaries, valid pointer lifetime), side effects/visibility
(synchronization guarantees with device access and whether caller must
fence/serialize), and thread-safety/usage notes (concurrent callers, recommended
use cases). Ensure the comments sit immediately above the declarations for
flush_host_cache_range and invalidate_host_cache_range so callers see the
contract.

Comment on lines +18 to +21
int host_device_mapped_region_allocate_sim(
DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform *platform, void **host_base,
void **device_base
) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add input validation for output parameters.

The function does not validate that output pointers (platform, host_base, device_base) are non-null before dereferencing them. If a caller passes null, this will cause undefined behavior.

🛡️ Proposed fix to add null checks
 int host_device_mapped_region_allocate_sim(
     DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform *platform, void **host_base,
     void **device_base
 ) {
     (void)ctx;
+    if (platform == nullptr || host_base == nullptr || device_base == nullptr) {
+        return -EINVAL;
+    }
+    if (total_bytes == 0) {
+        return -EINVAL;
+    }
     void *ptr = nullptr;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int host_device_mapped_region_allocate_sim(
DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform *platform, void **host_base,
void **device_base
) {
int host_device_mapped_region_allocate_sim(
DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform *platform, void **host_base,
void **device_base
) {
(void)ctx;
if (platform == nullptr || host_base == nullptr || device_base == nullptr) {
return -EINVAL;
}
if (total_bytes == 0) {
return -EINVAL;
}
void *ptr = nullptr;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/host_device_comm/host_device_mapped_region_sim.cpp` around lines
18 - 21, In host_device_mapped_region_allocate_sim, add input validation to
ensure output pointer arguments platform, host_base, and device_base are
non-null before any dereference; if any are null, return a failure code (e.g.,
-EINVAL or -1) immediately and do not proceed with allocation or writes,
otherwise continue as before. Ensure you reference the function
host_device_mapped_region_allocate_sim and check the three pointer parameters
(platform, host_base, device_base) at the top of the function to prevent
undefined behavior.

Comment on lines +430 to +437
auto *atomic_value = reinterpret_cast<std::atomic<uint32_t> *>(const_cast<uint32_t *>(&slot->value));
uint32_t current = atomic_value->load(std::memory_order_acquire);
if (value < current) {
release_region_op(region);
return -EINVAL;
}
atomic_value->store(value, std::memory_order_release);
int rc = flush_host_range(region, slot, sizeof(*slot));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Atomic reinterpret casts =="
rg -n --type=cpp 'reinterpret_cast<\s*std::atomic<\s*uint32_t\s*>\s*\*>' 

echo
echo "== C++ standard declarations in build config =="
rg -n --type=cmake 'CMAKE_CXX_STANDARD|cxx_std_[0-9]+'

Repository: hw-native-sys/simpler

Length of output: 705


Fix signal-slot atomic semantics: UB from std::atomic reinterpret_cast + non-monotonic notify under contention
In src/common/host_device_comm/host_device_mapped_region.cpp (C++17), slot->value is accessed as uint32_t storage (const_cast<uint32_t*>(&slot->value)) but then reinterpreted as std::atomic<uint32_t>*, which is undefined behavior unless slot->value is actually an std::atomic<uint32_t> object. The load/check/store sequence also doesn’t enforce monotonic “max” updates under concurrent notifies; a stale load can overwrite a higher value. Same pattern appears again at ~464-465.

🐛 Proposed fix (defined atomics + CAS monotonic update)
-    auto *atomic_value = reinterpret_cast<std::atomic<uint32_t> *>(const_cast<uint32_t *>(&slot->value));
-    uint32_t current = atomic_value->load(std::memory_order_acquire);
-    if (value < current) {
-        release_region_op(region);
-        return -EINVAL;
-    }
-    atomic_value->store(value, std::memory_order_release);
+    auto *value_ptr = const_cast<uint32_t *>(&slot->value);
+    uint32_t current = __atomic_load_n(value_ptr, __ATOMIC_ACQUIRE);
+    if (value < current) {
+        release_region_op(region);
+        return -EINVAL;
+    }
+    while (value > current &&
+           !__atomic_compare_exchange_n(
+               value_ptr, &current, value, /*weak=*/true, __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE
+           )) {
+    }
@@
-    auto *atomic_value = reinterpret_cast<std::atomic<uint32_t> *>(const_cast<uint32_t *>(&slot->value));
-    if (atomic_value->load(std::memory_order_acquire) >= target) {
+    auto *value_ptr = const_cast<uint32_t *>(&slot->value);
+    if (__atomic_load_n(value_ptr, __ATOMIC_ACQUIRE) >= target) {
@@
-        if (atomic_value->load(std::memory_order_acquire) >= target) {
+        if (__atomic_load_n(value_ptr, __ATOMIC_ACQUIRE) >= target) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/host_device_comm/host_device_mapped_region.cpp` around lines 430 -
437, The code currently reinterpret_casts slot->value to std::atomic<uint32_t>
(via atomic_value) which is UB; also the load/check/store can lose a higher
concurrent value. Fix by making the underlying storage a proper
std::atomic<uint32_t> (change the slot struct so value is
std::atomic<uint32_t>), then replace the load/check/store with a CAS loop that
performs a monotonic max update (use compare_exchange_weak/strong on the atomic
member so you only store the larger of current and value). Update both
occurrences (the current block using atomic_value and the other similar block
around ~464-465). Keep the release_region_op(region) and
flush_host_range(region, slot, sizeof(*slot)) calls in place after a successful
update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant