Add: L3/L2 host-device mapped region design#861
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the design document for HostDeviceMappedRegion, a low-level parent-to-child/NPU communication primitive in Simpler. The review feedback suggests improving the public ABI structure HostDeviceMappedRegionInfo by making implicit padding fields explicit to prevent potential security vulnerabilities and FFI errors, as well as updating the corresponding static assertions to validate these offsets.
| typedef struct { | ||
| uint64_t host_data_ptr; | ||
| uint64_t device_data_ptr; | ||
| uint64_t data_bytes; | ||
| uint64_t host_signal_ptr; | ||
| uint64_t device_signal_ptr; | ||
| uint32_t signal_count; | ||
| uint64_t total_bytes; | ||
| uint32_t flags; | ||
| } HostDeviceMappedRegionInfo; |
There was a problem hiding this comment.
The HostDeviceMappedRegionInfo structure contains implicit padding between signal_count (offset 40) and total_bytes (offset 48), as well as trailing padding after flags (offset 56) to align the structure to an 8-byte boundary (64 bytes total). Implicit padding in public ABI structures can lead to security vulnerabilities (such as uninitialized memory/information leakage) when copied or serialized directly across process boundaries. It can also make cross-language bindings (e.g., Python ctypes or Rust FFI) more error-prone. Adding explicit reserved fields ensures that padding is visible, documented, and can be explicitly zero-initialized.
| typedef struct { | |
| uint64_t host_data_ptr; | |
| uint64_t device_data_ptr; | |
| uint64_t data_bytes; | |
| uint64_t host_signal_ptr; | |
| uint64_t device_signal_ptr; | |
| uint32_t signal_count; | |
| uint64_t total_bytes; | |
| uint32_t flags; | |
| } HostDeviceMappedRegionInfo; | |
| typedef struct { | |
| uint64_t host_data_ptr; | |
| uint64_t device_data_ptr; | |
| uint64_t data_bytes; | |
| uint64_t host_signal_ptr; | |
| uint64_t device_signal_ptr; | |
| uint32_t signal_count; | |
| uint32_t reserved0; | |
| uint64_t total_bytes; | |
| uint32_t flags; | |
| uint32_t reserved1; | |
| } HostDeviceMappedRegionInfo; |
| static_assert(offsetof(HostDeviceMappedRegionInfo, host_data_ptr) == 0); | ||
| static_assert(offsetof(HostDeviceMappedRegionInfo, device_data_ptr) == 8); | ||
| static_assert(offsetof(HostDeviceMappedRegionInfo, data_bytes) == 16); | ||
| static_assert(offsetof(HostDeviceMappedRegionInfo, host_signal_ptr) == 24); | ||
| static_assert(offsetof(HostDeviceMappedRegionInfo, device_signal_ptr) == 32); | ||
| static_assert(offsetof(HostDeviceMappedRegionInfo, signal_count) == 40); | ||
| static_assert(offsetof(HostDeviceMappedRegionInfo, total_bytes) == 48); | ||
| static_assert(offsetof(HostDeviceMappedRegionInfo, flags) == 56); | ||
| static_assert(sizeof(HostDeviceMappedRegionInfo) == 64); |
There was a problem hiding this comment.
Update the static assertions for HostDeviceMappedRegionInfo to explicitly validate the offsets of the newly added reserved0 and reserved1 fields. This ensures the structure layout remains exactly 64 bytes with the expected alignment and no implicit padding.
| static_assert(offsetof(HostDeviceMappedRegionInfo, host_data_ptr) == 0); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, device_data_ptr) == 8); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, data_bytes) == 16); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, host_signal_ptr) == 24); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, device_signal_ptr) == 32); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, signal_count) == 40); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, total_bytes) == 48); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, flags) == 56); | |
| static_assert(sizeof(HostDeviceMappedRegionInfo) == 64); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, host_data_ptr) == 0); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, device_data_ptr) == 8); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, data_bytes) == 16); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, host_signal_ptr) == 24); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, device_signal_ptr) == 32); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, signal_count) == 40); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, reserved0) == 44); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, total_bytes) == 48); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, flags) == 56); | |
| static_assert(offsetof(HostDeviceMappedRegionInfo, reserved1) == 60); | |
| static_assert(sizeof(HostDeviceMappedRegionInfo) == 64); |
- Add HostDeviceMappedRegion ABI, common implementation, and sim/onboard backends - Expose mapped-region lifecycle, datacopy, notify, and wait through ChipWorker and Worker - Add L3 control-plane proxying, round-trip examples, UT coverage, and user documentation
📝 WalkthroughWalkthroughThis PR introduces ChangesHostDeviceMappedRegion Communication
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
|
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
src/a2a3/platform/onboard/host/host_device_mapped_region_onboard.h (1)
17-20: ⚡ Quick winAdd documentation for the public allocation API.
The function declaration lacks documentation describing its parameters, return values, and usage contract. Public APIs should document their behavior, especially error conditions.
📝 Suggested documentation
+/** + * Allocate a host/device mapped region on A2/A3 onboard platforms. + * + * `@param` ctx Device context handle. + * `@param` total_bytes Total size in bytes (data + signals). + * `@param` platform Output platform-specific handle and release callback. + * `@param` host_base Output host-visible base pointer. + * `@param` device_base Output device-visible base pointer. + * `@return` 0 on success, negative error code on failure. + */ int a2a3_onboard_host_device_mapped_region_allocate(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/onboard/host/host_device_mapped_region_onboard.h` around lines 17 - 20, The declaration of a2a3_onboard_host_device_mapped_region_allocate is missing a public doc comment; add a clear header comment above the function that documents each parameter (DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform *platform, void **host_base, void **device_base), the return value meaning (e.g., 0 on success, negative error codes on failure), the ownership and lifetime contract for host_base/device_base and platform, expected alignment/size constraints, and explicit error conditions (out-of-memory, invalid ctx, unsupported device) and side effects so callers know how to use and clean up the allocated mapped region.src/common/host_device_comm/host_device_mapped_region_sim.h (1)
17-20: ⚡ Quick winAdd documentation for the simulation allocation API.
The function declaration lacks documentation describing its parameters, return values, and usage contract.
📝 Suggested documentation
+/** + * Allocate a host/device mapped region in simulation mode. + * + * `@param` ctx Device context handle (unused in sim). + * `@param` total_bytes Total size in bytes (data + signals). + * `@param` platform Output platform-specific handle and release callback. + * `@param` host_base Output host-visible base pointer. + * `@param` device_base Output device-visible base pointer (same as host_base in sim). + * `@return` 0 on success, negative error code on failure. + */ int host_device_mapped_region_allocate_sim(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/common/host_device_comm/host_device_mapped_region_sim.h` around lines 17 - 20, Add a clear doc comment above the function declaration host_device_mapped_region_allocate_sim explaining the purpose, detailed parameter semantics for ctx (DeviceContextHandle), total_bytes, platform (HostDeviceMappedRegionPlatform*), and output pointers host_base and device_base, describe the return values/errno or error codes and ownership/lifetime responsibilities (who frees the region and alignment/size constraints), and any thread-safety or simulation-specific behavior/requirements so callers know how to use and release the allocated mapped region.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@examples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/kernels/aiv/host_device_mapped_region_round_trip.cpp`:
- Around line 81-99: The poll loop using signal0 (kMaxPollIters) may time out
but the code still writes transformed output and sets signal1 to seq, hiding
failures; modify the tail of the function to check the observed flag and if
observed is false then do not publish the success seq: either return immediately
(bail out) before transforming/writing output, or write a distinct error
sentinel to *signal1 (e.g., a reserved value different from seq) and call
flush_range(signal1, kCacheLineBytes) so the host can detect the failure; ensure
any writes to data/nbytes and flush_range(data + nbytes, nbytes) only occur when
observed is true and keep references to signal0, signal1, seq, data, nbytes,
kMaxPollIters and kCacheLineBytes to locate the changes.
In `@python/bindings/task_interface.cpp`:
- Around line 72-76: The function make_mapped_region_info currently copies raw
host pointers into the Python-visible MappedRegionInfo (specifically
info.host_data_ptr and info.host_signal_ptr); change it to redact/mask those
host pointer fields (e.g., set to 0 or a sentinel) before constructing and
returning the MappedRegionInfo so raw host VAs are not exposed to Python; apply
the same masking to any other places that construct MappedRegionInfo from
HostDeviceMappedRegionInfo (the other similar construction sites referenced in
the review) to ensure host_data_ptr and host_signal_ptr are always
zeroed/redacted in Python-facing structures.
In `@src/a2a3/platform/onboard/host/device_runner.cpp`:
- Around line 507-515: The lazy init of the HAL (shared g_hal_handle) in
device_runner.cpp is not thread-safe—wrap the initialization path so only one
thread performs dlopen/publish and others wait; protect the call sequence around
load_hal_if_needed() and subsequent get_halHostRegister() with a std::once_flag
(or a mutex) to ensure load_hal_if_needed() is executed exactly once and
g_hal_handle is safely published; update/introduce a static std::once_flag (or a
static std::mutex) in the translation unit and call std::call_once (or lock the
mutex) before calling load_hal_if_needed(), then check/get HalHostRegisterFn as
before and propagate errors unchanged.
In `@src/a2a3/platform/onboard/host/device_runner.h`:
- Around line 254-256: Add clear doc comments for the two cache maintenance
APIs: document flush_host_cache_range(void *host_ptr, size_t bytes) and
invalidate_host_cache_range(void *host_ptr, size_t bytes) by describing each
parameter (host_ptr = start address in host memory, bytes = length to operate
on), the return contract (e.g., 0 on success, negative error codes on failure),
behavioral semantics (what “flush” does vs “invalidate” and whether the
operation is inclusive/exclusive of bounds), preconditions/requirements
(alignment or page boundaries, valid pointer lifetime), side effects/visibility
(synchronization guarantees with device access and whether caller must
fence/serialize), and thread-safety/usage notes (concurrent callers, recommended
use cases). Ensure the comments sit immediately above the declarations for
flush_host_cache_range and invalidate_host_cache_range so callers see the
contract.
In `@src/common/host_device_comm/host_device_mapped_region_sim.cpp`:
- Around line 18-21: In host_device_mapped_region_allocate_sim, add input
validation to ensure output pointer arguments platform, host_base, and
device_base are non-null before any dereference; if any are null, return a
failure code (e.g., -EINVAL or -1) immediately and do not proceed with
allocation or writes, otherwise continue as before. Ensure you reference the
function host_device_mapped_region_allocate_sim and check the three pointer
parameters (platform, host_base, device_base) at the top of the function to
prevent undefined behavior.
In `@src/common/host_device_comm/host_device_mapped_region.cpp`:
- Around line 430-437: The code currently reinterpret_casts slot->value to
std::atomic<uint32_t> (via atomic_value) which is UB; also the load/check/store
can lose a higher concurrent value. Fix by making the underlying storage a
proper std::atomic<uint32_t> (change the slot struct so value is
std::atomic<uint32_t>), then replace the load/check/store with a CAS loop that
performs a monotonic max update (use compare_exchange_weak/strong on the atomic
member so you only store the larger of current and value). Update both
occurrences (the current block using atomic_value and the other similar block
around ~464-465). Keep the release_region_op(region) and
flush_host_range(region, slot, sizeof(*slot)) calls in place after a successful
update.
---
Nitpick comments:
In `@src/a2a3/platform/onboard/host/host_device_mapped_region_onboard.h`:
- Around line 17-20: The declaration of
a2a3_onboard_host_device_mapped_region_allocate is missing a public doc comment;
add a clear header comment above the function that documents each parameter
(DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform
*platform, void **host_base, void **device_base), the return value meaning
(e.g., 0 on success, negative error codes on failure), the ownership and
lifetime contract for host_base/device_base and platform, expected
alignment/size constraints, and explicit error conditions (out-of-memory,
invalid ctx, unsupported device) and side effects so callers know how to use and
clean up the allocated mapped region.
In `@src/common/host_device_comm/host_device_mapped_region_sim.h`:
- Around line 17-20: Add a clear doc comment above the function declaration
host_device_mapped_region_allocate_sim explaining the purpose, detailed
parameter semantics for ctx (DeviceContextHandle), total_bytes, platform
(HostDeviceMappedRegionPlatform*), and output pointers host_base and
device_base, describe the return values/errno or error codes and
ownership/lifetime responsibilities (who frees the region and alignment/size
constraints), and any thread-safety or simulation-specific behavior/requirements
so callers know how to use and release the allocated mapped region.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8a5636fd-d12b-4b0b-af8b-1a7771e8b3ab
📒 Files selected for processing (40)
docs/L3-L2-host-device-communication.mddocs/dynamic-linking.mddocs/worker-manager.mdexamples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/kernels/aiv/host_device_mapped_region_round_trip.cppexamples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/kernels/orchestration/host_device_mapped_region_round_trip_orch.cppexamples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/main.pyexamples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/test_mapped_region_round_trip.pypython/bindings/task_interface.cpppython/bindings/worker_bind.hpython/simpler/task_interface.pypython/simpler/worker.pysrc/a2a3/platform/onboard/host/CMakeLists.txtsrc/a2a3/platform/onboard/host/device_runner.cppsrc/a2a3/platform/onboard/host/device_runner.hsrc/a2a3/platform/onboard/host/host_device_mapped_region_onboard.cppsrc/a2a3/platform/onboard/host/host_device_mapped_region_onboard.hsrc/a2a3/platform/onboard/host/pto_runtime_c_api.cppsrc/a2a3/platform/sim/host/CMakeLists.txtsrc/a2a3/platform/sim/host/pto_runtime_c_api.cppsrc/a5/platform/onboard/host/CMakeLists.txtsrc/a5/platform/onboard/host/pto_runtime_c_api.cppsrc/a5/platform/sim/host/CMakeLists.txtsrc/a5/platform/sim/host/pto_runtime_c_api.cppsrc/common/hierarchical/worker.hsrc/common/hierarchical/worker_manager.cppsrc/common/hierarchical/worker_manager.hsrc/common/host_device_comm/host_device_mapped_region.cppsrc/common/host_device_comm/host_device_mapped_region.hsrc/common/host_device_comm/host_device_mapped_region_sim.cppsrc/common/host_device_comm/host_device_mapped_region_sim.hsrc/common/worker/chip_worker.cppsrc/common/worker/chip_worker.hsrc/common/worker/pto_runtime_c_api.htests/ut/cpp/CMakeLists.txttests/ut/cpp/common/test_host_device_mapped_region.cpptests/ut/cpp/hierarchical/test_mailbox_control_layout.cpptests/ut/py/test_chip_worker.pytests/ut/py/test_worker/test_mapped_region_hw.pytests/ut/py/test_worker/test_mapped_region_round_trip_hw.pytests/ut/py/test_worker/test_mapped_region_sim.py
| bool observed = false; | ||
| for (uint32_t i = 0; i < kMaxPollIters; ++i) { | ||
| invalidate_range(signal0, kCacheLineBytes); | ||
| if (*signal0 >= seq) { | ||
| observed = true; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| invalidate_range(data, nbytes); | ||
| for (uint32_t i = 0; i < nbytes; ++i) { | ||
| uint8_t mask = observed ? static_cast<uint8_t>(seq + i * 3U) : static_cast<uint8_t>(0xA5U); | ||
| data[nbytes + i] = static_cast<uint8_t>(data[i] ^ mask); | ||
| } | ||
| flush_range(data + nbytes, nbytes); | ||
|
|
||
| *signal1 = seq; | ||
| flush_range(signal1, kCacheLineBytes); | ||
| } |
There was a problem hiding this comment.
Don't signal success after the poll loop times out.
If signal0 never reaches seq, this still writes transformed output and then publishes signal1 = seq. That hides real notify/cache failures and can make the host accept corrupted data as a successful round trip. Bail out, or publish a distinct error sentinel, instead of reusing the success signal.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@examples/a2a3/tensormap_and_ringbuffer/host_device_mapped_region_round_trip/kernels/aiv/host_device_mapped_region_round_trip.cpp`
around lines 81 - 99, The poll loop using signal0 (kMaxPollIters) may time out
but the code still writes transformed output and sets signal1 to seq, hiding
failures; modify the tail of the function to check the observed flag and if
observed is false then do not publish the success seq: either return immediately
(bail out) before transforming/writing output, or write a distinct error
sentinel to *signal1 (e.g., a reserved value different from seq) and call
flush_range(signal1, kCacheLineBytes) so the host can detect the failure; ensure
any writes to data/nbytes and flush_range(data + nbytes, nbytes) only occur when
observed is true and keep references to signal0, signal1, seq, data, nbytes,
kMaxPollIters and kCacheLineBytes to locate the changes.
| MappedRegionInfo make_mapped_region_info(const HostDeviceMappedRegionInfo &info) { | ||
| return MappedRegionInfo{ | ||
| info.host_data_ptr, info.device_data_ptr, info.data_bytes, info.host_signal_ptr, | ||
| info.device_signal_ptr, info.signal_count, info.total_bytes, info.flags, | ||
| }; |
There was a problem hiding this comment.
Mask host pointers before returning MappedRegionInfo to Python.
This currently copies host_data_ptr and host_signal_ptr straight into a Python-visible object. The mapped-region design for this PR says public Python info should mask host pointers, and exposing raw host VAs leaks process layout information. Zero or redact those fields in make_mapped_region_info() before constructing the Python wrapper.
Proposed fix
MappedRegionInfo make_mapped_region_info(const HostDeviceMappedRegionInfo &info) {
return MappedRegionInfo{
- info.host_data_ptr, info.device_data_ptr, info.data_bytes, info.host_signal_ptr,
+ 0, info.device_data_ptr, info.data_bytes, 0,
info.device_signal_ptr, info.signal_count, info.total_bytes, info.flags,
};
}Also applies to: 759-763, 882-885
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/bindings/task_interface.cpp` around lines 72 - 76, The function
make_mapped_region_info currently copies raw host pointers into the
Python-visible MappedRegionInfo (specifically info.host_data_ptr and
info.host_signal_ptr); change it to redact/mask those host pointer fields (e.g.,
set to 0 or a sentinel) before constructing and returning the MappedRegionInfo
so raw host VAs are not exposed to Python; apply the same masking to any other
places that construct MappedRegionInfo from HostDeviceMappedRegionInfo (the
other similar construction sites referenced in the review) to ensure
host_data_ptr and host_signal_ptr are always zeroed/redacted in Python-facing
structures.
| if (load_hal_if_needed() != 0) { | ||
| LOG_ERROR("Failed to load ascend_hal for mapped region: %s", dlerror()); | ||
| return -1; | ||
| } | ||
| HalHostRegisterFn fn = get_halHostRegister(); | ||
| if (fn == nullptr) { | ||
| LOG_ERROR("halHostRegister symbol not found: %s", dlerror()); | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
Make the HAL lazy init thread-safe.
These new mapped-region entrypoints funnel through the shared g_hal_handle without synchronization. Two threads hitting first-use here can race on the dlopen/publish path, which is undefined behavior in C++. Please guard the lazy init with std::once_flag or a mutex.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a2a3/platform/onboard/host/device_runner.cpp` around lines 507 - 515, The
lazy init of the HAL (shared g_hal_handle) in device_runner.cpp is not
thread-safe—wrap the initialization path so only one thread performs
dlopen/publish and others wait; protect the call sequence around
load_hal_if_needed() and subsequent get_halHostRegister() with a std::once_flag
(or a mutex) to ensure load_hal_if_needed() is executed exactly once and
g_hal_handle is safely published; update/introduce a static std::once_flag (or a
static std::mutex) in the translation unit and call std::call_once (or lock the
mutex) before calling load_hal_if_needed(), then check/get HalHostRegisterFn as
before and propagate errors unchanged.
| int flush_host_cache_range(void *host_ptr, size_t bytes); | ||
|
|
||
| int invalidate_host_cache_range(void *host_ptr, size_t bytes); |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add documentation for cache maintenance methods.
The flush_host_cache_range and invalidate_host_cache_range methods lack documentation, unlike the other methods added in this section. Public APIs should document their parameters, return values, and usage contract.
📝 Suggested documentation
+ /**
+ * Flush host cache range to ensure device visibility of host writes.
+ *
+ * `@param` host_ptr Host mapping pointer.
+ * `@param` bytes Range size in bytes.
+ * `@return` 0 on success, non-zero HAL/runtime error on failure.
+ */
int flush_host_cache_range(void *host_ptr, size_t bytes);
+ /**
+ * Invalidate host cache range to ensure host visibility of device writes.
+ *
+ * `@param` host_ptr Host mapping pointer.
+ * `@param` bytes Range size in bytes.
+ * `@return` 0 on success, non-zero HAL/runtime error on failure.
+ */
int invalidate_host_cache_range(void *host_ptr, size_t bytes);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int flush_host_cache_range(void *host_ptr, size_t bytes); | |
| int invalidate_host_cache_range(void *host_ptr, size_t bytes); | |
| /** | |
| * Flush host cache range to ensure device visibility of host writes. | |
| * | |
| * `@param` host_ptr Host mapping pointer. | |
| * `@param` bytes Range size in bytes. | |
| * `@return` 0 on success, non-zero HAL/runtime error on failure. | |
| */ | |
| int flush_host_cache_range(void *host_ptr, size_t bytes); | |
| /** | |
| * Invalidate host cache range to ensure host visibility of device writes. | |
| * | |
| * `@param` host_ptr Host mapping pointer. | |
| * `@param` bytes Range size in bytes. | |
| * `@return` 0 on success, non-zero HAL/runtime error on failure. | |
| */ | |
| int invalidate_host_cache_range(void *host_ptr, size_t bytes); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a2a3/platform/onboard/host/device_runner.h` around lines 254 - 256, Add
clear doc comments for the two cache maintenance APIs: document
flush_host_cache_range(void *host_ptr, size_t bytes) and
invalidate_host_cache_range(void *host_ptr, size_t bytes) by describing each
parameter (host_ptr = start address in host memory, bytes = length to operate
on), the return contract (e.g., 0 on success, negative error codes on failure),
behavioral semantics (what “flush” does vs “invalidate” and whether the
operation is inclusive/exclusive of bounds), preconditions/requirements
(alignment or page boundaries, valid pointer lifetime), side effects/visibility
(synchronization guarantees with device access and whether caller must
fence/serialize), and thread-safety/usage notes (concurrent callers, recommended
use cases). Ensure the comments sit immediately above the declarations for
flush_host_cache_range and invalidate_host_cache_range so callers see the
contract.
| int host_device_mapped_region_allocate_sim( | ||
| DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform *platform, void **host_base, | ||
| void **device_base | ||
| ) { |
There was a problem hiding this comment.
Add input validation for output parameters.
The function does not validate that output pointers (platform, host_base, device_base) are non-null before dereferencing them. If a caller passes null, this will cause undefined behavior.
🛡️ Proposed fix to add null checks
int host_device_mapped_region_allocate_sim(
DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform *platform, void **host_base,
void **device_base
) {
(void)ctx;
+ if (platform == nullptr || host_base == nullptr || device_base == nullptr) {
+ return -EINVAL;
+ }
+ if (total_bytes == 0) {
+ return -EINVAL;
+ }
void *ptr = nullptr;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int host_device_mapped_region_allocate_sim( | |
| DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform *platform, void **host_base, | |
| void **device_base | |
| ) { | |
| int host_device_mapped_region_allocate_sim( | |
| DeviceContextHandle ctx, uint64_t total_bytes, HostDeviceMappedRegionPlatform *platform, void **host_base, | |
| void **device_base | |
| ) { | |
| (void)ctx; | |
| if (platform == nullptr || host_base == nullptr || device_base == nullptr) { | |
| return -EINVAL; | |
| } | |
| if (total_bytes == 0) { | |
| return -EINVAL; | |
| } | |
| void *ptr = nullptr; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/common/host_device_comm/host_device_mapped_region_sim.cpp` around lines
18 - 21, In host_device_mapped_region_allocate_sim, add input validation to
ensure output pointer arguments platform, host_base, and device_base are
non-null before any dereference; if any are null, return a failure code (e.g.,
-EINVAL or -1) immediately and do not proceed with allocation or writes,
otherwise continue as before. Ensure you reference the function
host_device_mapped_region_allocate_sim and check the three pointer parameters
(platform, host_base, device_base) at the top of the function to prevent
undefined behavior.
| auto *atomic_value = reinterpret_cast<std::atomic<uint32_t> *>(const_cast<uint32_t *>(&slot->value)); | ||
| uint32_t current = atomic_value->load(std::memory_order_acquire); | ||
| if (value < current) { | ||
| release_region_op(region); | ||
| return -EINVAL; | ||
| } | ||
| atomic_value->store(value, std::memory_order_release); | ||
| int rc = flush_host_range(region, slot, sizeof(*slot)); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Atomic reinterpret casts =="
rg -n --type=cpp 'reinterpret_cast<\s*std::atomic<\s*uint32_t\s*>\s*\*>'
echo
echo "== C++ standard declarations in build config =="
rg -n --type=cmake 'CMAKE_CXX_STANDARD|cxx_std_[0-9]+'Repository: hw-native-sys/simpler
Length of output: 705
Fix signal-slot atomic semantics: UB from std::atomic reinterpret_cast + non-monotonic notify under contention
In src/common/host_device_comm/host_device_mapped_region.cpp (C++17), slot->value is accessed as uint32_t storage (const_cast<uint32_t*>(&slot->value)) but then reinterpreted as std::atomic<uint32_t>*, which is undefined behavior unless slot->value is actually an std::atomic<uint32_t> object. The load/check/store sequence also doesn’t enforce monotonic “max” updates under concurrent notifies; a stale load can overwrite a higher value. Same pattern appears again at ~464-465.
🐛 Proposed fix (defined atomics + CAS monotonic update)
- auto *atomic_value = reinterpret_cast<std::atomic<uint32_t> *>(const_cast<uint32_t *>(&slot->value));
- uint32_t current = atomic_value->load(std::memory_order_acquire);
- if (value < current) {
- release_region_op(region);
- return -EINVAL;
- }
- atomic_value->store(value, std::memory_order_release);
+ auto *value_ptr = const_cast<uint32_t *>(&slot->value);
+ uint32_t current = __atomic_load_n(value_ptr, __ATOMIC_ACQUIRE);
+ if (value < current) {
+ release_region_op(region);
+ return -EINVAL;
+ }
+ while (value > current &&
+ !__atomic_compare_exchange_n(
+ value_ptr, ¤t, value, /*weak=*/true, __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE
+ )) {
+ }
@@
- auto *atomic_value = reinterpret_cast<std::atomic<uint32_t> *>(const_cast<uint32_t *>(&slot->value));
- if (atomic_value->load(std::memory_order_acquire) >= target) {
+ auto *value_ptr = const_cast<uint32_t *>(&slot->value);
+ if (__atomic_load_n(value_ptr, __ATOMIC_ACQUIRE) >= target) {
@@
- if (atomic_value->load(std::memory_order_acquire) >= target) {
+ if (__atomic_load_n(value_ptr, __ATOMIC_ACQUIRE) >= target) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/common/host_device_comm/host_device_mapped_region.cpp` around lines 430 -
437, The code currently reinterpret_casts slot->value to std::atomic<uint32_t>
(via atomic_value) which is UB; also the load/check/store can lose a higher
concurrent value. Fix by making the underlying storage a proper
std::atomic<uint32_t> (change the slot struct so value is
std::atomic<uint32_t>), then replace the load/check/store with a CAS loop that
performs a monotonic max update (use compare_exchange_weak/strong on the atomic
member so you only store the larger of current and value). Update both
occurrences (the current block using atomic_value and the other similar block
around ~464-465). Keep the release_region_op(region) and
flush_host_range(region, slot, sizeof(*slot)) calls in place after a successful
update.
Summary
HostDeviceMappedRegionprimitive.Backend Notes
halHostRegisterso the owner process can initialize and access the region through a host mapping while kernels receive device-visible pointers.-ENOTSUPstubs rather than missing ABI symbols.