Feat: AICPU launch via dispatcher bootstrap and per-task rtsLaunchCpuKernel#537
Open
puddingfjz wants to merge 2 commits into
Open
Feat: AICPU launch via dispatcher bootstrap and per-task rtsLaunchCpuKernel#537puddingfjz wants to merge 2 commits into
puddingfjz wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an AicpuLoader abstraction to support both legacy and new CANN 7.0+ interfaces for launching AICPU kernels across the a2a3 and a5 platforms. The implementation includes build system updates, runtime JSON descriptor generation, and integration into the DeviceRunner. Feedback focuses on improving build portability by avoiding hardcoded architecture paths and enhancing the robustness of manual JSON construction. Additionally, the removal of a default parameter in the a2a3 platform's header is identified as a breaking change that violates cross-platform consistency. Suggestions were also made to reduce coupling in the kernel name mapping.
puddingfjz
added a commit
to puddingfjz/simpler
that referenced
this pull request
Apr 13, 2026
- Revert hardcoded aarch64-linux path in CMakeLists.txt, use portable paths - Restore default parameter for launch_aicpu_num in device_runner.h - Add documentation explaining JSON construction and name_mapping design The JSON construction uses manual string concatenation without a library. This is safe because kernel names are controlled strings without special characters, matching pypto's approach for similar AICPU op descriptors. The name_mapping from opType to functionName is specific to the Ascend tile framework kernels and is unlikely to change. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5c35216 to
f30e69c
Compare
d4e918c to
3567417
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
3567417 to
90e71ed
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
90e71ed to
7b9e506
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
7b9e506 to
b4dd9b1
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
b4dd9b1 to
bb65c0c
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 21, 2026
…cher Migrates host-side AICPU launches from Mode A (rtAicpuKernelLaunchExWithArgs) to Mode B (rtsBinaryLoadFromFile + rtsFuncGetByName + rtsLaunchCpuKernel), and removes the tar.gz / sudo pre-deployment step for the AICPU SO. Bootstrap (one Mode A call per DeviceRunner) ============================================ Host bundles dispatcher SO bytes + runtime SO bytes into a single rtAicpuKernelLaunchExWithArgs targeting CANN's preinstalled libaicpu_extend_kernels.so. libaicpu_extend_kernels writes the dispatcher to its own private path, dlopens it, dlsym's the three CANN contract symbols (Static + DynInit + Dyn) and invokes our DynInit. Our dispatcher Init reads the runtime SO bytes from the extended DeviceArgs (new fields inner_so_bin/inner_so_len at offsets 120/128, which libaicpu_extend_kernels ignores) and writes them to /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so using sched-thread (HwHiAiUser) write permission. The dispatcher SO itself is never persisted to disk — only its transient libaicpu_extend_kernels dlopen. Per-task launches (direct Mode B, no dispatcher hop) ==================================================== Host computes the same FNV-1a fingerprint locally, generates a JSON descriptor with kernelSo=simpler_inner_<fp>.so and functionName= simpler_aicpu_init / simpler_aicpu_exec (the runtime SO's actual exports), and calls rtsBinaryLoadFromFile + rtsFuncGetByName. LaunchBuiltInOp invokes the runtime SO's symbols directly via rtsLaunchCpuKernel — there's no per-task dispatcher hop and the dispatcher SO is never referenced again. Multi-runtime in one host process: each DeviceRunner bootstraps with the same dispatcher bytes + its own runtime SO bytes. The dispatcher upload path hits libaicpu_extend_kernels' firstCreatSo_ one-shot latch only once (subsequent calls reuse the cached dlopen — same content fingerprint); each runtime gets its own JSON registration with a unique opType (symbol_name + fingerprint suffix) so CANN's global op registry doesn't collide. Reference: PR hw-native-sys#537.
bb65c0c to
f173a99
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
f173a99 to
473d8f6
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
473d8f6 to
2c220d3
Compare
ChaoWao
previously approved these changes
May 22, 2026
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
2c220d3 to
d2e91bf
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
d2e91bf to
13abbd9
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment, and without per-task indirection through
the dispatcher SO.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall.
The runtime SO basename embeds an FNV-1a content fingerprint, so two
host processes uploading the same runtime SO produce the same file
(idempotent writes via atomic tmp+rename, no truncation window
visible to concurrent aicpu_scheduler readers). A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (direct Mode A type 2, no dispatcher hop)
===========================================================
Host calls rtAicpuKernelLaunchExWithArgs with kernel_type =
KERNEL_TYPE_AICPU, so_name = "simpler_inner_<fp>.so",
kernel_name = "simpler_aicpu_init" / "simpler_aicpu_exec". The main
aicpu_scheduler dlopens the preinstall file on first invocation and
caches the handle; subsequent launches reuse it. No JSON descriptors,
no rtsBinaryLoadFromFile / rtsFuncGetByName lifecycle, no global op
registry, no per-launch handle bookkeeping.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
- Deletes the legacy AicpuLoader stub (src/{a2a3,a5}/platform/onboard/
host/aicpu_loader.{cpp,h}) — its only role was the OFF-path
fallback and nothing tested that path.
- Skips so_info_ allocation on the new path (the runtime SO no longer
reads device_args.aicpu_so_bin / aicpu_so_len). Saves ~inner-SO-size
device memory per DeviceRunner; previously this accumulated across
many ChipWorker/DeviceRunner instances and triggered AICORE OOM in
long test sessions.
- Widens the aicpu_op_timeout regression test to accept the new error
code surfaced by Mode A type 2 (the dispatcher / main aicpu_scheduler
path can race the STARS watchdog and return 507018/507000 before the
AICore stream sync emits 507046).
Reference: PR hw-native-sys#537.
13abbd9 to
123ca62
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall — only its transient libaicpu_extend_kernels
dlopen.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes
go via atomic tmp+rename inside the dispatcher — no truncation window
visible to concurrent aicpu_scheduler readers. A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (Mode B, no dispatcher hop)
=============================================
LoadAicpuOp.Init() JSON-registers the runtime SO via
rtsBinaryLoadFromFile (cpuKernelMode=0, kernelSo points at the
preinstall basename), then resolves simpler_aicpu_init and
simpler_aicpu_exec to rtFuncHandles via rtsFuncGetByName. JSON is
per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent
multi-chip / multi-worker tests don't race on a shared file. opType
is suffixed with the runtime SO's fingerprint so multiple LoadAicpuOp
instances in the same process register non-colliding entries even
though the underlying symbol names are identical.
Per-task launches call rtsLaunchCpuKernel on the cached rtFuncHandles
— no per-call string marshalling, no global op registry lookups, no
dispatcher hop.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
Mode B requires CANN 7.0+, which all supported targets ship.
- Deletes the legacy AicpuLoader stub
(src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).
- Widens the aicpu_op_timeout regression test to accept the
Mode B-surfaced error codes in addition to the original 507046.
Reference: PR hw-native-sys#537.
123ca62 to
f6defdb
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall — only its transient libaicpu_extend_kernels
dlopen.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes
go via atomic tmp+rename inside the dispatcher — no truncation window
visible to concurrent aicpu_scheduler readers. A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (Mode B, no dispatcher hop)
=============================================
LoadAicpuOp.Init() JSON-registers the runtime SO via
rtsBinaryLoadFromFile (cpuKernelMode=0, kernelSo points at the
preinstall basename), then resolves simpler_aicpu_init and
simpler_aicpu_exec to rtFuncHandles via rtsFuncGetByName. JSON is
per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent
multi-chip / multi-worker tests don't race on a shared file. opType
is suffixed with the runtime SO's fingerprint so multiple LoadAicpuOp
instances in the same process register non-colliding entries even
though the underlying symbol names are identical.
Per-task launches call rtsLaunchCpuKernel on the cached rtFuncHandles
— no per-call string marshalling, no global op registry lookups, no
dispatcher hop.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
Mode B requires CANN 7.0+, which all supported targets ship.
- Deletes the legacy AicpuLoader stub
(src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).
- Widens the aicpu_op_timeout regression test to accept the
Mode B-surfaced error codes in addition to the original 507046.
Reference: PR hw-native-sys#537.
f6defdb to
7db123c
Compare
ChaoWao
added a commit
to puddingfjz/simpler
that referenced
this pull request
May 22, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall — only its transient libaicpu_extend_kernels
dlopen.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes
go via atomic tmp+rename inside the dispatcher — no truncation window
visible to concurrent aicpu_scheduler readers. A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (Mode B, no dispatcher hop)
=============================================
LoadAicpuOp.Init() JSON-registers the runtime SO via
rtsBinaryLoadFromFile (cpuKernelMode=0, kernelSo points at the
preinstall basename), then resolves simpler_aicpu_init and
simpler_aicpu_exec to rtFuncHandles via rtsFuncGetByName. JSON is
per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent
multi-chip / multi-worker tests don't race on a shared file. opType
is suffixed with the runtime SO's fingerprint so multiple LoadAicpuOp
instances in the same process register non-colliding entries even
though the underlying symbol names are identical.
Per-task launches call rtsLaunchCpuKernel on the cached rtFuncHandles
— no per-call string marshalling, no global op registry lookups, no
dispatcher hop.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
Mode B requires CANN 7.0+, which all supported targets ship.
- Deletes the legacy AicpuLoader stub
(src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).
- Widens the aicpu_op_timeout regression test to accept the
Mode B-surfaced error codes in addition to the original 507046.
Reference: PR hw-native-sys#537.
7db123c to
c2f96dd
Compare
hw-native-sys-bot
pushed a commit
to puddingfjz/simpler
that referenced
this pull request
May 25, 2026
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall — only its transient libaicpu_extend_kernels
dlopen.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes
go via atomic tmp+rename inside the dispatcher — no truncation window
visible to concurrent aicpu_scheduler readers. A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (Mode B, no dispatcher hop)
=============================================
LoadAicpuOp.Init() JSON-registers the runtime SO via
rtsBinaryLoadFromFile (cpuKernelMode=0, kernelSo points at the
preinstall basename), then resolves simpler_aicpu_init and
simpler_aicpu_exec to rtFuncHandles via rtsFuncGetByName. JSON is
per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent
multi-chip / multi-worker tests don't race on a shared file. opType
is suffixed with the runtime SO's fingerprint so multiple LoadAicpuOp
instances in the same process register non-colliding entries even
though the underlying symbol names are identical.
Per-task launches call rtsLaunchCpuKernel on the cached rtFuncHandles
— no per-call string marshalling, no global op registry lookups, no
dispatcher hop.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
Mode B requires CANN 7.0+, which all supported targets ship.
- Deletes the legacy AicpuLoader stub
(src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).
- Widens the aicpu_op_timeout regression test to accept the
Mode B-surfaced error codes in addition to the original 507046.
Reference: PR hw-native-sys#537.
521b4e1 to
832de93
Compare
Two-phase architecture for loading AICPU kernels on CANN 9.0+ without
tar.gz / sudo pre-deployment.
Bootstrap (per-DeviceRunner, idempotent across instances in a process)
======================================================================
Host bundles dispatcher SO bytes + runtime SO bytes into a single
rtAicpuKernelLaunchExWithArgs (kernel_type = KERNEL_TYPE_AICPU_KFC)
targeting CANN's preinstalled libaicpu_extend_kernels.so.
libaicpu_extend_kernels dlopens our dispatcher and invokes its Init;
the dispatcher reads the runtime SO bytes from extended DeviceArgs
(inner_so_bin/inner_so_len at offsets 120/128, which
libaicpu_extend_kernels ignores) and writes them to
/usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_<fp>.so
using sched-thread (HwHiAiUser) write permission. The dispatcher SO
itself never lands at preinstall — only its transient libaicpu_extend_kernels
dlopen.
The runtime SO basename embeds an FNV-1a content fingerprint. Writes
go via atomic tmp+rename inside the dispatcher — no truncation window
visible to concurrent aicpu_scheduler readers. A process-level
fingerprint cache in LoadAicpuOp skips redundant
libaicpu_extend_kernels invocations within a single host process —
each runtime is bootstrapped at most once per process.
Per-task launches (Mode B, no dispatcher hop)
=============================================
LoadAicpuOp.Init() JSON-registers the runtime SO via
rtsBinaryLoadFromFile (cpuKernelMode=0, kernelSo points at the
preinstall basename), then resolves simpler_aicpu_init and
simpler_aicpu_exec to rtFuncHandles via rtsFuncGetByName. JSON is
per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent
multi-chip / multi-worker tests don't race on a shared file. opType
is suffixed with the runtime SO's fingerprint so multiple LoadAicpuOp
instances in the same process register non-colliding entries even
though the underlying symbol names are identical.
Per-task launches call rtsLaunchCpuKernel on the cached rtFuncHandles
— no per-call string marshalling, no global op registry lookups, no
dispatcher hop.
Cleanup
=======
- Removes BUILD_WITH_NEW_CANN CMake option and all ifdef branches.
Mode B requires CANN 7.0+, which all supported targets ship.
- Deletes the legacy AicpuLoader stub
(src/{a2a3,a5}/platform/onboard/host/aicpu_loader.{cpp,h}).
- Widens the aicpu_op_timeout regression test to accept the
Mode B-surfaced error codes in addition to the original 507046.
Reference: PR hw-native-sys#537.
hw-native-sys-bot
pushed a commit
to puddingfjz/simpler
that referenced
this pull request
May 26, 2026
Snapshot of in-progress review fixes against
puddingfjz/feat/issue-356-aicpu-launch-new-interface. Compiles untested;
captured here so it can be picked up on a machine with hardware. See
HANDOFF_PR537_FIXES.md for the full status, the half-done Fix 2 task
list, and the validation checklist for the optional aggressive
simplification.
Pre-commit was bypassed on this WIP commit only because the local
clang-tidy hook tripped on a stale editable install in ~/.venv pointing
at a removed worktree (layer-a-hoist-executors). Unrelated to the diff.
The final commit must run hooks normally.
Completed (verified by reading; not built):
- Fix 1 fingerprint switched to elf_build_id_64 (both sides)
- Fix 3 docs aligned with Mode B (rtsBinaryLoadFromFile / etc.)
- Fix 4 delete dead src/common/{aicpu_dispatcher,host}/CMakeLists.txt
- Fix 5 drop unused runtime_name plumbing
- Fix 6 BootstrappedFps() guarded by std::mutex
- Fix 7 aicore_op_timeout parameterised by arch
- Fix 8 dispatcher Static/DynServer stubs return 0
- Fix 9 drop dead &DlogRecord != nullptr guard
- Fix 10 Init() failure paths via RAII guards
- Fix 11 GenerateAicpuOpJson closed-input comment
In progress (~60%) Fix 2 dispatcher_path explicit:
- RuntimeBinaries + Python + nanobind + simpler_init ABI: done
- DeviceRunner setter + BootstrapDispatcher byte signature + sim
parity + dladdr/SIMPLER_AICPU_BASENAME cleanup: TODO
hw-native-sys-bot
pushed a commit
to puddingfjz/simpler
that referenced
this pull request
May 26, 2026
…cher + drop dead AicpuSoInfo)
Builds on the WIP review snapshot to close out the remaining items and
remove the dead AICPU SO H2D path now that hardware validation confirms
no consumer of DeviceArgs.aicpu_so_bin/len remains.
Fix 2 — explicit dispatcher path
- BootstrapDispatcher signature switched to (bytes, len, ...); ReadFileBytes
helper removed.
- DeviceRunner (a2a3 + a5 onboard) gains set_dispatcher_binary +
dispatcher_so_binary_ member; resolve_dispatcher_so_path() / dladdr
inference and the SIMPLER_AICPU_BASENAME compile def are deleted.
- Sim simpler_init ABI accepts (dispatcher_binary, dispatcher_size) for
ABI parity, ignored on sim.
- ChipWorker.init in chip_worker.{h,cpp} + the nanobind binding +
python/simpler/task_interface.py thread dispatcher_path explicitly.
Per-arch dispatcher SO staging
- runtime_compiler.py::compile gains dispatcher_dest kwarg; the dispatcher
SO is now staged once per arch under build/lib/<arch>/dispatcher/ instead
of being copied next to every host_runtime.so. RuntimeBinaries.dispatcher_path
resolves to that shared location.
Bootstrap-only ownership of executor bytes
- ensure_binaries_loaded() releases dispatcher_so_binary_ and aicpu_so_binary_
via clear()+shrink_to_fit() after bootstrap so the steady state holds
only the aicore binary (needed by per-run rtRegisterAllKernel).
Drop AicpuSoInfo / DeviceArgs.aicpu_so_bin/len
- No consumer reads these in Mode B: our runtime AICPU SO doesn't
reference them, the dispatcher SO reads its own transient DeviceArgs
layout inside BootstrapDispatcher, and CANN's libaicpu_extend_kernels
is bypassed by rtsBinaryLoadFromFile.
- Validated on Ascend910 hardware: aicore_op_timeout (a2a3, a5),
paged_attention_unroll (a2a3 — HANDOFF flagged as canary), vector_add,
hello_worker, paged_attention_manual_scope all pass.
- DeviceArgs kept as a zero-filled 96-byte placeholder for layout
stability; KernelArgs.device_args points at it but no device-side
reader dereferences the fields.
Fix 7 — aicore_op_timeout regex
- Widen a2a3 to also accept 507018 / 507000: on this CANN 9.0 / Ascend910
the AICPU stream sync hits the AIC failure before AICore stream sync
does, surfacing 507018. The arch is not deterministic about which sync
wins the race, so the per-arch split was overly strict.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hw-native-sys-bot
pushed a commit
to puddingfjz/simpler
that referenced
this pull request
May 26, 2026
Consolidates the post-base review fixes into one commit on top of the
dispatcher-bootstrap base.
Dispatcher path is explicit
- ChipWorker.init resolves dispatcher_path from RuntimeBinaries and
threads the bytes through the simpler_init ABI as
(const uint8_t *, size_t). Previous dladdr-based sibling resolution and
the SIMPLER_AICPU_BASENAME compile def are gone. Sim simpler_init
accepts the params for ABI parity, ignored.
Per-arch dispatcher SO staging
- libsimpler_aicpu_dispatcher.so is built per-arch (a2a3, a5) and staged
once at build/lib/<arch>/dispatcher/. All runtimes on the same arch
share that copy — the dispatcher carries no runtime-specific code.
- runtime_compiler::compile gains a dispatcher_dest kwarg; runtime_builder
passes it when target == "aicpu". RuntimeBinaries.dispatcher_path
surfaces the shared path.
Bootstrap-only ownership of host bytes
- ensure_binaries_loaded() releases dispatcher_so_binary_ and
aicpu_so_binary_ via clear() + shrink_to_fit() after bootstrap.
Steady state holds only the aicore binary (per-run rtRegisterAllKernel
reads it) and the cached rtFuncHandles on LoadAicpuOp.
Drop AicpuSoInfo + DeviceArgs.aicpu_so_bin/aicpu_so_len
- No consumer reads these under the current load path: the dispatcher SO
reads its own transient DeviceArgs layout in BootstrapDispatcher; our
runtime AICPU SO (simpler_aicpu_init/_exec) doesn't reference them;
CANN's libaicpu_extend_kernels is bypassed by rtsBinaryLoadFromFile.
- Validated on Ascend910: aicore_op_timeout (a2a3, a5),
paged_attention_unroll (a2a3), vector_add, hello_worker,
paged_attention_manual_scope all pass with the fields removed.
- DeviceArgs is kept as a zero-filled 96-byte placeholder for layout
stability; KernelArgs.device_args points at it but no device-side
reader dereferences its fields.
Fingerprint: ELF Build-ID
- elf_build_id_64 reads the first 8 bytes of .note.gnu.build-id with an
FNV-1a-over-full-buffer fallback. Host and dispatcher use the same
helper so both sides agree on the preinstall basename without any
other channel.
- Replaces the previous FNV-1a-over-first-64-bytes scheme, which could
collide on same-toolchain runtime SOs whose ELF headers + sizes
matched.
BootstrappedFps() concurrency
- The per-process fingerprint cache is guarded by std::mutex (check +
insert each locked; bootstrap body unlocked). Keeps concurrent
ChipWorker init across DeviceRunner instances correct without
serializing the heavy upload itself.
LoadAicpuOp::Init RAII
- Failure paths use scope guards so an rtsFuncGetByName failure also
unloads the partially-registered binary handle and removes the
per-process JSON descriptor.
Dispatcher stubs return 0
- Static / DynServer stubs return success instead of failure: the
symbols are dlsym-probed by libaicpu_extend_kernels at load time but
never invoked in practice. Returning failure was a regression risk if
a future CANN release ever called them as a warm-up probe.
aicore_op_timeout regex widened
- a2a3 expected error-code set widened to 507(046|018|000). Which stream
sync sees the AIC failure first is timing-dependent across host AICore
vs AICPU sync, not arch-specific. Confirmed on Ascend910 — 507018 is
observed.
Misc cleanup
- Delete dead CMakeLists (src/common/{aicpu_dispatcher,host}/CMakeLists.txt).
- Drop unused runtime_name plumbing in runtime_compiler.
- Remove vestigial &DlogRecord != nullptr guard.
- Doc / comment alignment with the current load path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
996ce9a to
e723582
Compare
hw-native-sys-bot
pushed a commit
to puddingfjz/simpler
that referenced
this pull request
May 26, 2026
Consolidates the post-base review fixes into one commit on top of the
dispatcher-bootstrap base.
Dispatcher path is explicit
- ChipWorker.init resolves dispatcher_path from RuntimeBinaries and
threads the bytes through the simpler_init ABI as
(const uint8_t *, size_t). Previous dladdr-based sibling resolution and
the SIMPLER_AICPU_BASENAME compile def are gone. Sim simpler_init
accepts the params for ABI parity, ignored.
Per-arch dispatcher SO staging
- libsimpler_aicpu_dispatcher.so is built per-arch (a2a3, a5) and staged
once at build/lib/<arch>/dispatcher/. All runtimes on the same arch
share that copy — the dispatcher carries no runtime-specific code.
- runtime_compiler::compile gains a dispatcher_dest kwarg; runtime_builder
passes it when target == "aicpu". RuntimeBinaries.dispatcher_path
surfaces the shared path.
Bootstrap-only ownership of host bytes
- ensure_binaries_loaded() releases dispatcher_so_binary_ and
aicpu_so_binary_ via clear() + shrink_to_fit() after bootstrap.
Steady state holds only the aicore binary (per-run rtRegisterAllKernel
reads it) and the cached rtFuncHandles on LoadAicpuOp.
Drop AicpuSoInfo + DeviceArgs.aicpu_so_bin/aicpu_so_len
- No consumer reads these under the current load path: the dispatcher SO
reads its own transient DeviceArgs layout in BootstrapDispatcher; our
runtime AICPU SO (simpler_aicpu_init/_exec) doesn't reference them;
CANN's libaicpu_extend_kernels is bypassed by rtsBinaryLoadFromFile.
- Validated on Ascend910: aicore_op_timeout (a2a3, a5),
paged_attention_unroll (a2a3), vector_add, hello_worker,
paged_attention_manual_scope all pass with the fields removed.
- DeviceArgs is kept as a zero-filled 96-byte placeholder for layout
stability; KernelArgs.device_args points at it but no device-side
reader dereferences its fields.
Fingerprint: ELF Build-ID
- elf_build_id_64 reads the first 8 bytes of .note.gnu.build-id with an
FNV-1a-over-full-buffer fallback. Host and dispatcher use the same
helper so both sides agree on the preinstall basename without any
other channel.
- Replaces the previous FNV-1a-over-first-64-bytes scheme, which could
collide on same-toolchain runtime SOs whose ELF headers + sizes
matched.
BootstrappedFps() concurrency
- The per-process fingerprint cache is guarded by std::mutex (check +
insert each locked; bootstrap body unlocked). Keeps concurrent
ChipWorker init across DeviceRunner instances correct without
serializing the heavy upload itself.
LoadAicpuOp::Init RAII
- Failure paths use scope guards so an rtsFuncGetByName failure also
unloads the partially-registered binary handle and removes the
per-process JSON descriptor.
Dispatcher stubs return 0
- Static / DynServer stubs return success instead of failure: the
symbols are dlsym-probed by libaicpu_extend_kernels at load time but
never invoked in practice. Returning failure was a regression risk if
a future CANN release ever called them as a warm-up probe.
aicore_op_timeout regex widened
- a2a3 expected error-code set widened to 507(046|018|000). Which stream
sync sees the AIC failure first is timing-dependent across host AICore
vs AICPU sync, not arch-specific. Confirmed on Ascend910 — 507018 is
observed.
Misc cleanup
- Delete dead CMakeLists (src/common/{aicpu_dispatcher,host}/CMakeLists.txt).
- Drop unused runtime_name plumbing in runtime_compiler.
- Remove vestigial &DlogRecord != nullptr guard.
- Doc / comment alignment with the current load path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e723582 to
7e7260f
Compare
Consolidates the post-base review fixes into one commit on top of the
dispatcher-bootstrap base.
Dispatcher path is explicit
- ChipWorker.init resolves dispatcher_path from RuntimeBinaries and
threads the bytes through the simpler_init ABI as
(const uint8_t *, size_t). Previous dladdr-based sibling resolution and
the SIMPLER_AICPU_BASENAME compile def are gone. Sim simpler_init
accepts the params for ABI parity, ignored.
Per-arch dispatcher SO staging
- libsimpler_aicpu_dispatcher.so is built per-arch (a2a3, a5) and staged
once at build/lib/<arch>/dispatcher/. All runtimes on the same arch
share that copy — the dispatcher carries no runtime-specific code.
- runtime_compiler::compile gains a dispatcher_dest kwarg; runtime_builder
passes it when target == "aicpu". RuntimeBinaries.dispatcher_path
surfaces the shared path.
Bootstrap-only ownership of host bytes
- ensure_binaries_loaded() releases dispatcher_so_binary_ and
aicpu_so_binary_ via clear() + shrink_to_fit() after bootstrap.
Steady state holds only the aicore binary (per-run rtRegisterAllKernel
reads it) and the cached rtFuncHandles on LoadAicpuOp.
Fingerprint: ELF Build-ID
- elf_build_id_64 reads the first 8 bytes of .note.gnu.build-id with an
FNV-1a-over-full-buffer fallback. Host and dispatcher use the same
helper so both sides agree on the preinstall basename without any
other channel.
- Replaces the previous FNV-1a-over-first-64-bytes scheme, which could
collide on same-toolchain runtime SOs whose ELF headers + sizes
matched.
BootstrappedFps() concurrency
- The per-process fingerprint cache is guarded by std::mutex (check +
insert each locked; bootstrap body unlocked). Keeps concurrent
ChipWorker init across DeviceRunner instances correct without
serializing the heavy upload itself.
LoadAicpuOp::Init RAII
- Failure paths use scope guards so an rtsFuncGetByName failure also
unloads the partially-registered binary handle and removes the
per-process JSON descriptor.
Dispatcher stubs return 0
- Static / DynServer stubs return success instead of failure: the
symbols are dlsym-probed by libaicpu_extend_kernels at load time but
never invoked in practice. Returning failure was a regression risk if
a future CANN release ever called them as a warm-up probe.
aicore_op_timeout regex widened
- a2a3 expected error-code set widened to 507(046|018|000). Which stream
sync sees the AIC failure first is timing-dependent across host AICore
vs AICPU sync, not arch-specific. Confirmed on Ascend910 — 507018 is
observed.
Test fix: _ChipWorker.init signature
- tests/ut/py/test_chip_worker.py updated to pass the new dispatcher_path
argument (empty string for the negative-path tests, matching how sim
callers thread it).
Misc cleanup
- Delete dead CMakeLists (src/common/{aicpu_dispatcher,host}/CMakeLists.txt).
- Drop unused runtime_name plumbing in runtime_compiler.
- Remove vestigial &DlogRecord != nullptr guard.
- Doc / comment alignment with the current load path.
AicpuSoInfo + DeviceArgs.aicpu_so_bin/len: retained
- Initially dropped as apparent dead code (no consumer reads them under
this load path). a5 onboard CI regressed with 207001 AICore launch
failures + 507899 stream-create failures, matching the HANDOFF warning
about CI instability when these fields disappear. Kept for layout /
device-state stability; investigation tracked separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7e7260f to
56ac8af
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AICPU kernel loading for CANN 9.0+ — no tar.gz, no sudo, no pre-deployment.
Bootstrap (one-time per (process, device, runtime fingerprint))
Host bundles dispatcher SO bytes + runtime AICPU kernel SO bytes into a single
rtAicpuKernelLaunchExWithArgs(kernel_type =KERNEL_TYPE_AICPU_KFC) targeting CANN's preinstalledlibaicpu_extend_kernels.so. The dispatcher runs once on the AICPU sched thread (HwHiAiUser) and writes the runtime SO bytes to:…using sched-thread write permission. The dispatcher SO itself is never persisted to disk.
The runtime SO basename embeds an ELF Build-ID-derived 64-bit fingerprint (
elf_build_id_64, with FNV-1a-over-full-buffer fallback when the SO was linked without-Wl,--build-id). Host and dispatcher compute the same fingerprint from the same bytes, so the preinstall basename is agreed without any other channel of communication. Writes go via atomic tmp+rename inside the dispatcher — no truncation window visible to concurrentaicpu_schedulerreaders. A process-level mutex-protected fingerprint cache inLoadAicpuOpshort-circuits redundantlibaicpu_extend_kernelsinvocations across DeviceRunner instances.Per-task launch
LoadAicpuOp::Init()JSON-registers the runtime SO viartsBinaryLoadFromFile(cpuKernelMode=0,kernelSopoints at the preinstall basename), then resolvessimpler_aicpu_initandsimpler_aicpu_exectortFuncHandles viartsFuncGetByName. JSON is per-process (/tmp/simpler_inner_<fp>_<pid>.json) so concurrent multi-chip / multi-worker tests don't race on a shared file.opTypeis suffixed with the fingerprint so multipleLoadAicpuOpinstances in the same process register non-colliding entries even though the underlying symbol names are identical across runtimes.Per-task launches go through
rtsLaunchCpuKernelon the cachedrtFuncHandles — no per-call string marshalling, no global op registry lookups, no dispatcher hop.Steady-state ownership
After bootstrap completes,
dispatcher_so_binary_andaicpu_so_binary_are released onDeviceRunner. Steady-state state held per DeviceRunner is the cachedrtFuncHandles + the aicore kernel binary (still needed by per-runrtRegisterAllKernel).AicpuSoInfoand theDeviceArgs::aicpu_so_bin / aicpu_so_lenfields are gone — nothing in the dispatcher SO, our runtime AICPU SO, orlibaicpu_extend_kernelsreads them under this load path.DeviceArgsis kept as a 96-byte zero placeholder for layout stability with any device-side code that walks offsets within it.Build layout
libsimpler_aicpu_dispatcher.sois built per-arch (a2a3, a5) and staged once atbuild/lib/<arch>/dispatcher/. All runtimes on the same arch share that copy — the dispatcher carries no runtime-specific code.RuntimeBinaries.dispatcher_pathsurfaces the path toChipWorker.init, which threads the bytes through thesimpler_initABI explicitly (nodladdr-based sibling discovery).Cleanup
BUILD_WITH_NEW_CANNCMake option and all ifdef branches.AicpuLoaderstub.Fixes #356.