feat(intrinsics): add KI.vload / KI.vstore! for wide vector memory operations by shreyas-omkar · Pull Request #719 · JuliaGPU/KernelAbstractions.jl

shreyas-omkar · 2026-06-30T12:41:40Z

Adds KernelIntrinsics.vload and KernelIntrinsics.vstore!

CPU (Array{T}): casts pointer(arr, idx) to
Ptr{NTuple{N,VecElement{T}}} and calls unsafe_load/unsafe_store!.
Julia/LLVM lowers NTuple{N,VecElement{T}} to <N x T>, emitting one wide
op (e.g. movups / vmovaps on x86).
GPU: each backend registers a @device_override for its concrete device
array type that reinterprets the LLVMPtr with N*sizeof(T) alignment.
POCL/SPIR-V override is included here (CLDeviceArray in
src/pocl/backend.jl); CUDA.jl and AMDGPU.jl can add analogous overrides
in their own KA extensions.
Fallback: N == 1, non-primitive T, or any array without a registered
override falls back to N scalar accesses via a @generated function (avoids
closures, which GPU compilers reject).

Benchmark (POCLBackend, Float32, cyclops)

Size	N	Scalar	vload	Speedup
256K (L2)	4	10.59 GB/s	11.58 GB/s	1.09×
256K (L2)	8	11.22 GB/s	13.03 GB/s	1.16×
4M (L3)	2	30.12 GB/s	39.94 GB/s	1.33×
4M (L3)	8	37.46 GB/s	49.57 GB/s	1.32×
16M (DRAM)	4	29.89 GB/s	30.00 GB/s	1.00×

Speedup is visible in L2/L3-resident workloads where instruction throughput is
the bottleneck; DRAM-bound workloads are flat as expected.

@generated

…ions Implements vectorised load/store for AbstractArray across all KA backends: - CPU (Ptr path): reinterpret to Ptr{NTuple{N,VecElement{T}}} + unsafe_load/store! - GPU (LLVMPtr path): reinterpret to LLVMPtr{NTuple{N,VecElement{T}},AS} + unsafe_load/store! with N*sizeof(T) alignment; lowers to ld.global.v4 (CUDA), global_load_dwordx4 (AMDGPU), or OpLoad <N x T> (SPIR-V/POCL) — avoids pointerref/pointerset which are unsupported in SPIR-V - Scalar fallback for N=1 or non-primitive element types Adds @generated helpers (_vload_ptr, _vload_lptr, _vstore_ptr!, _vstore_lptr!, _vload_arr, _vstore_arr!) and comprehensive CPU test suite (147 tests, all pass). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ore! Base vload/vstore! now fall back to scalar indexing for all non-Ptr arrays. GPU backends inject the vectorized path by registering @device_override for their concrete device array type, following the KernelIntrinsics.jl pattern: - src/intrinsics.jl: remove _vload_lptr/_vstore_lptr! and the LLVMPtr branch; base dispatch is now: Ptr{T} → wide load, everything else → scalar fallback - src/pocl/backend.jl: add @device_override for KI.vload/KI.vstore! on CLDeviceArray, using reinterpret(LLVMPtr{NTuple{N,VecElement{T}},AS}, ptr) + unsafe_load/unsafe_store! to emit OpLoad <N x T> (SPIR-V/POCL) CUDA.jl, AMDGPU.jl etc. should add analogous overrides for CuDeviceArray / ROCDeviceArray to get ld.global.v4.f32 / global_load_dwordx4. 147/147 CPU tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…le error The AbstractArray fallback called pointer(arr, idx) which routes through unsafe_convert(Ptr{T}, arr) — a path with a runtime string-formatting error that GPU compilers (CUDA, SPIR-V) reject as an unsupported dynamic function invocation. Fix: split into two methods — Array (CPU Ptr path) and AbstractArray (scalar fallback). GPU @device_overrides catch their device array types before the AbstractArray fallback, so the Ptr path is never reachable in GPU IR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

oneAPI (Intel GPU) throws "Float64 is not supported on this device" when allocating a oneArray{Float64}, so guard Float64 test cases behind KernelAbstractions.supports_float64(backend()). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-06-30T15:08:56Z

Codecov Report

❌ Patch coverage is 60.41667% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.46%. Comparing base (a8022b2) to head (2543a3e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/intrinsics.jl	24.00%	19 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #719      +/-   ##
==========================================
- Coverage   62.51%   62.46%   -0.06%     
==========================================
  Files          23       23              
  Lines        1926     1974      +48     
==========================================
+ Hits         1204     1233      +29     
- Misses        722      741      +19

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

shreyas-omkar and others added 5 commits June 30, 2026 13:42

docs(intrinsics): tighten vload/vstore! docstrings and inline comments

95f990a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(intrinsics): add KI.vload / KI.vstore! for wide vector memory operations#719

feat(intrinsics): add KI.vload / KI.vstore! for wide vector memory operations#719
shreyas-omkar wants to merge 5 commits into
JuliaGPU:mainfrom
shreyas-omkar:sh/vload

shreyas-omkar commented Jun 30, 2026

Uh oh!

codecov Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shreyas-omkar commented Jun 30, 2026

Benchmark (POCLBackend, Float32, cyclops)

Uh oh!

codecov Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 30, 2026 •

edited

Loading