On macOS arm64 (Apple Silicon) JIT-compiled OCCA kernels are called with garbage register state, producing silently wrong output (and segfaults in larger callers).
occa::sys::runFunction dispatches kernels through a function pointer typed as variadic:
https://github.com/libocca/occa/blob/main/src/occa/internal/utils/sys.hpp#L15
typedef void (*functionPtr_t)(...);
The generated runFunction.cpp_codegen calls f directly:
case 4:
f(args[0], args[1], args[2], args[3]);
break;
Apple's arm64 ABI passes all variadic args on the stack, while the AArch64 PCS used on Linux and by JIT-compiled OCCA kernels passes the first 8 in registers x0-x7. So on Apple Silicon the kernel reads its args from registers that were never set by the caller and runs on uninitialized values.
Apple's docs on the difference.
Reproducer
repro.cpp:
#include <iostream>
#include <vector>
#include <occa.hpp>
static const char *kernel_source = R"OKL(
@kernel void addVectors(const int entries,
const float *a,
const float *b,
float *ab) {
for (int i = 0; i < entries; ++i; @tile(4, @outer, @inner)) {
ab[i] = a[i] + b[i];
}
}
)OKL";
int main() {
const int entries = 8;
std::vector<float> a(entries, 1.0f), b(entries, 2.0f), ab(entries, 0.0f);
occa::device device({{"mode", "Serial"}});
occa::memory o_a = device.malloc<float>(entries, a.data());
occa::memory o_b = device.malloc<float>(entries, b.data());
occa::memory o_ab = device.malloc<float>(entries);
occa::kernel addVectors = device.buildKernelFromString(kernel_source,
"addVectors");
addVectors(entries, o_a, o_b, o_ab);
o_ab.copyTo(ab.data());
for (int i = 0; i < entries; ++i) {
std::cout << ab[i] << ' ';
}
std::cout << '\n';
return 0;
}
Build:
c++ -std=c++17 repro.cpp -I${OCCA_HOME}/include -L${OCCA_HOME}/lib -locca \
-Wl,-rpath,${OCCA_HOME}/lib -o repro
Run on macOS arm64:
$ ./repro
0 0 0 0 0 0 0 0
The kernel ran, but entries was read from x0 (uninitialized, happened to be 0), so the loop body never executed. With a different register state at call time we have observed segfaults dereferencing garbage a/b/ab pointers.
Expected:
Tested on macOS 26 / Apple M-series, Apple clang 21, OCCA built with -O2 -g.
On macOS arm64 (Apple Silicon) JIT-compiled OCCA kernels are called with garbage register state, producing silently wrong output (and segfaults in larger callers).
occa::sys::runFunctiondispatches kernels through a function pointer typed as variadic:https://github.com/libocca/occa/blob/main/src/occa/internal/utils/sys.hpp#L15
The generated
runFunction.cpp_codegencallsfdirectly:Apple's arm64 ABI passes all variadic args on the stack, while the AArch64 PCS used on Linux and by JIT-compiled OCCA kernels passes the first 8 in registers x0-x7. So on Apple Silicon the kernel reads its args from registers that were never set by the caller and runs on uninitialized values.
Apple's docs on the difference.
Reproducer
repro.cpp:Build:
Run on macOS arm64:
The kernel ran, but
entrieswas read fromx0(uninitialized, happened to be 0), so the loop body never executed. With a different register state at call time we have observed segfaults dereferencing garbagea/b/abpointers.Expected:
Tested on macOS 26 / Apple M-series, Apple clang 21, OCCA built with
-O2 -g.