Skip to content

macOS arm64: JIT kernels run with garbage args because runFunction dispatches via variadic typedef #810

@slitvinov

Description

@slitvinov

On macOS arm64 (Apple Silicon) JIT-compiled OCCA kernels are called with garbage register state, producing silently wrong output (and segfaults in larger callers).

occa::sys::runFunction dispatches kernels through a function pointer typed as variadic:

https://github.com/libocca/occa/blob/main/src/occa/internal/utils/sys.hpp#L15

typedef void (*functionPtr_t)(...);

The generated runFunction.cpp_codegen calls f directly:

case 4:
    f(args[0], args[1], args[2], args[3]);
    break;

Apple's arm64 ABI passes all variadic args on the stack, while the AArch64 PCS used on Linux and by JIT-compiled OCCA kernels passes the first 8 in registers x0-x7. So on Apple Silicon the kernel reads its args from registers that were never set by the caller and runs on uninitialized values.

Apple's docs on the difference.

Reproducer

repro.cpp:

#include <iostream>
#include <vector>

#include <occa.hpp>

static const char *kernel_source = R"OKL(
@kernel void addVectors(const int entries,
                        const float *a,
                        const float *b,
                        float *ab) {
  for (int i = 0; i < entries; ++i; @tile(4, @outer, @inner)) {
    ab[i] = a[i] + b[i];
  }
}
)OKL";

int main() {
  const int entries = 8;

  std::vector<float> a(entries, 1.0f), b(entries, 2.0f), ab(entries, 0.0f);

  occa::device device({{"mode", "Serial"}});

  occa::memory o_a  = device.malloc<float>(entries, a.data());
  occa::memory o_b  = device.malloc<float>(entries, b.data());
  occa::memory o_ab = device.malloc<float>(entries);

  occa::kernel addVectors = device.buildKernelFromString(kernel_source,
                                                         "addVectors");

  addVectors(entries, o_a, o_b, o_ab);

  o_ab.copyTo(ab.data());
  for (int i = 0; i < entries; ++i) {
    std::cout << ab[i] << ' ';
  }
  std::cout << '\n';

  return 0;
}

Build:

c++ -std=c++17 repro.cpp -I${OCCA_HOME}/include -L${OCCA_HOME}/lib -locca \
    -Wl,-rpath,${OCCA_HOME}/lib -o repro

Run on macOS arm64:

$ ./repro
0 0 0 0 0 0 0 0

The kernel ran, but entries was read from x0 (uninitialized, happened to be 0), so the loop body never executed. With a different register state at call time we have observed segfaults dereferencing garbage a/b/ab pointers.

Expected:

3 3 3 3 3 3 3 3

Tested on macOS 26 / Apple M-series, Apple clang 21, OCCA built with -O2 -g.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions