Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions docs/issues/native-macos-accelerate-simd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Native macOS SIMD acceleration via Apple Accelerate framework

## Problem

The `skainet-backend-cpu` module on Kotlin/Native macOS (macosArm64) uses plain scalar loops
for all tensor operations (`DefaultCpuOps`). On JVM, the same module uses the JDK Vector API
for SIMD-accelerated matmul, elementwise ops, and reductions (`DefaultCpuOpsJvm`), which gives
a significant performance advantage.

When running LLM inference benchmarks via the `llm-performance` native binary, the CPU backend
is 5-10x slower than it needs to be because every matmul is a triple-nested scalar loop
(`DefaultCpuOps.kt:264-272`).

## Proposed solution

Add an Accelerate-backed `TensorOps` implementation for the macOS native target, mirroring
how the JVM target has `DefaultCpuOpsJvm`. Apple's Accelerate framework provides
hardware-optimized BLAS and vector DSP routines that leverage ARM NEON and AMX under the hood.

### Architecture

```
PlatformCpuOpsFactory
├── jvmMain → DefaultCpuOpsJvm (Vector API + optional BLAS) ← exists
├── nativeMain → DefaultCpuOps (scalar fallback) ← exists
├── macosMain → AccelerateCpuOps (Accelerate framework via cinterop) ← NEW
└── linuxMain → DefaultCpuOps (scalar, or OpenBLAS in future) ← unchanged
```

### Key changes

**1. Cinterop definition** — `src/nativeInterop/cinterop/accelerate.def`

```def
package = platform.accelerate
language = C
headers = Accelerate/Accelerate.h
compilerOpts = -framework Accelerate
linkerOpts = -framework Accelerate
```

**2. New class** — `src/macosMain/kotlin/.../AccelerateCpuOps.kt`

Extends `DefaultCpuOps` and overrides hot-path operations with Accelerate calls:

| Priority | Operation | Accelerate function | Impact |
|----------|-----------|---------------------|--------|
| P0 | `matmul` | `cblas_sgemm` | Dominant cost in LLM inference (~90% of forward pass) |
| P1 | `add` | `vDSP_vadd` | Elementwise add (residual connections) |
| P1 | `multiply` | `vDSP_vmul` | Elementwise multiply (gates, scaling) |
| P1 | `subtract` | `vDSP_vsub` | Elementwise subtract |
| P1 | `divide` | `vDSP_vdiv` | Elementwise divide |
| P2 | `sum` (global) | `vDSP_sve` | Reduction for normalization |
| P2 | `mean` (global) | `vDSP_meanv` | Reduction for normalization |
| P2 | `softmax` | `vDSP_vse` + manual | Attention weights |
| P3 | `relu` | `vDSP_vthres` / `vDSP_vthr` | Activation function |
| P3 | `silu` | manual vectorized loop | Activation function (SiLU = x * sigmoid(x)) |
| P3 | `transpose` | `vDSP_mtrans` | Matrix transpose |

**3. Platform factory** — update `PlatformCpuOpsFactory` for macOS

```kotlin
// src/macosMain/kotlin/.../PlatformCpuOpsFactory.macos.kt
internal actual fun platformDefaultCpuOpsFactory(): (TensorDataFactory) -> TensorOps {
println("[SKaiNET] Using Accelerate-backed CPU operations (ARM NEON + AMX)")
return { factory -> AccelerateCpuOps(factory) }
}
```

This requires splitting the current `nativeMain` expect/actual into separate
`macosMain` and `linuxMain` actuals (the `macosMain` source set already exists in
`build.gradle.kts`).

**4. Build changes** — `build.gradle.kts`

Add cinterop configuration for macosArm64 (and optionally iosArm64/iosSimulatorArm64):

```kotlin
macosArm64 {
compilations["main"].cinterops {
val accelerate by creating {
defFile("src/nativeInterop/cinterop/accelerate.def")
}
}
}
```

Add linker opts for the Accelerate framework to all macOS/iOS binaries.

### Implementation notes

- `AccelerateCpuOps` should extend `DefaultCpuOps` and override only the operations above.
Non-accelerated operations fall through to the scalar implementation.
- The `matmul` override should handle 2D FP32 tensors with `cblas_sgemm` and delegate
batched/non-float cases to `super.matmul()`.
- `vDSP_*` functions operate on contiguous `FloatArray` buffers. Tensors backed by
`FloatArrayTensorData` can be passed directly; others need a `toFloatArray()` copy.
- Broadcasting logic (e.g., bias add, scalar multiply) should remain in the Kotlin layer
and only dispatch the contiguous inner loop to Accelerate.
- The same approach works for iOS targets (`iosArm64`, `iosSimulatorArm64`) since
Accelerate is available on all Apple platforms.

### Testing

- Existing `DefaultCpuOps` tests in `commonTest` should pass unchanged (numerical equivalence).
- Add macOS-specific tests verifying Accelerate dispatch actually occurs (e.g., check log output
or add a query method).
- Benchmark comparison: run `llm-performance` native benchmark with the current scalar backend
vs Accelerate backend on the same model.

### Expected impact

Based on JVM BLAS vs scalar measurements and Apple's published Accelerate performance data:

- **matmul**: 10-50x speedup (NEON + AMX vs scalar loop)
- **elementwise**: 4-8x speedup (NEON vectorization)
- **reductions**: 4-8x speedup (NEON vectorization)
- **overall LLM inference**: 5-20x speedup on native macOS CPU backend

### Files to create/modify

```
skainet-backends/skainet-backend-cpu/
├── build.gradle.kts # add cinterop
├── src/nativeInterop/cinterop/accelerate.def # NEW
├── src/macosMain/kotlin/.../AccelerateCpuOps.kt # NEW
├── src/macosMain/kotlin/.../PlatformCpuOpsFactory.macos.kt # NEW
├── src/linuxMain/kotlin/.../PlatformCpuOpsFactory.linux.kt # NEW (move from nativeMain)
└── src/nativeMain/kotlin/.../PlatformCpuOpsFactory.native.kt # REMOVE (split to platform-specific)
```

### References

- JVM SIMD implementation: `src/jvmMain/kotlin/.../DefaultCpuOpsJvm.kt`
- JVM BLAS integration: `src/jvmMain/kotlin/.../JvmBlas.kt`
- Apple Accelerate docs: https://developer.apple.com/documentation/accelerate
- CBLAS reference: https://developer.apple.com/documentation/accelerate/blas
- vDSP reference: https://developer.apple.com/documentation/accelerate/vdsp
8 changes: 6 additions & 2 deletions skainet-backends/skainet-backend-cpu/build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -70,16 +70,20 @@ kotlin {
dependsOn(commonMain)
}

val appleMain by creating {
dependsOn(nativeMain)
}

val linuxMain by creating {
dependsOn(nativeMain)
}

val iosMain by creating {
dependsOn(nativeMain)
dependsOn(appleMain)
}

val macosMain by creating {
dependsOn(nativeMain)
dependsOn(appleMain)
}

val iosArm64Main by getting {
Expand Down
Loading
Loading