Skip to content

Conversation

@iksaif
Copy link

@iksaif iksaif commented Jan 7, 2026

Motivation

Enable deployment of a single zstd binary across heterogeneous ARM fleets with varying CPU capabilities. This is particularly important for cloud deployments where applications run across multiple instance types:

  • AWS Graviton 2 (Neoverse N1): Baseline ARM64, no SVE/SVE2
  • AWS Graviton 3 (Neoverse V1): 256-bit SVE, no SVE2
  • AWS Graviton 4 (Neoverse V2): 128-bit SVE2
  • GCP C4D (Neoverse V2): 128-bit SVE2

Currently, to leverage SVE2 optimizations, you must compile with -march=neoverse-v2 or similar flags, which produces binaries that won't run on older processors. This forces users to either:

  1. Build multiple binaries for different targets
  2. Build for the lowest common denominator and lose performance
  3. Use separate container images for different instance types

This PR implements runtime CPU feature detection, similar to the existing BMI2 support on x86-64, allowing a single binary compiled for Neoverse N1 baseline (-mcpu=neoverse-n1) to automatically use SVE2 optimizations when available.

Changes

This PR adds runtime ARM SVE2 detection infrastructure:

Core Infrastructure

  • CPU feature detection (lib/common/cpu.h): Platform-specific detection via getauxval() on Linux/Android
  • Build macros (lib/common/portability_macros.h): DYNAMIC_SVE2 macro to enable runtime dispatch
  • Target attributes (lib/common/compiler.h): SVE2_TARGET_ATTRIBUTE for selective function compilation
  • Context initialization (lib/compress/zstd_compress.c): Detect SVE2 once per compression context

Platform Support

  • Linux/Android aarch64: Full runtime detection via getauxval()
  • Apple platforms: Explicitly disabled (Apple Silicon doesn't support SVE/SVE2)
  • ⏸️ Windows on ARM: Placeholder for future support

Recommended Flags

Benchmarking on Graviton 4:

# Compile for Neoverse N1 baseline with runtime SVE2 detection
make CC=gcc-15 CFLAGS="-O3 -mcpu=neoverse-n1"

# Binary automatically uses SVE2 on Graviton 4, falls back to baseline on Graviton 2/3

Overhead

Zero overhead on non-SVE2 systems:

  • CPU detection happens once per compression context initialization
  • No runtime checks in hot paths when SVE2 is unavailable
  • Binary size increase is minimal

Related

This follows the same pattern as the existing x86-64 BMI2 runtime detection, extending it to ARM architectures.

Implement runtime detection of ARM SVE and SVE2 CPU capabilities,
similar to the existing BMI2 runtime detection for x86-64.

Changes:
- Add ARM CPU feature detection in lib/common/cpu.h using platform-specific
  APIs (getauxval on Linux/Android, disabled on macOS/Windows)
- Add DYNAMIC_SVE and DYNAMIC_SVE2 macros in portability_macros.h
- Add SVE2_TARGET_ATTRIBUTE for selective function compilation
- Add sve2 field to compression context (ZSTD_CCtx)
- Update histogram functions to support dynamic SVE2 dispatch
- Explicitly disable SVE/SVE2 on Apple platforms (not supported)

Platform support:
- Linux/Android aarch64: Full runtime detection via getauxval()
- Apple platforms: Disabled (Apple Silicon doesn't support SVE/SVE2)
- Windows on ARM: Placeholder (API not yet available)

Benefits:
- Enables SVE2 optimizations on capable hardware without requiring
  build-time flags
- Zero overhead on non-SVE2 systems
- Expected 2-3x speedup in histogram counting on SVE2-capable CPUs
  (AWS Graviton4, Ampere AmpereOne)

Note: Currently only SVE2 optimizations exist. CPUs with SVE but not
SVE2 (e.g., Fujitsu A64FX) could benefit from future SVE-only
implementations.
@meta-cla meta-cla bot added the CLA Signed label Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant