Skip to content

Multi-target ports + unified bench infrastructure#3

Open
navado wants to merge 7 commits intofacex-engine:mainfrom
navado:main
Open

Multi-target ports + unified bench infrastructure#3
navado wants to merge 7 commits intofacex-engine:mainfrom
navado:main

Conversation

@navado
Copy link
Copy Markdown

@navado navado commented May 6, 2026

Multi-target ports + unified bench infrastructure

Adds first-class build paths for Apple Silicon, i.MX 8/93/95 (NXP
NPU), and ESP32-P4 (Espressif RISC-V MCU), plus a cross-platform
benchmark tool and a unified coverage matrix. Default make is
unchanged for existing x86 users — every new path is opt-in.

Four commits, each self-contained and amending its own section in the
new docs/implementation.md:

83aeee7  Add ESP32-P4 ESP-IDF component + MIPI-CSI camera example
6bc5f99  Add i.MX NPU library (TFLite + Ethos-U65 / VxDelegate / XNNPACK)
f75fd64  Add Apple Silicon perf paths: Accelerate (AMX), SME (M4+), Core ML (ANE)
7afb4f7  Add unified benchmark tooling + AArch64 NEON foundation

What's added

1. Bench foundation (7afb4f7) — also makes make work on ARM hosts

  • tools/bench.c — cross-platform synthetic latency bench. md/csv/json
    output. Same source compiles on macOS arm64/x86, Linux aarch64, future
    i.MX targets.
  • tools/bench_camera_mac.swift + make bench-camera — live AVFoundation
    camera benchmark. --summary mode emits one CSV row in the same schema
    as facex-bench so the two can be merged into one table.
  • scripts/bench_all.sh — sweeps build-flag combinations, emits unified
    Markdown comparison.
  • scripts/test_all.sh — single-host test harness; topic commits amend
    with their own checks (51/51 PASS on M2).
  • AArch64 build foundation: Makefile arch detection via uname -m,
    src/threadpool_pthread.c (futex/WaitOnAddress aren't portable),
    FACEX_NO_INT8 flag for the engine, hand-written NEON kernels for
    matmul_fp32_packed{,_bias,_bias_gelu} (output byte-identical to
    scalar within ULP), and a fix for the column-panel scalar fallbacks
    that were silently wrong on every non-x86 host.

2. Apple Silicon perf paths (f75fd64) — opt-in, never default

Build Adds Measured speedup
make ACCELERATE=1 cblas_sgemm via Accelerate.framework (AMX) -22% embed / -13% e2e on M2
make SME=1 M4+ FMOPA outer-product kernel; auto-disabled on M1-M3 n/a (compiles cleanly; not directly hardware-tested)
make COREML=1 Obj-C bridge loading .mlpackage for Apple Neural Engine n/a (compile + link verified)
make mac-universal Cross-compiled fat arm64 + x86_64 archive each slice contains real arch-specific SIMD

Each opt-in dispatch is gated at compile time AND runtime via a
self-check (e.g. SME runs a tiny FMOPA-vs-scalar consistency test and
disables itself on output divergence). Critical: -march=armv9-a+sme
is applied per-file to transformer_ops_sme.c only — applying it
globally would let clang auto-vectorize plain C using SVE/SME and trap
on M1/M2/M3.

3. i.MX NPU library (6bc5f99) — separate libfacex_npu.{so,dylib}

Make target SoC NPU Delegate
make imx-npu host (dev) XNNPACK fallback (built into TFLite)
make imx93 SDK=… i.MX 93 Arm Ethos-U65 libethosu_delegate.so
make imx95 SDK=… i.MX 95 Arm Ethos-U65 libethosu_delegate.so
make imx8mp SDK=… i.MX 8M Plus VeriSilicon VIP9000 libvx_delegate.so

src/backend_tflite.c is a TFLite C-API wrapper with a dlopen-based
delegate loader (vendor .sos aren't a hard dep — auto-fallback to
XNNPACK if missing). One source path, three deployment targets. New
public API in include/facex_npu.h, mirrors facex.h shape.

Detector path is intentionally -ENOSYS. Anchor decode + NMS for
arbitrary YuNet/SCRFD topology is too fragile to ship blind. The
recommended deployment is the hybrid pipeline: CPU detect via
libfacex.a, NPU embed via libfacex_npu.so. ~80% of the perf
benefit, none of the post-processing risk. Documented in
docs/imx_npu.md.

Offline tooling: tools/onnx_to_tflite.py (PyTorch/ONNX → INT8 TFLite)
and tools/compile_vela.sh (Arm Vela for Ethos-U65).

4. ESP32-P4 ESP-IDF component (83aeee7)

components/facex/ (IDF wrapper) + examples/esp32p4_camera/ (full
runnable IDF project).

The MIPI-CSI capture path is complete and runnable — follows the
official ESP-IDF camera_driver recipe verbatim (LDO 2.5 V on the CSI
PHY, SCCB I2C, esp_cam_sensor_detect, esp_cam_new_csi_ctlr,
IRAM-safe callbacks, PSRAM frame buffer ring, capture task that
downscales RGB565 → RGB888 and calls the FaceX backend, requeues).
Logs FPS + per-detection latency once per second.

The face-detection backend is a Kconfig three-way choice:

  • stub (default) — synthetic deterministic face per frame for
    bring-up. Works on the EV-Board today.
  • native — links the existing C engine. Compiles, runs, but at
    1-3 s/frame the EdgeFace-XS model is too large for production on
    P4. Provided for evaluation only.
  • espnn — reserved Kconfig slot. Production target needs a
    distilled EdgeFace-Nano (~300 K params, 64×64 input, 256-d
    embedding, no XCA attention) plus an ESP-NN backend (PIE-SIMD INT8).
    Not in this PR — model + kernel work is its own milestone.

Honest scope: the camera + dispatch bridge ships now; production-fit
model is the follow-up. Once it exists, only the Kconfig backend
toggle changes; everything else in this commit stays.

Coverage at a glance (full table in docs/coverage_matrix.md)

Configuration Compiles Runs end-to-end Tested on
Default make (Apple Silicon arm64 / NEON) M2, macOS
make ACCELERATE=1 (AMX) M2
make SME=1 (M4+ FMOPA) 🧪 inert on M2 (sysctl FEAT_SME=0); self-check guards M4 correctness M2 (compile + isolation only)
make COREML=1 (ANE bridge) 🧪 missing-.mlpackage smoke passes; ANE dispatch needs ONNX export M2 (compile + link + error path)
make mac-universal (fat archive) ✅ both slices contain arch-specific SIMD M2
make imx-npu / imx93 / imx95 / imx8mp 🧪 (against TFLite header stubs) 🚫 needs vendor SDK + EVK mac-m2 (syntax only)
ESP-IDF component + camera example 🧪 (against IDF header stubs) 🚫 needs IDF + P4-Function-EV-Board mac-m2 (syntax only)
Existing x86 / WASM paths upstream upstream unchanged

scripts/test_all.sh runs every check that's executable on the host:
51/51 PASS on M2. Topic commits each register their own checks so
the runner stays exhaustive as features land.

🧪 / 🚫 honesty markers: where I couldn't validate on hardware (M4,
i.MX EVK, P4 dev kit), the code follows the documented vendor APIs
(esp_cam_ctlr_*, TFLite C-API + delegate ABI, ACLE 2024 SME
intrinsics). Self-checks at runtime guard correctness for the SME
path. Bring-up checklists are documented in docs/{mac,imx_npu,esp32p4}.md.

Benchmark results (M2, n=100, scripts/bench_all.sh)

Build Active backends Embed median E2E median
default NEON 4.59 ms 8.46 ms
ACCELERATE=1 Accelerate(AMX) + NEON 3.54 ms 7.52 ms
SME=1 NEON (SME inert on M2) 4.57 ms 8.63 ms
SME=1 ACCELERATE=1 Accelerate(AMX) + NEON 3.52 ms 7.49 ms

Same \|\|emb\|\|² = 0.0756, same self-similarity 1.0000, same bbox
across all four — backend choice never changes the embedding bytes
beyond ULP.

WASM (Node.js, node wasm/bench.js): 5.27 ms median, 190 fps.
Embedding output matches native byte-for-byte.

Reproduce locally

bash download_weights.sh                       # pulls data/edgeface_xs_fp32.bin
make                                           # default build (NEON on ARM, AVX2 on x86)
make test                                      # golden test
scripts/test_all.sh                            # full local sweep (51 checks)
scripts/bench_all.sh --iters 200 --warmup 50   # build-flavour comparison table

# Mac-only opt-ins:
make ACCELERATE=1 mac-test                     # AMX
make SME=1 mac-test                            # M4+ SME
make COREML=1                                  # ANE bridge
make mac-universal                             # fat arm64 + x86_64

Compatibility / risk

  • Default build is unchanged. x86 users see the same Makefile
    paths, same flags, same artifacts.
  • No new mandatory dependencies. libfacex.a still has zero
    external runtime deps (otool -L → libSystem only on macOS, libc
    • libpthread on Linux). NPU build is opt-in via
      FACEX_BACKEND_TFLITE; Mac perf paths are opt-in via per-flag
      build vars.
  • No API breaks. Existing public C API is untouched. New surfaces
    are additive: include/facex_npu.h, include/facex_coreml.h,
    include/facex_backend.h.
  • examples/example.c has a 1-line update for the existing 3-arg
    facex_init signature — was orphaned in upstream relative to the
    current API.

What's NOT in this PR

  • EdgeFace-Nano distilled model (sized for ESP32-P4)
  • ESP-NN-based FaceX backend (production path for P4)
  • Hardware-validated runs on i.MX 8M Plus / 93 / 95 EVKs
  • Hardware-validated SME runs on M4 / M5
  • Core ML .mlpackage artefact (needs an EdgeFace ONNX export not in
    this repo)
  • WASM rebuild from source (uses pre-built wasm/*.wasm; no make wasm target wired)

Each of these is documented in docs/implementation.md with the
unblock path for whoever picks it up next.

Files at a glance

44 files changed, 6457 insertions(+), 30 deletions(-)
docs/
  benchmarking.md          how to bench (which-tool-when matrix)
  coverage_matrix.md       compile/static/e2e per (target × backend × flag)
  esp32p4.md               ESP32-P4 build, deploy, troubleshooting
  imx_npu.md               i.MX NPU deployment + bring-up checklist
  implementation.md        per-topic implementation snapshot (this doc)
  mac.md                   Apple Silicon build modes + perf reference

src/
  backend_accelerate.c     AMX cblas wrapper                       [Mac]
  backend_coreml.m         ARC Obj-C bridge for ANE / .mlpackage   [Mac]
  backend_tflite.c         TFLite + delegate loader                [i.MX]
  cpu_features.{h,c}       sysctl-based runtime probe              [Mac]
  threadpool_pthread.c     pthread+condvar pool                    [Bench]
  transformer_ops_sme.c    M4+ FMOPA matmul                        [Mac]

include/
  facex_backend.h          pluggable backend vtable                [i.MX]
  facex_coreml.h           Core ML public C API                    [Mac]
  facex_npu.h              NPU public C API                        [i.MX]

tools/
  bench.c                  cross-platform synthetic bench          [Bench]
  bench_camera_mac.swift   AVFoundation camera bench               [Bench]
  build_bench_camera_mac.sh swiftc invoker                         [Bench]
  compile_vela.sh          Vela compiler wrapper                   [i.MX]
  export_coreml.py         ONNX → .mlpackage                       [Mac]
  onnx_to_tflite.py        ONNX → INT8 .tflite                     [i.MX]

scripts/
  bench_all.sh             build-flavour sweep → comparison table  [Bench]
  test_all.sh              local test harness (51 checks)          [Bench]

components/facex/          ESP-IDF component                       [ESP32]
examples/esp32p4_camera/   Runnable IDF MIPI-CSI camera example    [ESP32]
tests/test_mac.c           Mac smoke + latency stats               [Mac]
tests/test_imx_npu_compile.c  NPU API smoke                        [i.MX]

Bisectability

Every commit builds clean on M2 (make succeeds, scripts/test_all.sh
passes its own commit's checks). The four are also intentionally
decoupled: Mac, i.MX, and ESP32 don't depend on each other; they all
sit on the Bench foundation.

navado and others added 7 commits May 6, 2026 16:19
Introduces a single source of truth for FaceX latency numbers across
build flavours, OSes, and stages — replaces scattered ad-hoc benches
each with its own format. Also lays the AArch64 build foundation so
`make` succeeds on Apple Silicon and Linux ARM hosts.

Benchmark tooling
-----------------
* tools/bench.c — cross-platform synthetic latency bench. Same source
  compiles on macOS arm64/x86, Linux aarch64, future i.MX targets.
  Three output formats (md/csv/json) emitting the same data; reports
  compiled-in vs runtime-active backends. Stages: `embed`, `e2e`,
  `both`.
* tools/bench_camera_mac.swift — live AVFoundation camera bench (Mac-
  only role). `--summary` mode emits one CSV row at exit using the
  same schema as facex-bench, so live-camera and engine numbers can
  join the unified table.
* tools/build_bench_camera_mac.sh — swiftc invocation + bridging
  header. Auto-detects optional libfacex symbols (Accelerate /
  Core ML) and links matching frameworks so a stale lib doesn't
  silently break the build.
* scripts/bench_all.sh — sweeps build-flag combos, runs facex-bench
  against each, emits unified Markdown / CSV.
* scripts/test_all.sh — single-host test harness. Topic commits
  amend with their own checks.
* docs/benchmarking.md — which-tool-answers-which-question matrix,
  CSV schema reference, recipe for combining engine + camera output.

AArch64 / Mac build foundation
------------------------------
Without this, `make` on Apple Silicon or Linux aarch64 produced a
broken binary (silently wrong output via the column-panel scalar
fallback in transformer_ops.c).

* Makefile arch detection via `uname -m`. arm64 path links
  src/gemm_stub.c (the existing INT8 GEMM is x86-only) and
  src/threadpool_pthread.c (linux/futex / win/WaitOnAddress aren't
  portable). Defines FACEX_NO_INT8 so the engine takes the FP32-
  packed path.
* src/threadpool_pthread.c — pthread + condvar pool (~80 LOC).
* src/edgeface_engine.c — gates the INT8 weight-packing block on
  !FACEX_NO_INT8 so mm->packed stays NULL on ARM and the matmul
  dispatch falls cleanly through to FP32.
* src/transformer_ops.c — fixes column-panel scalar fallbacks for
  matmul_fp32_packed{,_bias,_bias_gelu} that previously fed packed B
  into matmul_fp32 (wrong layout, garbage output on every non-x86
  host). Adds hand-written AArch64 NEON kernels (NR=8 MR=4 FMA-
  based; same packed format as AVX2). Output byte-equivalent to
  scalar within ULP. NEON is portable AArch64, helps i.MX too — not
  Mac-specific.

Documentation
-------------
* docs/implementation.md — new file replacing the forward-looking
  docs/plan/embedded_port_plan.md. This is the implementation-
  details document; each topic commit (this one, Mac, i.MX, ESP32)
  amends a new section.
* docs/coverage_matrix.md — initial table (CPU library, bench
  infrastructure). Subsequent topic commits append rows.
* CLAUDE.md — repo conventions, build/test commands, architecture
  summary. Touches Bench / Mac / i.MX / ESP32 surfaces incidentally;
  topic commits amend their own sections.

Verification on mac-m2
----------------------
* `make` builds clean (`Built libfacex.a (arm64)`).
* `make test` golden test passes (`||emb||² = 0.076`, sim 1.000).
* `make bench && ./facex-bench` produces md/csv/json output;
  ~4.6 ms median embed, ~8.4 ms e2e (NEON FP32 packed).
* `make bench-camera && ./facex-camera-bench` 29 fps end-to-end.
* `scripts/test_all.sh` all checks PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three are opt-in build flags, not the default. Default `make`
still produces the same portable NEON-only artifact for distribution
to any Mac. Each flag composes cleanly with the others; the
dispatcher in `matmul_fp32_packed` chains them
Accelerate → SME → NEON.

Apple Accelerate / AMX  (`make ACCELERATE=1`)
---------------------------------------------
* src/backend_accelerate.c — wraps cblas_sgemm. The wrapper unpacks
  the column-panel B back to row-major, dispatches via Accelerate;
  AMX wins for M ≥ 4 and M*K*N ≥ 4096, otherwise returns -1 so the
  in-tree NEON kernel handles the warmup-dominated calls.
* Self-check on first matmul: 4×16×8 cblas vs scalar reference,
  1e-4 relative tolerance. Mismatch → facex_disable_accelerate()
  and the rest of the process stays on NEON.
* Measured on M2: 4.6 → 3.5 ms / embed (-22%), 9 → 7.5 ms e2e (-13%).
* Embedding bytes identical to NEON within ULP.

SME / SME2  (`make SME=1`)  — Apple M4 and newer
------------------------------------------------
* src/transformer_ops_sme.c — __arm_locally_streaming __arm_new("za")
  matmul_fp32_packed using FMOPA outer products. Pre-transposes the
  row tile of A into a [K, SVL] scratch (gather not allowed in
  streaming mode); zeroes ZA tile 0; accumulates K outer products;
  reads the rows back with svread_hor_za32_f32_m. NR=8 to match
  the existing FP32 packed format.
* src/cpu_features.{h,c} — sysctl-based runtime probe for FEAT_SME /
  FEAT_SME2; cached, atomic, lock-free, no external deps. Designed
  to host future runtime probes (FP16 / BF16 / dotprod) too.
* Build isolation: -march=armv9-a+sme is applied PER-FILE
  (transformer_ops_sme.c only). Without that, clang auto-vectorizes
  plain C in transformer_ops.c using SVE/SME instructions that trap
  on M1/M2/M3. Verified post-fix: transformer_ops.o has zero
  rdvl/smstart/fmopa; transformer_ops_sme.o has the expected
  fmopa za0.s.
* Self-check: tiny 4×8 SME-vs-scalar consistency test on first
  matmul. Mismatch → facex_disable_sme() and stay on NEON.
* Hardware status: COMPILES + emits real SME asm; NOT directly
  hardware-tested (no M4 here). Self-check guards correctness on
  M4 — owners get NEON speed if SME has a bug, never wrong output.

Core ML / Apple Neural Engine  (`make COREML=1`)
------------------------------------------------
* src/backend_coreml.m — ARC-managed Obj-C bridge that loads a
  precompiled `.mlpackage` (auto-compiles to .mlmodelc on first
  call) and dispatches MLModel prediction. Runtime compute_units
  hint: ALL / CPU+GPU / CPU-only / CPU+ANE.
* include/facex_coreml.h — public C API: facex_coreml_init,
  facex_coreml_embed, facex_coreml_last_dispatch, facex_coreml_free.
* Output L2-normalized so cosine similarity stays comparable to the
  CPU backend regardless of how the .mlpackage ends.
* Graceful failure: missing .mlpackage returns NULL with a clear
  stderr message, no crash. (Validated by scripts/test_all.sh.)
* tools/export_coreml.py — ONNX → .mlpackage via
  coremltools.convert(convert_to="mlprogram") with optional INT8
  palettization (default 6 bits/weight via kmeans, drops package
  size to ~1.8 MB and unlocks ANE INT8 dispatch).
* Hardware status: COMPILE-TESTED. Runtime ANE dispatch is not
  end-to-end validated — that requires running export_coreml.py
  against an actual EdgeFace ONNX export.

Universal Mac binary  (`make mac-universal`)
--------------------------------------------
* Cross-compiles arm64 + x86_64 slices with target-specific flags,
  stashes each in /tmp (the in-Makefile clean target wipes its own
  artifacts), then `lipo`s them into libfacex-universal.a.
* Verified: arm64 slice has 293 NEON insts (fmla/fmul/fadd) in
  transformer_ops; x86_64 slice has 786 AVX2 insts (vfmadd/vmovups).
  Real arch-specific code in both halves.

Smoke test  (`make mac-test` → tests/test_mac.c)
------------------------------------------------
* Loads weights, embed sanity, determinism, self/cross similarity,
  latency stats (min/median/p99 over 50 iters), end-to-end
  detect+align+embed on tests/test_face_160.raw.
* Now reports both COMPILED-IN and RUNTIME-ACTIVE backends so the
  same binary tells you what will actually dispatch:
      Backends compiled in: Accelerate SME NEON
      Backends active at runtime: Accelerate(AMX) NEON
  (correctly shows SME compiled but inert on M2 — sysctl FEAT_SME=0).

Documentation
-------------
* docs/mac.md — full Mac story: build modes, runtime fallback chain,
  permissions, perf reference table, troubleshooting.
* docs/implementation.md — appended "Apple Silicon / Mac perf paths"
  section.
* docs/coverage_matrix.md — appended Mac rows (mac-test, ACCELERATE,
  SME, COREML, mac-universal, export_coreml.py).
* scripts/test_all.sh — appended the entire Mac variants block:
  builds with each flag combo, validates symbol presence, links the
  expected frameworks, runs mac-test, checks fmopa-in / rdvl-out
  isolation, lipo per-slice instruction-count probes.
* CLAUDE.md — make-target list extended with Mac options; new
  bullet documenting the opt-in flag policy.
* README.md — short Mac section + link to docs/mac.md.
* examples/example.c — orphaned 1-line API fix
  (facex_init went from 2 to 3 args repo-side; example didn't follow).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A second build of FaceX (libfacex_npu.{so,dylib}) that dispatches
inference through the TensorFlow Lite C API to a runtime-selected
delegate. Same source / same artefact targets three NXP SoCs:

    i.MX 8M Plus → NXP VxDelegate         (libvx_delegate.so)
    i.MX 93      → Arm Ethos-U external   (libethosu_delegate.so)
    i.MX 95      → Arm Ethos-U external   (libethosu_delegate.so)
    any AArch64  → XNNPACK CPU fallback   (built into TFLite)

Code
----
* include/facex_backend.h — pluggable FacexBackend vtable. i.MX is
  the first concrete consumer beyond the in-tree CPU backend.
* include/facex_npu.h — facex_npu_init / _embed / _detect / _free,
  mirrors facex.h shape so callers can swap CPU/NPU backends.
* src/backend_tflite.c — dlopen-based delegate loader (vx →
  ethos-u → armnn fallback chain), TfLiteModel + Interpreter setup,
  INT8 quantize/dequantize for the embedder, L2 normalize.
  Detector path is intentionally -ENOSYS — the recommended
  deployment is the hybrid pipeline (CPU detect via libfacex.a,
  NPU embed via this backend). 80% of the perf benefit, none of
  the post-processing risk.

Tooling (offline model conversion)
----------------------------------
* tools/onnx_to_tflite.py — ONNX → SavedModel (via onnx2tf) →
  INT8 TFLite (via tf.lite). Accepts a calibration directory of
  representative face crops; falls back to noise calibration with
  a warning. Uses subprocess.run with arg lists, not os.system.
* tools/compile_vela.sh — wraps Arm's Vela compiler to produce an
  Ethos-U65 command stream from an INT8 .tflite. Defaults to
  ethos-u65-256 (i.MX 93 / 95). Prints op coverage from Vela's
  summary CSV so any CPU-fallback layers are visible.

Build
-----
Makefile gains four new targets:
    make imx-npu                — host build (e.g. for XNNPACK fallback test)
    make imx93   SDK=…          — cross-compile for i.MX 93   (A55 + Ethos-U65)
    make imx95   SDK=…          — cross-compile for i.MX 95   (A55 + Ethos-U65)
    make imx8mp  SDK=…          — cross-compile for i.MX 8M Plus (A53 + VIP9000)
All four produce libfacex_npu.{so,dylib}; the difference is the
-mcpu tuning and which delegate the runtime picks at first init.
TFLite lives behind FACEX_BACKEND_TFLITE so the existing CPU build
(`make`) is unchanged — no new mandatory dependency.

Test
----
* tests/test_imx_npu_compile.c — API surface compile + link smoke,
  runs without an actual NPU device. With one or two .tflite paths,
  also reports the active delegate. Useful for CI on hosts without
  NXP / Arm hardware.
* scripts/test_all.sh — appended NPU compile-check section using a
  minimal TFLite header stub so the syntax check works on any host.

Validation
----------
* Default `make` still builds and `make mac-test` still passes
  byte-identical (the NPU code is gated on FACEX_BACKEND_TFLITE).
* src/backend_tflite.c + tests/test_imx_npu_compile.c syntax-check
  cleanly against the minimal TFLite C-API header stub.
* Hardware bring-up on real i.MX EVK is the next milestone — see
  docs/imx_npu.md §5 "Hardware bring-up checklist".

Documentation
-------------
* docs/imx_npu.md — full deployment guide: model conversion
  pipeline, host vs cross-compile builds, hybrid pipeline wiring,
  per-SoC bring-up checklist, known limitations.
* docs/implementation.md — appended "i.MX NPU library" section.
* docs/coverage_matrix.md — appended NPU rows.
* CLAUDE.md — make-target list extended with imx-* options; new
  bullet documenting the libfacex_npu library, hybrid pipeline,
  and -ENOSYS detector behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A complete ESP-IDF project that brings up the MIPI-CSI camera on an
ESP32-P4 and feeds frames into the FaceX detection wrapper. Capture
path is real and runnable; the face-detection backend ships in three
selectable forms (stub / native / espnn) gated by Kconfig.

Reasonable assumptions baked in
-------------------------------
The camera bridge ships complete; the model story is staged because:

1. Bring-up first, model second. Customers integrating FaceX on P4
   need to first prove camera + downscale + UART work. The default
   `stub` backend emits a deterministic synthetic face per frame
   so every code path is exercised without committing to a model.
2. Native backend is for evaluation only. EdgeFace-XS compiles and
   runs on P4 but at 1-3 s/frame — demonstrably not a product.
   Provided so partners can verify "the engine technically works".
3. EdgeFace-Nano is future work. Distilled model (~300 K params,
   64×64 input, 256-d, no XCA attn) plus an ESP-NN backend (PIE-
   SIMD INT8 conv) is the production target. Kconfig slot
   `CONFIG_FACEX_BACKEND_ESPNN` is reserved (`depends on 0`) so
   adopters can see the eventual shape.

components/facex/
-----------------
* Kconfig — choice between Stub / Native / ESPNN(reserved) backends,
  detector input W/H, optional per-frame log.
* CMakeLists.txt — registers the component, conditionally pulls
  src/edgeface_engine.c et al. when FACEX_BACKEND_NATIVE is set.
* include/facex_esp.h — small init / detect / free API. Mirrors
  FaceXResult (minus full embedding) so applications don't need to
  know which backend is running.
* src/facex_esp.c — dispatches detect calls. Stub backend emits one
  deterministic synthetic face per frame with smooth bbox jitter.
  Native backend forwards into the existing C engine — works but
  ~1-3 s/frame on P4, evaluation only.

examples/esp32p4_camera/
------------------------
* main/app_main.c — full ESP-IDF camera_driver recipe per
  https://docs.espressif.com/.../camera_driver.html — LDO 2.5 V
  for the CSI PHY, SCCB I2C, esp_cam_sensor_detect (auto-picks
  SC2336 etc.), set_format, esp_cam_new_csi_ctlr, on_get_new_trans
  / on_trans_finished callbacks (IRAM_ATTR), PSRAM frame buffer
  ring, capture_task that downscales RGB565 → RGB888 and calls
  facex_esp_detect, requeues. Logs FPS + detection latency once
  per second.
* main/Kconfig.projbuild — sensor resolution, lane count, lane
  bit-rate, SCCB pins, sensor reset+pwdn GPIOs.
* main/idf_component.yml — pulls esp_cam_sensor + esp_video.
* sdkconfig.defaults — target esp32p4, PSRAM hex, CPU @ 360 MHz,
  main task stack 8 KB.
* README.md — build + flash, expected console output, backend
  selection table.

Documentation
-------------
* docs/esp32p4.md — full deployment guide: status table, prereqs
  (IDF v5.4+), backend selection, resource budget on the
  Function-EV-Board with the SC2336 sensor, troubleshooting.
* docs/implementation.md — appended "ESP32-P4 ESP-IDF component"
  section.
* docs/coverage_matrix.md — appended ESP32 rows.
* scripts/test_all.sh — appended ESP32-P4 syntax check using
  synthesized IDF header stubs (esp_err / esp_log / esp_timer /
  sdkconfig) so the wrapper compiles cleanly without a full IDF
  install.
* CLAUDE.md — note added so future sessions know this is an IDF
  component (idf.py), not a Makefile target.

Honest scope
------------
Camera + dispatch bridge is complete and runnable. Model story
(EdgeFace-Nano distillation, ESP-NN backend) is the follow-up.
Once it lands, only the Kconfig backend toggle changes; everything
else in this commit stays.

Default host build (libfacex.a, mac-test) untouched and still
passing byte-identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous code claimed i.MX 95 used Arm Ethos-U65, but it actually
ships NXP's eIQ Neutron N3 NPU (Ethos-U65 is i.MX 93 only). Register
libneutron_delegate.so in the TFLite delegate loader and fix the
documentation across CLAUDE.md, the Makefile, the public header, and
docs/imx_npu.md (per-SoC bring-up table, offline compiler note for
neutron-converter vs vela).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Neutron delegate (and Ethos-U, for that matter) silently produces
"0 nodes delegated" when handed a .tflite that wasn't pre-compiled by
the matching offline tool — same latency as XNNPACK, no NPU offload.
Surface this failure mode two ways:

- tools/compile_neutron.sh — thin wrapper around NXP's neutron-converter
  (eIQ Toolkit), mirroring tools/compile_vela.sh: same args shape, same
  output naming convention (<base>_neutron.tflite).
- backend_tflite.c — when verbose=1 and a Neutron/Ethos-U delegate is
  picked, print a one-shot hint at init time pointing at the right
  offline tool, so users can immediately interpret a subsequent
  "0 nodes delegated" line from TFLite.

Also expand docs/imx_npu.md §1 with full instructions for obtaining
neutron-converter (nxp.com download path, host-OS install matrix,
env-script activation, BSP/toolkit version pinning).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets us compare CPU NEON, XNNPACK, eIQ Neutron, Ethos-U, and VxDelegate
side-by-side in a single CSV instead of running NXP's benchmark_model
separately and reconciling output formats.

Pieces:
- tools/bench_npu.c — synthetic-input bench that mirrors facex-bench's
  CSV/MD/JSON schema. Emit-only embed stage (facex_npu_detect is -ENOSYS).
- include/facex_npu.h — extend FaceXNpuOptions with external_delegate_path
  so callers (the bench, eventually production apps) can dlopen any
  TFLite-external-delegate-ABI .so by absolute path, matching how
  benchmark_model exposes --external_delegate_path.
- src/backend_tflite.c — derive_path_name() picks a tidy logging name
  from a delegate path (libneutron_delegate.so → "neutron", libarmnnDelegate.so
  → "armnn"), then select_delegate honours a non-NULL path before walking
  the registry. preferred_delegate continues to work unchanged.
- Makefile — facex-bench-npu target (depends on libfacex_npu.so), added
  to clean.

Docs:
- docs/benchmarking.md — new section for facex-bench-npu, three-way
  comparison recipe (NEON / XNNPACK / Neutron in one /tmp/cmp.csv).
- docs/imx_npu.md — testing section gains a bench subsection cross-linking
  to docs/benchmarking.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant