Skip to content

Port Phi-3 fused QKV/FFN from quant.h to libturboquant (enables Metal GPU) #91

@unamedkr

Description

@unamedkr

Summary

Complete the Phi-3 architecture port from quant.h to src/engine/*.c so that quant-server (libturboquant) can serve Phi-3.5 with Metal GPU acceleration.

Current State

  • quant.h: Phi-3.5 works perfectly (6.5 tok/s CPU NEON)
  • libturboquant: Phi-3.5 crashes or produces garbage
  • Workaround: quant-server-unified (compiles quant.h directly, no GPU)

What Needs Porting

Feature quant.h src/engine/ Status
Fused attn_qkv detection Partial loader works, forward crashes
Fused ffn_up_gate detection Partial loader works
LongRoPE NeoX rotation Written untested
K/V projection skip for fused QKV Missing root cause of crash
State allocation (xb2, hb sizing) Written needs verification
BOS token handling Written untested

Previous Attempt

PR #71 attempted this but caused SmolLM2 regression — the K/V projection skip interacted badly with existing code paths. The port needs to be done one component at a time with regression tests after each step.

Recommended Approach

  1. Port K/V skip only → test SmolLM2 + Phi-3.5
  2. Port FFN fused path → test
  3. Port LongRoPE → test
  4. Metal dispatch → benchmark

Impact

Metal GPU acceleration for Phi-3.5 could improve speed from 6.5 to ~15-20 tok/s on Apple Silicon (3-4B models benefit from Metal, unlike 1B).

Priority: P3

This is blocked by #85 (Single Source of Truth) — once that's done, porting is automatic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions