Port Phi-3 fused QKV/FFN from quant.h to libturboquant (enables Metal GPU)

## Summary

Complete the Phi-3 architecture port from `quant.h` to `src/engine/*.c` so that `quant-server` (libturboquant) can serve Phi-3.5 with Metal GPU acceleration.

## Current State

- **quant.h**: Phi-3.5 works perfectly (6.5 tok/s CPU NEON)
- **libturboquant**: Phi-3.5 crashes or produces garbage
- **Workaround**: `quant-server-unified` (compiles quant.h directly, no GPU)

## What Needs Porting

| Feature | quant.h | src/engine/ | Status |
|---------|---------|-------------|--------|
| Fused `attn_qkv` detection | ✅ | Partial | loader works, forward crashes |
| Fused `ffn_up_gate` detection | ✅ | Partial | loader works |
| LongRoPE NeoX rotation | ✅ | Written | untested |
| K/V projection skip for fused QKV | ✅ | **Missing** | root cause of crash |
| State allocation (xb2, hb sizing) | ✅ | Written | needs verification |
| BOS token handling | ✅ | Written | untested |

## Previous Attempt

PR #71 attempted this but caused SmolLM2 regression — the K/V projection skip interacted badly with existing code paths. The port needs to be done **one component at a time** with regression tests after each step.

## Recommended Approach

1. Port K/V skip only → test SmolLM2 + Phi-3.5
2. Port FFN fused path → test
3. Port LongRoPE → test
4. Metal dispatch → benchmark

## Impact

Metal GPU acceleration for Phi-3.5 could improve speed from 6.5 to ~15-20 tok/s on Apple Silicon (3-4B models benefit from Metal, unlike 1B).

## Priority: P3

This is blocked by #85 (Single Source of Truth) — once that's done, porting is automatic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port Phi-3 fused QKV/FFN from quant.h to libturboquant (enables Metal GPU) #91

Summary

Current State

What Needs Porting

Previous Attempt

Recommended Approach

Impact

Priority: P3

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature	quant.h	src/engine/	Status
Fused `attn_qkv` detection	✅	Partial	loader works, forward crashes
Fused `ffn_up_gate` detection	✅	Partial	loader works
LongRoPE NeoX rotation	✅	Written	untested
K/V projection skip for fused QKV	✅	Missing	root cause of crash
State allocation (xb2, hb sizing)	✅	Written	needs verification
BOS token handling	✅	Written	untested

Port Phi-3 fused QKV/FFN from quant.h to libturboquant (enables Metal GPU) #91

Description

Summary

Current State

What Needs Porting

Previous Attempt

Recommended Approach

Impact

Priority: P3

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions