-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
The current LLaMA 4 CUDA kernel implementation is monolithic:
native/ops/nn/
├── llama4/ # ← All kernels bundled together
└── llama4_kernels.cuh
This violates the modular architecture defined in CLAUDE.md where NN operations should be separated by function:
native/ops/nn/
├── activation/ # GELU, SiLU, etc.
├── attention/ # SDPA
├── norm/ # RMSNorm, LayerNorm
└── rope/ # RoPE
Impact
- LLaMA 4 specific kernels cannot be reused by other models
- Violates the principle of modular, composable operations
- Makes testing individual components harder
Proposed Solution
- Extract common operations from
llama4/into their respective directories - Keep only LLaMA 4 specific logic (if any) in
llama4/ - Update bindings to use modular kernel paths
Related
- Added in commit 5fcf3c3 with note about needing refactor
Metadata
Metadata
Assignees
Labels
No labels