Refactor LLaMA 4 kernels to modular nn/ structure

## Problem

The current LLaMA 4 CUDA kernel implementation is monolithic:

```
native/ops/nn/
├── llama4/           # ← All kernels bundled together
└── llama4_kernels.cuh
```

This violates the modular architecture defined in CLAUDE.md where NN operations should be separated by function:

```
native/ops/nn/
├── activation/   # GELU, SiLU, etc.
├── attention/    # SDPA
├── norm/         # RMSNorm, LayerNorm
└── rope/         # RoPE
```

## Impact

- LLaMA 4 specific kernels cannot be reused by other models
- Violates the principle of modular, composable operations
- Makes testing individual components harder

## Proposed Solution

1. Extract common operations from `llama4/` into their respective directories
2. Keep only LLaMA 4 specific logic (if any) in `llama4/`
3. Update bindings to use modular kernel paths

## Related

- Added in commit 5fcf3c3 with note about needing refactor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor LLaMA 4 kernels to modular nn/ structure #190

Problem

Impact

Proposed Solution

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Refactor LLaMA 4 kernels to modular nn/ structure #190

Description

Problem

Impact

Proposed Solution

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions