perf(llm): Batch decode missing CUDA Graph and zero-alloc optimizations

## Description

The LLM batch decode path (`forward_batch_zero_alloc`) has several TODO items for performance optimization that are not yet implemented.

## Locations

### 1. CUDA Graph capture (model.py:916)

```python
# TODO: CUDA Graph capture can be added once this path is validated.
```

The M=1 decode has CUDA Graph support (`DecodeM1Graph`), but batch decode (M>1) does not.

### 2. Zero-alloc Attention (model.py:965)

```python
# TODO: Add forward_fixed_cache_batch_zero_alloc to Attention class
attn_out = block.attn.forward_fixed_cache_batch(
    norm_out_buf, start_position, context_len
)
```

The batch attention path still allocates intermediate buffers.

### 3. Zero-alloc MLP (model.py:985)

```python
# TODO: Add zero-alloc MLP path
mlp_out = block.mlp(norm_out_buf)
```

The MLP layer allocates new buffers for gate_proj, up_proj, down_proj outputs.

## Impact

- **Batch decode performance**: Currently allocates buffers per layer per token
- **Memory fragmentation**: Many small allocations stress the memory pool
- **CUDA Graph benefits**: Batch decode would benefit from reduced kernel launch overhead

## Current Performance (v0.2.11)

| Batch Size | Per Token (us) | Throughput |
|------------|---------------|------------|
| 1 | 381,303 | 2.6 tok/s |
| 8 | 55,845 | 17.9 tok/s |

With zero-alloc + CUDA Graph, expect ~20-30% improvement.

## Required Work

1. Add `forward_fixed_cache_batch_zero_alloc` to `Attention` class
2. Add zero-alloc variants to MLP (reuse gate/up/down buffers)
3. Implement CUDA Graph capture for batch decode loop
4. Update `DecodeBuffers` with batch-specific pre-allocated buffers

## Related

- M=1 CUDA Graph: `src/pygpukit/llm/decode/m1_graph.py`
- Batch decode: `src/pygpukit/llm/decode/batch.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(llm): Batch decode missing CUDA Graph and zero-alloc optimizations #181

Description

Locations

1. CUDA Graph capture (model.py:916)

2. Zero-alloc Attention (model.py:965)

3. Zero-alloc MLP (model.py:985)

Impact

Current Performance (v0.2.11)

Required Work

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(llm): Batch decode missing CUDA Graph and zero-alloc optimizations #181

Description

Description

Locations

1. CUDA Graph capture (model.py:916)

2. Zero-alloc Attention (model.py:965)

3. Zero-alloc MLP (model.py:985)

Impact

Current Performance (v0.2.11)

Required Work

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions