perf(tts): Kokoro layers use CPU fallback (numpy) instead of GPU kernels

## Description

All neural network layers in `src/pygpukit/tts/kokoro/layers.py` use CPU fallback implementations via numpy instead of native GPU kernels.

## Affected Layers

| Layer | Issue |
|-------|-------|
| `Conv1d` | im2col + numpy matmul |
| `ConvTranspose1d` | Scatter-add via Python loop |
| `BertSelfAttention` | numpy attention (no FlashAttention) |
| `ResBlock1d` | Uses Conv1d (CPU) |
| `ISTFTNet` | Overlap-add via Python loop |
| `leaky_relu` | numpy where() |

## Example: Conv1d Implementation

```python
def __call__(self, x: GPUArray) -> GPUArray:
    # Convert to numpy for im2col (can be optimized later)
    x_np = x.to_numpy()
    w_np = self.weight.to_numpy()
    
    # im2col: extract patches
    for i in range(self.kernel_size):
        for j in range(out_length):
            col[:, :, i, j] = x_np[:, :, j_strided + i_dilated]
    
    # Matmul
    out_np = np.einsum("bkl,ok->bol", col, w_reshaped)
    
    return from_numpy(out_np.astype(np.float32))
```

## Impact

- **Latency**: Every layer incurs GPU->CPU->GPU transfer overhead
- **Throughput**: Python loops are orders of magnitude slower than CUDA
- **Memory**: Unnecessary copies between GPU and CPU memory

## Note

This is a **performance issue**, not a functionality issue. The layers work correctly, but slowly.

However, see #179 - the main TTS issue is that `model._forward_simple()` doesn't call these layers at all (generates sine wave instead).

## Required Work

1. Implement native CUDA conv1d kernel
2. Implement native CUDA transpose conv1d kernel  
3. Use existing SDPA for attention (already have causal, need bidirectional)
4. Implement LeakyReLU kernel
5. Implement ISTFT kernel (or use cuFFT)

## Priority

**Low** - First fix #179 (make TTS functional), then optimize performance.

## Related

- #179 - TTS outputs sine wave (blocking bug)
- #177 - Image generation needs similar kernels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(tts): Kokoro layers use CPU fallback (numpy) instead of GPU kernels #182

Description

Affected Layers

Example: Conv1d Implementation

Impact

Note

Required Work

Priority

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Layer	Issue
`Conv1d`	im2col + numpy matmul
`ConvTranspose1d`	Scatter-add via Python loop
`BertSelfAttention`	numpy attention (no FlashAttention)
`ResBlock1d`	Uses Conv1d (CPU)
`ISTFTNet`	Overlap-add via Python loop
`leaky_relu`	numpy where()

perf(tts): Kokoro layers use CPU fallback (numpy) instead of GPU kernels #182

Description

Description

Affected Layers

Example: Conv1d Implementation

Impact

Note

Required Work

Priority

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions