perf(asr): Whisper conv1d uses CPU fallback instead of GPU kernel

## Description

The Whisper ASR encoder uses a CPU fallback implementation for 1D convolution instead of a native GPU kernel.

## Location

`src/pygpukit/asr/whisper/encoder.py:78`

```python
# CPU fallback implementation using im2col
# TODO: Implement native GPU conv1d kernel
x_np = x.to_numpy()
w_np = weight.to_numpy()
```

## Impact

- **Performance**: Conv1d is called for every audio frame in the encoder
- **Memory**: Unnecessary GPU -> CPU -> GPU transfers
- **Latency**: Adds significant overhead to ASR inference

## Current Implementation

Uses `im2col + numpy matmul` pattern:
1. `x.to_numpy()` - GPU to CPU transfer
2. `np.pad()` - CPU padding
3. im2col loop (Python for-loop)
4. `from_numpy()` - CPU to GPU transfer

## Required Work

1. Implement native CUDA conv1d kernel in `native/ops/conv1d.cu`
2. Support stride, padding, dilation parameters
3. Optimize for typical ASR dimensions (80 mel bins, kernel_size=3)

## Related

- #177 - Image generation also needs Conv2d
- Conv1d can share design patterns with Conv2d

## Workaround

None - the feature works but is slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(asr): Whisper conv1d uses CPU fallback instead of GPU kernel #180

Description

Location

Impact

Current Implementation

Required Work

Related

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(asr): Whisper conv1d uses CPU fallback instead of GPU kernel #180

Description

Description

Location

Impact

Current Implementation

Required Work

Related

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions