-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or requestv0.3Advanced: Triton backend, advanced opsAdvanced: Triton backend, advanced ops
Description
Description
The Whisper ASR encoder uses a CPU fallback implementation for 1D convolution instead of a native GPU kernel.
Location
src/pygpukit/asr/whisper/encoder.py:78
# CPU fallback implementation using im2col
# TODO: Implement native GPU conv1d kernel
x_np = x.to_numpy()
w_np = weight.to_numpy()Impact
- Performance: Conv1d is called for every audio frame in the encoder
- Memory: Unnecessary GPU -> CPU -> GPU transfers
- Latency: Adds significant overhead to ASR inference
Current Implementation
Uses im2col + numpy matmul pattern:
x.to_numpy()- GPU to CPU transfernp.pad()- CPU padding- im2col loop (Python for-loop)
from_numpy()- CPU to GPU transfer
Required Work
- Implement native CUDA conv1d kernel in
native/ops/conv1d.cu - Support stride, padding, dilation parameters
- Optimize for typical ASR dimensions (80 mel bins, kernel_size=3)
Related
- RFC: Image Generation Support (Stable Diffusion, Flux, DiT) #177 - Image generation also needs Conv2d
- Conv1d can share design patterns with Conv2d
Workaround
None - the feature works but is slow.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestv0.3Advanced: Triton backend, advanced opsAdvanced: Triton backend, advanced ops