Skip to content

perf(asr): Whisper conv1d uses CPU fallback instead of GPU kernel #180

@m96-chan

Description

@m96-chan

Description

The Whisper ASR encoder uses a CPU fallback implementation for 1D convolution instead of a native GPU kernel.

Location

src/pygpukit/asr/whisper/encoder.py:78

# CPU fallback implementation using im2col
# TODO: Implement native GPU conv1d kernel
x_np = x.to_numpy()
w_np = weight.to_numpy()

Impact

  • Performance: Conv1d is called for every audio frame in the encoder
  • Memory: Unnecessary GPU -> CPU -> GPU transfers
  • Latency: Adds significant overhead to ASR inference

Current Implementation

Uses im2col + numpy matmul pattern:

  1. x.to_numpy() - GPU to CPU transfer
  2. np.pad() - CPU padding
  3. im2col loop (Python for-loop)
  4. from_numpy() - CPU to GPU transfer

Required Work

  1. Implement native CUDA conv1d kernel in native/ops/conv1d.cu
  2. Support stride, padding, dilation parameters
  3. Optimize for typical ASR dimensions (80 mel bins, kernel_size=3)

Related

Workaround

None - the feature works but is slow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestv0.3Advanced: Triton backend, advanced ops

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions