GitHub - danielbodart/capsper: Push-to-talk voice dictation for Linux and macOS. Hold a key, speak, release — text is typed into whatever window is focused.

Does CapsLock annoy you? Ever wished it actually did something useful instead of SHOUTING AT PEOPLE BY ACCIDENT?

Ever wished you could just whisper to a friendly ghost and have your words appear on screen? Well now you can. Capsper is your friendly neighbourhood ghost writer — hold CapsLock, speak, and he types it out for you. No cloud, no subscription, no latency worth complaining about. Just a local GPU (or CPU), a haunted key, and a little ~~whisper~~ nemo magic.

Push-to-talk voice dictation for Linux and macOS. Uses NVIDIA's Nemotron Speech 600M model (FastConformer RNNT) for streaming speech-to-text. Single self-contained binary per platform.

How it works

A single binary intercepts CapsLock as push-to-talk
Audio is captured directly from the system audio while the trigger key is held
Incremental transcription runs locally via the Nemotron RNNT model
Transcribed text is injected as keystrokes into the focused window

	Linux	macOS
Keyboard	evdev grab + uinput virtual keyboard	CGEventTap + CGEventPost
Audio	PipeWire capture	CoreAudio (AUHAL)
Inference	ONNX Runtime (CUDA or CPU)	CoreML (93% Apple Neural Engine)
Display server	X11 and Wayland	native

Requirements

Linux

Debian/Ubuntu (or similar)
PipeWire (default audio server on modern Ubuntu/Fedora)
GPU (recommended): NVIDIA GPU with ~4 GB VRAM (Turing or newer: GTX 16xx, RTX 20xx/30xx/40xx/50xx), NVIDIA drivers, and cuDNN — near-zero CPU impact during inference
CPU-only: works without a GPU at similar speed, but uses significant CPU while speaking

macOS

Apple Silicon Mac (M1 or later)
macOS 13 (Ventura) or later
Accessibility and Microphone permissions (the installer walks you through this)

Install

Linux

mkdir capsper && cd capsper
curl -fSL https://github.com/danielbodart/capsper/releases/latest/download/capsper-linux-x86_64.tar.gz | tar -xz
./install.sh

macOS

mkdir capsper && cd capsper
curl -fSL https://github.com/danielbodart/capsper/releases/latest/download/capsper-macos-arm64.tar.gz | tar -xz
./install.sh

The installer walks you through everything interactively — downloading models, setting up permissions, detecting your microphone, and installing a background service.

On Linux, a launcher script automatically detects whether you have an NVIDIA GPU and runs the appropriate binary (capsper-cuda or capsper-cpu).

Usage

Linux

systemctl --user start capsper.service

macOS

launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/io.github.danielbodart.capsper.plist

Hold CapsLock and speak. Release to stop. Text appears in the focused window. CapsLock is the default trigger — you can change it with --trigger (see Trigger keys).

Auto-updates

On Linux, capsper can optionally check for updates daily (the installer offers to set this up). When a new version is found, it's downloaded and staged in the background. The update is applied automatically on the next service restart — capsper is never interrupted mid-session.

On macOS, auto-updates are not yet available — updating the binary invalidates Accessibility permission because macOS identifies ad-hoc signed binaries by hash. To update manually, re-download and run install.sh, then re-approve capsper in System Settings > Accessibility.

Check for updates manually (Linux):

~/.local/share/capsper/capsper-update.sh

Apply a staged update (Linux):

systemctl --user restart capsper.service

If a new version crashes repeatedly (3 times within 60 seconds), capsper automatically rolls back to the previous version (Linux). You can also roll back manually:

~/.local/share/capsper/capsper-rollback.sh --force
systemctl --user reset-failed capsper.service
systemctl --user start capsper.service

Disable auto-updates (Linux):

systemctl --user disable --now capsper-update.timer

Drop terms

Unlike Whisper, the Nemotron RNNT model doesn't automatically suppress filler words like "um" and "uh". You can suppress these (and other unwanted phrases) by providing a drop terms file:

capsper --trigger capslock --drop-terms ~/my-drop-terms.txt

The file is one phrase per line. If the entire output of a single decode cycle exactly matches a drop term, it's silently suppressed. Write terms in lowercase (the model always outputs lowercase text).

Example my-drop-terms.txt:

uh
um
you know

Audio setup (Linux)

Run the interactive setup wizard to detect your microphone channel and calibrate gain:

capsper --audio-detect

This lists available audio sources, lets you pick a device, records silence and speech to detect the best channel, then calibrates software gain. At the end it prints the recommended flags:

  --audio-channel FL --audio-gain 3.2

If you already know your device, skip the selection step:

capsper --audio-detect --audio-target alsa_input.usb-Focusrite_Vocaster...

Debug recording

To diagnose transcription issues (e.g. dropped words), enable per-utterance recording:

mkdir /tmp/capsper-debug
capsper --trigger capslock --record-dir /tmp/capsper-debug

Each utterance produces a pair of files (000.wav/000.log, 001.wav/001.log, etc.) in a ring buffer — old files are overwritten after --record-keep pairs (default 10). The WAV contains the full utterance audio and the log contains emitted text plus a per-cycle diagnostic trace.

Batch-transcribe a captured WAV to compare streaming vs non-streaming results:

capsper --transcribe /tmp/capsper-debug/005.wav

This loads the model, transcribes the entire file in one shot, prints the result, and exits.

Architecture

Capsper uses NVIDIA's Nemotron Speech 600M model — a FastConformer-based RNNT (Recurrent Neural Network Transducer) that's inherently incremental. Unlike the previous whisper.cpp approach which needed separate voice activity detection and cross-attention tricks for streaming, the RNNT model naturally processes audio as it arrives and emits tokens incrementally. Push-to-talk is the sole gate — no VAD needed.

The model runs through different backends depending on platform:

Linux (NVIDIA GPU): ONNX Runtime with CUDA execution provider — int8-static quantization
Linux (CPU): ONNX Runtime CPU — int8-dynamic quantization
macOS (Apple Silicon): CoreML — FP16, runs 93% on the Apple Neural Engine

A single Zig binary handles everything: keyboard interception, audio capture, mel spectrogram computation, model inference, SentencePiece detokenization, and text injection. No Python, no runtime dependencies beyond the platform's audio system and GPU drivers.

Server options

Running capsper with no arguments prints usage and exits.

capsper [OPTIONS]

  --model, -m PATH          Model directory path (default: ../models/nemotron relative to binary)
  --port, -p PORT           TCP port (default: 43007, use 0 for OS-assigned)
  --input tcp|local         Input mode: tcp (socket) or local (audio capture)
  --trigger KEY             Trigger key for push-to-talk (see Trigger keys below)
  --trigger-passthrough     Forward trigger key to OS after interception
  --type-delay US           Delay between injected keystrokes in microseconds (default: 12000)
  --low-latency             Keep audio stream open (mic indicator always visible, ~300ms faster)
  --audio-target NODE       Audio capture target device name
  --audio-channel CHANNEL   Audio channel: MONO, FL, FR, AUX0-AUX63 (default: FL)
  --audio-gain FACTOR       Software gain multiplier (default: 1.0, max: 10.0)
  --audio-detect            Interactive audio setup wizard (device selection, channel detection, gain calibration)
  --detect-duration SECS    Duration per detection phase (default: 5)
  --drop-terms FILE         Text file of phrases to suppress (one per line, exact match)
  --record-dir DIR          Record each utterance to DIR (WAV + diagnostic log)
  --record-keep N           Number of recording pairs to keep (default: 10, ring buffer)
  --transcribe FILE         Batch-transcribe a WAV file (non-streaming) and exit
  --no-auto-gain            Disable automatic gain adjustment
  --warmup-file FILE        WAV file for inference warmup at startup
  --no-warmup               Skip warmup inference
  --verbose, -v             Enable verbose logging
  --dry-run                 Load models, run warmup, then exit (validates setup)
  --version                 Print version and exit

Trigger keys

The --trigger flag selects which key activates push-to-talk. CapsLock is the default.

Key	Linux	macOS	Notes
`capslock`	yes	yes	Default. On macOS, remapped to F19 via hidutil to suppress LED/modifier
`scrolllock`	yes	yes	On macOS, shares keycode with F14
`numlock`	yes	yes	On macOS, maps to Clear (kVK_ANSI_KeypadClear)
`pause`	yes	—
`f13`–`f20`	yes	yes
`f21`–`f24`	yes	—

Function keys F13–F20 are the safest choice for a non-default trigger — they exist on both platforms and are rarely used by applications.

Development

Want to hack on Capsper? You'll need the requirements for your platform.

git clone https://github.com/danielbodart/capsper.git
cd capsper
./run.ts

This auto-detects your platform and handles everything:

Installs toolchain (mise, Zig 0.15, Bun) on first run via bootstrap.sh
Installs system packages (libpipewire-0.3-dev, pv, ncat on Linux; shellcheck on macOS)
Downloads models if missing (~250 MB for ONNX, ~150 MB for CoreML)
Compiles the Zig binary (pre-built ONNX Runtime shared libs committed via Git LFS on Linux)
Runs unit tests, property tests, and short integration smoke tests

On Linux, ./run.ts build produces two binaries (capsper-cuda + capsper-cpu) plus a launcher script. On macOS, it produces a single capsper binary using CoreML.

Every step is incremental — re-running ./run.ts is fast if everything is already set up.

Building & testing

All commands go through the Bun-based task runner (run.ts):

# Build (default command)
./run.ts build

# Unit + property tests (no GPU required)
./run.ts test

# Regression test groups (requires built binary + model)
./run.ts short-test               # Short files (<15s) via fast-forward TCP
./run.ts medium-test              # Medium files (15-40s) via fast-forward TCP
./run.ts long-test                # Long files (>60s) via fast-forward TCP

# All integration tests (all groups + platform plumbing)
./run.ts slow-test

Troubleshooting

Linux

"No CUDA GPU detected" — the CUDA binary requires an NVIDIA GPU with cuDNN. Ensure NVIDIA drivers are installed (sudo ubuntu-drivers autoinstall), that nvidia-smi shows your GPU, and that cuDNN is installed. Alternatively, the CPU binary works without a GPU (the launcher script auto-detects this).

Cannot open /dev/input — ensure your user is in the input group (groups to check, sudo usermod -aG input $USER then log out/in).

Text not being typed — ensure /dev/uinput is accessible. The udev rule should be set up by ./install.sh, or manually: echo 'KERNEL=="uinput", GROUP="input", MODE="0660"' | sudo tee /etc/udev/rules.d/99-uinput.rules && sudo udevadm control --reload-rules && sudo udevadm trigger /dev/uinput.

Audio capture fails — ensure PipeWire is running (pw-cli info). Run capsper --audio-detect to list available sources, select your device, and detect the correct channel.

Quiet or degraded transcription — if using a multi-channel audio interface, make sure you're capturing the correct channel. Run capsper --audio-detect to detect the best channel and calibrate gain.

macOS

"Failed to init input handler" — Accessibility permission not granted. The capsper binary itself must be in the Accessibility list (not just Terminal). Open System Settings > Privacy & Security > Accessibility, click '+', press Cmd+Shift+G, and paste:

~/.local/share/capsper/current/bin/capsper

If capsper is already in the list, remove it and re-add — macOS caches the permission against the binary hash, so it may need refreshing after an update.

No audio captured — Microphone permission not granted. Open System Settings > Privacy & Security > Microphone and add capsper (or Terminal).

Service not starting — check logs with tail -f ~/.local/share/capsper/capsper.log. Ensure both Accessibility and Microphone permissions are granted for the capsper binary.

Both platforms

Keyboard locked up — press Enter+Backspace+Escape simultaneously to trigger the panic sequence and ungrab all keyboards.

"Failed to load model" — model files not found. Re-run ./install.sh to download models, or download manually from HuggingFace.

Acknowledgements

Capsper uses NVIDIA's Nemotron Speech 600M (FastConformer RNNT) model for speech recognition. The model is converted to ONNX and CoreML formats for cross-platform deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 326 Commits
.claude		.claude
.github/workflows		.github/workflows
dist		dist
docs		docs
scripts		scripts
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.mise.toml		.mise.toml
CLAUDE.md		CLAUDE.md
README.md		README.md
bootstrap.sh		bootstrap.sh
build.zig		build.zig
build.zig.zon		build.zig.zon
logo.png		logo.png
run		run
run.ts		run.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How it works

Requirements

Linux

macOS

Install

Linux

macOS

Usage

Linux

macOS

Auto-updates

Drop terms

Audio setup (Linux)

Debug recording

Architecture

Server options

Trigger keys

Development

Building & testing

Troubleshooting

Linux

macOS

Both platforms

Acknowledgements

About

Uh oh!

Releases 187

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How it works

Requirements

Linux

macOS

Install

Linux

macOS

Usage

Linux

macOS

Auto-updates

Drop terms

Audio setup (Linux)

Debug recording

Architecture

Server options

Trigger keys

Development

Building & testing

Troubleshooting

Linux

macOS

Both platforms

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 187

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages