captions.py

Generate subtitle files from video using OpenAI Whisper — runs locally, no API key needed.

Outputs .ass by default (with position and outline styling pre-applied) or .srt via --format srt. Designed for use with Sony Vegas Pro on 1080×1920 portrait video.

Prerequisites

1. Python 3.8+

Download from python.org.

2. ffmpeg (system install — required)

Windows: winget install ffmpeg or download from ffmpeg.org and add the bin/ folder to your PATH
macOS: brew install ffmpeg
Linux: sudo apt install ffmpeg

Verify: ffmpeg -version

Installation

# 1. Clone or copy this folder, then navigate to it
cd captions

# 2. Create a virtual environment (recommended)
python -m venv venv
venv\Scripts\activate      # Windows
# source venv/bin/activate  # macOS/Linux

# 3. Install dependencies
pip install -r requirements.txt

GPU support (optional but recommended for speed)

The default requirements.txt installs CPU-only PyTorch. For CUDA GPU acceleration:

Find your CUDA version: nvidia-smi
Get the right install command from pytorch.org/get-started/locally

Run that command before pip install -r requirements.txt, e.g.:

pip install torch --index-url https://download.pytorch.org/whl/cu121

The script auto-detects CUDA and prints which device it's using.

Usage

python captions.py <video_path> [--mode sentence|phrase|word]
                                [--model tiny|base|small|medium|large]
                                [--language CODE]
                                [--format srt|ass]
                                [--diarize --hf-token TOKEN [--speakers N]]

The output file is saved in the same directory as the input video, with the same base name.

Examples

# Basic — phrase mode, small model, .ass output
python captions.py my_video.mp4

# Output plain .srt instead
python captions.py my_video.mp4 --format srt

# Word-by-word (karaoke / TikTok style)
python captions.py my_video.mp4 --mode word

# Use a more accurate model
python captions.py my_video.mp4 --model medium

# Non-English video
python captions.py my_video.mp4 --language ja

# Speaker diarization (labels each subtitle A:, B:, etc.)
python captions.py my_video.mp4 --diarize --hf-token hf_xxx

# Diarization with a known number of speakers
python captions.py my_video.mp4 --diarize --hf-token hf_xxx --speakers 2

# Combine options
python captions.py my_video.mp4 --mode phrase --model large --language fr

Chunking modes

Mode	Description	Best for
`phrase`	Up to 3 words per line, breaks at natural pauses (default)	Fast speech, TikTok/Shorts style
`sentence`	One subtitle per Whisper segment	Dialogue, narration
`word`	Each word is its own subtitle event (≥100ms)	Karaoke, word-by-word captions

Model sizes

Model	Speed	Accuracy	VRAM
`tiny`	Fastest	Lowest	~1 GB
`base`	Fast	Low	~1 GB
`small`	Balanced (default)	Good	~2 GB
`medium`	Slow	Better	~5 GB
`large`	Slowest	Best	~10 GB

Models are downloaded automatically on first use and cached in ~/.cache/whisper/.

Supported input formats

.mp4, .mov, .mkv, .avi, .webm

Output formats

.ass (default)

Advanced SubStation Alpha format with styling pre-applied — Arial font, white text, black outline, centered near the bottom. Matches the Vegas Pro Subtitles preset for 1080×1920.

To adjust the style, edit the constants near the top of captions.py:

ASS_FONT_SIZE    = 72     # increase/decrease text size
ASS_OUTLINE_SIZE = 4      # outline thickness in pixels
ASS_MARGIN_V     = 384    # pixels from the bottom of the frame

.srt

Plain subtitle format with no styling. Use --format srt to get this instead.

1
00:00:01,000 --> 00:00:03,500
First subtitle line

2
00:00:03,500 --> 00:00:06,000
Second subtitle line

Importing into Sony Vegas Pro

.ass (recommended)

Open your project in Vegas Pro
Drag the .ass file onto the timeline
Vegas auto-generates the subtitle track with position and outline already applied — no manual styling needed per event

.srt

Open your project in Vegas Pro
Drag the .srt file onto the timeline
Vegas auto-generates the subtitle track — style each event manually as needed

Speaker diarization

Diarization identifies who is speaking and prefixes each subtitle with a letter label (A:, B:, etc.).

Requirements:

pip install pyannote.audio
A free HuggingFace token
Accept the model terms at pyannote/speaker-diarization-3.1

python captions.py my_video.mp4 --diarize --hf-token hf_xxx
# or set the token as an env var to avoid typing it each time:
# HF_TOKEN=hf_xxx python captions.py my_video.mp4 --diarize

Troubleshooting

ffmpeg not found — The ffmpeg binary is not on your PATH. See Prerequisites above.

openai-whisper is not installed — Run pip install openai-whisper inside your venv.

Poor transcription accuracy — Try a larger model (--model medium or --model large), or specify the language explicitly (--language en).

Word timestamps unavailable in word mode — Falls back to sentence mode automatically with a warning. This is rare but can occur with very short or silent audio.

pyannote.audio is not installed — Run pip install pyannote.audio to enable diarization.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
captions.py		captions.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

captions.py

Prerequisites

1. Python 3.8+

2. ffmpeg (system install — required)

Installation

GPU support (optional but recommended for speed)

Usage

Examples

Chunking modes

Model sizes

Supported input formats

Output formats

.ass (default)

.srt

Importing into Sony Vegas Pro

.ass (recommended)

.srt

Speaker diarization

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

captions.py

Prerequisites

1. Python 3.8+

2. ffmpeg (system install — required)

Installation

GPU support (optional but recommended for speed)

Usage

Examples

Chunking modes

Model sizes

Supported input formats

Output formats

.ass (default)

.srt

Importing into Sony Vegas Pro

.ass (recommended)

.srt

Speaker diarization

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages