FFmpeg Whisper Filter Guide: Generate Video Subtitles Automatically with One Command

Want to add subtitles to your videos? In the past, you’d either pay someone to transcribe or struggle with various online tools. Now, FFmpeg 8.x comes with built-in Whisper speech recognition - one command converts speech to SRT subtitles, supporting both Chinese and English with impressive accuracy.

The whisper filter uses OpenAI’s Whisper model with high recognition accuracy. The best part? No messy dependencies - just download FFmpeg and the model file, and you’re good to go.

What Can It Do?

Simply put: video/audio in, text out. Here’s what you get:

Speech to Text: Convert spoken words in videos to text with high accuracy
Multi-language Support: Chinese, English, Japanese, Korean… 99 languages to choose from
Flexible Output: SRT subtitles, JSON, plain text - whatever format you need
GPU Acceleration: Lightning fast with a graphics card, still works on CPU
Smart Segmentation: VAD voice detection automatically identifies pauses for natural subtitle breaks
Real-time Transcription: Can even transcribe live microphone input

Before You Start

Step 1: Download FFmpeg

Don’t worry, no compilation needed - just download and use:

Version Required: FFmpeg 8.x (this version has the whisper filter built-in)
Download: FFmpeg Official Download Page
That’s It: Download, extract, done. No Python, no compiling whisper.cpp

Step 2: Download the Speech Recognition Model

The model file is essential. Use the official script for easy downloading:

Windows Users:

# Download the model download script
curl -O https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/models/download-ggml-model.cmd

# Download a model (e.g., base model)
download-ggml-model.cmd base

Linux/macOS Users:

# Download the model download script
wget https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/models/download-ggml-model.sh
chmod +x download-ggml-model.sh

# Download a model (e.g., base model)
./download-ggml-model.sh base

Tip: Model files download to the current directory. Create a models folder first to keep things organized.

Parameter Reference

Lots of parameters here, but you’ll only use a few regularly - don’t be intimidated:

Parameter	Type	Description	Default
`model`	string	(Required) Full path to the downloaded whisper.cpp model file (e.g., `./models/ggml-base.bin`). Must be a `.bin` file in whisper.cpp compatible format	None
`language`	string	Language code for transcription. Set to `auto` for automatic detection. Supported: `en` (English), `zh` (Chinese), `ja` (Japanese), `es` (Spanish), etc. Specifying the correct language improves accuracy	`auto`
`queue`	integer	Audio chunk queue size in seconds before processing. Small values (1-3): More frequent processing, lower latency, but may reduce quality and increase CPU usage; Large values (10-20): Better quality but higher latency (not suitable for real-time). Recommended to use larger values with `vad_model`	`3`
`use_gpu`	boolean	Enable GPU acceleration. `true`: Use GPU (requires CUDA); `false`: CPU only	`true`
`gpu_device`	integer	GPU device index. `0`: First GPU; `1`: Second GPU, etc. Only effective when `use_gpu=true`	`0`
`destination`	string	Output destination for transcription. File path: Save to local file (e.g., `output.srt`); URL: Send to HTTP service (e.g., `http://localhost:3000`); Empty: Output as info message to log. Results also written to frame metadata `lavfi.whisper.text`. Existing files will be overwritten	None
`format`	string	Output format. `text`: Plain text only; `srt`: Standard subtitle format with timestamps; `json`: JSON format with detailed transcription info and timestamps	`text`
`vad_model`	string	Path to Silero VAD (Voice Activity Detection) model. Recommended: `./models/ggml-silero-v5.1.2.bin`. Enables intelligent audio queue splitting for better quality. Use with larger `queue` values (e.g., 20)	None
`vad_threshold`	float	VAD sensitivity threshold. Range: `0.0` - `1.0`. Lower: More sensitive, may detect more speech; Higher: Stricter, only clear speech	`0.5`
`vad_min_speech_duration`	float	Minimum speech duration in seconds for VAD. Shorter segments are ignored, helps filter brief noise or stutters	`0.1`
`vad_min_silence_duration`	float	Minimum silence duration in seconds for VAD. Used to determine speech segment boundaries	`0.5`

How to Choose a Model?

Bigger models are more accurate but slower and use more memory. Choose based on your needs:

Model	Size	Characteristics	Recommended Use
`tiny`	~39MB	Smallest and fastest	Real-time transcription, resource-limited environments
`tiny.en`	~39MB	English-only version	English real-time transcription
`base`	~142MB	Balanced performance	General purpose - recommended
`base.en`	~142MB	English-only version	English transcription - recommended
`small`	~466MB	Higher accuracy	High-quality transcription
`small.en`	~466MB	English-only version	High-quality English transcription
`medium`	~1.5GB	High accuracy	Professional transcription
`medium.en`	~1.5GB	English-only version	Professional English transcription
`large-v3`	~3.1GB	Highest accuracy	Best quality transcription
`large-v3-turbo`	~3.1GB	Optimized large model	Balance of quality and speed

What are quantized versions? Compressed models that are smaller and faster, with slightly reduced accuracy:

q5_0, q5_1: 5-bit quantization, half the size, noticeably faster
q8_0: 8-bit quantization, balanced size and quality

My recommendation: Start with the base model - it’s good enough for most cases. Use medium for quality, tiny for speed.

Download Models

# Download recommended base model
download-ggml-model.cmd base

# Download English-only model
download-ggml-model.cmd base.en

# Download to specific directory
download-ggml-model.cmd base ./models

# Download high-quality model
download-ggml-model.cmd large-v3

Hands-On Examples

Enough theory - let’s try it out!

The Simplest Use: Generate SRT Subtitles

Convert speech in input.mp4 to a subtitle file with just one line:

ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-base.bin:language=auto:queue=3:destination=output.srt:format=srt" -f null -

Confused? Here’s what each part does:

-vn: Tell FFmpeg “I don’t want video, just process audio”
-f null -: Don’t output a file, subtitles go directly to the destination path
model=ggml-base.bin: Which model to use - enter your downloaded model filename

Advanced: Send to HTTP Service

Building a real-time subtitle system? Push transcription results directly to your server:

ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-base.bin:language=auto:queue=3:destination=http\://localhost\:3000:format=json" -f null -

Want Better Accuracy? Add VAD

VAD (Voice Activity Detection) automatically identifies speech and silence for more natural subtitle segmentation:

# First download the VAD model
wget https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/models/for-tests-silero-v5.1.2-ggml.bin

# Use VAD for high-quality transcription
ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-medium-q5_0.bin:language=auto:queue=20:destination=output.srt:format=srt:vad_model=for-tests-silero-v5.1.2-ggml.bin:vad_threshold=0.6" -f null -

Cool Trick: Real-time Microphone Transcription (Linux)

Transcribe as you speak - perfect for meeting notes:

ffmpeg -loglevel warning -f pulse -i default -af 'highpass=f=200,lowpass=f=3000,whisper=model=ggml-medium-q5_0.bin:language=en:queue=10:destination=-:format=json:vad_model=for-tests-silero-v5.1.2-ggml.bin' -f null -

Multi-language Video? Auto-detect!

Don’t know what language the video is in? Set language=auto and let it figure it out:

ffmpeg -i multilang_video.mp4 -vn -af "whisper=model=ggml-medium-q5_0.bin:language=auto:queue=15:destination=transcript.json:format=json:use_gpu=true" -f null -

No GPU? CPU Works Too

No NVIDIA graphics card? No problem - just add use_gpu=false, it’s just slower:

ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-medium-q5_0.bin:language=zh:queue=5:destination=output.srt:format=srt:use_gpu=false" -f null -

Advanced: Burn-in Real-time Subtitles

Beyond generating subtitle files, you can “burn” recognized text directly onto the video. The whisper filter writes transcription text to frame metadata (lavfi.whisper.text), which the drawtext filter can read and render.

Recognize and Display Subtitles Simultaneously

ffmpeg -i input.mp4 -af "whisper=model=ggml-base.en.bin:language=en" -vf "drawtext=text='%{metadata\:lavfi.whisper.text}':fontsize=24:fontcolor=white:x=10:y=h-th-10" output_with_subtitles.mp4

Running Slow? Optimize Like This

Choosing the Right Model Matters

Need Speed: Use tiny or base, results in seconds
Need Quality: Use medium or large-v3, slower but accurate
Compromise: Use quantized versions like medium-q5_0, fast and accurate

Use Your GPU If You Have One

GPU acceleration makes a huge difference, especially with large models:

Make sure CUDA drivers are installed
Add use_gpu=true to your command (it’s actually on by default)
Multiple GPUs? Use gpu_device=1 to specify which one

How to Tune the queue Parameter?

Live streaming/real-time: Keep it small, queue=1 to queue=3, low latency
Processing local videos: Go bigger, queue=15 to queue=20, works better with VAD

Having Problems?

Common issues and solutions:

Model not found error: Check if the path after model= is correct, try using absolute paths
GPU error: Add use_gpu=false to run on CPU first, confirm if it’s a driver issue
Out of memory: Use a smaller model, or reduce queue value
Poor accuracy: Try a larger model, or specify the correct language parameter

Supported Languages

Whisper claims to support 99 languages. Common ones work great:

Code	Language	Code	Language
`en`	English	`zh`	Chinese
`ja`	Japanese	`ko`	Korean
`es`	Spanish	`fr`	French
`de`	German	`ru`	Russian
`it`	Italian	`ar`	Arabic
`pt`	Portuguese	`hi`	Hindi

Pro tip: While language=auto can detect the language automatically, specifying it directly (e.g., language=en) usually gives better results.

Wrapping Up

The FFmpeg + Whisper combo is seriously powerful. What used to cost money or take hours of work now takes just one command. Give it a try!

Table of Contents

What Can It Do?

Before You Start

Step 1: Download FFmpeg

Step 2: Download the Speech Recognition Model

Parameter Reference

How to Choose a Model?

Download Models

Hands-On Examples

The Simplest Use: Generate SRT Subtitles

Advanced: Send to HTTP Service

Want Better Accuracy? Add VAD

Cool Trick: Real-time Microphone Transcription (Linux)

Multi-language Video? Auto-detect!

No GPU? CPU Works Too

Advanced: Burn-in Real-time Subtitles

Recognize and Display Subtitles Simultaneously

Running Slow? Optimize Like This

Choosing the Right Model Matters

Use Your GPU If You Have One

How to Tune the queue Parameter?

Having Problems?

Supported Languages

Wrapping Up

Resources

Tags

Copyright Notice

Comments

Featured Articles

Demo Showcase

Basic getUserMedia

Pan-Tilt-Zoom Camera

Screen Sharing

RTC Related Projects