Want to add subtitles to your videos? In the past, you’d either pay someone to transcribe or struggle with various online tools. Now, FFmpeg 8.x comes with built-in Whisper speech recognition - one command converts speech to SRT subtitles, supporting both Chinese and English with impressive accuracy.
The whisper filter uses OpenAI’s Whisper model with high recognition accuracy. The best part? No messy dependencies - just download FFmpeg and the model file, and you’re good to go.
What Can It Do?
Simply put: video/audio in, text out. Here’s what you get:
- Speech to Text: Convert spoken words in videos to text with high accuracy
- Multi-language Support: Chinese, English, Japanese, Korean… 99 languages to choose from
- Flexible Output: SRT subtitles, JSON, plain text - whatever format you need
- GPU Acceleration: Lightning fast with a graphics card, still works on CPU
- Smart Segmentation: VAD voice detection automatically identifies pauses for natural subtitle breaks
- Real-time Transcription: Can even transcribe live microphone input
Before You Start
Step 1: Download FFmpeg
Don’t worry, no compilation needed - just download and use:
- Version Required: FFmpeg 8.x (this version has the whisper filter built-in)
- Download: FFmpeg Official Download Page
- That’s It: Download, extract, done. No Python, no compiling whisper.cpp
Step 2: Download the Speech Recognition Model
The model file is essential. Use the official script for easy downloading:
Windows Users:
# Download the model download script
curl -O https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/models/download-ggml-model.cmd
# Download a model (e.g., base model)
download-ggml-model.cmd base
Linux/macOS Users:
# Download the model download script
wget https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/models/download-ggml-model.sh
chmod +x download-ggml-model.sh
# Download a model (e.g., base model)
./download-ggml-model.sh base
Tip: Model files download to the current directory. Create a
modelsfolder first to keep things organized.
Parameter Reference
Lots of parameters here, but you’ll only use a few regularly - don’t be intimidated:
| Parameter | Type | Description | Default |
|---|---|---|---|
model | string | (Required) Full path to the downloaded whisper.cpp model file (e.g., ./models/ggml-base.bin). Must be a .bin file in whisper.cpp compatible format | None |
language | string | Language code for transcription. Set to auto for automatic detection. Supported: en (English), zh (Chinese), ja (Japanese), es (Spanish), etc. Specifying the correct language improves accuracy | auto |
queue | integer | Audio chunk queue size in seconds before processing. Small values (1-3): More frequent processing, lower latency, but may reduce quality and increase CPU usage; Large values (10-20): Better quality but higher latency (not suitable for real-time). Recommended to use larger values with vad_model | 3 |
use_gpu | boolean | Enable GPU acceleration. true: Use GPU (requires CUDA); false: CPU only | true |
gpu_device | integer | GPU device index. 0: First GPU; 1: Second GPU, etc. Only effective when use_gpu=true | 0 |
destination | string | Output destination for transcription. File path: Save to local file (e.g., output.srt); URL: Send to HTTP service (e.g., http://localhost:3000); Empty: Output as info message to log. Results also written to frame metadata lavfi.whisper.text. Existing files will be overwritten | None |
format | string | Output format. text: Plain text only; srt: Standard subtitle format with timestamps; json: JSON format with detailed transcription info and timestamps | text |
vad_model | string | Path to Silero VAD (Voice Activity Detection) model. Recommended: ./models/ggml-silero-v5.1.2.bin. Enables intelligent audio queue splitting for better quality. Use with larger queue values (e.g., 20) | None |
vad_threshold | float | VAD sensitivity threshold. Range: 0.0 - 1.0. Lower: More sensitive, may detect more speech; Higher: Stricter, only clear speech | 0.5 |
vad_min_speech_duration | float | Minimum speech duration in seconds for VAD. Shorter segments are ignored, helps filter brief noise or stutters | 0.1 |
vad_min_silence_duration | float | Minimum silence duration in seconds for VAD. Used to determine speech segment boundaries | 0.5 |
How to Choose a Model?
Bigger models are more accurate but slower and use more memory. Choose based on your needs:
| Model | Size | Characteristics | Recommended Use |
|---|---|---|---|
tiny | ~39MB | Smallest and fastest | Real-time transcription, resource-limited environments |
tiny.en | ~39MB | English-only version | English real-time transcription |
base | ~142MB | Balanced performance | General purpose - recommended |
base.en | ~142MB | English-only version | English transcription - recommended |
small | ~466MB | Higher accuracy | High-quality transcription |
small.en | ~466MB | English-only version | High-quality English transcription |
medium | ~1.5GB | High accuracy | Professional transcription |
medium.en | ~1.5GB | English-only version | Professional English transcription |
large-v3 | ~3.1GB | Highest accuracy | Best quality transcription |
large-v3-turbo | ~3.1GB | Optimized large model | Balance of quality and speed |
What are quantized versions? Compressed models that are smaller and faster, with slightly reduced accuracy:
q5_0,q5_1: 5-bit quantization, half the size, noticeably fasterq8_0: 8-bit quantization, balanced size and quality
My recommendation: Start with the base model - it’s good enough for most cases. Use medium for quality, tiny for speed.
Download Models
# Download recommended base model
download-ggml-model.cmd base
# Download English-only model
download-ggml-model.cmd base.en
# Download to specific directory
download-ggml-model.cmd base ./models
# Download high-quality model
download-ggml-model.cmd large-v3
Hands-On Examples
Enough theory - let’s try it out!
The Simplest Use: Generate SRT Subtitles
Convert speech in input.mp4 to a subtitle file with just one line:
ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-base.bin:language=auto:queue=3:destination=output.srt:format=srt" -f null -
Confused? Here’s what each part does:
-vn: Tell FFmpeg “I don’t want video, just process audio”-f null -: Don’t output a file, subtitles go directly to thedestinationpathmodel=ggml-base.bin: Which model to use - enter your downloaded model filename
Advanced: Send to HTTP Service
Building a real-time subtitle system? Push transcription results directly to your server:
ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-base.bin:language=auto:queue=3:destination=http\://localhost\:3000:format=json" -f null -
Want Better Accuracy? Add VAD
VAD (Voice Activity Detection) automatically identifies speech and silence for more natural subtitle segmentation:
# First download the VAD model
wget https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/models/for-tests-silero-v5.1.2-ggml.bin
# Use VAD for high-quality transcription
ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-medium-q5_0.bin:language=auto:queue=20:destination=output.srt:format=srt:vad_model=for-tests-silero-v5.1.2-ggml.bin:vad_threshold=0.6" -f null -
Cool Trick: Real-time Microphone Transcription (Linux)
Transcribe as you speak - perfect for meeting notes:
ffmpeg -loglevel warning -f pulse -i default -af 'highpass=f=200,lowpass=f=3000,whisper=model=ggml-medium-q5_0.bin:language=en:queue=10:destination=-:format=json:vad_model=for-tests-silero-v5.1.2-ggml.bin' -f null -
Multi-language Video? Auto-detect!
Don’t know what language the video is in? Set language=auto and let it figure it out:
ffmpeg -i multilang_video.mp4 -vn -af "whisper=model=ggml-medium-q5_0.bin:language=auto:queue=15:destination=transcript.json:format=json:use_gpu=true" -f null -
No GPU? CPU Works Too
No NVIDIA graphics card? No problem - just add use_gpu=false, it’s just slower:
ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-medium-q5_0.bin:language=zh:queue=5:destination=output.srt:format=srt:use_gpu=false" -f null -
Advanced: Burn-in Real-time Subtitles
Beyond generating subtitle files, you can “burn” recognized text directly onto the video. The whisper filter writes transcription text to frame metadata (lavfi.whisper.text), which the drawtext filter can read and render.
Recognize and Display Subtitles Simultaneously
ffmpeg -i input.mp4 -af "whisper=model=ggml-base.en.bin:language=en" -vf "drawtext=text='%{metadata\:lavfi.whisper.text}':fontsize=24:fontcolor=white:x=10:y=h-th-10" output_with_subtitles.mp4
Running Slow? Optimize Like This
Choosing the Right Model Matters
- Need Speed: Use
tinyorbase, results in seconds - Need Quality: Use
mediumorlarge-v3, slower but accurate - Compromise: Use quantized versions like
medium-q5_0, fast and accurate
Use Your GPU If You Have One
GPU acceleration makes a huge difference, especially with large models:
- Make sure CUDA drivers are installed
- Add
use_gpu=trueto your command (it’s actually on by default) - Multiple GPUs? Use
gpu_device=1to specify which one
How to Tune the queue Parameter?
- Live streaming/real-time: Keep it small,
queue=1toqueue=3, low latency - Processing local videos: Go bigger,
queue=15toqueue=20, works better with VAD
Having Problems?
Common issues and solutions:
- Model not found error: Check if the path after
model=is correct, try using absolute paths - GPU error: Add
use_gpu=falseto run on CPU first, confirm if it’s a driver issue - Out of memory: Use a smaller model, or reduce
queuevalue - Poor accuracy: Try a larger model, or specify the correct
languageparameter
Supported Languages
Whisper claims to support 99 languages. Common ones work great:
| Code | Language | Code | Language |
|---|---|---|---|
en | English | zh | Chinese |
ja | Japanese | ko | Korean |
es | Spanish | fr | French |
de | German | ru | Russian |
it | Italian | ar | Arabic |
pt | Portuguese | hi | Hindi |
Pro tip: While
language=autocan detect the language automatically, specifying it directly (e.g.,language=en) usually gives better results.
Wrapping Up
The FFmpeg + Whisper combo is seriously powerful. What used to cost money or take hours of work now takes just one command. Give it a try!
Resources
Tags
Copyright Notice
This article is created by WebRTC.link and licensed under CC BY-NC-SA 4.0. This site repost articles will cite the source and author. If you need to repost, please cite the source and author.
Comments
GiscusComments powered by Giscus, based on GitHub Discussions