FFmpeg Whisper Filter Guide: Generate Video Subtitles Automatically with One Command

Tutorials Featured Articles 7 minutes |
FFmpeg Whisper Filter Guide: Generate Video Subtitles Automatically with One Command

Want to add subtitles to your videos? In the past, you’d either pay someone to transcribe or struggle with various online tools. Now, FFmpeg 8.x comes with built-in Whisper speech recognition - one command converts speech to SRT subtitles, supporting both Chinese and English with impressive accuracy.

The whisper filter uses OpenAI’s Whisper model with high recognition accuracy. The best part? No messy dependencies - just download FFmpeg and the model file, and you’re good to go.

What Can It Do?

Simply put: video/audio in, text out. Here’s what you get:

  • Speech to Text: Convert spoken words in videos to text with high accuracy
  • Multi-language Support: Chinese, English, Japanese, Korean… 99 languages to choose from
  • Flexible Output: SRT subtitles, JSON, plain text - whatever format you need
  • GPU Acceleration: Lightning fast with a graphics card, still works on CPU
  • Smart Segmentation: VAD voice detection automatically identifies pauses for natural subtitle breaks
  • Real-time Transcription: Can even transcribe live microphone input

Before You Start

Step 1: Download FFmpeg

Don’t worry, no compilation needed - just download and use:

  • Version Required: FFmpeg 8.x (this version has the whisper filter built-in)
  • Download: FFmpeg Official Download Page
  • That’s It: Download, extract, done. No Python, no compiling whisper.cpp

Step 2: Download the Speech Recognition Model

The model file is essential. Use the official script for easy downloading:

Windows Users:

# Download the model download script
curl -O https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/models/download-ggml-model.cmd

# Download a model (e.g., base model)
download-ggml-model.cmd base

Linux/macOS Users:

# Download the model download script
wget https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/models/download-ggml-model.sh
chmod +x download-ggml-model.sh

# Download a model (e.g., base model)
./download-ggml-model.sh base

Tip: Model files download to the current directory. Create a models folder first to keep things organized.

Parameter Reference

Lots of parameters here, but you’ll only use a few regularly - don’t be intimidated:

ParameterTypeDescriptionDefault
modelstring(Required) Full path to the downloaded whisper.cpp model file (e.g., ./models/ggml-base.bin). Must be a .bin file in whisper.cpp compatible formatNone
languagestringLanguage code for transcription. Set to auto for automatic detection. Supported: en (English), zh (Chinese), ja (Japanese), es (Spanish), etc. Specifying the correct language improves accuracyauto
queueintegerAudio chunk queue size in seconds before processing. Small values (1-3): More frequent processing, lower latency, but may reduce quality and increase CPU usage; Large values (10-20): Better quality but higher latency (not suitable for real-time). Recommended to use larger values with vad_model3
use_gpubooleanEnable GPU acceleration. true: Use GPU (requires CUDA); false: CPU onlytrue
gpu_deviceintegerGPU device index. 0: First GPU; 1: Second GPU, etc. Only effective when use_gpu=true0
destinationstringOutput destination for transcription. File path: Save to local file (e.g., output.srt); URL: Send to HTTP service (e.g., http://localhost:3000); Empty: Output as info message to log. Results also written to frame metadata lavfi.whisper.text. Existing files will be overwrittenNone
formatstringOutput format. text: Plain text only; srt: Standard subtitle format with timestamps; json: JSON format with detailed transcription info and timestampstext
vad_modelstringPath to Silero VAD (Voice Activity Detection) model. Recommended: ./models/ggml-silero-v5.1.2.bin. Enables intelligent audio queue splitting for better quality. Use with larger queue values (e.g., 20)None
vad_thresholdfloatVAD sensitivity threshold. Range: 0.0 - 1.0. Lower: More sensitive, may detect more speech; Higher: Stricter, only clear speech0.5
vad_min_speech_durationfloatMinimum speech duration in seconds for VAD. Shorter segments are ignored, helps filter brief noise or stutters0.1
vad_min_silence_durationfloatMinimum silence duration in seconds for VAD. Used to determine speech segment boundaries0.5

How to Choose a Model?

Bigger models are more accurate but slower and use more memory. Choose based on your needs:

ModelSizeCharacteristicsRecommended Use
tiny~39MBSmallest and fastestReal-time transcription, resource-limited environments
tiny.en~39MBEnglish-only versionEnglish real-time transcription
base~142MBBalanced performanceGeneral purpose - recommended
base.en~142MBEnglish-only versionEnglish transcription - recommended
small~466MBHigher accuracyHigh-quality transcription
small.en~466MBEnglish-only versionHigh-quality English transcription
medium~1.5GBHigh accuracyProfessional transcription
medium.en~1.5GBEnglish-only versionProfessional English transcription
large-v3~3.1GBHighest accuracyBest quality transcription
large-v3-turbo~3.1GBOptimized large modelBalance of quality and speed

What are quantized versions? Compressed models that are smaller and faster, with slightly reduced accuracy:

  • q5_0, q5_1: 5-bit quantization, half the size, noticeably faster
  • q8_0: 8-bit quantization, balanced size and quality

My recommendation: Start with the base model - it’s good enough for most cases. Use medium for quality, tiny for speed.

Download Models

# Download recommended base model
download-ggml-model.cmd base

# Download English-only model
download-ggml-model.cmd base.en

# Download to specific directory
download-ggml-model.cmd base ./models

# Download high-quality model
download-ggml-model.cmd large-v3

Hands-On Examples

Enough theory - let’s try it out!

The Simplest Use: Generate SRT Subtitles

Convert speech in input.mp4 to a subtitle file with just one line:

ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-base.bin:language=auto:queue=3:destination=output.srt:format=srt" -f null -

Confused? Here’s what each part does:

  • -vn: Tell FFmpeg “I don’t want video, just process audio”
  • -f null -: Don’t output a file, subtitles go directly to the destination path
  • model=ggml-base.bin: Which model to use - enter your downloaded model filename

Advanced: Send to HTTP Service

Building a real-time subtitle system? Push transcription results directly to your server:

ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-base.bin:language=auto:queue=3:destination=http\://localhost\:3000:format=json" -f null -

Want Better Accuracy? Add VAD

VAD (Voice Activity Detection) automatically identifies speech and silence for more natural subtitle segmentation:

# First download the VAD model
wget https://raw.githubusercontent.com/ggml-org/whisper.cpp/master/models/for-tests-silero-v5.1.2-ggml.bin

# Use VAD for high-quality transcription
ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-medium-q5_0.bin:language=auto:queue=20:destination=output.srt:format=srt:vad_model=for-tests-silero-v5.1.2-ggml.bin:vad_threshold=0.6" -f null -

Cool Trick: Real-time Microphone Transcription (Linux)

Transcribe as you speak - perfect for meeting notes:

ffmpeg -loglevel warning -f pulse -i default -af 'highpass=f=200,lowpass=f=3000,whisper=model=ggml-medium-q5_0.bin:language=en:queue=10:destination=-:format=json:vad_model=for-tests-silero-v5.1.2-ggml.bin' -f null -

Multi-language Video? Auto-detect!

Don’t know what language the video is in? Set language=auto and let it figure it out:

ffmpeg -i multilang_video.mp4 -vn -af "whisper=model=ggml-medium-q5_0.bin:language=auto:queue=15:destination=transcript.json:format=json:use_gpu=true" -f null -

No GPU? CPU Works Too

No NVIDIA graphics card? No problem - just add use_gpu=false, it’s just slower:

ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-medium-q5_0.bin:language=zh:queue=5:destination=output.srt:format=srt:use_gpu=false" -f null -

Advanced: Burn-in Real-time Subtitles

Beyond generating subtitle files, you can “burn” recognized text directly onto the video. The whisper filter writes transcription text to frame metadata (lavfi.whisper.text), which the drawtext filter can read and render.

Recognize and Display Subtitles Simultaneously

ffmpeg -i input.mp4 -af "whisper=model=ggml-base.en.bin:language=en" -vf "drawtext=text='%{metadata\:lavfi.whisper.text}':fontsize=24:fontcolor=white:x=10:y=h-th-10" output_with_subtitles.mp4

Running Slow? Optimize Like This

Choosing the Right Model Matters

  • Need Speed: Use tiny or base, results in seconds
  • Need Quality: Use medium or large-v3, slower but accurate
  • Compromise: Use quantized versions like medium-q5_0, fast and accurate

Use Your GPU If You Have One

GPU acceleration makes a huge difference, especially with large models:

  • Make sure CUDA drivers are installed
  • Add use_gpu=true to your command (it’s actually on by default)
  • Multiple GPUs? Use gpu_device=1 to specify which one

How to Tune the queue Parameter?

  • Live streaming/real-time: Keep it small, queue=1 to queue=3, low latency
  • Processing local videos: Go bigger, queue=15 to queue=20, works better with VAD

Having Problems?

Common issues and solutions:

  • Model not found error: Check if the path after model= is correct, try using absolute paths
  • GPU error: Add use_gpu=false to run on CPU first, confirm if it’s a driver issue
  • Out of memory: Use a smaller model, or reduce queue value
  • Poor accuracy: Try a larger model, or specify the correct language parameter

Supported Languages

Whisper claims to support 99 languages. Common ones work great:

CodeLanguageCodeLanguage
enEnglishzhChinese
jaJapanesekoKorean
esSpanishfrFrench
deGermanruRussian
itItalianarArabic
ptPortuguesehiHindi

Pro tip: While language=auto can detect the language automatically, specifying it directly (e.g., language=en) usually gives better results.

Wrapping Up

The FFmpeg + Whisper combo is seriously powerful. What used to cost money or take hours of work now takes just one command. Give it a try!

Resources

Tags

#FFmpeg #Whisper #Speech Recognition #Subtitle Generation #Video to Text #Audio Processing

Copyright Notice

This article is created by WebRTC.link and licensed under CC BY-NC-SA 4.0. This site repost articles will cite the source and author. If you need to repost, please cite the source and author.

Comments

Giscus

Comments powered by Giscus, based on GitHub Discussions