
Got a recorded interview, meeting, or podcast that needs transcription? Instead of paying for online transcription services, you can use OpenAI Whisper — a free, open-source speech recognition model that runs locally on your machine.
In this guide, we'll walk through the complete pipeline: from video file to ready-to-use text transcription.
FFmpeg is needed to extract audio from video files.
brew install ffmpeg
sudo apt update
sudo apt install ffmpeg
Download from ffmpeg.org and add to PATH.
Verify installation:
ffmpeg -version
Install Whisper via pip:
pip install openai-whisper
For faster GPU transcription (NVIDIA):
pip install openai-whisper torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Pull the audio track from your video file:
ffmpeg -i interview.mp4 -vn -acodec libmp3lame -q:a 2 interview.mp3
Flag breakdown:
-i interview.mp4 — input file-vn — no video (audio only)-acodec libmp3lame — MP3 codec-q:a 2 — high audio qualityFor a smaller file (sufficient for transcription):
ffmpeg -i interview.mp4 -vn -ar 16000 -ac 1 -b:a 64k interview.mp3
whisper interview.mp3 --language English --model medium
| Model | Size | VRAM | Quality | Speed |
|---|---|---|---|---|
| tiny | 39 MB | ~1 GB | Low | Very fast |
| base | 74 MB | ~1 GB | Medium | Fast |
| small | 244 MB | ~2 GB | Good | Medium |
| medium | 769 MB | ~5 GB | Very good | Slow |
| large | 1550 MB | ~10 GB | Best | Very slow |
For most use cases, small or medium offers the best quality-to-speed ratio.
whisper interview.mp3 --language English --model medium --output_format txt
Available output formats: txt, vtt, srt, tsv, json
Let's say you recorded a 15-minute user feedback session about your product. Here's the complete workflow:
ffmpeg -i user_feedback_session.mov -vn -ar 16000 -ac 1 feedback.mp3
whisper feedback.mp3 --language English --model medium --output_format txt
Whisper creates feedback.txt with content like:
So the first thing about these signals before the interval.
Yes, so you set a 25-second interval and before the interval
you get 4 audio signals, let's say, the main signal with an
accent and then 4 weaker ones again. So basically if you don't
arrive exactly on the signal, you know that you need to speed
up by two seconds or slow down by two seconds because the sound
tells you that.
For processing multiple files or integrating into your workflow:
#!/usr/bin/env python3
"""
Video to Text Transcription Pipeline
Uses FFmpeg + OpenAI Whisper
"""
import subprocess
import whisper
from pathlib import Path
def extract_audio(video_path: str, audio_path: str) -> None:
"""Extract audio from video using FFmpeg."""
cmd = [
"ffmpeg", "-i", video_path,
"-vn", "-ar", "16000", "-ac", "1",
"-b:a", "64k", "-y",
audio_path
]
subprocess.run(cmd, check=True, capture_output=True)
print(f"Audio extracted: {audio_path}")
def transcribe_audio(
audio_path: str,
output_path: str,
model_name: str = "medium",
language: str = "en"
) -> str:
"""Transcribe audio using Whisper."""
print(f"Loading model {model_name}...")
model = whisper.load_model(model_name)
print("Transcribing...")
result = model.transcribe(audio_path, language=language)
text = result["text"]
with open(output_path, "w", encoding="utf-8") as f:
f.write(text)
print(f"Transcription saved: {output_path}")
return text
def main():
# Configuration
video_file = "meeting_recording.mp4"
audio_file = "temp_audio.mp3"
output_file = "meeting_transcript.txt"
# Pipeline
extract_audio(video_file, audio_file)
text = transcribe_audio(audio_file, output_file)
# Cleanup temp file
Path(audio_file).unlink()
print(f"\nTranscription preview:\n{text[:500]}...")
if __name__ == "__main__":
main()
Run it:
python transcribe.py
Need timestamps for subtitles or reference? Use the segments feature:
import whisper
model = whisper.load_model("medium")
result = model.transcribe("podcast_episode.mp3", language="en")
# Print segments with timestamps
for segment in result["segments"]:
start = segment["start"]
end = segment["end"]
text = segment["text"].strip()
print(f"[{start:.1f}s - {end:.1f}s] {text}")
Output:
[0.0s - 4.2s] Welcome to the show. Today we're talking about...
[4.2s - 8.7s] ...building hardware products for athletes.
[8.7s - 15.3s] Our guest has been working on a tempo trainer device.
Don't know the language? Let Whisper detect it:
import whisper
model = whisper.load_model("medium")
# Auto-detect language
result = model.transcribe("unknown_language.mp3")
print(f"Detected language: {result['language']}")
print(f"Text: {result['text']}")
pip install mlx-whisper
import mlx_whisper
result = mlx_whisper.transcribe(
"audio.mp3",
path_or_hf_repo="mlx-community/whisper-medium-mlx"
)
For recordings over 1 hour, consider splitting:
ffmpeg -i long_recording.mp3 -f segment -segment_time 1800 -c copy chunk_%03d.mp3
large instead of medium)initial_prompt:result = model.transcribe(
"audio.mp3",
language="en",
initial_prompt="This is a conversation about swimming training and interval timers."
)
| Solution | Cost | Privacy | Quality |
|---|---|---|---|
| Whisper (local) | Free | Full privacy | Very good |
| OpenAI API | $0.006/min | Cloud-based | Excellent |
| Google Speech-to-Text | $0.016/min | Cloud-based | Very good |
| AssemblyAI | $0.015/min | Cloud-based | Very good |
Choose local Whisper when:
The video → mp3 → text pipeline with Whisper is straightforward:
ffmpeg -i video.mp4 -vn audio.mp3whisper audio.mp3 --language English --model mediumEverything runs locally, it's free, and delivers production-quality results.
Building something similar or facing technical challenges? We've been there.
Let's talk — no sales pitch, just honest engineering advice.
Building something similar or facing technical challenges? We've been there.
Let's talk — no sales pitch, just honest engineering advice.
How to manage tenants in the multitenant app based on django_tenants and saleor.io platform
Saleor is a great e-commerce open-source platform for building modern online stores with a multi-tenant management system.
How to Use Epidemic Sound MCP with Claude: AI-Powered Music Discovery for Developers
Learn how to set up and use Epidemic Sound MCP Server with Claude for AI-powered music discovery. Find the perfect tracks for your projects using natural language prompts.