How to Transcribe Video to Text Using OpenAI Whisper

Introduction

Got a recorded interview, meeting, or podcast that needs transcription? Instead of paying for online transcription services, you can use OpenAI Whisper — a free, open-source speech recognition model that runs locally on your machine.

In this guide, we'll walk through the complete pipeline: from video file to ready-to-use text transcription.

Requirements

Python 3.8+
FFmpeg (for audio/video conversion)
2-10 GB of free disk space (depending on model size)

Step 1: Install FFmpeg

FFmpeg is needed to extract audio from video files.

macOS (Homebrew)

brew install ffmpeg

Ubuntu/Debian

sudo apt update
sudo apt install ffmpeg

Windows

Download from ffmpeg.org and add to PATH.

Verify installation:

ffmpeg -version

Step 2: Install OpenAI Whisper

Install Whisper via pip:

pip install openai-whisper

For faster GPU transcription (NVIDIA):

pip install openai-whisper torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Step 3: Extract Audio from Video

Pull the audio track from your video file:

ffmpeg -i interview.mp4 -vn -acodec libmp3lame -q:a 2 interview.mp3

Flag breakdown:

-i interview.mp4 — input file
-vn — no video (audio only)
-acodec libmp3lame — MP3 codec
-q:a 2 — high audio quality

For a smaller file (sufficient for transcription):

ffmpeg -i interview.mp4 -vn -ar 16000 -ac 1 -b:a 64k interview.mp3

Step 4: Transcribe with Whisper

Basic CLI Usage

whisper interview.mp3 --language English --model medium

Available Models

Model	Size	VRAM	Quality	Speed
tiny	39 MB	~1 GB	Low	Very fast
base	74 MB	~1 GB	Medium	Fast
small	244 MB	~2 GB	Good	Medium
medium	769 MB	~5 GB	Very good	Slow
large	1550 MB	~10 GB	Best	Very slow

For most use cases, small or medium offers the best quality-to-speed ratio.

Save to Text File

whisper interview.mp3 --language English --model medium --output_format txt

Available output formats: txt, vtt, srt, tsv, json

Example: Transcribing a User Interview

Let's say you recorded a 15-minute user feedback session about your product. Here's the complete workflow:

1. Extract Audio

ffmpeg -i user_feedback_session.mov -vn -ar 16000 -ac 1 feedback.mp3

2. Run Transcription

whisper feedback.mp3 --language English --model medium --output_format txt

3. Output

Whisper creates feedback.txt with content like:

So the first thing about these signals before the interval.
Yes, so you set a 25-second interval and before the interval
you get 4 audio signals, let's say, the main signal with an
accent and then 4 weaker ones again. So basically if you don't
arrive exactly on the signal, you know that you need to speed
up by two seconds or slow down by two seconds because the sound
tells you that.

Example: Python Script for Batch Processing

For processing multiple files or integrating into your workflow:

#!/usr/bin/env python3
"""
Video to Text Transcription Pipeline
Uses FFmpeg + OpenAI Whisper
"""

import subprocess
import whisper
from pathlib import Path


def extract_audio(video_path: str, audio_path: str) -> None:
    """Extract audio from video using FFmpeg."""
    cmd = [
        "ffmpeg", "-i", video_path,
        "-vn", "-ar", "16000", "-ac", "1",
        "-b:a", "64k", "-y",
        audio_path
    ]
    subprocess.run(cmd, check=True, capture_output=True)
    print(f"Audio extracted: {audio_path}")


def transcribe_audio(
    audio_path: str,
    output_path: str,
    model_name: str = "medium",
    language: str = "en"
) -> str:
    """Transcribe audio using Whisper."""
    print(f"Loading model {model_name}...")
    model = whisper.load_model(model_name)

    print("Transcribing...")
    result = model.transcribe(audio_path, language=language)

    text = result["text"]

    with open(output_path, "w", encoding="utf-8") as f:
        f.write(text)

    print(f"Transcription saved: {output_path}")
    return text


def main():
    # Configuration
    video_file = "meeting_recording.mp4"
    audio_file = "temp_audio.mp3"
    output_file = "meeting_transcript.txt"

    # Pipeline
    extract_audio(video_file, audio_file)
    text = transcribe_audio(audio_file, output_file)

    # Cleanup temp file
    Path(audio_file).unlink()

    print(f"\nTranscription preview:\n{text[:500]}...")


if __name__ == "__main__":
    main()

Run it:

python transcribe.py

Example: Transcription with Timestamps

Need timestamps for subtitles or reference? Use the segments feature:

import whisper

model = whisper.load_model("medium")
result = model.transcribe("podcast_episode.mp3", language="en")

# Print segments with timestamps
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"].strip()
    print(f"[{start:.1f}s - {end:.1f}s] {text}")

Output:

[0.0s - 4.2s] Welcome to the show. Today we're talking about...
[4.2s - 8.7s] ...building hardware products for athletes.
[8.7s - 15.3s] Our guest has been working on a tempo trainer device.

Example: Multi-Language Detection

Don't know the language? Let Whisper detect it:

import whisper

model = whisper.load_model("medium")

# Auto-detect language
result = model.transcribe("unknown_language.mp3")

print(f"Detected language: {result['language']}")
print(f"Text: {result['text']}")

Tips and Optimizations

1. Faster Transcription on Apple Silicon

pip install mlx-whisper

import mlx_whisper

result = mlx_whisper.transcribe(
    "audio.mp3",
    path_or_hf_repo="mlx-community/whisper-medium-mlx"
)

2. Handling Long Recordings

For recordings over 1 hour, consider splitting:

ffmpeg -i long_recording.mp3 -f segment -segment_time 1800 -c copy chunk_%03d.mp3

3. Improving Quality for Difficult Audio

Use a larger model (large instead of medium)
Add context with initial_prompt:

result = model.transcribe(
    "audio.mp3",
    language="en",
    initial_prompt="This is a conversation about swimming training and interval timers."
)

Comparison with Alternatives

Solution	Cost	Privacy	Quality
Whisper (local)	Free	Full privacy	Very good
OpenAI API	$0.006/min	Cloud-based	Excellent
Google Speech-to-Text	$0.016/min	Cloud-based	Very good
AssemblyAI	$0.015/min	Cloud-based	Very good

Choose local Whisper when:

Data privacy matters
You have lots of content to transcribe
You want to avoid recurring costs

Summary

The video → mp3 → text pipeline with Whisper is straightforward:

Extract audio: ffmpeg -i video.mp4 -vn audio.mp3
Transcribe: whisper audio.mp3 --language English --model medium

Everything runs locally, it's free, and delivers production-quality results.

Resources

Need Help with Audio/Video Processing?

Building something similar or facing technical challenges? We've been there.

Let's talk — no sales pitch, just honest engineering advice.

How to Transcribe Video to Text Using OpenAI Whisper

Need Help with This?