How to Transcribe Video to Text Using OpenAI Whisper

A practical guide to the video → mp3 → text pipeline using OpenAI Whisper. Free, local transcription for interviews, podcasts, and meetings.
How to Transcribe Video to Text Using OpenAI Whisper

Introduction

Got a recorded interview, meeting, or podcast that needs transcription? Instead of paying for online transcription services, you can use OpenAI Whisper — a free, open-source speech recognition model that runs locally on your machine.

In this guide, we'll walk through the complete pipeline: from video file to ready-to-use text transcription.


Requirements

  • Python 3.8+
  • FFmpeg (for audio/video conversion)
  • 2-10 GB of free disk space (depending on model size)

Step 1: Install FFmpeg

FFmpeg is needed to extract audio from video files.

macOS (Homebrew)

brew install ffmpeg

Ubuntu/Debian

sudo apt update
sudo apt install ffmpeg

Windows

Download from ffmpeg.org and add to PATH.

Verify installation:

ffmpeg -version

Step 2: Install OpenAI Whisper

Install Whisper via pip:

pip install openai-whisper

For faster GPU transcription (NVIDIA):

pip install openai-whisper torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Step 3: Extract Audio from Video

Pull the audio track from your video file:

ffmpeg -i interview.mp4 -vn -acodec libmp3lame -q:a 2 interview.mp3

Flag breakdown:

  • -i interview.mp4 — input file
  • -vn — no video (audio only)
  • -acodec libmp3lame — MP3 codec
  • -q:a 2 — high audio quality

For a smaller file (sufficient for transcription):

ffmpeg -i interview.mp4 -vn -ar 16000 -ac 1 -b:a 64k interview.mp3

Step 4: Transcribe with Whisper

Basic CLI Usage

whisper interview.mp3 --language English --model medium

Available Models

ModelSizeVRAMQualitySpeed
tiny39 MB~1 GBLowVery fast
base74 MB~1 GBMediumFast
small244 MB~2 GBGoodMedium
medium769 MB~5 GBVery goodSlow
large1550 MB~10 GBBestVery slow

For most use cases, small or medium offers the best quality-to-speed ratio.

Save to Text File

whisper interview.mp3 --language English --model medium --output_format txt

Available output formats: txt, vtt, srt, tsv, json


Example: Transcribing a User Interview

Let's say you recorded a 15-minute user feedback session about your product. Here's the complete workflow:

1. Extract Audio

ffmpeg -i user_feedback_session.mov -vn -ar 16000 -ac 1 feedback.mp3

2. Run Transcription

whisper feedback.mp3 --language English --model medium --output_format txt

3. Output

Whisper creates feedback.txt with content like:

So the first thing about these signals before the interval.
Yes, so you set a 25-second interval and before the interval
you get 4 audio signals, let's say, the main signal with an
accent and then 4 weaker ones again. So basically if you don't
arrive exactly on the signal, you know that you need to speed
up by two seconds or slow down by two seconds because the sound
tells you that.

Example: Python Script for Batch Processing

For processing multiple files or integrating into your workflow:

#!/usr/bin/env python3
"""
Video to Text Transcription Pipeline
Uses FFmpeg + OpenAI Whisper
"""

import subprocess
import whisper
from pathlib import Path


def extract_audio(video_path: str, audio_path: str) -> None:
    """Extract audio from video using FFmpeg."""
    cmd = [
        "ffmpeg", "-i", video_path,
        "-vn", "-ar", "16000", "-ac", "1",
        "-b:a", "64k", "-y",
        audio_path
    ]
    subprocess.run(cmd, check=True, capture_output=True)
    print(f"Audio extracted: {audio_path}")


def transcribe_audio(
    audio_path: str,
    output_path: str,
    model_name: str = "medium",
    language: str = "en"
) -> str:
    """Transcribe audio using Whisper."""
    print(f"Loading model {model_name}...")
    model = whisper.load_model(model_name)

    print("Transcribing...")
    result = model.transcribe(audio_path, language=language)

    text = result["text"]

    with open(output_path, "w", encoding="utf-8") as f:
        f.write(text)

    print(f"Transcription saved: {output_path}")
    return text


def main():
    # Configuration
    video_file = "meeting_recording.mp4"
    audio_file = "temp_audio.mp3"
    output_file = "meeting_transcript.txt"

    # Pipeline
    extract_audio(video_file, audio_file)
    text = transcribe_audio(audio_file, output_file)

    # Cleanup temp file
    Path(audio_file).unlink()

    print(f"\nTranscription preview:\n{text[:500]}...")


if __name__ == "__main__":
    main()

Run it:

python transcribe.py

Example: Transcription with Timestamps

Need timestamps for subtitles or reference? Use the segments feature:

import whisper

model = whisper.load_model("medium")
result = model.transcribe("podcast_episode.mp3", language="en")

# Print segments with timestamps
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"].strip()
    print(f"[{start:.1f}s - {end:.1f}s] {text}")

Output:

[0.0s - 4.2s] Welcome to the show. Today we're talking about...
[4.2s - 8.7s] ...building hardware products for athletes.
[8.7s - 15.3s] Our guest has been working on a tempo trainer device.

Example: Multi-Language Detection

Don't know the language? Let Whisper detect it:

import whisper

model = whisper.load_model("medium")

# Auto-detect language
result = model.transcribe("unknown_language.mp3")

print(f"Detected language: {result['language']}")
print(f"Text: {result['text']}")

Tips and Optimizations

1. Faster Transcription on Apple Silicon

pip install mlx-whisper
import mlx_whisper

result = mlx_whisper.transcribe(
    "audio.mp3",
    path_or_hf_repo="mlx-community/whisper-medium-mlx"
)

2. Handling Long Recordings

For recordings over 1 hour, consider splitting:

ffmpeg -i long_recording.mp3 -f segment -segment_time 1800 -c copy chunk_%03d.mp3

3. Improving Quality for Difficult Audio

  • Use a larger model (large instead of medium)
  • Add context with initial_prompt:
result = model.transcribe(
    "audio.mp3",
    language="en",
    initial_prompt="This is a conversation about swimming training and interval timers."
)

Comparison with Alternatives

SolutionCostPrivacyQuality
Whisper (local)FreeFull privacyVery good
OpenAI API$0.006/minCloud-basedExcellent
Google Speech-to-Text$0.016/minCloud-basedVery good
AssemblyAI$0.015/minCloud-basedVery good

Choose local Whisper when:

  • Data privacy matters
  • You have lots of content to transcribe
  • You want to avoid recurring costs

Summary

The video → mp3 → text pipeline with Whisper is straightforward:

  1. Extract audio: ffmpeg -i video.mp4 -vn audio.mp3
  2. Transcribe: whisper audio.mp3 --language English --model medium

Everything runs locally, it's free, and delivers production-quality results.


Resources


Need Help with Audio/Video Processing?

Building something similar or facing technical challenges? We've been there.

Let's talk — no sales pitch, just honest engineering advice.

Need Help with This?

Building something similar or facing technical challenges? We've been there.

Let's talk — no sales pitch, just honest engineering advice.