Systems Library / AI Capabilities / How to Build an AI Voice Transcription System
AI Capabilities voice audio

How to Build an AI Voice Transcription System

Transcribe calls, meetings, and audio files automatically using AI.

Jay Banlasan

Jay Banlasan

The AI Systems Guy

An ai voice transcription system automated for calls, meetings, and audio files turns spoken content into searchable text. I build these for teams that record everything but never go back and listen. Transcripts make meetings searchable, calls reviewable, and podcasts repurposable.

OpenAI Whisper runs locally for free or via API. It handles accents, background noise, and multiple speakers better than most paid services.

What You Need Before Starting

Step 1: Set Up Whisper

For local transcription (free, private):

pip install openai-whisper
import whisper

model = whisper.load_model("base")  # Options: tiny, base, small, medium, large

def transcribe_local(audio_path):
    result = model.transcribe(audio_path)
    return {
        "text": result["text"],
        "segments": [{
            "start": s["start"],
            "end": s["end"],
            "text": s["text"]
        } for s in result["segments"]],
        "language": result["language"]
    }

For API-based transcription:

from openai import OpenAI

client = OpenAI()

def transcribe_api(audio_path):
    with open(audio_path, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            response_format="verbose_json"
        )
    return transcript

Step 2: Process and Store Transcripts

import sqlite3
import json
from datetime import datetime

def save_transcript(audio_id, audio_path, transcript):
    conn = sqlite3.connect("transcripts.db")
    conn.execute("""
        INSERT INTO transcripts (audio_id, audio_path, full_text, segments, language, transcribed_at)
        VALUES (?, ?, ?, ?, ?, datetime('now'))
    """, (audio_id, audio_path, transcript["text"], json.dumps(transcript["segments"]), transcript.get("language", "en")))
    conn.commit()

Step 3: Add Speaker Diarization

Identify who said what in multi-speaker audio:

def add_speaker_labels(audio_path, segments):
    """Basic speaker diarization using pyannote.audio."""
    from pyannote.audio import Pipeline
    pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
    diarization = pipeline(audio_path)

    labeled_segments = []
    for segment in segments:
        mid_time = (segment["start"] + segment["end"]) / 2
        speaker = "Unknown"
        for turn, _, spk in diarization.itertracks(yield_label=True):
            if turn.start <= mid_time <= turn.end:
                speaker = spk
                break
        labeled_segments.append({**segment, "speaker": speaker})

    return labeled_segments

Step 4: Build a Search Index

Make transcripts searchable:

from sentence_transformers import SentenceTransformer
import chromadb

search_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma = chromadb.PersistentClient(path="./transcript_search")
collection = chroma.get_or_create_collection("transcripts")

def index_transcript(audio_id, segments):
    for i, seg in enumerate(segments):
        embedding = search_model.encode(seg["text"]).tolist()
        collection.add(
            ids=[f"{audio_id}_seg_{i}"],
            embeddings=[embedding],
            documents=[seg["text"]],
            metadatas=[{"audio_id": audio_id, "start": seg["start"], "end": seg["end"]}]
        )

def search_transcripts(query, top_k=5):
    embedding = search_model.encode(query).tolist()
    results = collection.query(query_embeddings=[embedding], n_results=top_k)
    return results

Step 5: Batch Process Audio Files

import os

def batch_transcribe(audio_folder, output_folder):
    os.makedirs(output_folder, exist_ok=True)
    results = []

    for filename in os.listdir(audio_folder):
        if not filename.lower().endswith((".mp3", ".wav", ".m4a", ".mp4")):
            continue

        audio_path = os.path.join(audio_folder, filename)
        transcript = transcribe_local(audio_path)
        save_transcript(filename, audio_path, transcript)
        index_transcript(filename, transcript["segments"])

        output_path = os.path.join(output_folder, f"{os.path.splitext(filename)[0]}.txt")
        with open(output_path, "w") as f:
            f.write(transcript["text"])

        results.append({"file": filename, "words": len(transcript["text"].split())})

    return results

What to Build Next

Add real-time transcription for live meetings. Use Whisper in streaming mode to transcribe as people speak, displaying live captions and generating searchable notes the moment the meeting ends.

Related Reading

Want this system built for your business?

Get a free assessment. We will map every system your business needs and show you the ROI.

Get Your Free Assessment

Related Systems