Project Showcase: Whisperboard - Real-time Transcription Dashboard


Live Demo: whisperboard.example.com
Source Code: github.com/yourusername/whisperboard
Tech Stack: Python, FastAPI, WebSockets, Whisper, React, TailwindCSS

The Problem

I regularly attend virtual meetings and workshops, and found myself frantically taking notes while trying to stay engaged. Existing transcription tools were either expensive, required cloud services with privacy concerns, or had terrible accuracy.

I wanted something that:

  • Ran locally (privacy-first)
  • Worked in real-time
  • Had high accuracy
  • Was simple to use

The Solution

Whisperboard is a real-time audio transcription dashboard that uses OpenAI’s Whisper model to transcribe speech as you speak. It captures system audio, processes it in chunks, and displays the transcription in a clean web interface.

Architecture Overview

┌─────────────────┐
│  Audio Capture  │ (PortAudio / PyAudio)
└────────┬────────┘


┌─────────────────┐
│  Audio Buffers  │ (Circular buffer, 30s chunks)
└────────┬────────┘


┌─────────────────┐
│ Whisper Engine  │ (whisper-large-v3, local)
└────────┬────────┘


┌─────────────────┐
│  WebSocket API  │ (FastAPI + WS)
└────────┬────────┘


┌─────────────────┐
│  React Frontend │ (Real-time display)
└─────────────────┘

Key Features

1. Real-Time Processing

Audio is captured in 30-second chunks with overlapping windows. This provides context for better accuracy while maintaining low latency.

class AudioProcessor:
    def __init__(self, chunk_duration=30, overlap=5):
        self.chunk_duration = chunk_duration
        self.overlap = overlap
        self.buffer = CircularBuffer(
            maxsize=chunk_duration * SAMPLE_RATE
        )
    
    def process_chunk(self, audio_data):
        """Process overlapping chunks for context"""
        self.buffer.extend(audio_data)
        
        if self.buffer.is_full():
            chunk = self.buffer.get_with_overlap(self.overlap)
            transcription = self.transcribe(chunk)
            return transcription

2. Smart Chunking

Uses voice activity detection (VAD) to split at natural pauses:

import webrtcvad

vad = webrtcvad.Vad(3)  # Aggressiveness 0-3

def split_at_pauses(audio, sample_rate=16000):
    """Split audio at detected pauses"""
    frame_duration = 30  # ms
    frame_size = int(sample_rate * frame_duration / 1000)
    
    chunks = []
    current_chunk = []
    
    for i in range(0, len(audio), frame_size):
        frame = audio[i:i + frame_size]
        is_speech = vad.is_speech(frame, sample_rate)
        
        if is_speech:
            current_chunk.extend(frame)
        elif current_chunk:
            chunks.append(current_chunk)
            current_chunk = []
    
    return chunks

3. WebSocket Updates

Real-time transcription delivery:

from fastapi import FastAPI, WebSocket
import asyncio

app = FastAPI()
active_connections: list[WebSocket] = []

@app.websocket("/ws/transcribe")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    active_connections.append(websocket)
    
    try:
        while True:
            # Receive audio data
            audio_data = await websocket.receive_bytes()
            
            # Process asynchronously
            transcription = await process_audio_async(audio_data)
            
            # Send result back
            await websocket.send_json({
                'timestamp': time.time(),
                'text': transcription['text'],
                'confidence': transcription.get('confidence', 1.0)
            })
    except WebSocketDisconnect:
        active_connections.remove(websocket)

4. Frontend Interface

Clean React interface with auto-scroll and search:

function TranscriptionView() {
  const [transcripts, setTranscripts] = useState([]);
  const ws = useRef(null);
  
  useEffect(() => {
    ws.current = new WebSocket('ws://localhost:8000/ws/transcribe');
    
    ws.current.onmessage = (event) => {
      const data = JSON.parse(event.data);
      setTranscripts(prev => [...prev, {
        timestamp: new Date(data.timestamp * 1000),
        text: data.text,
        confidence: data.confidence
      }]);
    };
    
    return () => ws.current?.close();
  }, []);
  
  return (
    <div className="transcript-view">
      {transcripts.map((t, idx) => (
        <div key={idx} className="transcript-entry">
          <span className="timestamp">
            {t.timestamp.toLocaleTimeString()}
          </span>
          <p className="text">{t.text}</p>
          <span className="confidence">
            {(t.confidence * 100).toFixed(0)}%
          </span>
        </div>
      ))}
    </div>
  );
}

Challenges Overcome

1. Latency vs. Accuracy Trade-off

Shorter chunks = lower latency but worse accuracy (less context). I settled on 30-second chunks with 5-second overlap, which gave ~2 second latency with high accuracy.

2. Memory Management

Whisper models are large (1.5GB+ for large-v3). Solution: Use whisper.cpp for CPU inference with quantization, reducing memory usage by 75%.

# Convert model to GGML format
python convert-whisper-to-ggml.py large-v3 --outdir ./models

# Run with quantization
./whisper.cpp -m models/ggml-large-v3-q5_0.bin -l en -f audio.wav

3. Audio Capture Permissions

macOS requires explicit permissions for audio capture. Added clear UI prompts and fallback to manual file uploads.

4. Handling Multiple Speakers

Initial version didn’t distinguish speakers. Added pyannote.audio for speaker diarization:

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1"
)

def add_speaker_labels(transcription, audio_file):
    """Add speaker labels to transcription"""
    diarization = pipeline(audio_file)
    
    for segment in transcription['segments']:
        start = segment['start']
        speaker = diarization.get_label_at_time(start)
        segment['speaker'] = speaker
    
    return transcription

Performance Metrics

  • Latency: ~2 seconds end-to-end
  • Accuracy: 95%+ for clear audio (English)
  • Resource Usage: ~2GB RAM, 40% CPU (M1 Mac)
  • Supported Languages: 99 languages via Whisper

What I Learned

Technical Skills

  • Audio processing fundamentals (sampling rates, buffers, formats)
  • WebSocket implementation and connection management
  • Whisper model optimization and deployment
  • Real-time data streaming patterns

Product Lessons

  • Privacy concerns are real - local-first was non-negotiable for users
  • UX matters: added visual feedback (waveforms, confidence indicators)
  • Edge cases are the product (background noise, accents, multiple speakers)

Process Insights

  • Started with MVP (basic file upload → transcription) before real-time
  • Dogfooding revealed issues I never would have anticipated
  • Community feedback drove 80% of feature additions

Future Improvements

  • Export transcripts (Markdown, SRT subtitles, JSON)
  • Custom vocabulary/domain terminology
  • Meeting summary generation (using GPT)
  • Mobile app (iOS/Android)
  • Multi-language real-time switching

Try It Yourself

The project is fully open source. Setup takes about 10 minutes:

# Clone the repository
git clone https://github.com/yourusername/whisperboard
cd whisperboard

# Install backend dependencies
pip install -r requirements.txt

# Download Whisper model
python -m whisper.download large-v3

# Install frontend dependencies
cd frontend && npm install

# Run backend
python main.py &

# Run frontend
npm start

Then visit http://localhost:3000 and allow microphone access.

Conclusion

Whisperboard started as a weekend project to solve my own note-taking problem and evolved into something I use daily. Building it taught me more about audio processing, real-time systems, and product development than months of reading documentation.

If you try it out or build something similar, I’d love to hear about your experience!


Have questions about the implementation? Found a bug? Want to contribute? All interactions welcome via GitHub or the contact page!