Project Showcase: Whisperboard - Real-time Transcription Dashboard
Live Demo: whisperboard.example.com
Source Code: github.com/yourusername/whisperboard
Tech Stack: Python, FastAPI, WebSockets, Whisper, React, TailwindCSS
The Problem
I regularly attend virtual meetings and workshops, and found myself frantically taking notes while trying to stay engaged. Existing transcription tools were either expensive, required cloud services with privacy concerns, or had terrible accuracy.
I wanted something that:
- Ran locally (privacy-first)
- Worked in real-time
- Had high accuracy
- Was simple to use
The Solution
Whisperboard is a real-time audio transcription dashboard that uses OpenAI’s Whisper model to transcribe speech as you speak. It captures system audio, processes it in chunks, and displays the transcription in a clean web interface.
Architecture Overview
┌─────────────────┐
│ Audio Capture │ (PortAudio / PyAudio)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Audio Buffers │ (Circular buffer, 30s chunks)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Whisper Engine │ (whisper-large-v3, local)
└────────┬────────┘
│
▼
┌─────────────────┐
│ WebSocket API │ (FastAPI + WS)
└────────┬────────┘
│
▼
┌─────────────────┐
│ React Frontend │ (Real-time display)
└─────────────────┘
Key Features
1. Real-Time Processing
Audio is captured in 30-second chunks with overlapping windows. This provides context for better accuracy while maintaining low latency.
class AudioProcessor:
def __init__(self, chunk_duration=30, overlap=5):
self.chunk_duration = chunk_duration
self.overlap = overlap
self.buffer = CircularBuffer(
maxsize=chunk_duration * SAMPLE_RATE
)
def process_chunk(self, audio_data):
"""Process overlapping chunks for context"""
self.buffer.extend(audio_data)
if self.buffer.is_full():
chunk = self.buffer.get_with_overlap(self.overlap)
transcription = self.transcribe(chunk)
return transcription
2. Smart Chunking
Uses voice activity detection (VAD) to split at natural pauses:
import webrtcvad
vad = webrtcvad.Vad(3) # Aggressiveness 0-3
def split_at_pauses(audio, sample_rate=16000):
"""Split audio at detected pauses"""
frame_duration = 30 # ms
frame_size = int(sample_rate * frame_duration / 1000)
chunks = []
current_chunk = []
for i in range(0, len(audio), frame_size):
frame = audio[i:i + frame_size]
is_speech = vad.is_speech(frame, sample_rate)
if is_speech:
current_chunk.extend(frame)
elif current_chunk:
chunks.append(current_chunk)
current_chunk = []
return chunks
3. WebSocket Updates
Real-time transcription delivery:
from fastapi import FastAPI, WebSocket
import asyncio
app = FastAPI()
active_connections: list[WebSocket] = []
@app.websocket("/ws/transcribe")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
active_connections.append(websocket)
try:
while True:
# Receive audio data
audio_data = await websocket.receive_bytes()
# Process asynchronously
transcription = await process_audio_async(audio_data)
# Send result back
await websocket.send_json({
'timestamp': time.time(),
'text': transcription['text'],
'confidence': transcription.get('confidence', 1.0)
})
except WebSocketDisconnect:
active_connections.remove(websocket)
4. Frontend Interface
Clean React interface with auto-scroll and search:
function TranscriptionView() {
const [transcripts, setTranscripts] = useState([]);
const ws = useRef(null);
useEffect(() => {
ws.current = new WebSocket('ws://localhost:8000/ws/transcribe');
ws.current.onmessage = (event) => {
const data = JSON.parse(event.data);
setTranscripts(prev => [...prev, {
timestamp: new Date(data.timestamp * 1000),
text: data.text,
confidence: data.confidence
}]);
};
return () => ws.current?.close();
}, []);
return (
<div className="transcript-view">
{transcripts.map((t, idx) => (
<div key={idx} className="transcript-entry">
<span className="timestamp">
{t.timestamp.toLocaleTimeString()}
</span>
<p className="text">{t.text}</p>
<span className="confidence">
{(t.confidence * 100).toFixed(0)}%
</span>
</div>
))}
</div>
);
}
Challenges Overcome
1. Latency vs. Accuracy Trade-off
Shorter chunks = lower latency but worse accuracy (less context). I settled on 30-second chunks with 5-second overlap, which gave ~2 second latency with high accuracy.
2. Memory Management
Whisper models are large (1.5GB+ for large-v3). Solution: Use whisper.cpp for CPU inference with quantization, reducing memory usage by 75%.
# Convert model to GGML format
python convert-whisper-to-ggml.py large-v3 --outdir ./models
# Run with quantization
./whisper.cpp -m models/ggml-large-v3-q5_0.bin -l en -f audio.wav
3. Audio Capture Permissions
macOS requires explicit permissions for audio capture. Added clear UI prompts and fallback to manual file uploads.
4. Handling Multiple Speakers
Initial version didn’t distinguish speakers. Added pyannote.audio for speaker diarization:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1"
)
def add_speaker_labels(transcription, audio_file):
"""Add speaker labels to transcription"""
diarization = pipeline(audio_file)
for segment in transcription['segments']:
start = segment['start']
speaker = diarization.get_label_at_time(start)
segment['speaker'] = speaker
return transcription
Performance Metrics
- Latency: ~2 seconds end-to-end
- Accuracy: 95%+ for clear audio (English)
- Resource Usage: ~2GB RAM, 40% CPU (M1 Mac)
- Supported Languages: 99 languages via Whisper
What I Learned
Technical Skills
- Audio processing fundamentals (sampling rates, buffers, formats)
- WebSocket implementation and connection management
- Whisper model optimization and deployment
- Real-time data streaming patterns
Product Lessons
- Privacy concerns are real - local-first was non-negotiable for users
- UX matters: added visual feedback (waveforms, confidence indicators)
- Edge cases are the product (background noise, accents, multiple speakers)
Process Insights
- Started with MVP (basic file upload → transcription) before real-time
- Dogfooding revealed issues I never would have anticipated
- Community feedback drove 80% of feature additions
Future Improvements
- Export transcripts (Markdown, SRT subtitles, JSON)
- Custom vocabulary/domain terminology
- Meeting summary generation (using GPT)
- Mobile app (iOS/Android)
- Multi-language real-time switching
Try It Yourself
The project is fully open source. Setup takes about 10 minutes:
# Clone the repository
git clone https://github.com/yourusername/whisperboard
cd whisperboard
# Install backend dependencies
pip install -r requirements.txt
# Download Whisper model
python -m whisper.download large-v3
# Install frontend dependencies
cd frontend && npm install
# Run backend
python main.py &
# Run frontend
npm start
Then visit http://localhost:3000 and allow microphone access.
Conclusion
Whisperboard started as a weekend project to solve my own note-taking problem and evolved into something I use daily. Building it taught me more about audio processing, real-time systems, and product development than months of reading documentation.
If you try it out or build something similar, I’d love to hear about your experience!
Links
- Live Demo: whisperboard.example.com
- GitHub: github.com/yourusername/whisperboard
- Documentation: docs.whisperboard.example.com
- Discussion: GitHub Discussions
Have questions about the implementation? Found a bug? Want to contribute? All interactions welcome via GitHub or the contact page!