Real-time Whisper Transcriptions #2638

aadhith-aivar · 2025-08-23T10:23:47Z

aadhith-aivar
Aug 23, 2025

Issue Description

We're building a real time transcription system using Whisper large-v3, but facing fundamental architectural limitations with concurrent processing and real-time requirements.

Current Implementation

Whisper Model: OpenAI Whisper large-v3 (local deployment)

Architecture: 3 independent Whisper instances on 3 dedicated GPUs

Infrastructure: Kubernetes deployment with GPU allocation

Use Case: conversations requiring immediate transcription feedback

Problem Statement:

Despite implementing GPU parallelism with dedicated instances, we still face sequential processing bottlenecks due to Whisper's
inherent architecture.

What We've Tried
✅ ThreadPoolExecutor: Failed due to GPU memory contention
✅ Multiple Whisper Instances: Achieved true GPU parallelism (3 concurrent requests)
✅ Audio Chunking: Implemented 800k sample chunks for memory management
✅ GPU Dedication: 3 instances on 3 separate GPUs

Current Performance
Single Request: ~13 seconds
Concurrent Requests: 3 can run simultaneously (true parallel)
Sequential Bottleneck: Still exists within each instance due to Whisper architecture

Questions for Community

Real-time Architecture: Has anyone successfully architected Whisper for real-time transcription?
WebSocket Streaming: Is WebSocket streaming with audio chunks a viable approach for real-time feedback?
Alternative Models: Are there Whisper variants or alternatives better suited for real-time use cases?
Batching Strategies: Can we batch multiple conversation streams while maintaining real-time response?
Architecture Patterns: What are the best practices for real-time ASR in production systems?

Technical Requirements

Environment
GPUs: 3x NVIDIA GPUs (80GB VRAM total)
Framework: Python with Transformers library
Deployment: Kubernetes with GPU scheduling
Audio: 16kHz WAV, variable length (30 seconds to 10+ minutes)

Expected Outcome
Looking for architectural guidance, implementation examples, or alternative approaches to achieve real-time transcription for hospital conversations while maintaining Whisper's accuracy.

HitHereX · 2025-08-23T14:37:15Z

HitHereX
Aug 23, 2025

Hi, I've gone through something similar before. Might not be the best practice but sharing some experience:

real-time transcription worked for me (sub 1sec with A100 GPU)
WebSocket streaming worked well.
For Whisper, if you need better response speed, how about trying smaller models like turbo or medium? Difference in error rate was not significant in my case.
Suggest trying other implementations like faster-whisper

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Real-time Whisper Transcriptions #2638

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Real-time Whisper Transcriptions #2638

Uh oh!

aadhith-aivar Aug 23, 2025

Replies: 1 comment

Uh oh!

Uh oh!

HitHereX Aug 23, 2025

aadhith-aivar
Aug 23, 2025

HitHereX
Aug 23, 2025