Real-time Whisper Transcriptions #2638
Unanswered
aadhith-aivar
asked this question in
Q&A
Replies: 1 comment
-
Hi, I've gone through something similar before. Might not be the best practice but sharing some experience:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Issue Description
We're building a real time transcription system using Whisper large-v3, but facing fundamental architectural limitations with concurrent processing and real-time requirements.
Current Implementation
Whisper Model: OpenAI Whisper large-v3 (local deployment)
Architecture: 3 independent Whisper instances on 3 dedicated GPUs
Infrastructure: Kubernetes deployment with GPU allocation
Use Case: conversations requiring immediate transcription feedback
Problem Statement:
Despite implementing GPU parallelism with dedicated instances, we still face sequential processing bottlenecks due to Whisper's
inherent architecture.
What We've Tried
✅ ThreadPoolExecutor: Failed due to GPU memory contention
✅ Multiple Whisper Instances: Achieved true GPU parallelism (3 concurrent requests)
✅ Audio Chunking: Implemented 800k sample chunks for memory management
✅ GPU Dedication: 3 instances on 3 separate GPUs
Current Performance
Single Request: ~13 seconds
Concurrent Requests: 3 can run simultaneously (true parallel)
Sequential Bottleneck: Still exists within each instance due to Whisper architecture
Questions for Community
Real-time Architecture: Has anyone successfully architected Whisper for real-time transcription?
WebSocket Streaming: Is WebSocket streaming with audio chunks a viable approach for real-time feedback?
Alternative Models: Are there Whisper variants or alternatives better suited for real-time use cases?
Batching Strategies: Can we batch multiple conversation streams while maintaining real-time response?
Architecture Patterns: What are the best practices for real-time ASR in production systems?
Technical Requirements
Environment
GPUs: 3x NVIDIA GPUs (80GB VRAM total)
Framework: Python with Transformers library
Deployment: Kubernetes with GPU scheduling
Audio: 16kHz WAV, variable length (30 seconds to 10+ minutes)
Expected Outcome
Looking for architectural guidance, implementation examples, or alternative approaches to achieve real-time transcription for hospital conversations while maintaining Whisper's accuracy.
Beta Was this translation helpful? Give feedback.
All reactions