Training whisper on a completely new language ..... #2615
Replies: 1 comment
-
How to Train Whisper on a Completely New Language
Audio format: WAV or FLAC, sampled at 16kHz mono is preferred. Transcript quality: Clean, normalized text (lowercase, no special symbols unless language-specific). Dataset split: Divide into training, validation, and test sets (e.g., 80%/10%/10%).
Fine-tune a pretrained Whisper model: Recommended because Whisper has learned a lot about audio features already. Train a new model from scratch: Much harder and requires huge data & compute. For a new language, fine-tuning an existing Whisper model (like base or small) is usually the best approach.
PyTorch (with GPU support) whisper repository from OpenAI: https://github.com/openai/whisper Hugging Face transformers and datasets libraries (optional but recommended)
Audio inputs (usually log-Mel spectrograms generated on the fly or precomputed) Text targets tokenized by Whisper’s tokenizer (you can reuse Whisper’s tokenizer, but might need to add new tokens if your language has unique characters)
Extend the tokenizer vocabulary with your new tokens. This requires reloading the tokenizer and adding tokens, then fine-tuning the model with the updated tokenizer.
Load pretrained Whisper model (e.g., openai/whisper-small) Prepare your dataset loader to return audio tensors and tokenized transcripts Define training loop or use Hugging Face Trainer Use appropriate loss function (CrossEntropy on token outputs) Train with a low learning rate (e.g., 1e-5 to 5e-5) Validate periodically and check loss and WER (word error rate)
Community repos like openai/whisper + forks with training scripts
Training time depends on data size and model size |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
how can i train whisper on a completely new language ? i have the transcribed data and the audio how can i do it
Beta Was this translation helpful? Give feedback.
All reactions