Training whisper on a completely new language ..... #2615

hangman121 · 2025-07-07T12:51:54Z

hangman121
Jul 7, 2025

how can i train whisper on a completely new language ? i have the transcribed data and the audio how can i do it

SametDulger · 2025-08-03T02:07:41Z

SametDulger
Aug 3, 2025

How to Train Whisper on a Completely New Language

Prepare Your Dataset
You already have transcribed audio + text pairs, which is perfect. Make sure:

Audio format: WAV or FLAC, sampled at 16kHz mono is preferred.

Transcript quality: Clean, normalized text (lowercase, no special symbols unless language-specific).

Dataset split: Divide into training, validation, and test sets (e.g., 80%/10%/10%).

Choose Your Training Strategy
You can either:

Fine-tune a pretrained Whisper model: Recommended because Whisper has learned a lot about audio features already.

Train a new model from scratch: Much harder and requires huge data & compute.

For a new language, fine-tuning an existing Whisper model (like base or small) is usually the best approach.

Set Up Your Environment
Python 3.8+

PyTorch (with GPU support)

whisper repository from OpenAI: https://github.com/openai/whisper

Hugging Face transformers and datasets libraries (optional but recommended)

Data Formatting
You need your dataset in the form Whisper expects:

Audio inputs (usually log-Mel spectrograms generated on the fly or precomputed)

Text targets tokenized by Whisper’s tokenizer (you can reuse Whisper’s tokenizer, but might need to add new tokens if your language has unique characters)

Modify Tokenizer if Needed
If your new language uses characters or tokens not in Whisper’s tokenizer vocabulary:

Extend the tokenizer vocabulary with your new tokens.

This requires reloading the tokenizer and adding tokens, then fine-tuning the model with the updated tokenizer.

Fine-Tuning the Model
Here’s a high-level overview of the steps to fine-tune:

Load pretrained Whisper model (e.g., openai/whisper-small)

Prepare your dataset loader to return audio tensors and tokenized transcripts

Define training loop or use Hugging Face Trainer

Use appropriate loss function (CrossEntropy on token outputs)

Train with a low learning rate (e.g., 1e-5 to 5e-5)

Validate periodically and check loss and WER (word error rate)

Example Frameworks
Whisper Fine-tuning example by Hugging Face (adapt it for your language)

Community repos like openai/whisper + forks with training scripts

Compute Requirements
A good GPU (or multiple GPUs) is needed (e.g., Nvidia RTX 3090 or better)

Training time depends on data size and model size

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training whisper on a completely new language ..... #2615

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Training whisper on a completely new language ..... #2615

Uh oh!

hangman121 Jul 7, 2025

Replies: 1 comment

Uh oh!

SametDulger Aug 3, 2025

hangman121
Jul 7, 2025

SametDulger
Aug 3, 2025