You can follow the instructions in the verl install.
Push Dummy Dataset defines a dummy dataset and pushes it to the Hugging Face Hub.
Make local hdfs directory makes a local hdfs directory and pushes the dataset to the hdfs directory.
python3 examples/data_preprocess/tts.py \
--data_source Seungyoun/dummy_llasa_tts_text \
--local_dir ~/data/llasa-tts-rl-grpoThis tutorial use 3xA6000 GPUs. 2 for training and 1 for whisper nll calculation.
we make grpo reward objective function as follows:
Here's the reward calculation described clearly in English with mathematical notation:
The reward is calculated based on two key metrics: the Character Error Rate (CER) and the Negative Log-Likelihood (NLL) obtained from Whisper. The formula is given by:
where:
-
CER Utility:
$$ U_{CER} = 1 - \tanh(\beta_c \cdot CER) $$
-
NLL Utility:
$$ U_{NLL} = e^{-\frac{NLL}{\tau_n}} $$
- CER: Character Error Rate (difference between the ground truth and Whisper's transcript).
- NLL: Negative Log-Likelihood from Whisper (a measure of speech synthesis quality).
-
$\beta_c$ ,$\tau_n$ : Parameters controlling sensitivity of CER and NLL respectively. -
$\lambda_c$ ,$\lambda_n$ : Weights determining the relative importance of CER and NLL.
This results in a reward value ranging between 0 and 1, with higher values indicating better quality.
Whisper server is a server that calculates the NLL of the Whisper model.
CUDA_VISIBLE_DEVICES=2 \
python3 tts/whisper_server.py \
--port 8001 \
--model large-v3then
WHISPER_SERVER=http://localhost:8001nohup bash ./examples/grpo_trainer/run_llasa_tts_grpo.sh > verl_grpo_1b.log 2>&1 &We performed continual training of a Korean TTS model starting from the LLASA-1B checkpoint and evaluated its performance using our internal dataset.
The results clearly indicate an improvement when applying GRPO:
- LLasa1B + 15K Korean: CER = 0.0266
- LLasa1B + 15K Korean + GRPO: CER = 0.0204
The chart visually demonstrates that GRPO significantly reduces the Character Error Rate (CER), indicating enhanced synthesis quality.
