WhisperX Transcription + Diarization Audio Processing for Researchers

This repository contains a Jupyter notebook for qualitative researchers to transcribe, diarize speakers, and convert audio or video files into various text formats (csv, txt, json, & vtt). The notebook uses advanced transcription and diarization capabilities provided by Whisper and WhisperX, as well as pyannote speaker-diarization-3.1 and segmentation-3.0 libraries from Hugging Face*.

*A free Hugging Face token is required specifically for the diarization aspects. The code will not work without it.

The code is derived and built from the following Medium article

I wanted the ability to do batch transcriptions of audio files found in multiple subdirectories.
I wanted to take advantage of WhisperX's word level time stamping.
Utilize pyannote's speaker diarization capabilities.
Generate csv, txt, json, and vtt files for each audio file transcribe.
Ability to anonymize specific names and places during transcription.

Example CSV output

What This Code Does

Device and Configuration Setup: Sets up the device (GPU or CPU) and other configuration variables like batch size, compute type, and model type.
Library Imports: Imports necessary libraries including PyTorch, WhisperX, and others for handling audio files, text processing, and file I/O.
Path and File Type Setup: Defines paths to your audio files and output directories and specifies the types of audio files to process.
Pseudonym Loading: Loads a CSV file containing pseudonyms for anonymizing transcripts.
Audio Processing Functions: Includes functions to find audio files, get file modification dates, anonymize text, convert segments to different formats, and process each audio file.
Main Function Execution: Finds all audio files in the specified directory, processes them, and saves the transcripts in multiple formats (CSV, TXT, JSON, VTT).

How to Use This Repo

WhisperX documentation found here: https://github.com/m-bain/whisperX

A. Install General Software

================================================

Install Git
Install FFMPEG and add to PATH
Install Anaconda

B. Set Up Conda Environment

================================================

Create Conda environment

conda create -n whisperxtranscription-env python=3.10
conda activate whisperxtranscription-env

Install PyTorch https://pytorch.org/get-started/locally/

pip install numpy==1.26.3 torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121

Install WhisperX repository and additional packages

pip install whisperx==3.2.0

pip install speechbrain ipykernel ipywidgets charset-normalizer pandas nltk plotly matplotlib webvtt-py pypi-json srt python-dotenv tqdm

Make sure to choose this kernel in the Jupyter notebook https://code.visualstudio.com/docs/datascience/jupyter-kernel-management
There is an .env file at the same level as this notebook file paste your Hugging Face Token between the " " and save the file.

HF_TOKEN="REPLACEWITHHUGGINGFACETOKENHERE"

C. To Set Up NVIDIA GPU

=================================================

Install Visual Studio Community https://visualstudio.microsoft.com/downloads/
Install NVIDIA CUDA Toolkit 12.1 https://developer.nvidia.com/cuda-12-1-0-download-archive

Check PyTorch and CUDA installation

import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

D. Setup and Run Jupyter Notebook

=================================================

Prepare Pseudonyms CSV

Pseudonyms CSV: Ensure you have a CSV file named pseudonyms.csv in the data directory. This file should contain columns name and pseudonym for anonymizing the transcripts. This isn't a requirement, you will be asked through a popup if you are using one.

1.0 Execute the set-up code

The main function finds all audio files of the types specified in the folder choice that will occur when you run the subsequent code snippet, processes them, and saves the transcripts. To run the code, simply execute the script.

Audio File Types

Update the file type(s) of your audio files:

file_type1 = '.wav'
file_type2 = '.mp3'
file_type3 = '.ogg'

2.0 Execute the Run code

Just push run here. You shouldn't need to change anything here unless you want to output less or more file types. These are mostly functions which are then called at the end of the cell.

You should get a popup asking to choose the folder where the files are found (It will also search subfolders).
You should then get a popup asking for where the transcription files should be placed (It will replicate the folder structure in which they were found)
You will also see a popup asking if you want to anonymize with a pseudonyms.csv file, and if so where it is located.
You should then see an output similar to the following (just ignore the warnings):

Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.

When complete you will see where each were written and the folders where they were written to.

Check the Outputs

Output Files: The transcripts will be saved in the specified output directory in multiple formats: CSV, TXT, JSON, and VTT

Conclusion

This code is designed to make it easy to process and transcribe large batches of audio or video files while ensuring anonymity through pseudonymization. Happy transcribing!# WhisperXTranscription4Researchers

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
Data		Data
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini
image-1.png		image-1.png
image-2.png		image-2.png
image-3.png		image-3.png
image.png		image.png
whisperXTranscription4Researchers.ipynb		whisperXTranscription4Researchers.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WhisperX Transcription + Diarization Audio Processing for Researchers

Example CSV output

What This Code Does

How to Use This Repo

A. Install General Software

B. Set Up Conda Environment

C. To Set Up NVIDIA GPU

Check PyTorch and CUDA installation

D. Setup and Run Jupyter Notebook

Prepare Pseudonyms CSV

1.0 Execute the set-up code

Audio File Types

2.0 Execute the Run code

Check the Outputs

Conclusion

About

Uh oh!

Releases

Packages

Languages

License

mrhallonline/WhisperXTranscription4Researchers

Folders and files

Latest commit

History

Repository files navigation

WhisperX Transcription + Diarization Audio Processing for Researchers

Example CSV output

What This Code Does

How to Use This Repo

A. Install General Software

B. Set Up Conda Environment

C. To Set Up NVIDIA GPU

Check PyTorch and CUDA installation

D. Setup and Run Jupyter Notebook

Prepare Pseudonyms CSV

1.0 Execute the set-up code

Audio File Types

2.0 Execute the Run code

Check the Outputs

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages