This repository contains a Jupyter notebook for qualitative researchers to transcribe, diarize speakers, and convert audio or video files into various text formats (csv, txt, json, & vtt). The notebook uses advanced transcription and diarization capabilities provided by Whisper and WhisperX, as well as pyannote speaker-diarization-3.1 and segmentation-3.0 libraries from Hugging Face*.
*A free Hugging Face token is required specifically for the diarization aspects. The code will not work without it.
The code is derived and built from the following Medium article
- I wanted the ability to do batch transcriptions of audio files found in multiple subdirectories.
- I wanted to take advantage of WhisperX's word level time stamping.
- Utilize pyannote's speaker diarization capabilities.
- Generate csv, txt, json, and vtt files for each audio file transcribe.
- Ability to anonymize specific names and places during transcription.
- Device and Configuration Setup: Sets up the device (GPU or CPU) and other configuration variables like batch size, compute type, and model type.
- Library Imports: Imports necessary libraries including PyTorch, WhisperX, and others for handling audio files, text processing, and file I/O.
- Path and File Type Setup: Defines paths to your audio files and output directories and specifies the types of audio files to process.
- Pseudonym Loading: Loads a CSV file containing pseudonyms for anonymizing transcripts.
- Audio Processing Functions: Includes functions to find audio files, get file modification dates, anonymize text, convert segments to different formats, and process each audio file.
- Main Function Execution: Finds all audio files in the specified directory, processes them, and saves the transcripts in multiple formats (CSV, TXT, JSON, VTT).
WhisperX documentation found here: https://github.com/m-bain/whisperX
================================================
- Install Git
- Install FFMPEG and add to PATH
- Install Anaconda
================================================
- Create Conda environment
conda create -n whisperxtranscription-env python=3.10
conda activate whisperxtranscription-env
- Install PyTorch https://pytorch.org/get-started/locally/
pip install numpy==1.26.3 torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121
- Install WhisperX repository and additional packages
pip install whisperx==3.2.0
pip install speechbrain ipykernel ipywidgets charset-normalizer pandas nltk plotly matplotlib webvtt-py pypi-json srt python-dotenv tqdm
-
Make sure to choose this kernel in the Jupyter notebook https://code.visualstudio.com/docs/datascience/jupyter-kernel-management
-
There is an .env file at the same level as this notebook file paste your Hugging Face Token between the " " and save the file.
HF_TOKEN="REPLACEWITHHUGGINGFACETOKENHERE"
=================================================
- Install Visual Studio Community https://visualstudio.microsoft.com/downloads/
- Install NVIDIA CUDA Toolkit 12.1 https://developer.nvidia.com/cuda-12-1-0-download-archive
import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
=================================================
Pseudonyms CSV: Ensure you have a CSV file named pseudonyms.csv in the data directory. This file should contain columns name and pseudonym for anonymizing the transcripts. This isn't a requirement, you will be asked through a popup if you are using one.
The main function finds all audio files of the types specified in the folder choice that will occur when you run the subsequent code snippet, processes them, and saves the transcripts. To run the code, simply execute the script.
Update the file type(s) of your audio files:
file_type1 = '.wav'
file_type2 = '.mp3'
file_type3 = '.ogg'
Just push run here. You shouldn't need to change anything here unless you want to output less or more file types. These are mostly functions which are then called at the end of the cell.
- You should get a popup asking to choose the folder where the files are found (It will also search subfolders).
- You should then get a popup asking for where the transcription files should be placed (It will replicate the folder structure in which they were found)
- You will also see a popup asking if you want to anonymize with a pseudonyms.csv file, and if so where it is located.
- You should then see an output similar to the following (just ignore the warnings):
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.
- When complete you will see where each were written and the folders where they were written to.
Output Files: The transcripts will be saved in the specified output directory in multiple formats: CSV, TXT, JSON, and VTT
This code is designed to make it easy to process and transcribe large batches of audio or video files while ensuring anonymity through pseudonymization. Happy transcribing!# WhisperXTranscription4Researchers