Transcription using tensor fails using `model.transcribe()`, can't use `whisper.load_audio()` as an alternative #1145

CarlosGTrejo · 2023-03-23T22:33:32Z

CarlosGTrejo
Mar 23, 2023

I am trying to use whisper in the backend of a website project. I need to be able to transcribe audio that is uploaded without saving/reading to disk. I am loading the bytes into a BytesIO object (since model.transcribe() doesn't support loading bytes) and passing that to model.transcribe() but I am getting the following error:

File C:\Repos\Insightful\__pypackages__\3.10\lib\whisper\transcribe.py:134, in transcribe(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, word_timestamps, prepend_punctuations, append_punctuations, **decode_options)
    132 mel_segment = pad_or_trim(mel, N_FRAMES).to(model.device).to(dtype)
    133 _, probs = model.detect_language(mel_segment)
--> 134 decode_options["language"] = max(probs, key=probs.get)
    135 if verbose is not None:
    136     print(
    137         f"Detected language: {LANGUAGES[decode_options['language']].title()}"
    138     )

AttributeError: 'list' object has no attribute 'get'

Discussion #930 mentions a similar issue, but that answer does not work for my case since whisper.audio.load_audio() only supports a file name string and doesn't support file-like objects or bytes:

load_audio(file: str, sr: int = 16000)
    Open an audio file and read as mono waveform, resampling as necessary

    Parameters
    ----------
    file: str
        The audio file to open

    sr: int
        The sample rate to resample the audio if necessary

    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.

Answered by CarlosGTrejo

Mar 24, 2023

I modified whisper.load_audio() to support bytes, now I can transcribe audio using a file's bytes.

def load_audio(file_bytes: bytes, sr: int = 16_000) -> np.ndarray:
    """
    Converts audio file's bytes to mono waveform, resampling as necessary
    Parameters
    ----------
    file: bytes
        The bytes of the audio file
    sr: int
        The sample rate to resample the audio if necessary
    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """
    try:
        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
        out…

View full answer

CarlosGTrejo · 2023-03-24T23:58:43Z

CarlosGTrejo
Mar 24, 2023
Author

I modified whisper.load_audio() to support bytes, now I can transcribe audio using a file's bytes.

def load_audio(file_bytes: bytes, sr: int = 16_000) -> np.ndarray:
    """
    Converts audio file's bytes to mono waveform, resampling as necessary
    Parameters
    ----------
    file: bytes
        The bytes of the audio file
    sr: int
        The sample rate to resample the audio if necessary
    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """
    try:
        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
        out, _ = (
            ffmpeg.input('pipe:', threads=0)
            .output("pipe:", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
            .run_async(pipe_stdin=True, pipe_stdout=True)
        ).communicate(input=file_bytes)

    except ffmpeg.Error as e:
        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e

    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

2 replies

MagzhanUnited Feb 24, 2024

I'm facing with the following error:
pipe:: Invalid data found when processing input

abhirup84 Aug 6, 2025

reply hello carlos - i am getting following error. can you pls throw some light

import whisper
model = whisper.load_model("small")
audio_path = "C:/Users/USER/Downloads/AUD-20250718-WA0004.mp3"
result = model.transcribe(audio_path)
print("\nTranription :\n")
print(result["text"])

FileNotFoundError Traceback (most recent call last)
Cell In[23], line 4
2 model = whisper.load_model("small")
3 audio_path = "C:/Users/USER/Downloads/AUD-20250718-WA0004.mp3"
----> 4 result = model.transcribe(audio_path)
5 print("\nTranription :\n")
6 print(result["text"])

File ~\anaconda3\Lib\site-packages\whisper\transcribe.py:139, in transcribe(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, carry_initial_prompt, word_timestamps, prepend_punctuations, append_punctuations, clip_timestamps, hallucination_silence_threshold, **decode_options)
136 decode_options["fp16"] = False
138 # Pad 30-seconds of silence to the input audio, for slicing
--> 139 mel = log_mel_spectrogram(audio, model.dims.n_mels, padding=N_SAMPLES)
140 content_frames = mel.shape[-1] - N_FRAMES
141 content_duration = float(content_frames * HOP_LENGTH / SAMPLE_RATE)

File ~\anaconda3\Lib\site-packages\whisper\audio.py:140, in log_mel_spectrogram(audio, n_mels, padding, device)
138 if not torch.is_tensor(audio):
139 if isinstance(audio, str):
--> 140 audio = load_audio(audio)
141 audio = torch.from_numpy(audio)
143 if device is not None:

File ~\anaconda3\Lib\site-packages\whisper\audio.py:58, in load_audio(file, sr)
56 # fmt: on
57 try:
---> 58 out = run(cmd, capture_output=True, check=True).stdout
59 except CalledProcessError as e:
60 raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e

File ~\anaconda3\Lib\subprocess.py:554, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
551 kwargs['stdout'] = PIPE
552 kwargs['stderr'] = PIPE
--> 554 with Popen(*popenargs, **kwargs) as process:
555 try:
556 stdout, stderr = process.communicate(input, timeout=timeout)

File ~\anaconda3\Lib\subprocess.py:1039, in Popen.init(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize, process_group)
1035 if self.text_mode:
1036 self.stderr = io.TextIOWrapper(self.stderr,
1037 encoding=encoding, errors=errors)
-> 1039 self._execute_child(args, executable, preexec_fn, close_fds,
1040 pass_fds, cwd, env,
1041 startupinfo, creationflags, shell,
1042 p2cread, p2cwrite,
1043 c2pread, c2pwrite,
1044 errread, errwrite,
1045 restore_signals,
1046 gid, gids, uid, umask,
1047 start_new_session, process_group)
1048 except:
1049 # Cleanup if the child failed starting.
1050 for f in filter(None, (self.stdin, self.stdout, self.stderr)):

File ~\anaconda3\Lib\subprocess.py:1554, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_gid, unused_gids, unused_uid, unused_umask, unused_start_new_session, unused_process_group)
1552 # Start the process
1553 try:
-> 1554 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
1555 # no special security
1556 None, None,
1557 int(not close_fds),
1558 creationflags,
1559 env,
1560 cwd,
1561 startupinfo)
1562 finally:
1563 # Child is launched. Close the parent's copy of those pipe
1564 # handles that only the child should have open. You need
(...)
1567 # pipe will not close when the child process exits and the
1568 # ReadFile will hang.
1569 self._close_pipe_fds(p2cread, p2cwrite,
1570 c2pread, c2pwrite,
1571 errread, errwrite)

FileNotFoundError: [WinError 2] The system cannot find the file specified

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transcription using tensor fails using `model.transcribe()`, can't use `whisper.load_audio()` as an alternative #1145

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Transcription using tensor fails using model.transcribe(), can't use whisper.load_audio() as an alternative #1145

Uh oh!

CarlosGTrejo Mar 23, 2023

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

CarlosGTrejo Mar 24, 2023 Author

Uh oh!

MagzhanUnited Feb 24, 2024

Uh oh!

abhirup84 Aug 6, 2025

Transcription using tensor fails using `model.transcribe()`, can't use `whisper.load_audio()` as an alternative #1145

CarlosGTrejo
Mar 23, 2023

Replies: 1 comment 2 replies

CarlosGTrejo
Mar 24, 2023
Author