prompt vs prefix in DecodingOptions #117

wiseman · 2022-09-25T02:58:19Z

wiseman
Sep 25, 2022

DecodingOptions has the following properties that aren't really discussed in the blog post or paper:

    # prompt, prefix, and token suppression
    prompt: Optional[Union[str, List[int]]] = None   # text or tokens for the previous context
    prefix: Optional[Union[str, List[int]]] = None   # text or tokens to prefix the current context

I've successfully used prompt to give the recognizer extra context, but so far my attempts to use prefix have resulted in errors. What's the difference between the two, and what kind of data is prefix expecting?

Answered by jongwook

Sep 25, 2022

prompt conditions the model on the text that appeared in the previous ~30 seconds of audio, and in long-form transcription it helps continuing the text in a consistent style, e.g. starting a sentence with a capital letter if the previous context ended with a period. You can also use this for "prompt engineering", to inform the model to become more likely to output certain jargon (" So we were just talking about DALL·E") or do a crude form of speaker turn tracking (e.g. " - Hey how are you doing? - I'm doing good. How are you?", note that the token for " -" is suppressed by default and will need to be enabled manually.)

prefix accepts a partial transcription for the current audio input, al…

View full answer

jongwook · 2022-09-25T11:04:11Z

jongwook
Sep 25, 2022
Maintainer

prompt conditions the model on the text that appeared in the previous ~30 seconds of audio, and in long-form transcription it helps continuing the text in a consistent style, e.g. starting a sentence with a capital letter if the previous context ended with a period. You can also use this for "prompt engineering", to inform the model to become more likely to output certain jargon (" So we were just talking about DALL·E") or do a crude form of speaker turn tracking (e.g. " - Hey how are you doing? - I'm doing good. How are you?", note that the token for " -" is suppressed by default and will need to be enabled manually.)

prefix accepts a partial transcription for the current audio input, allowing for resuming the transcription after a certain point within the 30-second speech. I made this option to prototype semi-realtime transcription, where overlapping windows would be used to incrementally accept audio every second or so, and prefix could contain the text for the overlapping portion.

Below shows where prompt and prefix go in the tokens:

9 replies

mezaros Oct 4, 2022

@jongwook I've gotten this to work on some files, bizarrely, by using three hyphens in a row affixed to the start of the first word in the prompt. I.e. " ---Hey how are you doing? ---I'm doing good. How are you?" The output only includes two hyphens at each speaker turn, not three, so I'm really not sure what's happening under the hood. More importantly, in my testing this seems to work really well when there are short bursts of back and forth talking. If one person talks for too long, or there's an interruption, things get confused and the hyphens disappear for the rest of the transcript. I guess each window of transcription trains the next one, so it all falls apart? Anyway, while speaker diarization would be ideal, it feels like this level of speaker turn indication is VERY close to working—and would be tremendously useful to many. Thanks!

VulumeCode Oct 4, 2022

@mezaros the context prompt can be very long, but the cli implementation only uses the text recognized in the previous window. There is also special handling of characters like "-" in the decoder, after model output is queried. You should consider digging in the code, what you need is not complicated pytorch stuff, just some plumbing.

jongwook Oct 4, 2022
Maintainer

The commit 8cf36f3 only enabled hyphens and apostrophes within/after a word, such as in "D'Angelo" or "warm-hearted", but still suppresses " -" and " '" which are different tokens. Even if you made sure those are not suppressed, the model unfortunately may ignore the prompt in which case you may need to resort to fine-tuning the model.

octimot Oct 24, 2022

@mezaros

@jongwook I've gotten this to work on some files, bizarrely, by using three hyphens in a row affixed to the start of the first word in the prompt. I.e. " ---Hey how are you doing? ---I'm doing good. How are you?" The output only includes two hyphens at each speaker turn, not three, so I'm really not sure what's happening under the hood. More importantly, in my testing this seems to work really well when there are short bursts of back and forth talking. If one person talks for too long, or there's an interruption, things get confused and the hyphens disappear for the rest of the transcript. I guess each window of transcription trains the next one, so it all falls apart? Anyway, while speaker diarization would be ideal, it feels like this level of speaker turn indication is VERY close to working—and would be tremendously useful to many. Thanks!

In my tests, I'm using initial_prompt="- How are you? - I'm fine, thank you.". I don't seem do get any " -" (hyphen) back in the output, but the segments are being split according to the speakers which is cool.

The call looks something like this:

whisper_model.transcribe(audio_file_path, 
                         initial_prompt=" - How are you? - I'm fine, thank you.",  
                         **other_whisper_options)

For complete speaker diarization, I made this proposal.

mindofsteel Dec 12, 2022

Wow, that worked! I'm pleasantly surprised that this is possible! Thanks!

MattFisher · 2023-03-08T00:49:32Z

MattFisher
Mar 8, 2023

Will the provided prompt affect the transcription all the way through, or just in a time window from the beginning to some given time?

I'm trying to transcribe audio that is about six minutes long and contains quite a few non-English names. I've had some success providing the names in variously-formatted prompts, but it only seems to help in the first 90 seconds or so of the six minute video, and any names introduced int the video after that point tend to be wrong.

As soon as I get past the 90 second mark, I can't find a prompt that allows whisper to get the names right.
For context they're African names from "Things Fall Apart" like "Umufia", "Mbaino", and "Ikemefuna".

Am I misunderstanding, or is there any way to provide context that will persist all the way through the transcription?

0 replies

radurevutchi · 2023-03-17T22:34:34Z

radurevutchi
Mar 17, 2023

It seems that prompt works as expected when calling decode() however when calling transcribe(), prompt never gets passed in because it gets replaced by all_tokens (AKA previous window transcription) over here on line transcribe.py:228. This line decode_options["prompt"] = all_tokens[prompt_reset_since:] prevents us from passing a prompt across all transcription windows because it replaces the prompt that I pass in.

Even if I pass in initial_prompt, it gets concatenated with the previous window transcriptions which get truncated to the model's model.dims.n_text_ctx // 2. For the tiny model, that's the last 223 tokens only. This would explain the issue @MattFisher is having.

I would like to propose a change to the way transcribe() passed prompts to decode(). If a prompt is added as part of options, it should be suffixed to the previous transcription context, NOT prefixed. That will ensure that the prompt persists all the way through the transcription. cc @jongwook @wiseman

1 reply

horiws Aug 14, 2025

You may try to set the parameter carry_initial_prompt as True, so that the initial_prompt will be prepended to the beginning of prompts of each chunk.
@MattFisher @radurevutchi

mosnicholas · 2023-04-23T23:25:07Z

mosnicholas
Apr 23, 2023

@jongwook is this issue related to an issue I'm seeing related to prompting:

I'm passing prompts that look like this in my whisper calls: This transcript is about Bayern Munich and various soccer teams. Common competitors in the league include Real Madrid, Barcelona, Manchester City, Liverpool, Paris Saint-Germain, Juventus, Chelsea, Borussia Dortmund, and AC Milan.

80% of the time I use the prompt, I get fully hallucinated output. It ends up on a loop, repeating the same thing (eg. a competitor's name, or a url made up from one of the competitor's names). An example output using the prompt above on an audio file^:

In the past, the league has been a place of competition for the players. The league has been a place of competition for the players. The league has been a place of competition for the players. The league has been a place of competition for the players. The league has been a place of competition for the players. The league has been a place of competition for the players....

I'm calling whisper from a node backend -- more details on how exactly here.

5 replies

prompt vs prefix in DecodingOptions #117

Uh oh!

Replies: 4 comments · 15 replies

Uh oh!

jongwook Sep 25, 2022 Maintainer

Uh oh!

Uh oh!

Uh oh!

jongwook Oct 4, 2022 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 4 comments 15 replies

jongwook
Sep 25, 2022
Maintainer

jongwook Oct 4, 2022
Maintainer