Hallucinations different model sizes #1452
Replies: 5 comments 2 replies
-
one of the main causes of hallucinations should be training data, not model size also there're various settings for whisper, you cannot know how those settings + model size would affect hallucinations, default values wouldn't always guarantee best transcripts if you have time + resources, try changing those settings + data (amount of silence) to compare the performance, also testing pre-process audio, that would be better addition to your thesis |
Beta Was this translation helpful? Give feedback.
-
In my testing I have found that most hallucinations occur after silence,
thus using ASD to remove the silence does a great job of removing the
hallucinations.
Jeffrey Duncan
…On Fri, Jun 16, 2023 at 11:51 AM Phan Tuấn Anh ***@***.***> wrote:
one of the main causes of hallucinations should be training data, not
model size
also there're various settings for whisper, you cannot know how those
settings + model size would affect hallucinations, default values wouldn't
always guarantee best transcripts
if you have time + resources, try changing those settings + data (amount
of silence) to compare the performance, also testing pre-process audio,
that would be better addition to your thesis
—
Reply to this email directly, view it on GitHub
<#1452 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGW5AYGO4UICOYSARHHWCDXLSMJJANCNFSM6AAAAAAZJO7YJE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Whoops, I meant VAD (Voice Activity Detector) - you can use it to find the
areas of a recording that are silent and trim them out:
https://github.com/snakers4/silero-vad
…On Fri, Jun 16, 2023 at 1:49 PM Phan Tuấn Anh ***@***.***> wrote:
what is ASD ?
—
Reply to this email directly, view it on GitHub
<#1452 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGW5A7GHJ76ID5FJS6QQKDXLS2C5ANCNFSM6AAAAAAZJO7YJE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I agree with pre-processing the data @phineas-pta , would you have other additional pre-processing steps next to removing silences?
|
Beta Was this translation helpful? Give feedback.
-
You should have a look here: And here: ;) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I used Whisper to transcribe the Common Voice data set for a language and note that the 'tiny' model hallucinates a lot, whereas the bigger 'small' model almost does not hallucinate at all, and the even bigger 'base' model hallucinates more than the 'small' model. Furthermore, the general performance of the small model is better than both the tiny and base models. As a side note, the data instances in this data set are sentences worth about 5-10 seconds of audio.
I am mostly interested in your thoughts on why a larger model does not necessarily perform better and may hallucinate more. I did not change any of the temperature or other settings when transcribing. I can imagine a larger model might overfit which can cause this phenomenon but I would like to know what you guys think might be the cause of the lower performance with more hallucinations.
As context: I am doing research for my master thesis so any ideas are welcome!
Beta Was this translation helpful? Give feedback.
All reactions