Unreliable Speech Recognition in Custom Discord Bot

Hey everyone, I’m having trouble with my Discord bot project. I’m using the interactions library to build a bot that can transcribe audio using OpenAI Whisper. Here’s the problem:

I’ve set up the bot to record audio and save it as a .pcm file. Then I use this function to transcribe it:

def process_audio(self, audio_data):
    audio_array = numpy.frombuffer(audio_data, dtype=numpy.int16)
    normalized_audio = audio_array.astype('float32') / 32767.0
    transcript = self.whisper_model.transcribe(normalized_audio)
    self.audio_stream.close()
    print('Transcription done')
    return transcript['text']

But the results are all over the place. Even when I use the same audio input, I get different transcriptions each time. It’s driving me crazy!

Does anyone know why this might be happening? Are there any tricks to make Whisper more consistent when used with a Discord bot? Any help would be awesome. Thanks!

I’ve dealt with speech recognition inconsistencies in my projects too. One thing that helped was implementing a confidence threshold. Basically, you can set up your code to only accept transcriptions that meet a certain confidence level. This way, you filter out potentially inaccurate results.

Another approach that worked for me was preprocessing the audio. Try applying some noise reduction techniques before feeding it into Whisper. Libraries like pydub can be useful for this.

Lastly, consider batch processing instead of real-time transcription. I found that running Whisper on larger chunks of audio sometimes yielded more consistent results than processing small snippets on the fly.

Hope these suggestions help. Speech recognition can be tricky, but keep at it!

hey there, i’ve had similar issues with whisper. have u tried adjusting the sampling rate? sometimes that can help with consistency. also, make sure ur audio quality is good - background noise can really mess things up. maybe experiment with different whisper models too? good luck!