I’m working with OpenAI’s Whisper model and need to get raw transcriptions that include filler words like “um”, “ah”, “hmm”, etc. I found out that setting normalization to false can help with this.
My current setup works fine for short audio clips, but I’m stuck when trying to process longer audio files. The model only transcribes the first 30 seconds of my audio.
I know that using the pipeline approach can handle longer audio through chunking, but I can’t figure out how to disable normalization when using it. Also not sure how to load my local audio file properly with the pipeline method.
Can someone help me combine these approaches? I need to process audio longer than 30 seconds while keeping the normalize=False setting to preserve those filler words.
The pipeline doesn’t expose the normalize parameter in batch_decode, but there’s a workaround. Access the processor manually after chunking. I faced a similar issue with interview recordings. Here’s what worked: use the pipeline with return_timestamps=True and chunk_length_s=30. Then rebuild the full transcription by looping through chunks and calling processor.batch_decode() with normalize=False on each chunk’s token IDs. You’ll need to access the pipeline’s tokenizer and model components directly. For local files, just pass the file path string to the pipeline - it handles most audio formats without librosa preprocessing.
you can also use the max_new_tokens parameter in generate() to force longer outputs. set it to 448 or higher depending on your audio length. this skips the 30sec limit without dealing with chunking.
Had the same issue transcribing podcast episodes. You need to manually chunk your audio before sending it to the model, then process each piece separately. Load your audio with librosa like you’re already doing, then split it into 30-second chunks using audio_data[i*sample_rate*30:(i+1)*sample_rate*30]. Run each chunk through your existing setup and combine the results. Keep your current approach with normalize=False - just use it on smaller chunks instead of fighting the pipeline. I overlapped chunks by 2-3 seconds so words don’t get cut off at the edges. You’ll keep full control over normalization while handling any length audio file.
try setting chunk_length_s in the pipeline and normalize=False in batch_decode. you’ll need to loop through each chunk and decode them separately to keep that setting.