How to turn off text normalization in Whisper for audio files exceeding 30 seconds?

I’m working with Whisper for speech recognition and need to get raw transcripts that include filler words like “um”, “uh”, “hmm”, etc. I found out that setting normalization to false should help with this.

The issue is that my current setup only processes the first 30 seconds of audio. I need to handle longer audio files while keeping the normalization disabled.

Here’s what I have working for short clips:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

audio_data, _ = librosa.load("test_file.mp3", sr=16000, mono=True)

whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-large")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")

whisper_model.config.forced_decoder_ids = whisper_processor.get_decoder_prompt_ids(language="en", task="transcribe")
audio_features = whisper_processor(audio_data, return_tensors="pt", sampling_rate=16000).input_features
generated_ids = whisper_model.generate(audio_features)
result = whisper_processor.batch_decode(generated_ids, skip_special_tokens=True, normalize=False)

print(result)

This works great for short audio but cuts off at 30 seconds. I know that using pipeline with chunking can handle longer files, but I can’t figure out how to disable normalization when using that approach.

I also tried the pipeline method like this:

import torch
from transformers import pipeline

device_type = "cuda:0" if torch.cuda.is_available() else "cpu"
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base",
    chunk_length_s=30,
    device=device_type,
)

my_audio = {"array": audio_data, "sampling_rate": 16000}
output = asr_pipeline(my_audio)["text"]

But I don’t see where to set the normalize parameter to False in this setup. Any ideas on how to combine chunking for long audio with disabled normalization?

I think the pipeline’s using different decoding params internally. Try setting return_timestamps=True and normalize=False directly in the pipeline call instead of generate_kwargs. So asr_pipeline(my_audio, return_timestamps=True, normalize=False). Fixed it for me when I had similar issues keeping “ums” in longer files.

You can disable normalization by passing the normalize parameter through generate_kwargs. The pipeline’s batch_decode method will respect this setting when processing chunks.

Try this:

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-base",
    chunk_length_s=30,
    device=device_type,
    generate_kwargs={"normalize": False}
)

I’ve used this with longer audio files and it keeps the filler words across all chunks. The pipeline handles chunking internally while maintaining your normalization settings for the entire transcription. Just make sure you’re using a recent version of transformers - older versions might not support this parameter.