I’m working with Whisper for speech recognition and need to get raw transcripts that include filler words like “um”, “uh”, “hmm”, etc. I found out that setting normalization to false should help with this.
The issue is that my current setup only processes the first 30 seconds of audio. I need to handle longer audio files while keeping the normalization disabled.
Here’s what I have working for short clips:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
audio_data, _ = librosa.load("test_file.mp3", sr=16000, mono=True)
whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-large")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
whisper_model.config.forced_decoder_ids = whisper_processor.get_decoder_prompt_ids(language="en", task="transcribe")
audio_features = whisper_processor(audio_data, return_tensors="pt", sampling_rate=16000).input_features
generated_ids = whisper_model.generate(audio_features)
result = whisper_processor.batch_decode(generated_ids, skip_special_tokens=True, normalize=False)
print(result)
This works great for short audio but cuts off at 30 seconds. I know that using pipeline with chunking can handle longer files, but I can’t figure out how to disable normalization when using that approach.
I also tried the pipeline method like this:
import torch
from transformers import pipeline
device_type = "cuda:0" if torch.cuda.is_available() else "cpu"
asr_pipeline = pipeline(
"automatic-speech-recognition",
model="openai/whisper-base",
chunk_length_s=30,
device=device_type,
)
my_audio = {"array": audio_data, "sampling_rate": 16000}
output = asr_pipeline(my_audio)["text"]
But I don’t see where to set the normalize parameter to False in this setup. Any ideas on how to combine chunking for long audio with disabled normalization?