How to properly configure GPU usage for PyTorch with Whisper AI?

I’m having trouble getting my speech recognition model to use GPU instead of CPU. Can someone help me figure out what’s wrong with my setup?

I keep getting memory errors that suggest the model is trying to use CPU even though I’ve configured it for GPU. Here’s my setup:

import whisper
import soundfile as sf
import torch

# File paths
audio_file = "C:\\recordings\\sample.wav"
output_transcript = "C:\\output\\result.txt"

# GPU configuration
torch.cuda.init()
gpu_device = "cuda"

# Load the audio
waveform, sr = sf.read(audio_file, always_2d=True)

# Initialize the model
model_type = "tiny"
print(f"Loading {model_type} model...")
whisper_model = whisper.load_model(model_type).to(gpu_device)
print(f"{model_type} model ready")

# Settings
transcript_results = []
target_language = "fr"

# Process the audio
with torch.cuda.device(gpu_device):
    output = whisper_model.transcribe(waveform, language=target_language, fp16=False, word_timestamps=True)

But I get this error:

RuntimeError: [enforce fail at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 30623038517864 bytes.

My PyTorch version shows CUDA support: 2.0.0+cu117

I thought setting the device to “cuda” and moving the model there would be enough. Am I missing something obvious? Is there a conflict with my libraries or am I not configuring the GPU correctly?

I’m using Jupyter notebooks with Anaconda. The transcribe function seems to ignore my GPU settings and keeps trying to use CPU memory instead.

Your GPU setup is fine - the problem is how you’re feeding audio to Whisper. Loading with soundfile and passing the numpy array directly to transcribe() forces CPU processing, no matter what device you set.

Whisper wants either a file path or a proper torch tensor on the right device. Try this instead:

# Just pass the file path directly
output = whisper_model.transcribe(audio_file, language=target_language, fp16=False, word_timestamps=True)

Or if you need to preprocess first:

# Convert to tensor and move to GPU
waveform_tensor = torch.from_numpy(waveform).float().to(gpu_device)
output = whisper_model.transcribe(waveform_tensor, language=target_language, fp16=False, word_timestamps=True)

Ditch the torch.cuda.device() context manager too - it’s not needed since your model’s already on GPU. That massive memory error happens when numpy arrays get mangled in Whisper’s internal processing.

Your GPU setup isn’t the problem - it’s how you’re handling the audio format. Using sf.read() with always_2d=True gives you a 2D numpy array, but Whisper expects something different. This messes with internal memory allocation no matter what device settings you use.

I’ve hit this exact issue before. Just drop the always_2d=True and it should work. Whisper wants mono audio as a 1D array by default:

waveform, sr = sf.read(audio_file)  // Remove always_2d=True

Couple other things: skip torch.cuda.init() - PyTorch handles CUDA setup automatically. And that fp16=False setting? It’s actually hurting performance since modern GPUs work better with half-precision. Either enable it or just remove the parameter to use Whisper’s defaults.

Your model loading looks fine, but that memory error screams dimension mismatch creating massive tensors during preprocessing.