How to synchronize text display with audio playback in OpenAI real-time voice API using React

I’m working with OpenAI’s real-time voice API (gpt-4o-realtime-preview-2024-12-17) in my React app and having trouble getting the text and audio to stay in sync. The transcribed text either shows up too early or the audio plays with a lag.

My setup:

  • Using WebSockets for streaming audio to OpenAI
  • RealtimeClient handles the API communication
  • AudioRecorder and AudioPlayer manage the streaming (16bitPCM format)
  • Text updates happen as responses come in

Here’s how I connect to the API:

const startConnection = useCallback(async () => {
    const apiClient = clientRef.current;
    const audioRec = audioRecorderRef.current;
    const audioPlay = audioPlayerRef.current;
    
    await audioRec.start();
    await audioPlay.initialize();
    
    try {
        const result = await apiClient.connect();
        if (result) {
            setIsConnecting(false);
            apiClient.sendMessage([{ type: "input_text", text: "Hi there!" }]);
            
            if (apiClient.getDetectionMode() === "server_vad") {
                await audioRec.startRecording((audioData) => apiClient.addAudioInput(audioData.mono));
            }
        }
    } catch (err) {
        console.error("Connection failed:", err);
    }
}, []);

And here’s how I handle responses:

apiClient.on("conversation.updated", async ({ message, changes }) => {
    if (message.role === "assistant" && changes?.audio) {
        audioPlayer.addPCMData(changes.audio, message.id);
        displayTextRef.current = message.formatted.transcript;
    } else if (changes?.text) {
        displayTextRef.current = message.formatted.transcript;
    }
    
    if (message.status === "completed" && message.formatted.audio?.length) {
        const audioBlob = await AudioRecorder.convertToWav(message.formatted.audio, 24000, 24000);
        setPlaybackUrl(audioBlob.url);
    }
});

The main issue:
I can’t get the text to scroll smoothly with the audio. I’m trying to scroll based on 150 words per minute:

const animateTextScroll = () => {
    if (!textContainerRef.current) return;
    
    const textBox = textContainerRef.current;
    const now = Date.now();
    const timeElapsed = now - animationStartRef.current;
    const totalDuration = calculateScrollTime(displayText);
    
    if (timeElapsed >= totalDuration) {
        textBox.scrollTop = textBox.scrollHeight - textBox.clientHeight;
        return;
    }
    
    const scrollProgress = timeElapsed / totalDuration;
    const maxScroll = textBox.scrollHeight - textBox.clientHeight;
    
    const smoothEasing = (t) => t < 0.5 ? 2 * t * t : 1 - Math.pow(-2 * t + 2, 2) / 2;
    
    textBox.scrollTop = maxScroll * smoothEasing(scrollProgress);
    requestAnimationFrame(animateTextScroll);
};

What I’ve tried:

  1. Converting the PCM data to playable audio but it takes too long and causes sync issues
  2. Using a fixed 150 WPM calculation but it doesn’t work well for longer responses since speaking speed varies
const avgWordsPerMin = 150;
const wordCount = displayText.split(" ").length;
return (wordCount / avgWordsPerMin) * 60 * 1000;

How can I get the text scrolling to match the actual audio playback timing? Are there any libraries or techniques that work well for this kind of real-time sync?

Yeah, this timing issue is super common with real-time audio APIs. Don’t bother with word-per-minute calculations - they’re unreliable. I’d go with buffer-based sync instead, which I’ve used before with good results. Track the actual audio playback position instead of guessing. Modify your audio player to emit progress events as PCM data plays, then use those events to drive text scrolling. Way more accurate. Add a small delay buffer (100-200ms) between receiving audio chunks and starting text updates. This helps with WebSocket latency variations. Also try a sliding window approach - only scroll the visible text that matches the currently playing audio segment. Another thing that worked for me: request timing metadata from OpenAI if they provide it, or implement your own real-time audio analysis to detect speech pacing. You’ll get much better sync points than fixed WPM calculations, especially since the AI’s speaking rate changes a lot depending on content complexity.

Had this exact problem last month. You’re fighting two different timing domains - when WebSocket messages arrive vs when audio actually plays from your buffer.

Here’s what fixed it for me: build a time-offset correction system. Calculate the gap between when audio data arrives and when it starts playing, then apply that offset to your text scrolling.

Your conversation.updated handler should record both message.timestamp and current audio playback position. This creates a mapping between text chunks and their real playback timing.

Don’t try predicting scroll duration - instead, reactively adjust scroll position based on actual audio progress events. PCM streaming has variable latency depending on buffer states, so you need continuous calibration, not one-time sync.

That smooth easing function looks nice but it’s working against you. Linear interpolation between known audio position checkpoints maintains sync accuracy way better.

the 150wpm thing is broken - don’t bother with it. use audiocontext.currentTime instead to track where you actually are in playback. way more reliable. i timestamp each text chunk as it comes in, then sync scrolling by comparing audio.currentTime against those timestamps. works great once you factor in browser buffering (usually 50-100ms delay).