Issues with synchronizing audio and text in OpenAI's real-time API

I’m developing a React application that utilizes OpenAI’s real-time API (gpt-4o-realtime-preview-2024-12-17) for live voice communication. I’m encountering difficulties with ensuring that the transcribed text and the audio playback are in sync. At times, the text appears too soon, or the audio playback lags behind.

My configuration includes:

  • A WebSocket for streaming real-time audio to OpenAI
  • The RealtimeClient to manage live audio replies
  • WavRecorder and WavStreamPlayer for audio handling, working with 16bitPCM format
  • The text responses get updated automatically as they are received

Code for establishing the connection:

const initiateSession = useCallback(async () => {
  const apiClient = clientRef.current;
  const audioRecorder = audioRecorderRef.current;
  const streamPlayer = streamPlayerRef.current;

  await audioRecorder.start();
  await streamPlayer.connect();

  try {
    const connection = await apiClient.connect();
    if (connection) {
      setLoading(false);
      apiClient.sendUserMessageContent([{ type: "input_text", text: "Hello there!" }]);
      
      if (apiClient.getTurnDetectionType() === "server_vad") {
        await audioRecorder.record((data) => apiClient.appendInputAudio(data.mono));
      }
    }
  } catch (err) {
    console.error("Connection error:", err);
  }
}, []);

Code for handling responses:

apiClient.on("conversation.updated", async ({ item, delta }) => {
  if (item.role === "assistant" && delta?.audio) {
    streamPlayer.add16BitPCM(delta.audio, item.id);
    textRef.current = item.formatted.transcript;
  } else if (delta?.text) {
    textRef.current = item.formatted.transcript;
  }

  if (item.status === "completed" && item.formatted.audio?.length) {
    const audioFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000);
    setAudiosrc(audioFile.url);
  }
});

Key problem: The text scrolling does not align with the audio playback timing.

Current scrolling implementation (based on 150 WPM):

const synchronizeScrolling = () => {
  if (!scrollContainerRef.current) return;

  const container = scrollContainerRef.current;
  const currentTime = Date.now();
  const elapsed = currentTime - scrollStartTimeRef.current;
  const duration = estimateScrollDuration(text);

  if (elapsed >= duration) {
    container.scrollTop = container.scrollHeight - container.clientHeight;
    return;
  }

  const progress = elapsed / duration;
  const maxScrollTop = container.scrollHeight - container.clientHeight;

  const smoothEaseFunction = (t) =>
    t < 0.5 ? 2 * t * t : 1 - Math.pow(-2 * t + 2, 2) / 2;

  container.scrollTop = maxScrollTop * smoothEaseFunction(progress);
  animationFrameRef.current = requestAnimationFrame(synchronizeScrolling);
};

Experiments carried out:

  1. PCM to audio conversion: const audioFile = await WavRecorder.decode(item.formatted.audio, 24000, 24000); setAudiosrc(audioFile.url); - However, conversion delays create synchronization problems.

  2. Timed scrolling based on word count: const wordsPerMinute = 150; const wordCount = text.split(" ").length; return (wordCount / wordsPerMinute) * 60 * 1000; - This approach works for shorter texts but fails with longer ones due to variations in speech speed.

I need assistance with:

  1. How can I effectively synchronize text scrolling with the live audio playback?
  2. Are there any recommended libraries or best practices for achieving text-audio synchronization in real-time applications?

I appreciate any guidance or suggestions!

You’re trying to manually sync two streams when you should use one system to handle both. Been there - building custom sync logic is a nightmare.

You need a workflow that handles the whole audio-text pipeline automatically. When I hit similar issues with real-time voice apps, I ditched the manual buffer tracking and timing calculations.

Don’t manage WebSocket connections, audio buffers, and text sync separately. Set up automation that:

  • Grabs OpenAI real-time API responses
  • Processes audio and text streams together
  • Keeps perfect sync by treating them as one data flow
  • Handles timing without manual calculations

You’ll eliminate all that complex scrolling math and buffer position tracking. Let automation handle sync while your React app just shows the results.

I’ve watched teams debug sync issues for weeks when they should’ve just automated the entire flow.

skip the scrolling calculation completely. the realtime api already sends audio chunks with timestamps - just use those instead of guessing with wpm math. i had the same problem and solved it by listening to the streamplayer’s progress events instead of manual sync. much simpler and it actually works.

You’re treating audio and text as separate streams when they should be one synchronized timeline. Don’t use conversation.updated events for both - split them into separate handlers.

Keep using streamPlayer.add16BitPCM() for audio like you’re doing now. For text sync, create a separate handler that buffers text updates and releases them based on actual audio playback timing.

Here’s what worked for me: track the streamPlayer’s internal buffer state and use that to calculate when each text chunk should appear. WavStreamPlayer has methods to get current playback position - use streamPlayer.getTrackSampleOffset() to know exactly where you are in the audio timeline.

Map your text chunks to sample positions instead of time estimates. When delta.text arrives, store it with its corresponding audio sample position. Only display text when streamPlayer reaches that position.

This kills the guesswork of WPM calculations and directly ties text appearance to actual audio playback progress. Sync stays accurate regardless of speech speed changes or buffer delays.

The problem is you’re mixing streaming audio with batch text processing. conversation.updated fires for every delta chunk, but your audio has buffering delays that text doesn’t know about.

Ditch the Date.now() timestamps and use token-based sync instead. When you get delta.audio chunks, store the text tokens with their expected playback times based on your audio context’s sample rate and buffer size. Since WavStreamPlayer runs at 24kHz, calculate when each text segment should show up by mapping it to actual audio sample positions.

Build a text queue that releases tokens based on streamPlayer.getCurrentTime() - not wall clock time. This handles real audio processing delays and browser pipeline latency. Then make your smooth scrolling respond to this controlled text release instead of guessing at duration.

Bottom line: sync to the audio timeline, don’t force audio to match text timing.

You’re trying to sync text with streaming audio, but you’re doing it backwards. Don’t calculate scroll duration from word count - tie your text sync directly to the WavStreamPlayer’s buffer position instead.

Here’s the key: stop updating text right when delta.audio arrives. Those audio chunks come in way faster than they actually play. I built a queue system that delays text updates to match the real audio playback delay, and it works much better.

To get the actual delay, track when you call streamPlayer.add16BitPCM() versus when that audio starts playing. That’s your real buffer delay - use it to offset your text updates. Also, ditch the Date.now() calculations and use the audio context’s currentTime property for precise playback positioning.

I’ve dealt with this sync problem too. Skip the WPM estimates and use the streamPlayer’s actual playback position instead. Track where the audio buffer is and sync your text rendering to that. Also check your delta.audio chunks for timing metadata - the realtime API often includes timing info that works way better than doing the math yourself.