How to synchronize text display with audio playback in OpenAI real-time voice API using React

I’m working on a React app that uses OpenAI’s real-time API for live voice transcription and response generation. The main problem I’m having is that the text and audio are not syncing properly. Sometimes the text shows up too early or the audio plays with a delay.

My current setup:

  • Using WebSockets for real-time audio streaming to OpenAI
  • RealtimeClient handles sending and receiving audio responses
  • AudioRecorder and AudioPlayer manage streaming and playback (16bitPCM format)
  • Text updates happen dynamically as responses come in

Here’s how I establish the connection:

const startSession = useCallback(async () => {
  const apiClient = clientRef.current;
  const audioRecorder = recorderRef.current;
  const audioPlayer = playerRef.current;
  
  await audioRecorder.initialize();
  await audioPlayer.connect();
  
  try {
    const result = await apiClient.connect();
    if (result) {
      setIsLoading(false);
      apiClient.sendMessage([{ type: "input_text", text: "Hi there!" }]);
      
      if (apiClient.getDetectionMode() === "server_vad") {
        await audioRecorder.start((audioData) => apiClient.addAudioInput(audioData.mono));
      }
    }
  } catch (err) {
    console.error("Connection failed:", err);
  }
}, []);

And here’s how I handle the responses:

apiClient.on("conversation.updated", async ({ item, delta }) => {
  if (item.role === "assistant" && delta?.audio) {
    audioPlayer.add16BitPCM(delta.audio, item.id);
    textRef.current = item.formatted.transcript;
  } else if (delta?.text) {
    textRef.current = item.formatted.transcript;
  }
  
  if (item.status === "completed" && item.formatted.audio?.length) {
    const audioFile = await AudioRecorder.decode(item.formatted.audio, 24000, 24000);
    setAudioSource(audioFile.url);
  }
});

Main issues:

  1. Can’t get text scrolling to sync with audio playback
  2. My scrolling is based on 150 words per minute calculation
const performScroll = () => {
  if (!textContainerRef.current) return;
  
  const container = textContainerRef.current;
  const now = Date.now();
  const timeElapsed = now - scrollStartRef.current;
  const totalDuration = calculateScrollTime(textContent);
  
  if (timeElapsed >= totalDuration) {
    container.scrollTop = container.scrollHeight - container.clientHeight;
    return;
  }
  
  const scrollProgress = timeElapsed / totalDuration;
  const maxScrollTop = container.scrollHeight - container.clientHeight;
  
  const smoothEasing = (t) => t < 0.5 ? 2 * t * t : 1 - Math.pow(-2 * t + 2, 2) / 2;
  container.scrollTop = maxScrollTop * smoothEasing(scrollProgress);
  
  frameRef.current = requestAnimationFrame(performScroll);
};

What I’ve tried:

  • Converting 16-bit PCM to audio source but it takes time and causes sync issues
  • Using word count with 150 WPM rule: const wpm = 150; const wordCount = textContent.split(" ").length; return (wordCount / wpm) * 60 * 1000;

This approach works okay for short responses but breaks down with longer ones because speech speed varies.

Questions:

  1. What’s the best way to sync text scrolling with real-time audio?
  2. Are there any libraries or proven methods for handling this kind of synchronization?

Any help would be really appreciated!

I’ve been working on similar stuff and found that using the audio player’s currentTime property works way better than trying to calculate based on wpm. You can track playback position and match it to character positions in your text. Also try buffering the audio chunks properly before playing - that helped me a lot with the delay issues.

Synchronizing audio playback with text display can be quite challenging, but a reliable solution I found involves creating a timing queue that logs the arrival times of each audio segment. Instead of relying on word per minute calculations, use the timestamps from WebSocket messages to align text portions with their respective audio segments. Make sure to buffer the audio properly; I typically wait for around 200 ms of audio to accumulate before initiating playback, which significantly reduces jitter. For maintaining accurate scrolling, utilize the AudioContext.currentTime from the Web Audio API, as it provides a more precise match to the actual playback timing, effectively minimizing drift for longer responses.