I’m working on a React app that uses OpenAI’s real-time API for live voice transcription and response generation. The main problem I’m having is that the text and audio are not syncing properly. Sometimes the text shows up too early or the audio plays with a delay.
My current setup:
- Using WebSockets for real-time audio streaming to OpenAI
- RealtimeClient handles sending and receiving audio responses
- AudioRecorder and AudioPlayer manage streaming and playback (16bitPCM format)
- Text updates happen dynamically as responses come in
Here’s how I establish the connection:
const startSession = useCallback(async () => {
const apiClient = clientRef.current;
const audioRecorder = recorderRef.current;
const audioPlayer = playerRef.current;
await audioRecorder.initialize();
await audioPlayer.connect();
try {
const result = await apiClient.connect();
if (result) {
setIsLoading(false);
apiClient.sendMessage([{ type: "input_text", text: "Hi there!" }]);
if (apiClient.getDetectionMode() === "server_vad") {
await audioRecorder.start((audioData) => apiClient.addAudioInput(audioData.mono));
}
}
} catch (err) {
console.error("Connection failed:", err);
}
}, []);
And here’s how I handle the responses:
apiClient.on("conversation.updated", async ({ item, delta }) => {
if (item.role === "assistant" && delta?.audio) {
audioPlayer.add16BitPCM(delta.audio, item.id);
textRef.current = item.formatted.transcript;
} else if (delta?.text) {
textRef.current = item.formatted.transcript;
}
if (item.status === "completed" && item.formatted.audio?.length) {
const audioFile = await AudioRecorder.decode(item.formatted.audio, 24000, 24000);
setAudioSource(audioFile.url);
}
});
Main issues:
- Can’t get text scrolling to sync with audio playback
- My scrolling is based on 150 words per minute calculation
const performScroll = () => {
if (!textContainerRef.current) return;
const container = textContainerRef.current;
const now = Date.now();
const timeElapsed = now - scrollStartRef.current;
const totalDuration = calculateScrollTime(textContent);
if (timeElapsed >= totalDuration) {
container.scrollTop = container.scrollHeight - container.clientHeight;
return;
}
const scrollProgress = timeElapsed / totalDuration;
const maxScrollTop = container.scrollHeight - container.clientHeight;
const smoothEasing = (t) => t < 0.5 ? 2 * t * t : 1 - Math.pow(-2 * t + 2, 2) / 2;
container.scrollTop = maxScrollTop * smoothEasing(scrollProgress);
frameRef.current = requestAnimationFrame(performScroll);
};
What I’ve tried:
- Converting 16-bit PCM to audio source but it takes time and causes sync issues
- Using word count with 150 WPM rule:
const wpm = 150; const wordCount = textContent.split(" ").length; return (wordCount / wpm) * 60 * 1000;
This approach works okay for short responses but breaks down with longer ones because speech speed varies.
Questions:
- What’s the best way to sync text scrolling with real-time audio?
- Are there any libraries or proven methods for handling this kind of synchronization?
Any help would be really appreciated!