I’m working with OpenAI’s real-time voice API (gpt-4o-realtime-preview-2024-12-17) in my React app and having trouble getting the text and audio to stay in sync. The transcribed text either shows up too early or the audio plays with a lag.
My setup:
- Using WebSockets for streaming audio to OpenAI
- RealtimeClient handles the API communication
- AudioRecorder and AudioPlayer manage the streaming (16bitPCM format)
- Text updates happen as responses come in
Here’s how I connect to the API:
const startConnection = useCallback(async () => {
const apiClient = clientRef.current;
const audioRec = audioRecorderRef.current;
const audioPlay = audioPlayerRef.current;
await audioRec.start();
await audioPlay.initialize();
try {
const result = await apiClient.connect();
if (result) {
setIsConnecting(false);
apiClient.sendMessage([{ type: "input_text", text: "Hi there!" }]);
if (apiClient.getDetectionMode() === "server_vad") {
await audioRec.startRecording((audioData) => apiClient.addAudioInput(audioData.mono));
}
}
} catch (err) {
console.error("Connection failed:", err);
}
}, []);
And here’s how I handle responses:
apiClient.on("conversation.updated", async ({ message, changes }) => {
if (message.role === "assistant" && changes?.audio) {
audioPlayer.addPCMData(changes.audio, message.id);
displayTextRef.current = message.formatted.transcript;
} else if (changes?.text) {
displayTextRef.current = message.formatted.transcript;
}
if (message.status === "completed" && message.formatted.audio?.length) {
const audioBlob = await AudioRecorder.convertToWav(message.formatted.audio, 24000, 24000);
setPlaybackUrl(audioBlob.url);
}
});
The main issue:
I can’t get the text to scroll smoothly with the audio. I’m trying to scroll based on 150 words per minute:
const animateTextScroll = () => {
if (!textContainerRef.current) return;
const textBox = textContainerRef.current;
const now = Date.now();
const timeElapsed = now - animationStartRef.current;
const totalDuration = calculateScrollTime(displayText);
if (timeElapsed >= totalDuration) {
textBox.scrollTop = textBox.scrollHeight - textBox.clientHeight;
return;
}
const scrollProgress = timeElapsed / totalDuration;
const maxScroll = textBox.scrollHeight - textBox.clientHeight;
const smoothEasing = (t) => t < 0.5 ? 2 * t * t : 1 - Math.pow(-2 * t + 2, 2) / 2;
textBox.scrollTop = maxScroll * smoothEasing(scrollProgress);
requestAnimationFrame(animateTextScroll);
};
What I’ve tried:
- Converting the PCM data to playable audio but it takes too long and causes sync issues
- Using a fixed 150 WPM calculation but it doesn’t work well for longer responses since speaking speed varies
const avgWordsPerMin = 150;
const wordCount = displayText.split(" ").length;
return (wordCount / avgWordsPerMin) * 60 * 1000;
How can I get the text scrolling to match the actual audio playback timing? Are there any libraries or techniques that work well for this kind of real-time sync?