I’m trying to build a system that transcribes phone calls in real time using Twilio audio streams, n8n automation, and Deepgram API. Here’s my current setup:
What I have working:
Twilio call streaming configured with <Stream> element pointing to WebSocket proxy
Proxy server that captures media payloads and forwards base64 encoded audio to my n8n webhook endpoint
n8n workflow with Function node designed to process the audio data and send it to Deepgram
The issue I’m facing:
My audio processing node in n8n isn’t producing any output, which means Deepgram gets empty data and can’t perform transcription.
Specific questions:
Am I handling the base64 decoding correctly for Twilio’s media payload format?
What’s the proper way to convert and pass Twilio’s streamed audio data to Deepgram so it can process it for transcription?
Here’s my current n8n Function node implementation:
const encodedAudio = $json.encodedAudio;
if (!encodedAudio) {
throw new Error('Audio data not found');
}
const audioBuffer = Buffer.from(encodedAudio, 'base64');
return [{
binary: {
audioFile: {
data: audioBuffer,
mimeType: 'audio/mulaw',
fileName: 'recording.wav'
}
}
}];
Check your webhook config first - n8n probably isn’t getting the payload right. When I debugged mine, my proxy was mangling the JSON structure. Log $json directly to see what’s actually coming through. Also heads up - Twilio sends empty keepalive packets between audio chunks, so filter those out before you process anything.
This is a Twilio streaming audio issue - I’ve hit this exact problem building call transcription systems. Your function node processes single audio packets, but Twilio fires off hundreds of tiny 20ms chunks per second. That’s way too small for Deepgram to handle.
You need to buffer these chunks into bigger segments before sending to Deepgram. Set up a buffer that collects 2-3 seconds of audio data, then send that combined chunk for transcription.
Also double-check your proxy is forwarding the media payloads correctly. Twilio sends different message types - you only want the ‘media’ messages, not the ‘start’ and ‘stop’ events. Your base64 decoding looks fine, but without proper buffering you’re just sending audio fragments that no transcription service can work with.
Real-time audio processing gets messy with streaming data. n8n just wasn’t built to handle continuous WebSocket streams well.
Your code works fine for single chunks, but Twilio sends continuous packets that need buffering and sequential processing. That proxy server setup adds unnecessary complexity too.
I’ve done similar real-time transcription projects - Latenode handles streaming audio way better than n8n. Native WebSocket support means it can receive Twilio’s media streams directly without a proxy.
With Latenode you can:
Connect straight to Twilio’s WebSocket stream
Buffer audio chunks properly as they arrive
Send batched audio to Deepgram at optimal intervals
Handle continuous flow without dropping packets
Workflow’s cleaner too. No separate proxy server - just configure Twilio to stream directly to your Latenode webhook. It processes the base64 audio and manages Deepgram API calls automatically.
I switched from n8n to Latenode for exactly this kind of real-time processing. The reliability difference is huge - no more dropped chunks or empty transcriptions.
Your problem is audio format handling. Twilio streams mulaw encoded data, but your function node treats it like standard WAV format. I hit this exact issue last year building a similar transcription system. Here’s what’s happening: Deepgram expects proper audio headers, not raw mulaw data. Your current setup strips away the format info it needs. Twilio’s media comes in 8kHz mulaw format - you’ve got to convert it before sending to Deepgram. Either convert mulaw to PCM in your function node or configure Deepgram to handle mulaw directly. Timing’s another big issue. You’re probably processing individual packets instead of building up audio chunks. Real-time transcription works way better when you buffer a few seconds of audio before sending to Deepgram. Don’t process every single packet. Modify your function to accumulate audio data and send bigger chunks to Deepgram. You’ll make fewer API calls and get much better transcription accuracy.