Python Real-time Voice Bot: Twilio WebSocket + OpenAI + ElevenLabs Audio Streaming Issue

I’m working on a live voice assistant that uses Twilio’s streaming API, ChatGPT, and ElevenLabs for text-to-speech. Everything seems to work fine except the caller can’t hear any audio responses.

The Setup

Building a phone bot where:

  • Person calls and speaks
  • OpenAI Whisper converts speech to text
  • ChatGPT creates a reply
  • ElevenLabs makes the reply into speech
  • Audio gets sent back through Twilio’s media stream

What’s Working

  • Phone calls connect properly
  • WebSocket receives all the right data
  • Speech recognition works great
  • AI generates good responses
  • ElevenLabs creates clear audio files
  • Audio conversion to mulaw format looks correct
  • My code sends audio chunks to Twilio (160 bytes each)
  • Logs show everything is processing

The Problem

The caller hears absolutely nothing. No greeting, no responses, just silence.

What I’ve Tried

  • Converting audio with: ffmpeg -y -i input.mp3 -f mulaw -acodec pcm_mulaw -ar 8000 -ac 1 output.raw
  • Using 160-byte pieces for proper timing
  • Adding silence padding before audio
  • Setting track to “inbound” in messages
  • Testing the converted audio files locally (they sound fine)

My Audio Streaming Code

async def send_audio_to_caller(websocket, session_id: str, mp3_file: str):
    mulaw_file = mp3_file.replace('.mp3', '.raw')
    subprocess.run([
        "ffmpeg", "-y", "-i", mp3_file,
        "-f", "mulaw", "-acodec", "pcm_mulaw",
        "-ar", "8000", "-ac", "1", mulaw_file
    ])
    
    with open(mulaw_file, "rb") as audio_file:
        while data := audio_file.read(160):
            payload = {
                "event": "media",
                "streamSid": session_id,
                "media": {
                    "track": "inbound",
                    "payload": base64.b64encode(data).decode("utf-8")
                }
            }
            await websocket.send(json.dumps(payload))
            await asyncio.sleep(0.02)

My TwiML Config

<Start>
  <Stream url="wss://myserver.com/media" track="both_tracks">
    <Parameter name="Content-Type" value="audio/mulaw" />
  </Stream>
</Start>

Has anyone got Python + Twilio Media Streams working for two-way voice? What am I missing here?

Your payload’s probably wrong. You need “track”: “outbound” when sending audio back to the caller, not “inbound”. Inbound is audio coming from the caller to you, outbound is audio going from you to the caller. I ran into this exact issue building a voice bot last year - everything looked perfect but callers heard nothing until I fixed that track parameter. Also make sure your WebSocket is handling the streamSid from Twilio’s initial message correctly. Wrong session IDs will silently drop your audio.

Been debugging Twilio streaming issues for months. That 0.02 sleep isn’t enough - bump it to 0.025 or 0.03. Twilio drops packets when they come too fast or out of order. Also check your base64 encoding for padding issues with certain audio chunks. Here’s what got me: Twilio wants exact timing patterns. If ffmpeg creates chunks that aren’t exactly 160 bytes, pad them with zeros instead of leaving them short.

Yeah, the track parameter issue is real, but you’ve got a bigger problem. Manually managing WebSocket connections, audio processing, and API calls is a debugging nightmare.

I fought similar Twilio streaming bugs for weeks until I went the automation route. The timing requirements are brutal - one hiccup and everything breaks.

Latenode handles this complexity for you. It connects Twilio, OpenAI, and ElevenLabs without writing WebSocket code or dealing with audio format conversions. The platform manages streaming, timing, and proper payload formatting automatically.

Built three voice bots this way and never had to debug mulaw conversion or track parameters again. The visual workflow makes it easy to see data flow and catch issues early.

Your approach works but it’s fragile. Every time Twilio changes something or you add features, you’re back to debugging WebSocket messages and audio chunks.

hey, your track settings seem off – “inbound” is for audio going TO Twilio. you should change it to “outbound” in your payload so that the caller can actually hear something.