I’m working on a live voice assistant that uses Twilio’s streaming API, ChatGPT, and ElevenLabs for text-to-speech. Everything seems to work fine except the caller can’t hear any audio responses.
The Setup
Building a phone bot where:
- Person calls and speaks
- OpenAI Whisper converts speech to text
- ChatGPT creates a reply
- ElevenLabs makes the reply into speech
- Audio gets sent back through Twilio’s media stream
What’s Working
- Phone calls connect properly
- WebSocket receives all the right data
- Speech recognition works great
- AI generates good responses
- ElevenLabs creates clear audio files
- Audio conversion to mulaw format looks correct
- My code sends audio chunks to Twilio (160 bytes each)
- Logs show everything is processing
The Problem
The caller hears absolutely nothing. No greeting, no responses, just silence.
What I’ve Tried
- Converting audio with:
ffmpeg -y -i input.mp3 -f mulaw -acodec pcm_mulaw -ar 8000 -ac 1 output.raw - Using 160-byte pieces for proper timing
- Adding silence padding before audio
- Setting track to “inbound” in messages
- Testing the converted audio files locally (they sound fine)
My Audio Streaming Code
async def send_audio_to_caller(websocket, session_id: str, mp3_file: str):
mulaw_file = mp3_file.replace('.mp3', '.raw')
subprocess.run([
"ffmpeg", "-y", "-i", mp3_file,
"-f", "mulaw", "-acodec", "pcm_mulaw",
"-ar", "8000", "-ac", "1", mulaw_file
])
with open(mulaw_file, "rb") as audio_file:
while data := audio_file.read(160):
payload = {
"event": "media",
"streamSid": session_id,
"media": {
"track": "inbound",
"payload": base64.b64encode(data).decode("utf-8")
}
}
await websocket.send(json.dumps(payload))
await asyncio.sleep(0.02)
My TwiML Config
<Start>
<Stream url="wss://myserver.com/media" track="both_tracks">
<Parameter name="Content-Type" value="audio/mulaw" />
</Stream>
</Start>
Has anyone got Python + Twilio Media Streams working for two-way voice? What am I missing here?