Trouble using 'speaker_wav' in JavaScript API calls for coqui-ai tts-server

I’m stuck trying to use the ‘speaker_wav’ feature with the tts-server API from coqui-ai. I can make it work with voices that don’t need extra audio files, like ‘tts_models/en/jenny/jenny’. But now I need to use xtts with a ‘speaker_wav’ file to copy a voice.

I’ve tried:

  1. Adding the speaker_wav path to the query string
  2. Putting everything in a JSON-stringified body
  3. Using FormData
  4. Mixing query string and body data

Nothing seems to work. I can get it running with the terminal tool like this:

tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "Hello" --speaker_wav "path/to/speaker.wav" --language_idx en

But I want to avoid the startup time for each call. Is there a way to include the speaker_wav when starting the server? Here’s my current batch file:

set TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1
echo TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=%TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD%
"path/to/tts-server.exe" --model_path "path/to/model" --config_path "path/to/config.json"

I’ve looked at the server code and tried tweaking it, but no luck. Any ideas on how to make this work with JavaScript API calls?

hey there, i’ve been messing with this too. have u tried using fetch with a blob? like this:

let audioFile = await fetch(‘path/to/speaker.wav’)
let audioBlob = await audioFile.blob()

then add that to ur formData. might work better than just the file path. worth a shot anyway!

I’ve been working with the coqui-ai tts-server API for a while now, and I can tell you that handling the ‘speaker_wav’ feature can be tricky. One approach that’s worked well for me is using a Base64 encoded string of the audio file in the request body.

Here’s a snippet that might help:

const fs = require('fs');
const axios = require('axios');

const audioFile = fs.readFileSync('path/to/speaker.wav', { encoding: 'base64' });

axios.post('http://localhost:5002/api/tts', {
  text: 'Your text here',
  speaker_wav: audioFile
}, {
  headers: { 'Content-Type': 'application/json' }
})
.then(response => {
  // Handle the audio data
})
.catch(error => console.error('Error:', error));

This method avoids issues with file paths and FormData complexities. Just make sure your server can handle Base64 encoded audio. As for reducing startup time, you might want to look into using PM2 or similar tools to keep your server running persistently. It’s been a game-changer for my workflow.

I’ve encountered similar issues when working with the coqui-ai tts-server API, particularly with the ‘speaker_wav’ feature. From my experience, the most reliable method for including the speaker_wav file is by using a multipart/form-data request.

Here’s a JavaScript snippet that worked for me:

const formData = new FormData();
formData.append(‘text’, ‘Your text here’);
formData.append(‘speaker_wav’, audioBlob, ‘speaker.wav’);

fetch(‘http://localhost:5002/api/tts’, {
method: ‘POST’,
body: formData
})
.then(response => response.arrayBuffer())
.then(audioData => {
// Handle the audio data
});

Make sure your server is running with the XTTS model loaded. As for reducing startup time, you might consider keeping the server running as a background process or exploring containerization options for quicker initialization. Hope this helps!