Speech-to-Speech

Endpoint

POST /v1/audio/speech-to-speech

Takes audio input, processes it through a three-stage pipeline (STT → LLM → TTS), and returns audio output.

Pipeline

Audio in → Groq Whisper (STT) → Gateway LLM → Workers AI MeloTTS → Audio out

Request

multipart/form-data

Field	Type	Default	Description
`file`	file	required	Audio file with the user’s voice input. Supported formats: mp3, mp4, wav, webm, m4a.
`system_prompt`	string	—	Optional system prompt to guide the LLM’s response personality or role.

Response

Returns audio/mpeg (MP3) data.

Response Headers:

Header	Description
`x-transcribed-text`	URL-encoded transcription of the input audio.
`x-llm-response`	URL-encoded LLM text response (first 500 chars).

Free Limits

Bound by Groq Whisper (8h audio/day), gateway LLM limits, and Workers AI free-tier neurons.

curl https://your-gateway.workers.dev/v1/audio/speech-to-speech \
  -F file=@question.mp3 \
  -F system_prompt="You are a helpful assistant. Keep answers concise." \
  --output response.mp3

const formData = new FormData();
formData.append('file', audioFile);
formData.append('system_prompt', 'You are a helpful assistant.');

const response = await fetch('https://your-gateway.workers.dev/v1/audio/speech-to-speech', {
  method: 'POST',
  body: formData,
});

const audioBlob = await response.blob();
const audioUrl = URL.createObjectURL(audioBlob);
const audio = new Audio(audioUrl);
audio.play();

// Read metadata from headers
const transcribed = decodeURIComponent(response.headers.get('x-transcribed-text') || '');
const llmResponse = decodeURIComponent(response.headers.get('x-llm-response') || '');