Skip to content

Speech-to-Speech

POST /v1/audio/speech-to-speech

Takes audio input, processes it through a three-stage pipeline (STT → LLM → TTS), and returns audio output.

Audio in → Groq Whisper (STT) → Gateway LLM → Workers AI MeloTTS → Audio out

multipart/form-data

FieldTypeDefaultDescription
filefilerequiredAudio file with the user’s voice input. Supported formats: mp3, mp4, wav, webm, m4a.
system_promptstringOptional system prompt to guide the LLM’s response personality or role.

Returns audio/mpeg (MP3) data.

Response Headers:

HeaderDescription
x-transcribed-textURL-encoded transcription of the input audio.
x-llm-responseURL-encoded LLM text response (first 500 chars).

Bound by Groq Whisper (8h audio/day), gateway LLM limits, and Workers AI free-tier neurons.

Terminal window
curl https://your-gateway.workers.dev/v1/audio/speech-to-speech \
-F file=@question.mp3 \
-F system_prompt="You are a helpful assistant. Keep answers concise." \
--output response.mp3