Speech-to-Speech
Endpoint
Section titled “Endpoint”POST /v1/audio/speech-to-speechTakes audio input, processes it through a three-stage pipeline (STT → LLM → TTS), and returns audio output.
Pipeline
Section titled “Pipeline”Audio in → Groq Whisper (STT) → Gateway LLM → Workers AI MeloTTS → Audio outRequest
Section titled “Request”multipart/form-data
| Field | Type | Default | Description |
|---|---|---|---|
file | file | required | Audio file with the user’s voice input. Supported formats: mp3, mp4, wav, webm, m4a. |
system_prompt | string | — | Optional system prompt to guide the LLM’s response personality or role. |
Response
Section titled “Response”Returns audio/mpeg (MP3) data.
Response Headers:
| Header | Description |
|---|---|
x-transcribed-text | URL-encoded transcription of the input audio. |
x-llm-response | URL-encoded LLM text response (first 500 chars). |
Free Limits
Section titled “Free Limits”Bound by Groq Whisper (8h audio/day), gateway LLM limits, and Workers AI free-tier neurons.
Examples
Section titled “Examples”curl https://your-gateway.workers.dev/v1/audio/speech-to-speech \ -F file=@question.mp3 \ -F system_prompt="You are a helpful assistant. Keep answers concise." \ --output response.mp3const formData = new FormData();formData.append('file', audioFile);formData.append('system_prompt', 'You are a helpful assistant.');
const response = await fetch('https://your-gateway.workers.dev/v1/audio/speech-to-speech', { method: 'POST', body: formData,});
const audioBlob = await response.blob();const audioUrl = URL.createObjectURL(audioBlob);const audio = new Audio(audioUrl);audio.play();
// Read metadata from headersconst transcribed = decodeURIComponent(response.headers.get('x-transcribed-text') || '');const llmResponse = decodeURIComponent(response.headers.get('x-llm-response') || '');