Skip to main content
POST
/
v1
/
tts
Text to Speech
curl --request POST \
  --url https://api.fish.audio/v1/tts \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'model: <model>' \
  --data '
{
  "text": "<string>",
  "temperature": 0.7,
  "top_p": 0.7,
  "references": [
    {
      "audio": "<string>",
      "text": "<string>"
    }
  ],
  "reference_id": "<string>",
  "prosody": {
    "speed": 1,
    "volume": 0
  },
  "chunk_length": 300,
  "normalize": true,
  "format": "mp3",
  "sample_rate": 123,
  "mp3_bitrate": 128,
  "opus_bitrate": -1000,
  "latency": "normal",
  "max_new_tokens": 1024,
  "repetition_penalty": 1.2,
  "min_chunk_length": 50,
  "condition_on_previous_chunks": true,
  "early_stop_threshold": 1
}
'
{
  "status": 123,
  "message": "<string>"
}
This endpoint only accepts application/json and application/msgpack.For best results, upload reference audio using the create model before using this one. This improves speech quality and reduces latency.To upload audio clips directly, without pre-uploading, serialize the request body with MessagePack as per the instructions.
Audio formats supported:
  • WAV / PCM
    • Sample Rate: 8kHz, 16kHz, 24kHz, 32kHz, 44.1kHz
    • Default Sample Rate: 44.1kHz
    • 16-bit, mono
  • MP3
    • Sample Rate: 32kHz, 44.1kHz
    • Default Sample Rate: 44.1kHz
    • mono
    • Bitrate: 64kbps, 128kbps (default), 192kbps
  • Opus
    • Sample Rate: 48kHz
    • Default Sample Rate: 48kHz
    • mono
    • Bitrate: -1000 (auto), 24kbps, 32kbps (default), 48kbps, 64kbps

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Headers

model
enum<string>
default:s1
required

Specify which TTS model to use. We recommend s1

Available options:
s1,
speech-1.6,
speech-1.5

Body

Request body for text-to-speech synthesis.

text
string
required

Text to convert to speech.

temperature
number
default:0.7

Controls expressiveness. Higher is more varied, lower is more consistent.

Required range: 0 <= x <= 1
top_p
number
default:0.7

Controls diversity via nucleus sampling.

Required range: 0 <= x <= 1
references
ReferenceAudio · object[] | null

Inline voice references for zero-shot cloning. Requires MessagePack (not JSON). Ignored if reference_id is provided.

reference_id
string | null

Voice model ID from the Fish Audio library or your custom models.

prosody
ProsodyControl · object

Speed and volume adjustments for the output.

chunk_length
integer
default:300

Text segment size for processing.

Required range: 100 <= x <= 300
normalize
boolean
default:true

Normalizes text for English and Chinese, improving stability for numbers.

format
enum<string>
default:mp3

Output audio format.

Available options:
wav,
pcm,
mp3,
opus
sample_rate
integer | null

Audio sample rate in Hz. When null, uses the format's default (44100 Hz for most formats, 48000 Hz for opus).

mp3_bitrate
enum<integer>
default:128

MP3 bitrate in kbps. Only applies when format is mp3.

Available options:
64,
128,
192
opus_bitrate
enum<integer>
default:-1000

Opus bitrate in bps. -1000 for automatic. Only applies when format is opus.

Available options:
-1000,
24,
32,
48,
64
latency
enum<string>
default:normal

Latency-quality trade-off. normal: best quality, balanced: reduced latency, low: lowest latency.

Available options:
low,
normal,
balanced
max_new_tokens
integer
default:1024

Maximum audio tokens to generate per text chunk.

repetition_penalty
number
default:1.2

Penalty for repeating audio patterns. Values above 1.0 reduce repetition.

min_chunk_length
integer
default:50

Minimum characters before splitting into a new chunk.

Required range: 0 <= x <= 100
condition_on_previous_chunks
boolean
default:true

Use previous audio as context for voice consistency.

early_stop_threshold
number
default:1

Early stopping threshold for batch processing.

Required range: 0 <= x <= 1

Response

Request fulfilled, document follows