Text to Speech

curl --request POST \
  --url https://api.fish.audio/v1/tts \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'model: <model>' \
  --data '
{
  "text": "<string>",
  "temperature": 0.7,
  "top_p": 0.7,
  "references": [
    {
      "audio": "<string>",
      "text": "<string>"
    }
  ],
  "reference_id": "<string>",
  "prosody": {
    "speed": 1,
    "volume": 0
  },
  "chunk_length": 300,
  "normalize": true,
  "format": "mp3",
  "sample_rate": 123,
  "mp3_bitrate": 128,
  "opus_bitrate": -1000,
  "latency": "normal",
  "max_new_tokens": 1024,
  "repetition_penalty": 1.2,
  "min_chunk_length": 50,
  "condition_on_previous_chunks": true,
  "early_stop_threshold": 1
}
'

{
  "status": 123,
  "message": "<string>"
}

POST

tts

Text to Speech

curl --request POST \
  --url https://api.fish.audio/v1/tts \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'model: <model>' \
  --data '
{
  "text": "<string>",
  "temperature": 0.7,
  "top_p": 0.7,
  "references": [
    {
      "audio": "<string>",
      "text": "<string>"
    }
  ],
  "reference_id": "<string>",
  "prosody": {
    "speed": 1,
    "volume": 0
  },
  "chunk_length": 300,
  "normalize": true,
  "format": "mp3",
  "sample_rate": 123,
  "mp3_bitrate": 128,
  "opus_bitrate": -1000,
  "latency": "normal",
  "max_new_tokens": 1024,
  "repetition_penalty": 1.2,
  "min_chunk_length": 50,
  "condition_on_previous_chunks": true,
  "early_stop_threshold": 1
}
'

{
  "status": 123,
  "message": "<string>"
}

This endpoint only accepts application/json and application/msgpack.For best results, upload reference audio using the create model before using this one. This improves speech quality and reduces latency.To upload audio clips directly, without pre-uploading, serialize the request body with MessagePack as per the instructions.

Audio formats supported:

WAV / PCM
- Sample Rate: 8kHz, 16kHz, 24kHz, 32kHz, 44.1kHz
- Default Sample Rate: 44.1kHz
- 16-bit, mono
MP3
- Sample Rate: 32kHz, 44.1kHz
- Default Sample Rate: 44.1kHz
- mono
- Bitrate: 64kbps, 128kbps (default), 192kbps
Opus
- Sample Rate: 48kHz
- Default Sample Rate: 48kHz
- mono
- Bitrate: -1000 (auto), 24kbps, 32kbps (default), 48kbps, 64kbps

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Headers

model

enum<string>

default:s1

required

Specify which TTS model to use. We recommend s1

Available options:

s1,

speech-1.6,

speech-1.5

Body

Request body for text-to-speech synthesis.

text

string

required

Text to convert to speech.

temperature

number

default:0.7

Controls expressiveness. Higher is more varied, lower is more consistent.

Required range: 0 <= x <= 1

top_p

number

default:0.7

Controls diversity via nucleus sampling.

Required range: 0 <= x <= 1

references

ReferenceAudio · object[] | null

Inline voice references for zero-shot cloning. Requires MessagePack (not JSON). Ignored if reference_id is provided.

Show child attributes

reference_id

string | null

Voice model ID from the Fish Audio library or your custom models.

prosody

ProsodyControl · object

Speed and volume adjustments for the output.

Show child attributes

chunk_length

integer

default:300

Text segment size for processing.

Required range: 100 <= x <= 300

normalize

boolean

default:true

Normalizes text for English and Chinese, improving stability for numbers.

format

enum<string>

default:mp3

Output audio format.

Available options:

wav,

pcm,

mp3,

opus

sample_rate

integer | null

Audio sample rate in Hz. When null, uses the format's default (44100 Hz for most formats, 48000 Hz for opus).

mp3_bitrate

enum<integer>

default:128

MP3 bitrate in kbps. Only applies when format is mp3.

Available options:

64,

128,

192

opus_bitrate

enum<integer>

default:-1000

Opus bitrate in bps. -1000 for automatic. Only applies when format is opus.

Available options:

-1000,

24,

32,

48,

64

latency

enum<string>

default:normal

Latency-quality trade-off. normal: best quality, balanced: reduced latency, low: lowest latency.

Available options:

low,

normal,

balanced

max_new_tokens

integer

default:1024

Maximum audio tokens to generate per text chunk.

repetition_penalty

number

default:1.2

Penalty for repeating audio patterns. Values above 1.0 reduce repetition.

min_chunk_length

integer

default:50

Minimum characters before splitting into a new chunk.

Required range: 0 <= x <= 100

condition_on_previous_chunks

boolean

default:true

Use previous audio as context for voice consistency.

early_stop_threshold

number

default:1

Early stopping threshold for batch processing.

Required range: 0 <= x <= 1

Response

Request fulfilled, document follows

Update Model Speech to Text

⌘I

API Reference

REST API

Python SDK

JavaScript SDK

Go SDK

Text to Speech

Authorizations

Headers

Body

Response