Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization.
We present General Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications.
This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.
Synthesizing speech from text while cloning the timbre of a reference audio sample.
| Reference Audio | Input Text | Generated Audio |
|---|
Converting the voice of a source audio to match the timbre of a reference audio while preserving content.
| Source Audio | Reference Audio | Converted Audio |
|---|
The following results are obtained by benchmarking services instantiated via the official deployment scripts, reflecting end-to-end performance in realistic serving scenarios rather than offline inference.
Among currently available open-source systems, our model is one of the few that natively supports both concurrent and streaming inference, while achieving performance comparable to the first tier of existing approaches.| Concurrency | Avg TTFC (ms) | P50 TTFC (ms) | P99 TTFC (ms) | Avg RTF | P50 RTF | P99 RTF | Audio Dur (s) |
|---|---|---|---|---|---|---|---|
| 1 | 258.8 | 258.8 | 258.8 | 0.197 | 0.197 | 0.197 | 6.44 |
| 5 | 385.0 | 394.7 | 396.2 | 0.218 | 0.217 | 0.248 | 6.76 |
| 10 | 544.6 | 564.2 | 566.7 | 0.282 | 0.301 | 0.313 | 6.49 |
| 20 | 977.8 | 977.9 | 982.9 | 0.470 | 0.490 | 0.538 | 7.19 |
| 40 | 1797.0 | 1736.4 | 2564.5 | 0.421 | 0.400 | 0.587 | 6.33 |
| 80 | 3786.4 | 4054.4 | 5415.8 | 0.763 | 0.763 | 1.096 | 6.32 |
| 160 | 9847.9 | 10239.9 | 14350.3 | 1.718 | 1.740 | 2.577 | 6.44 |
| Concurrency | Avg TTFT (ms) | P50 TTFT (ms) | P99 TTFT (ms) | Avg Total (ms) |
|---|---|---|---|---|
| 1 | 157.5 | 157.5 | 157.5 | 190.9 |
| 5 | 394.1 | 393.7 | 395.9 | 400.0 |
| 10 | 589.6 | 721.3 | 723.3 | 598.1 |
| 20 | 1316.3 | 1495.6 | 1500.4 | 1317.8 |
| 40 | 2690.9 | 2678.3 | 2861.4 | 2693.7 |
| 80 | 3833.4 | 3961.3 | 4027.0 | 3845.1 |
| 160 | 5037.0 | 5689.3 | 6676.0 | 5044.0 |
| Model | Open-Source | Model Size | SEED-zh | SEED-en | ||
|---|---|---|---|---|---|---|
| CER (%) ↓ | Speaker Similarity (%) ↑ | WER (%) ↓ | Speaker Similarity (%) ↑ | |||
| Multi-Stage or NAR Methods | ||||||
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 |
| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 |
| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 |
| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 |
| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 |
| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 |
| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 |
| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 |
| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 |
| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 |
| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - |
| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - |
| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 |
| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 |
| One-Stage AR Methods | ||||||
| Spark TTS | ✅ | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 |
| GPA-0.3B | ✅ | 0.3B | 0.95 | 65.9 | 1.51 | 56.5 |
| Model | Params | Librispeech test-clean (WER↓) | AISHELL-1 (CER↓) |
|---|---|---|---|
| Models with < 0.5B parameters | |||
| Whisper-S | 0.24B | 3.13 | - |
| GPA-0.3B | 0.3B | 8.88 | 4.50 |
| Models with ≥ 0.5B parameters | |||
| Fun-ASR-nano | 0.8B | 1.76 | 1.80 |
| FireRed-ASR | 1.1B | 1.84 | 0.54 |
| GLM-ASR-nano | 1.5B | 2.00 | 1.81 |
| GLM-ASR-nano* | 1.5B | 2.17 | 2.17 |
| Whisper-L | 1.55B | 1.82 | 4.72 |
| Kimi-Audio | - | 1.32 | 0.71 |
| Step-Audio2 | - | 1.17 | 0.63 |
| Seed-ASR | - | 1.58 | 0.68 |
| Seed-ASR* | - | 2.80 | 1.63 |
| Fun-ASR | 7.7B | 1.51 | 1.22 |
The GPA model possesses advanced capabilities in voice cloning and speech synthesis. While these features have significant potential for positive applications in accessibility, entertainment, and education, we acknowledge the risk of misuse, such as deepfake generation or voice spoofing.
We are committed to responsible AI development. The released models are intended for academic research and personal educational use only. We have implemented watermarking techniques in the generated audio to aid in detection. Users are strictly prohibited from using this technology for illegal purposes, including but not limited to fraud, defamation, or impersonation without consent. By using this software, you agree to adhere to these ethical guidelines.