All in one, built for all. A single model delivering near-SOTA performance on TTS and ASR in a unified open framework.
General Purpose Audio unifies speech recognition and speech synthesis in one autoregressive audio-language model, with native PyTorch workflows and ONNX runtime deployment for GPA v1.5.
Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization.
We present General Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications.
This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline for practical deployment.
GPA-v1.5 extends this direction as the new mainline release, delivering stronger ASR and TTS performance while preserving the unified modeling objective.
Fine-tune or continue training GPA-v1.5 with Hugging Face Trainer and the v1.5 training package.
Run direct Hugging Face and PyTorch execution for GPA-v1.5 ASR and TTS behavior baselines.
Use local CLI inference, FastAPI service deployment, browser UI testing, voice registration, and runtime validation.
GPA-v1.5 is the new mainline release of GPA: a unified 0.6B audio model for ASR and TTS, with native PyTorch workflows and ONNX runtime deployment now available.
| Checkpoint | Open-sourced on Hugging Face |
|---|---|
| Native Inference | Direct PyTorch / Hugging Face execution for ASR and TTS |
| Native Training | Fine-tuning and continued training with Hugging Face Trainer |
| ONNX Runtime | CLI inference, FastAPI service, browser UI, voice registration, and runtime validation |
| Planned | Voice Conversion support in the native v1.5 path |
TTS is one of the most popular features in the online demo, so GPA-TTS extracts the TTS component into a standalone, self-contained runtime.
| Quantization | Qwen INT4 plus Detokenizer INT8 / FP16 / FP32 with ONNX Runtime |
|---|---|
| Voice Cloning | Zero-shot voice cloning from a short reference audio |
| Decoder Precision | Selectable at runtime: INT8 for edge, FP16 for balanced use, FP32 for highest quality |
| Footprint | Among the smallest open-source TTS runtimes with cloning support |
| Optimized For | Local CPU inference on Mac, Linux, and edge devices |
Synthesizing speech from text while cloning the timbre of a reference audio sample.
| Reference Audio | Input Text | Generated Audio |
|---|
Coming soon. GPA-v1.5 does not include the web VC demo yet. Native Voice Conversion support is planned and this section will be updated when it is available.
| Model | Open-Source | Model Size | test-zh CER (%) ↓ | test-zh Sim (%) ↑ | test-en WER (%) ↓ | test-en Sim (%) ↑ |
|---|---|---|---|---|---|---|
| Multi-Stage or NAR Methods | ||||||
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | No | - | 1.12 | 79.6 | 2.25 | 76.2 |
| MiniMax-Speech | No | - | 0.83 | 78.3 | 1.65 | 69.2 |
| F5-TTS | Yes | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 |
| CosyVoice2 | Yes | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 |
| FireRedTTS2 | Yes | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 |
| Index-TTS2 | Yes | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 |
| VibeVoice-1.5B | Yes | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 |
| VibeVoice-Realtime | Yes | 0.5B | - | - | 2.05 | 63.3 |
| HiggsAudio-v2 | Yes | 3B | 1.50 | 74.0 | 2.44 | 67.7 |
| VoxCPM | Yes | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 |
| GLM-TTS | Yes | 1.5B | 1.03 | 76.1 | - | - |
| GLM-TTS RL | Yes | 1.5B | 0.89 | 76.4 | - | - |
| Fun-CosyVoice3-0.5B-2512 | Yes | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 |
| Fun-CosyVoice3-0.5B-2512_RL | Yes | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 |
| One-Stage AR Methods | ||||||
| Spark TTS | Yes | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 |
| GPA-v1.5 | Yes | 0.6B | 1.03 | 70.2 | 1.43 | 63.5 |
ASR results on LibriSpeech, AISHELL-1, test_Meeting, and test_Net. WER (%) is reported for LibriSpeech; CER (%) is reported for AISHELL-1.
| Model | Model Size | LibriSpeech test-clean | LibriSpeech test-other | AISHELL-1 | test_Meeting | test_Net |
|---|---|---|---|---|---|---|
| Whisper-S | 0.24B | 3.43 | 7.63 | - | - | - |
| GPA-v1.5 | 0.6B | 2.78 | 5.02 | 2.83 | 7.40 | 6.49 |
| Fun-ASR-nano | 0.8B | 1.76 | 4.33 | 1.80 | 6.60 | 6.01 |
| FireRed-ASR | 1.1B | 1.84 | 4.52 | 0.54 | 4.95 | 4.94 |
| GLM-ASR-nano | 1.5B | 2.00 | 4.19 | 1.81 | 6.73 | - |
| GLM-ASR-nano* | 1.5B | 2.17 | 4.43 | 2.17 | 8.21 | 6.33 |
| Whisper-L | 1.55B | 1.86 | 3.43 | 4.72 | 18.39 | 11.89 |
| Kimi-Audio | - | 1.32 | 2.63 | 0.71 | 6.24 | 6.45 |
| Step-Audio2 | - | 1.17 | 2.42 | 0.63 | 4.75 | 4.67 |
| Seed-ASR | - | 1.58 | 2.84 | 0.68 | 5.69 | 4.66 |
| Seed-ASR* | - | 2.80 | 5.69 | 1.63 | 7.07 | 4.84 |
| Fun-ASR | 7.7B | 1.51 | 3.03 | 1.22 | 6.17 | 5.46 |
If you find GPA useful for your research or projects, please cite us:
@misc{cai2026unifyingspeechrecognitionsynthesis,
title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
year={2026},
eprint={2601.10770},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2601.10770},
}