GPA v1.5: General Purpose Audio

Abstract

Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization.

We present General Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications.

This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline for practical deployment.

GPA-v1.5 extends this direction as the new mainline release, delivering stronger ASR and TTS performance while preserving the unified modeling objective.

Model Overview

Figure 1. GPA unifies speech understanding and generation in a single autoregressive audio-language model.

Native Train

Fine-tune or continue training GPA-v1.5 with Hugging Face Trainer and the v1.5 training package.

Native Infer

Run direct Hugging Face and PyTorch execution for GPA-v1.5 ASR and TTS behavior baselines.

ONNX Runtime

Use local CLI inference, FastAPI service deployment, browser UI testing, voice registration, and runtime validation.

GPA-v1.5 Release

GPA-v1.5 is the new mainline release of GPA: a unified 0.6B audio model for ASR and TTS, with native PyTorch workflows and ONNX runtime deployment now available.

Checkpoint	Open-sourced on Hugging Face
Native Inference	Direct PyTorch / Hugging Face execution for ASR and TTS
Native Training	Fine-tuning and continued training with Hugging Face Trainer
ONNX Runtime	CLI inference, FastAPI service, browser UI, voice registration, and runtime validation
Planned	Voice Conversion support in the native v1.5 path

GPA-v1.5 README Training Guide Inference Guide ONNX Runtime Guide

GPA-TTS: Edge-Ready Voice-Cloning TTS

TTS is one of the most popular features in the online demo, so GPA-TTS extracts the TTS component into a standalone, self-contained runtime.

Quantization	Qwen INT4 plus Detokenizer INT8 / FP16 / FP32 with ONNX Runtime
Voice Cloning	Zero-shot voice cloning from a short reference audio
Decoder Precision	Selectable at runtime: INT8 for edge, FP16 for balanced use, FP32 for highest quality
Footprint	Among the smallest open-source TTS runtimes with cloning support
Optimized For	Local CPU inference on Mac, Linux, and edge devices

GPA-TTS README Download from Hugging Face

Demo

TTS: Zero-Shot Voice Cloning

Synthesizing speech from text while cloning the timbre of a reference audio sample.

Reference Audio	Input Text	Generated Audio

Voice Conversion (VC)

Coming soon. GPA-v1.5 does not include the web VC demo yet. Native Voice Conversion support is planned and this section will be updated when it is available.

GPA-v1.5 Evaluation Metric Results

TTS Evaluation Table

Lower CER/WER is better. Higher speaker similarity is better.
Model	Open-Source	Model Size	test-zh CER (%) ↓	test-zh Sim (%) ↑	test-en WER (%) ↓	test-en Sim (%) ↑
Multi-Stage or NAR Methods
Human	-	-	1.26	75.5	2.14	73.4
Seed-TTS	No	-	1.12	79.6	2.25	76.2
MiniMax-Speech	No	-	0.83	78.3	1.65	69.2
F5-TTS	Yes	0.3B	1.52	74.1	2.00	64.7
CosyVoice2	Yes	0.5B	1.45	75.7	2.57	65.9
FireRedTTS2	Yes	1.5B	1.14	73.2	1.95	66.5
Index-TTS2	Yes	1.5B	1.03	76.5	2.23	70.6
VibeVoice-1.5B	Yes	1.5B	1.16	74.4	3.04	68.9
VibeVoice-Realtime	Yes	0.5B	-	-	2.05	63.3
HiggsAudio-v2	Yes	3B	1.50	74.0	2.44	67.7
VoxCPM	Yes	0.5B	0.93	77.2	1.85	72.9
GLM-TTS	Yes	1.5B	1.03	76.1	-	-
GLM-TTS RL	Yes	1.5B	0.89	76.4	-	-
Fun-CosyVoice3-0.5B-2512	Yes	0.5B	1.21	78.0	2.24	71.8
Fun-CosyVoice3-0.5B-2512_RL	Yes	0.5B	0.81	77.4	1.68	69.5
One-Stage AR Methods
Spark TTS	Yes	0.5B	1.20	66.0	1.98	57.3
GPA-v1.5	Yes	0.6B	1.03	70.2	1.43	63.5

ASR Evaluation Table

ASR results on LibriSpeech, AISHELL-1, test_Meeting, and test_Net. WER (%) is reported for LibriSpeech; CER (%) is reported for AISHELL-1.

Model	Model Size	LibriSpeech test-clean	LibriSpeech test-other	AISHELL-1	test_Meeting	test_Net
Whisper-S	0.24B	3.43	7.63	-	-	-
GPA-v1.5	0.6B	2.78	5.02	2.83	7.40	6.49
Fun-ASR-nano	0.8B	1.76	4.33	1.80	6.60	6.01
FireRed-ASR	1.1B	1.84	4.52	0.54	4.95	4.94
GLM-ASR-nano	1.5B	2.00	4.19	1.81	6.73	-
GLM-ASR-nano*	1.5B	2.17	4.43	2.17	8.21	6.33
Whisper-L	1.55B	1.86	3.43	4.72	18.39	11.89
Kimi-Audio	-	1.32	2.63	0.71	6.24	6.45
Step-Audio2	-	1.17	2.42	0.63	4.75	4.67
Seed-ASR	-	1.58	2.84	0.68	5.69	4.66
Seed-ASR*	-	2.80	5.69	1.63	7.07	4.84
Fun-ASR	7.7B	1.51	3.03	1.22	6.17	5.46

Citation

If you find GPA useful for your research or projects, please cite us:

@misc{cai2026unifyingspeechrecognitionsynthesis,
      title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
      author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
      year={2026},
      eprint={2601.10770},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2601.10770},
}