GPA: General Purpose Audio

Abstract

Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization.

We present General Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications.

This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.

Model Overview

Figure 1: Architecture of the proposed GPA framework. The model utilizes a shared Large Language Model (LLM) backbone to unify three core audio tasks: Understanding (ASR), Generation (TTS), and Editing (Voice Conversion). Depending on the task, the model processes different combinations of inputs (Source Audio, Target Text, or Reference Audio) via Semantic and Acoustic modules to generate the corresponding text or audio output.

Key Points

Unified Audio Token Space
GPA operates on a shared discrete audio token space that unifies TTS, ASR, and VC under a single LLM backbone, reducing cross-task fragmentation and enabling effective multi-task audio modeling.
Instruction-Driven Task Induction
Task behavior is induced via textual instructions rather than task-specific output heads, allowing dynamic switching between TTS, ASR, and VC without architectural changes or retraining, and supporting strong zero-shot generalization.
Edge-First Architecture Design
GPA prioritizes deployment efficiency, with a 0.3B model optimized for edge and consumer hardware, while larger variants are used to validate scalability.
Comprehensive Framework Support
GPA supports multiple inference frameworks, including vLLM, llama.cpp, SGLang, Torch, and MLX-LM for cloud/server deployment, as well as RKNN for edge devices.

📚 Much like an academic GPA reflects capability across subjects, our model aims for decent results across all audio tasks.

Demo

TTS: Zero-Shot Voice Cloning

Synthesizing speech from text while cloning the timbre of a reference audio sample.

Reference Audio	Input Text	Generated Audio

Voice Conversion (VC)

Converting the voice of a source audio to match the timbre of a reference audio while preserving content.

Source Audio	Reference Audio	Converted Audio

Model Performance

The following results are obtained by benchmarking services instantiated via the official deployment scripts, reflecting end-to-end performance in realistic serving scenarios rather than offline inference.

Among currently available open-source systems, our model is one of the few that natively supports both concurrent and streaming inference, while achieving performance comparable to the first tier of existing approaches.

TTS Streaming Benchmark (Latency & Throughput)

Table 1. TTS Streaming RTF and Audio Duration

Concurrency	Avg TTFC (ms)	P50 TTFC (ms)	P99 TTFC (ms)	Avg RTF	P50 RTF	P99 RTF	Audio Dur (s)
1	258.8	258.8	258.8	0.197	0.197	0.197	6.44
5	385.0	394.7	396.2	0.218	0.217	0.248	6.76
10	544.6	564.2	566.7	0.282	0.301	0.313	6.49
20	977.8	977.9	982.9	0.470	0.490	0.538	7.19
40	1797.0	1736.4	2564.5	0.421	0.400	0.587	6.33
80	3786.4	4054.4	5415.8	0.763	0.763	1.096	6.32
160	9847.9	10239.9	14350.3	1.718	1.740	2.577	6.44

ASR Streaming Benchmark

Table 2. ASR Streaming Latency vs Concurrency

Concurrency	Avg TTFT (ms)	P50 TTFT (ms)	P99 TTFT (ms)	Avg Total (ms)
1	157.5	157.5	157.5	190.9
5	394.1	393.7	395.9	400.0
10	589.6	721.3	723.3	598.1
20	1316.3	1495.6	1500.4	1317.8
40	2690.9	2678.3	2861.4	2693.7
80	3833.4	3961.3	4027.0	3845.1
160	5037.0	5689.3	6676.0	5044.0

Metrics Definition

TTFC: Time To First Chunk (TTS)
TTFT: Time To First Token (ASR)
RTF: Real-Time Factor (audio duration / synthesis time)

Model Evaluation Results

TTS Evaluation Table

↓ Lower is better ↑ Higher is better
Model	Open-Source	Model Size	SEED-zh		SEED-en
Model	Open-Source	Model Size	CER (%) ↓	Speaker Similarity (%) ↑	WER (%) ↓	Speaker Similarity (%) ↑
Multi-Stage or NAR Methods
Human	-	-	1.26	75.5	2.14	73.4
Seed-TTS	❌	-	1.12	79.6	2.25	76.2
MiniMax-Speech	❌	-	0.83	78.3	1.65	69.2
F5-TTS	✅	0.3B	1.52	74.1	2.00	64.7
CosyVoice2	✅	0.5B	1.45	75.7	2.57	65.9
FireRedTTS2	✅	1.5B	1.14	73.2	1.95	66.5
Index-TTS2	✅	1.5B	1.03	76.5	2.23	70.6
VibeVoice-1.5B	✅	1.5B	1.16	74.4	3.04	68.9
VibeVoice-Realtime	✅	0.5B	-	-	2.05	63.3
HiggsAudio-v2	✅	3B	1.50	74.0	2.44	67.7
VoxCPM	✅	0.5B	0.93	77.2	1.85	72.9
GLM-TTS	✅	1.5B	1.03	76.1	-	-
GLM-TTS RL	✅	1.5B	0.89	76.4	-	-
Fun-CosyVoice3-0.5B-2512	✅	0.5B	1.21	78.0	2.24	71.8
Fun-CosyVoice3-0.5B-2512_RL	✅	0.5B	0.81	77.4	1.68	69.5
One-Stage AR Methods
Spark TTS	✅	0.5B	1.20	66.0	1.98	57.3
GPA-0.3B	✅	0.3B	0.95	65.9	1.51	56.5

ASR Evaluation Table

ASR results on Librispeech and Aishell-1.
Model	Params	Librispeech test-clean (WER↓)	AISHELL-1 (CER↓)
Models with < 0.5B parameters
Whisper-S	0.24B	3.13	-
GPA-0.3B	0.3B	8.88	4.50
Models with ≥ 0.5B parameters
Fun-ASR-nano	0.8B	1.76	1.80
FireRed-ASR	1.1B	1.84	0.54
GLM-ASR-nano	1.5B	2.00	1.81
GLM-ASR-nano*	1.5B	2.17	2.17
Whisper-L	1.55B	1.82	4.72
Kimi-Audio	-	1.32	0.71
Step-Audio2	-	1.17	0.63
Seed-ASR	-	1.58	0.68
Seed-ASR*	-	2.80	1.63
Fun-ASR	7.7B	1.51	1.22

ASR results on Librispeech, Wenetspeech, and Aishell-1.
WER (%) is reported for Librispeech, and CER (%) is reported for Wenetspeech and Aishell-1.

Ethics Statement

The GPA model possesses advanced capabilities in voice cloning and speech synthesis. While these features have significant potential for positive applications in accessibility, entertainment, and education, we acknowledge the risk of misuse, such as deepfake generation or voice spoofing.

We are committed to responsible AI development. The released models are intended for academic research and personal educational use only. We have implemented watermarking techniques in the generated audio to aid in detection. Users are strictly prohibited from using this technology for illegal purposes, including but not limited to fraud, defamation, or impersonation without consent. By using this software, you agree to adhere to these ethical guidelines.