GPA: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion

AutoArk-AI
Abstract

Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization.

We present General Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications.

This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.

Model Overview Demo
ASR TTS VC
Evaluation Performance Ethics Statement

Model Overview

GPA Model Architecture
Figure 1: Architecture of the proposed GPA framework. The model utilizes a shared Large Language Model (LLM) backbone to unify three core audio tasks: Understanding (ASR), Generation (TTS), and Editing (Voice Conversion). Depending on the task, the model processes different combinations of inputs (Source Audio, Target Text, or Reference Audio) via Semantic and Acoustic modules to generate the corresponding text or audio output.

Key Points

📚 Much like an academic GPA reflects capability across subjects, our model aims for decent results across all audio tasks.

Demo

TTS: Zero-Shot Voice Cloning

Synthesizing speech from text while cloning the timbre of a reference audio sample.

Reference Audio Input Text Generated Audio

Voice Conversion (VC)

Converting the voice of a source audio to match the timbre of a reference audio while preserving content.

Source Audio Reference Audio Converted Audio

Model Performance

The following results are obtained by benchmarking services instantiated via the official deployment scripts, reflecting end-to-end performance in realistic serving scenarios rather than offline inference.

Among currently available open-source systems, our model is one of the few that natively supports both concurrent and streaming inference, while achieving performance comparable to the first tier of existing approaches.

TTS Streaming Benchmark (Latency & Throughput)

Table 1. TTS Streaming RTF and Audio Duration

Concurrency Avg TTFC (ms) P50 TTFC (ms) P99 TTFC (ms) Avg RTF P50 RTF P99 RTF Audio Dur (s)
1258.8258.8258.80.1970.1970.1976.44
5385.0394.7396.20.2180.2170.2486.76
10544.6564.2566.70.2820.3010.3136.49
20977.8977.9982.90.4700.4900.5387.19
401797.01736.42564.50.4210.4000.5876.33
803786.44054.45415.80.7630.7631.0966.32
1609847.910239.914350.31.7181.7402.5776.44

ASR Streaming Benchmark

Table 2. ASR Streaming Latency vs Concurrency

Concurrency Avg TTFT (ms) P50 TTFT (ms) P99 TTFT (ms) Avg Total (ms)
1157.5157.5157.5190.9
5394.1393.7395.9400.0
10589.6721.3723.3598.1
201316.31495.61500.41317.8
402690.92678.32861.42693.7
803833.43961.34027.03845.1
1605037.05689.36676.05044.0

Metrics Definition

Model Evaluation Results

TTS Evaluation Table

Model Open-Source Model Size SEED-zh SEED-en
CER (%) Speaker Similarity (%) WER (%) Speaker Similarity (%)
Multi-Stage or NAR Methods
Human - - 1.26 75.5 2.14 73.4
Seed-TTS - 1.12 79.6 2.25 76.2
MiniMax-Speech - 0.83 78.3 1.65 69.2
F5-TTS 0.3B 1.52 74.1 2.00 64.7
CosyVoice2 0.5B 1.45 75.7 2.57 65.9
FireRedTTS2 1.5B 1.14 73.2 1.95 66.5
Index-TTS2 1.5B 1.03 76.5 2.23 70.6
VibeVoice-1.5B 1.5B 1.16 74.4 3.04 68.9
VibeVoice-Realtime 0.5B - - 2.05 63.3
HiggsAudio-v2 3B 1.50 74.0 2.44 67.7
VoxCPM 0.5B 0.93 77.2 1.85 72.9
GLM-TTS 1.5B 1.03 76.1 - -
GLM-TTS RL 1.5B 0.89 76.4 - -
Fun-CosyVoice3-0.5B-2512 0.5B 1.21 78.0 2.24 71.8
Fun-CosyVoice3-0.5B-2512_RL 0.5B 0.81 77.4 1.68 69.5
One-Stage AR Methods
Spark TTS 0.5B 1.20 66.0 1.98 57.3
GPA-0.3B 0.3B 0.95 65.9 1.51 56.5
↓ Lower is better    ↑ Higher is better

ASR Evaluation Table

Model Params Librispeech test-clean (WER↓) AISHELL-1 (CER↓)
Models with < 0.5B parameters
Whisper-S 0.24B 3.13 -
GPA-0.3B 0.3B 8.88 4.50
Models with ≥ 0.5B parameters
Fun-ASR-nano 0.8B 1.76 1.80
FireRed-ASR 1.1B 1.84 0.54
GLM-ASR-nano 1.5B 2.00 1.81
GLM-ASR-nano* 1.5B 2.17 2.17
Whisper-L 1.55B 1.82 4.72
Kimi-Audio - 1.32 0.71
Step-Audio2 - 1.17 0.63
Seed-ASR - 1.58 0.68
Seed-ASR* - 2.80 1.63
Fun-ASR 7.7B 1.51 1.22
ASR results on Librispeech and Aishell-1.
ASR results on Librispeech, Wenetspeech, and Aishell-1.
WER (%) is reported for Librispeech, and CER (%) is reported for Wenetspeech and Aishell-1.

Ethics Statement

The GPA model possesses advanced capabilities in voice cloning and speech synthesis. While these features have significant potential for positive applications in accessibility, entertainment, and education, we acknowledge the risk of misuse, such as deepfake generation or voice spoofing.

We are committed to responsible AI development. The released models are intended for academic research and personal educational use only. We have implemented watermarking techniques in the generated audio to aid in detection. Users are strictly prohibited from using this technology for illegal purposes, including but not limited to fraud, defamation, or impersonation without consent. By using this software, you agree to adhere to these ethical guidelines.