All in one, built for all. A single model delivering near-SOTA performance on TTS and ASR in a unified open framework.

GPA v1.5 Release

GPA: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion

General Purpose Audio unifies speech recognition and speech synthesis in one autoregressive audio-language model, with native PyTorch workflows and ONNX runtime deployment for GPA v1.5.

AutoArk-AI

Abstract

Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization.

We present General Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications.

This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline for practical deployment.

GPA-v1.5 extends this direction as the new mainline release, delivering stronger ASR and TTS performance while preserving the unified modeling objective.

Model Overview

GPA unified speech model overview
Figure 1. GPA unifies speech understanding and generation in a single autoregressive audio-language model.

Native Train

Fine-tune or continue training GPA-v1.5 with Hugging Face Trainer and the v1.5 training package.

Native Infer

Run direct Hugging Face and PyTorch execution for GPA-v1.5 ASR and TTS behavior baselines.

ONNX Runtime

Use local CLI inference, FastAPI service deployment, browser UI testing, voice registration, and runtime validation.

GPA-v1.5 Release

GPA-v1.5 is the new mainline release of GPA: a unified 0.6B audio model for ASR and TTS, with native PyTorch workflows and ONNX runtime deployment now available.

CheckpointOpen-sourced on Hugging Face
Native InferenceDirect PyTorch / Hugging Face execution for ASR and TTS
Native TrainingFine-tuning and continued training with Hugging Face Trainer
ONNX RuntimeCLI inference, FastAPI service, browser UI, voice registration, and runtime validation
PlannedVoice Conversion support in the native v1.5 path

GPA-TTS: Edge-Ready Voice-Cloning TTS

TTS is one of the most popular features in the online demo, so GPA-TTS extracts the TTS component into a standalone, self-contained runtime.

QuantizationQwen INT4 plus Detokenizer INT8 / FP16 / FP32 with ONNX Runtime
Voice CloningZero-shot voice cloning from a short reference audio
Decoder PrecisionSelectable at runtime: INT8 for edge, FP16 for balanced use, FP32 for highest quality
FootprintAmong the smallest open-source TTS runtimes with cloning support
Optimized ForLocal CPU inference on Mac, Linux, and edge devices

Demo

TTS: Zero-Shot Voice Cloning

Synthesizing speech from text while cloning the timbre of a reference audio sample.

Reference Audio Input Text Generated Audio

Voice Conversion (VC)

Coming soon. GPA-v1.5 does not include the web VC demo yet. Native Voice Conversion support is planned and this section will be updated when it is available.

GPA-v1.5 Evaluation Metric Results

TTS Evaluation Table

Model Open-Source Model Size test-zh CER (%) test-zh Sim (%) test-en WER (%) test-en Sim (%)
Multi-Stage or NAR Methods
Human--1.2675.52.1473.4
Seed-TTSNo-1.1279.62.2576.2
MiniMax-SpeechNo-0.8378.31.6569.2
F5-TTSYes0.3B1.5274.12.0064.7
CosyVoice2Yes0.5B1.4575.72.5765.9
FireRedTTS2Yes1.5B1.1473.21.9566.5
Index-TTS2Yes1.5B1.0376.52.2370.6
VibeVoice-1.5BYes1.5B1.1674.43.0468.9
VibeVoice-RealtimeYes0.5B--2.0563.3
HiggsAudio-v2Yes3B1.5074.02.4467.7
VoxCPMYes0.5B0.9377.21.8572.9
GLM-TTSYes1.5B1.0376.1--
GLM-TTS RLYes1.5B0.8976.4--
Fun-CosyVoice3-0.5B-2512Yes0.5B1.2178.02.2471.8
Fun-CosyVoice3-0.5B-2512_RLYes0.5B0.8177.41.6869.5
One-Stage AR Methods
Spark TTSYes0.5B1.2066.01.9857.3
GPA-v1.5Yes0.6B1.0370.21.4363.5
Lower CER/WER is better. Higher speaker similarity is better.

ASR Evaluation Table

ASR results on LibriSpeech, AISHELL-1, test_Meeting, and test_Net. WER (%) is reported for LibriSpeech; CER (%) is reported for AISHELL-1.

Model Model Size LibriSpeech test-clean LibriSpeech test-other AISHELL-1 test_Meeting test_Net
Whisper-S0.24B3.437.63---
GPA-v1.50.6B2.785.022.837.406.49
Fun-ASR-nano0.8B1.764.331.806.606.01
FireRed-ASR1.1B1.844.520.544.954.94
GLM-ASR-nano1.5B2.004.191.816.73-
GLM-ASR-nano*1.5B2.174.432.178.216.33
Whisper-L1.55B1.863.434.7218.3911.89
Kimi-Audio-1.322.630.716.246.45
Step-Audio2-1.172.420.634.754.67
Seed-ASR-1.582.840.685.694.66
Seed-ASR*-2.805.691.637.074.84
Fun-ASR7.7B1.513.031.226.175.46

Citation

If you find GPA useful for your research or projects, please cite us:

@misc{cai2026unifyingspeechrecognitionsynthesis,
      title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
      author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
      year={2026},
      eprint={2601.10770},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2601.10770},
}