MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

MOSS-Speech introduces true end-to-end speech interaction. Unlike cascaded pipelines or text-guided models, it directly generates speech without first producing text. This design preserves intonation, emotion, and other paralinguistic cues, while retaining the knowledge of the pretrained textual LLM—enabling more natural and efficient speech-to-speech dialogue.

Features

🎯

True Speech-to-Speech Large Language Model

MOSS-Speech understands and generates speech directly — no text guidance required. It can capture and generate emotion, laughter, and other paralinguistic information, enabling more natural and efficient interaction.

🔧

New Architecture for Modality Alignment

Built upon a pretrained text LLM, MOSS-Speech introduces modality-layered design with frozen pretraining. This allows the model to retain the abilities of the text LLM while natively adding speech understanding and generation capabilities.

📊

Native Multimodal Support

MOSS-Speech handles both speech and text seamlessly, supporting flexible multimodal interaction including: Speech → Speech, Text → Speech, Speech → Text, and Text → Text.

Demo

Model inference segments in the videos have been time-accelerated.