
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance
MOSS-Speech introduces true end-to-end speech interaction. Unlike cascaded pipelines or text-guided models, it directly generates speech without first producing text. This design preserves intonation, emotion, and other paralinguistic cues, while retaining the knowledge of the pretrained textual LLM—enabling more natural and efficient speech-to-speech dialogue.
Features
True Speech-to-Speech Large Language Model
MOSS-Speech understands and generates speech directly — no text guidance required. It can capture and generate emotion, laughter, and other paralinguistic information, enabling more natural and efficient interaction.
New Architecture for Modality Alignment
Built upon a pretrained text LLM, MOSS-Speech introduces modality-layered design with frozen pretraining. This allows the model to retain the abilities of the text LLM while natively adding speech understanding and generation capabilities.
Native Multimodal Support
MOSS-Speech handles both speech and text seamlessly, supporting flexible multimodal interaction including: Speech → Speech, Text → Speech, Speech → Text, and Text → Text.
Demo
Model inference segments in the videos have been time-accelerated.