Translatotron 3: Speech to Speech Translation with Monolingual Data

Alon Levkovitch
Yifan Ding
Chulayuth Asawaroengchai
2024
Google Scholar

Abstract

This paper presents a novel approach to train a direct speech-to-speech translation model from monolingual datasets only in a fully unsupervised manner. The proposed approach combines back-translation, denoising autoencoder, and unsupervised embedding mapping techniques to achieve this goal. We demonstrate the effectiveness of the proposed approach by comparing it against a cascaded baseline using two Spanish and English datasets. The proposed approach achieved a significant improvement over the cascaded baseline on synthesized unpaired conversational and synthesized Common Voice $11$ datasets.