This page shows the samples in the paper "Preserving Background Sound in Noise-robust Voice Conversion via Multi-task Learning".

Abstract

Background sound is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. In this paper, we propose an end-to-end framework via multi-task learning which sequentially cascades a source separation (SS) module, a bottleneck feature extraction module and a VC module. Specifically, the source separation task explicitly considers critical phase information and limits the distortion caused by the imperfect separation process. The source separation task, the typical VC task and the unified task share a uniform reconstruction loss constrained by joint training to reduce the mismatch between the SS and VC modules. Experimental results demonstrate that our proposed framework significantly outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.


Proposed Multi-task Framework

Fig.1. (a) System overview of the proposed multi-task framework, BN represents bottleneck features. (b) SS module. (c) VC module. The solid line represents the forward propagation, and the dashed line represents which loss functions are used for the module output.

1: Comparison between Proposed Framework and Baseline Systems (Ensemble quality)

Target speaker p294 p334
Source Baseline1 (Sep) Baseline2 (Sep + Denoise) Proposed Upper Bound
F2F
M2F
F2M
M2M

2: Comparison between Proposed Framework and Baseline Systems (Speech quality)

Target speaker p294 p334
Source Baseline1 (Sep) Baseline2 (Sep + Denoise) Proposed
F2F
M2F
F2M
M2M

3: Ablation Study

Proposed Without SE Loss Without VC Loss Without Joint Training
Sample 1
Sample 2
Sample 3
Sample 4

4: Flexibly Control

A's Linguistic Content + B's Background Sound + C's Timbre = Final
Sample 1
Sample 2

5: Video Dubbing with the Proposed System

Original video

Target Speaker: p334

Target Speaker: p294

6: Mel spectrogram comparison of ablation experiments

With \( L_{*}^{ss} \)

Without \( L_{*}^{ss} \)