Unsupervised Learning of Disentangled Speech Content and Style Representation

Andros Tjandra¹, Ruoming Pang², Yu Zhang², Shigeki Karita²
¹NAIST, ²Google LLC

Description:
Content X: original input speech fed to the content encoder
Style Y: original input speech fed to the style encoder
Generated content X, style Y: synthesize speech from decoder by combining the content encoder output X and style encoder output Y

Tested on Google Chrome

Audio

Content X
Style Y
Generated content X, style Y

Content X
Style Y
Generated content X, style Y

Content X
Style Y
Generated content X, style Y

Content X
Style Y
Generated content X, style Y

Content X
Style Y
Generated content X, style Y

Content X
Style Y
Generated content X, style Y

Content X
Style Y
Generated content X, style Y

Content X
Style Y
Generated content X, style Y

Unsupervised Learning of Disentangled Speech Content and Style Representation

Andros Tjandra1, Ruoming Pang2, Yu Zhang2, Shigeki Karita2 1NAIST, 2Google LLC

Tested on Google Chrome

Andros Tjandra¹, Ruoming Pang², Yu Zhang², Shigeki Karita²
¹NAIST, ²Google LLC