Unsupervised Learning of Disentangled Speech Content and Style Representation

Andros Tjandra1, Ruoming Pang2, Yu Zhang2, Shigeki Karita2
1NAIST, 2Google LLC

Content X: original input speech fed to the content encoder
Style Y: original input speech fed to the style encoder
Generated content X, style Y: synthesize speech from decoder by combining the content encoder output X and style encoder output Y

Tested on Google Chrome
ID Audio