FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis
Yifan Hu1, Rui Liu1,*, Guanglai Gao1, Haizhou Li2,3 1Inner Mongolia University, China  
2Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
3National University of Singapore, Singapore
hyfwalker@163.com, liurui_imu@163.com, csggl@imu.edu.cn, haizhouli@cuhk.edu.cn
ABSTRACT
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context. The correlation between the current utterance and the dialogue history at the utterance level was used to improve the expressiveness of synthesized speech. However, the fine-grained information in the dialogue history at the word level also has an important impact on the prosodic expression of an utterance, which has not been well studied in the prior work. Therefore, we propose a novel expressive conversational TTS model, termed as FCTalker, that learns the fine and coarse grained context dependency at the same time during speech generation. Specifically, the FCTalker includes fine and coarse grained encoders to exploit the word and utterance-level context dependency. To model the word-level dependencies between an utterance and its dialogue history, the fine-grained dialogue encoder is built on top of a dialogue BERT model. The subjective and objective evaluation results show that the proposed model achieves remarkable results compared to the baseline model and generates contextually appropriate expressive speech. We release the source code at: https://github.com/AI-S2-Lab/FCTalker.
EXPERIMENTS
We develop four neural TTS systems for a comparative study:
1) FastSpeech2: the state-of-the-art neural TTS model that takes the single utterance as input, without any conversational context modeling.
2) DailyTalk: a latest neural conversational TTS model with coarse-grained context encoder.
3) FCTalker w/o coarse: we take the coarse-grained encoder module out of the proposed FCTalker model.
4) FCTalker: the proposed model with fine-and corse-grained context modeling strategy.
5) Ground Truth: speech under conversation scenario.
1. Prosodic Expressiveness Evaluation
1.1 Utterance-level
FastSpeech2 DailyTalk FCTalker w/o coarse FCTalker Ground Truth
1. No problem. What size would you like?
2. You'll never be in shape until you eat less and take more exercise.
3. Yes, let me see some of your hats, please.
1.2 Dialogue-level
Dialogue History Current Utterance
FastSpeech2 DailyTalk FCTalker w/o coarse FCTalker Ground Truth
1. I am looking for a pan. 1. No problem. What size would you like?
2.Mary, can you tell me how you keep in shape? 2. You'll never be in shape until you eat less and take more exercise.
3. Good morning! Can I help you? 3. Yes, let me see some of your hats, please.
2. Dialogue History Turns Comparison
2.1 Utterance-level
FCTalker
Dialogue Turn Τ 1. There are so many ancient relics in China. 2. Thank you for your compliments. You're welcome to our hotel again. 3. Here's your change.
2
3
4
5
6
7
8
9
10
11
12
13
14
2.2 Dialogue-level
Note: xth represents the order in which the current sentence is in the dialogue.
Dialogue History Current Utterance
1th Look, George, There's the Great Wall. Dialogue Turn Τ There are so many ancient relics in China.
2th I see. It's on top of the hills. 2
3th Yeah, it stretches over for thousands of miles. 3
4th I know. It's a major symbol of China. 4
5th Where can we climb it? 5
6th Do we have any choices? 6
7th Well, we could take the cable car. 7
8th Ah... let's just climb. Umm it's more fun, I think. 8
9th OK. Let's go. 9
10th Well, that was tough. 10
11th But we made it. 11
12th This looks great. When was it built? 12
13th It was first built about twenty-five hundred years ago. 13
14th That's remarkable. 14