FCTalker

Yifan Hu¹, Rui Liu^1,*, Guanglai Gao¹, Haizhou Li^2,3 ¹Inner Mongolia University, China
²Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
³National University of Singapore, Singapore hyfwalker@163.com, liurui_imu@163.com, csggl@imu.edu.cn, haizhouli@cuhk.edu.cn

ABSTRACT

Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context. The correlation between the current utterance and the dialogue history at the utterance level was used to improve the expressiveness of synthesized speech. However, the fine-grained information in the dialogue history at the word level also has an important impact on the prosodic expression of an utterance, which has not been well studied in the prior work. Therefore, we propose a novel expressive conversational TTS model, termed as FCTalker, that learns the fine and coarse grained context dependency at the same time during speech generation. Specifically, the FCTalker includes fine and coarse grained encoders to exploit the word and utterance-level context dependency. To model the word-level dependencies between an utterance and its dialogue history, the fine-grained dialogue encoder is built on top of a dialogue BERT model. The subjective and objective evaluation results show that the proposed model achieves remarkable results compared to the baseline model and generates contextually appropriate expressive speech. We release the source code at: https://github.com/AI-S2-Lab/FCTalker.

EXPERIMENTS

We develop four neural TTS systems for a comparative study:
1) FastSpeech2: the state-of-the-art neural TTS model that takes the single utterance as input, without any conversational context modeling.
2) DailyTalk: a latest neural conversational TTS model with coarse-grained context encoder.
3) FCTalker w/o coarse: we take the coarse-grained encoder module out of the proposed FCTalker model.
4) FCTalker: the proposed model with fine-and corse-grained context modeling strategy.
5) Ground Truth: speech under conversation scenario.

1. Prosodic Expressiveness Evaluation

1.1 Utterance-level

FastSpeech2	DailyTalk	FCTalker w/o coarse	FCTalker	Ground Truth
1. No problem. What size would you like?

2. You'll never be in shape until you eat less and take more exercise.

3. Yes, let me see some of your hats, please.

1.2 Dialogue-level

Dialogue History	Current Utterance
	FastSpeech2	DailyTalk	FCTalker w/o coarse	FCTalker	Ground Truth
1. I am looking for a pan.	1. No problem. What size would you like?

2.Mary, can you tell me how you keep in shape?	2. You'll never be in shape until you eat less and take more exercise.

3. Good morning! Can I help you?	3. Yes, let me see some of your hats, please.

2. Dialogue History Turns Comparison

2.1 Utterance-level

FCTalker
Dialogue Turn Τ	1. There are so many ancient relics in China.	2. Thank you for your compliments. You're welcome to our hotel again.	3. Here's your change.
2
3
4
5
6
7
8
9
10
11
12
13
14

2.2 Dialogue-level

Note: x^th represents the order in which the current sentence is in the dialogue.

Dialogue History		Current Utterance
1^th	Look, George, There's the Great Wall.	Dialogue Turn Τ	There are so many ancient relics in China.
2^th	I see. It's on top of the hills.	2
3^th	Yeah, it stretches over for thousands of miles.	3
4^th	I know. It's a major symbol of China.	4
5^th	Where can we climb it?	5
6^th	Do we have any choices?	6
7^th	Well, we could take the cable car.	7
8^th	Ah... let's just climb. Umm it's more fun, I think.	8
9^th	OK. Let's go.	9
10^th	Well, that was tough.	10
11^th	But we made it.	11
12^th	This looks great. When was it built?	12
13^th	It was first built about twenty-five hundred years ago.	13
14^th	That's remarkable.	14