Generative Expressive Conversational Speech Synthesis
Rui Liu1,*, Yifan Hu1, Yi Ren2, Xiang Yin2, Haizhou Li3 1Inner Mongolian University, China   2ByteDance
3Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
4National University of Singapore, Singapore
liurui imu@163.com, hyfwalker@163.com, ren.yi@bytedance.com, yinxiang.stephen@bytedance.com, haizhouli@cuhk.edu.cn
ABSTRACT
Conversational Speech Synthesis (CSS) aims to express a target utterance with the proper speaking style in a user-agent conversation setting. Existing CSS methods employ effective multi-modal context modeling techniques to achieve empathy understanding and expression. However, they often need to design complex network architectures and meticulously optimize the modules within them. In addition, due to the limitations of small-scale datasets containing scripted recording styles, they often fail to simulate real natural conversational styles. To address the above issues, we propose a novel generative expressive CSS system, termed GPT-Talker. We transform the multimodal information of the multi-turn dialogue history into discrete token sequences and seamlessly integrate them to form a comprehensive user-agent dialogue context. Leveraging the power of GPT, we predict the token sequence, that includes both semantic and style knowledge, of response for the agent. After that, the expressive conversational speech is synthesized by the conversation-enriched VITS to deliver feedback to the user. Furthermore, we propose a large-scale Natural CSS Dataset called NCSSD, that includes both naturally recorded conversational speech in improvised styles and dialogues extracted from TV shows. It encompasses both Chinese and English languages, with a total duration of 236 hours. We conducted comprehensive experiments on the reliability of the NCSSD and the effectiveness of our GPT-Talker. Both subjective and objective evaluations demonstrate that our model outperforms other state-of-the-art CSS systems significantly in terms of naturalness and expressiveness. Code, Dataset, and Pre-trained Model are available at: https://github.com/walker-hyf/GPT-Talker.
DATASET: NCSSD
Some dialogue samples of the NCSSD we built are shown below:
1. NCSSD(EN)
(1) RC-EN
Samples: NCSSD (RC-EN)
Dialog #1 Dialog #2 Dialog #3
1. (Spk A) Do you think the music industry is going in the right direction? I'm worried about his future.
1. (Spk A) What do you want to do today? I'm bored.
1. (Spk A) What's the weather like today?
2. (Spk B) I understand your concern with the rise of digital streaming and privacy. It's challenging for artists to earn a fair income.
2. (Spk B) Let's go on an adventure.
2. (Spk B) I haven't checked whether focused yet.
3. (Spk A) Exactly, it's becoming increasingly difficult for them to make a living salary for music.
3. (Spk A) Uh, I hate adventures. They are so tiring.
3. (Spk A) You should check it. It might read later.
4. (Spk B) However, technology has also loaded up new opportunities for independent artists to promote their work.
4. (Spk B) But think about the thrill and excitement we experience.
4. (Spk B) Thanks for reminding me. I will check it right away.
5. (Spk A) That's true. But the competition is fierce and it's hard for them to stand out in the overcrowded market.
5. (Spk A) I don't care about through and excitement. I prefer a calm and relaxing day.
5. (Spk A) You're welcome. Let me know if there's any update.
6. (Spk B) You are right. But I believe that quality music will always find its audience.
6. (Spk B) Come on. It will be a great opportunity to try something new and step out of our comfort zone.
6. (Spk B) Sure, I'll keep you informed.
7. (Spk A) I hope so. It would be a shame if talented musicians couldn't pursue their passion due to financial constraints.
7. (Spk A) I'm sorry, but I really don't enjoy taking risks. I'll pass this time.
7. (Spk A) Great, thanks. Have a good day.
8. (Spk B) I sure you confess, but let's also remember that music has always been resilient and adaptable throughout history.
8. (Spk B) Fine, I understand that we can do something else.
9. (Spk A) Alright, I guess we can only hope for the best and support our favorite artists.
(2) CL-EN
Samples: NCSSD (CL-EN)
Dialog #1 Dialog #2 Dialog #3
1. (Spk A) You know, Rose said you were gonna be threatened by this.
1. (Spk A) take your watch off.
1. (Spk A) joy. when Itold you that my grandmother didn't have much time,I didn't realize that she's 15 years older than you.
2. (Spk B) Rose is a psychologist now. I thought she was the genealogist.
2. (Spk B) why?
2. (Spk B) how would you know how old I am?that is a very closely guarded secret.
3. (Spk A) What is your problem with her?
3. (Spk A) you don't wear a Cascio watchwith a 2,000 bound wedding dress?
3. (Spk A) I did some research.
4. (Spk B) Let's start with the fact that she thinks I'd be threatened by you.
4. (Spk B) I wanna know what time it is.
4. (Spk B) very resourceful.
5. (Spk A) This is just killing you, isn't it?
5. (Spk A) is your wedding day? people will tell you the time.
5. (Spk A) I did it to assure myself that you can handle a heavier workload.
6. (Spk B) I don't even know what you're talking about right now.
6. (Spk B) so my brothers don't get to the church on time.people literally taking them watches off them. don't touch me.
6. (Spk B) I appreciate that,but I'm not worried about my health.
7. (Spk A) oh, i clothes and you walk away with fifty mile.
7. (Spk A) is it the money?because I can guarantee 200 million up front is just atenth of what you stand to make on this deal.
8. (Spk B) I got what I did, Tommy, because I stayed here. I fought and I worked my ass off while you went down to Mexico to, I don't know, irrigate corn.
8. (Spk B) Michael,my dolls eat better than most people.I've got plenty of money.this deal meant staying on for five more yearsin doubling my workload.I am going to take my grandchildren on a cruise to Tahiti.
2. NCSSD(ZH)
(1) RC-ZH
Samples: NCSSD (RC-ZH)
Dialog #1 Dialog #2 Dialog #3
1. (Spk A) 你知道吗?最近啊我参加了一个社会价值观培养的讲座。
1. (Spk A) 你知道吗?最近我发现越来越多的企业开始关注社会责任了。
1. (Spk A) 你觉得健身训练怎么样啊?
2. (Spk B) 真的吗?听起来很有意思,你能告诉我更多关于这个讲座的内容吗?
2. (Spk B) 是吗?具体有哪些系列呢?
2. (Spk B) 我真的很讨厌健身训练。
3. (Spk A) 当然,讲座主要强调了培养良好的社会价值观重要性。
3. (Spk A) 比如有一家知名的电子科技公司,他们啊在生产过程中采用了环保材料,减少了对环境的污染。
3. (Spk A) 为什么呀?健身可以让你保持健康,增强体质啊。
4. (Spk B) 那这个讲座具体讲了哪些社会价值观呢?
4. (Spk B) 这真是令人惊喜,不仅能够保护环境,还能提高产品的质量。
4. (Spk B) 哎,我不喜欢那种累的要死的感觉,而且啊每次锻炼完我都会全身酸痛,真的很痛苦。
5. (Spk A) 他们提到了尊重他人、平等和公正、责任感以及关心环境等。
5. (Spk A) 没错,而且还有一家餐饮连锁企业,他们主动参与公益活动,关爱弱势群体。
5. (Spk A) 但是锻炼可以增加你的耐力和力量,让你更有自信,也能减轻压力啊。
6. (Spk B) 这些价值观确实非常重要,但是你觉得我们如何培养这些价值观呢?
6. (Spk B) 太好了,这样的企业真是让人感到欣慰,希望更多的企业能够履行社会责任。
6. (Spk B) 我厌恶健身训练的原因,还包括我对健身房环境不喜欢,人多又吵闹,真的很烦。
7. (Spk A) 像托尼啊提到一些方法,比如家庭教育加学校教育和社会环境的影响等。
7. (Spk A) 是的,我也希望能看到更多的企业关注社会责任,为社会责任做出更多的贡献。
7. (Spk A) 嗯,那您可以考虑在家里或者室外进行锻炼,呃,这样可以避免噪音和人群呢。
8. (Spk B) 听起来很有启发性,我也想参加这样的讲座,提升自己的社会价值观。
8. (Spk B) 我们作为消费者,也应该支持这些积极履行社会责任的企业,共同打造一个更美好的社会。
8. (Spk B) 嗯,可能吧,但我觉得健身训练太枯燥了,我对这个真的没有感情,没有兴趣。
9. (Spk A) 我可以把很多资料分享给你,你可以研读一下。
9. (Spk A) 如果你有其他运动爱好,也可以选择其他方式来保持身体健康,不一定非要健身训练呀。
10. (Spk B) 太好了,非常感谢你的分享,我会好好研究的。
10. (Spk B) 嗯,或许我可以试试其他的运动方式,看看有没有适合我的。谢谢你的建议。
11. (Spk A) 不客气,希望这些资料对你有帮助。
(2) CL-ZH
Samples: NCSSD (CL-ZH)
Dialog #1 Dialog #2 Dialog #3
1. (Spk A) 没事了。
1. (Spk A) 哎,小月。
1. (Spk A) 什么意思啊,你干嘛还给我钱啊?
2. (Spk B) 怎么了?你这是躲谁呢?
2. (Spk B) 你怎么知道我在这儿啊?
2. (Spk B) 咱们之间就别玩虚的了吧。小魏子都给我来电话了,说你又搬家了,还剩借住在朋友那儿。还有那儿子在国外读书,那也花不少钱吧。
3. (Spk A) 啊,就我一个同事特别爱说话,呃,非要拉着我一起下班呃,分摊车钱的那种啊,他话特别多,可我上下班路上就爱听歌睡觉什么的,所以不是特别想跟他。那什么你回去吧,我回家了。来来来。
3. (Spk A) 饭点嘛,你不是在这儿,就是在汽配店,反正啊这两个地儿总能逮到你。
3. (Spk A) 行了,我有钱用。
4. (Spk B) 不行不行,都这么晚了,爸爸一定要送你回去啊。
4. (Spk B) 真聪明,找我干嘛呀?
4. (Spk B) 你要多少钱,你跟我说说。
5. (Spk A) 真的不用了。
5. (Spk A) 哎呀,想你了,过来看看。
5. (Spk A) 一万,五千。十块,怎么办?分不了。
6. (Spk B) 也跟爸爸下课去了,听话把门关上。过来。说说你这孩子,哎,不愿意麻烦我,要不是我长个心眼在这等着,你说都这么晚了啊。一个女孩子怎么让人放心?
6. (Spk B) 得了吧,你不会又找我去什么周末招聘会吧,说好了啊不去。
EXPERIMENTS
1. Reliability verification of NCSSD
We train VITS using our NCSSD in single-sentence scenarios:
NCSSD(EN)
It's just hard to ignore their words. Sometimes I feel like I've lost the joy I used to have.
You are right. It's devastating to witness the destruction of ecosystems and laws of biodiversity.
NCSSD(ZH)
我真是气愤,我们应该向相关部门投诉,不能让这些虚假广告继续欺骗人们了。
确实,市场上的保健品种类繁多,选择起来确实有一些困难。
2. Reliability verification of NCSSD
We develop three advanced CSS systems that represent various context learning methods with the following three categories:
1) GRU-based Context Learning: The conversational context-aware TTS (we call “CCATTS” here) model employs a GRU-based network to model the sentence-level dependency among the dialogue context; [17](Guo et al. 2020)
2) Multi-scale Context Learning: FCTalker is an representative work that considerboth the sentence-level and word-level contextual within the dialogue context; [19](Hu et al. 2022)
3) Heterogeneous Graph-based Context Learning: ECSS is an advanced expressive and emotional CSS model that adopts heterogeneous graph to model the complex relation among the multi-modal context. [34](Liu et al. 2024)
4) GPT-Talker (Proposed): We proposed the novel generative expressive CSS system.
5) Ground Truth: Real recorded speech.
We conducted experiments using two standard datasets for conversational speech synthesis:
1) DailyTalk: The dataset contains 2541 English dialogue, totaling approximately 20 hours. [27](Lee et al. 2023)
2) NCSSD: The dataset we have constructed consists of 19,456 dialogues, totaling approximately 236 hours. It includes two forms of construction: Collected(CL) and Recorded(RC), covering two languages, English(EN) and Chinese(ZH).
(1) Synthesized speech from the DailyTalk and NCSSD datasets (CL-EN & RC-EN):
Sample #1 (Dataset: DailyTalk):
Dialogue History
1.(Spk A) I often eat fish and eggs.
2.(Spk B) Do you eat a lot of vegetables?
Current Utterance
GRU-based Context Learning Multi-scale Context Learning Heterogeneous Graph-based Context Learning GPT-Talker (Proposed) Ground Truth
3. (Spk A) Oh yes, and fruits. I love fruits very much.
Sample #2 (Dataset: NCSSD(EN)):
Dialogue History
1. (Spk A) What do you think about the current industrial structure? I am afraid it's not sustainable.
2. (Spk B) I totally agree. The overbalance on a single industry is very risky.
Current Utterance
GRU-based Context Learning Multi-scale Context Learning Heterogeneous Graph-based Context Learning GPT-Talker (Proposed) Ground Truth
3. (Spk A) Exactly, if that industry collapses, it will have a devastating impact on the economy.
(2) Experimental results of Three-Stage training:
We design the following five training strategies for validation:
1) One-Stage (w/CL&RC): We directly use both the collection and recording subsets of NCSSD to train the GPT-Talker.
2) Two-Stage (w/CL): We first perform pre-training for single-sentence speech synthesis. Then we use collected subset data to fine-tune.
3) Two-Stage (w/RC): We first perform pre-training for single-sentence speech synthesis. Then we use recorded subset data to fine-tune.
4) Two-Stage (w/CL&RC): We first perform pre-training for single-sentence speech synthesis. Afterward, we merge the two subsets together and perform fine-tuning.
5) Three-Stage(Ours): In the first stage, we focused on the modeling capabilities of ConGPT and ConVITS in single-sentence speech scenarios. In the second stage, we continue to train the ConGPT using the collection subset of NCSSD. In the third stage, we further enhance the naturalness and expressiveness of the synthesized speech by fine-tuning both ConGPT and ConVITS using the recording subset of NCSSD.
Sample #1 (Dataset: NCSSD(EN)):
Dialogue History
1. (Spk A) I used to love playing the piano, but I just can't find the motivation anymore.
2. (Spk B) Why? What happened, used to be so passionate about it?
Current Utterance
One-Stage (w/CL&RC) Two-Stage (w/CL) Two-Stage (w/RC) Two-Stage (w/CL&RC) Three-Stage(Ours)
3. (Spk A) Well, I faced a lot of criticism from my family. They never believed in my talent.
Sample #2 (Dataset: NCSSD(ZH)):
Dialogue History
1. (Spk A) 我们人类啊,终于可以走出地球,探索无垠的宇宙了。
2. (Spk B) 是啊,这是人类的伟大壮举,让我对未来充满希望。
Current Utterance
One-Stage (w/CL&RC) Two-Stage (w/CL) Two-Stage (w/RC) Two-Stage (w/CL&RC) Three-Stage(Ours)
3. (Spk A) 哎,我想到啊那些为太空付出生命的宇航员,他们啊真的是英雄啊。
(3) GPT-Talker-synthesized emotional speech:
Sample #1 (Dataset: NCSSD(EN)):
Dialogue History
1. (Spk A) I know, right? And features some one blowing.
2. (Spk B) This car is definitely a game tender. I'm impressed.
Current Utterance (Surprise)
GPT-Talker (ours) Ground Truth
3. (Spk A) Exactly, I never thought I would see such innovation in your car.
Sample #2 (Dataset: NCSSD(EN)):
Dialogue History
1. (Spk A) It's about embracing sustainability and ethical practices in fashion.
2. (Apk B) That sounds amazing. I'm glad the industry is moving towards a more responsible.
Current Utterance (Happy)
GPT-Talker (ours) Ground Truth
3. (Spk A) Yes, it's definitely your step in the right direction. It's important to create beautiful designs without harming the environment.
Sample #3 (Dataset: NCSSD(ZH)):
Dialogue History
1. (Spk A) 没错,通过比较地区研究,我们可以了解到不同地区的文化、经济、政治方面的差异,这对于全球化时代的理解非常重要。
2. (Spk B) 而且啊,比较地区研究还可以为国际合作提供参考,促进不同地区之间的交流和合作呢。
Current Utterance (Happy)
GPT-Talker (ours) Ground Truth
3. (Spk A) 是的,通过比较地区研究,我们可以找到各个地区的优势和特色,从而实现互利共赢的合作关系。
Sample #4 (Dataset: NCSSD(ZH)):
Dialogue History
1. (Spk A) 这些冰雹真的太大了,完全没有办法防御嘛。
2. (Spk B) 是啊,我刚刚修好的车窗也被砸了。
Current Utterance (Angry)
GPT-Talker (ours) Ground Truth
3. (Spk A) 这种天气真是太让人愤怒了。
(4) Experimental results of dialogue turn setting:
To further investigate the influence of dialogue turns (the length of dialogue context) on conversational speech synthesis, we adjust the number of dialogue turns in the input model during inference for comparative analysis:
Note: The dialogue turn is set to N, which means that during inference includes N-1 historical dialogue sentences.
Sample #1 (Dataset: DailyTalk):
Dialogue History
1. (Spk A) Exercise has no benefit unless you sweat like a pig.
2. (Spk B) Well, that's not for me.
3. (Spk A) Thanks for coming, it was a real blessed.
Current Utterance
1 2 (ours) 3
4. (Spk B) Get out of it! it wasn't as good as you think.
Sample #2 (Dataset: NCSSD(EN)):
Dialogue History
1. (Spk A) Oh no. Why are you feeling so sad about?
2. (Spk B) Well, I had high expectations for the show, but it turned out to be quite disappointing.
3. (Spk A) That's too bad. Can you tell me what digit didn't you like about it?
Current Utterance
1 2 (ours) 3
4. (Spk B) The story line was weak and characters lacked depth. It feel like a waste of time.
Sample #3 (Dataset: NCSSD(ZH)):
Dialogue History
1. (Spk A) 嗯,但是现在很少有人关注版画艺术了,市场需求不足啊。
2. (Spk B) 确实如此,但我认为版画艺术的独特性和历史价值使其不会轻易被淘汰。
3. (Spk A) 嗯,或许你说的有道理,但我还是担心版画艺术的发展。
Current Utterance
1 2 (ours) 3
4. (Spk B) 毕竟艺术市场是变幻莫测的,我们只能尽力去创作和推广。
(5) Experimental results of context serialization:
Sample #1 (Dataset: NCSSD(EN)):
Dialogue History
1. (Spk A) Do you remember our beloved pet? I miss him so much.
2. (Spk B) Yes, I miss him too. He was such a loyal companion.
Current Utterance
AABB-Format ABAB-Format(ours)
3. (Spk A) I can't help but feel guilty for not taking better care of him towards the end.
Sample #1 (Dataset: NCSSD(ZH)):
Dialogue History
1. (Spk A) 我觉得历史可以帮助我们了解过去,从而更好的理解现在和未来。
2. (Spk B) 确实,历史是一门重要学科,可以让我们从中吸取经验和教训。
Current Utterance
AABB-Format ABAB-Format(ours)
3. (Spk A) 嗯,是的,而且学习历史还可以培养我们的思考能力和批判思维呢。
(6) Experimental results of Zero-Shot timbre rendering:
Sample #1 (Dataset: NCSSD(EN)):
Dialogue History
1. (Spk A) Well, I faced a lot of criticism from my family. They never believed in my talent.
2. (Spk B) Just, really. Disheartening. Don't let their negative opinions bring you down. Keep doing what you love.
Unseen Speaker (From IEMOCAP)
How am I supposed to get an ID without an ID? How does a person get an ID in the first place?
Synthesized speech from GPT-Talker
3. (Unseen Spk) It's just hard to ignore their words. Sometimes I feel like I've lost the joy I used to have.
Sample #2 (Dataset: NCSSD(ZH)):
Dialogue History
1. (Spk A) 是啊,冰雹真的很讨厌。
2. (Spk B) 冰雹刀砸的我家的玻璃都破了,真是气死人了。
Unseen Speaker (From M3ED)
想让你帮我找一个司机。
Synthesized speech from GPT-Talker
3. (Unseen Spk)别生气了,这个天气真的是太讨厌了。