广东工业大学学报 ›› 2024, Vol. 41 ›› Issue (02): 84-92.doi: 10.12052/gdutxb.230025

• 计算机科学与技术 • 上一篇    

基于时序对齐的风格控制语音合成算法

郭傲1, 许柏炎1, 蔡瑞初1, 郝志峰1,2   

  1. 1. 广东工业大学 计算机学院, 广东 广州 510006;
    2. 汕头大学 理学院, 广东 汕头 515063
  • 收稿日期:2323-02-21 发布日期:2024-04-23
  • 通信作者: 蔡瑞初(1983-),男,教授,博士,主要研究方向为机器学习、因果关系,E-mail:cairuichu@gmail.com
  • 作者简介:郭傲(1999-),男,硕士研究生,主要研究方向为机器学习、语音合成,E-mail:guoao0209@gmail.com
  • 基金资助:
    国家自然科学基金资助项目(61876043, 61976052, 62206064);国家优秀青年科学基金资助项目 (62122022);科技创新2030“新一代人工智能”重大项目(2021ZD0111501)

Temporal Alignment Style Control in Text-to-Speech Synthesis Algorithm

Guo Ao1, Xu Bo-yan1, Cai Rui-chu1, Hao Zhi-feng1,2   

  1. 1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China;
    2. College of Science, Shantou University, Shantou 515063, China
  • Received:2323-02-21 Published:2024-04-23

摘要: 语音合成风格控制的目标是将自然语言转化为对应富有表现力的音频输出。基于Transformer的风格控制语音合成算法能在保持质量的情况下提高了合成速度,但仍存在不足:第一,在风格参考音频和文本长度差异大的情况下,存在合成音频部分风格缺失的问题;第二,基于普通注意力的解码过程容易出现复读、漏读以及跳读的问题。针对以上问题,提出了一种基于时间对齐的风格控制语音合成算法(Temporal Alignment Text-to-Speech,TATTS)分别在编码和解码过程中有效利用时序信息。在编码过程中,TATTS提出了时序对齐的交叉注意力模块联合训练风格音频与文本表示,解决了不等长音频文本的对齐问题;在解码过程中,TATTS考虑了音频时序单调性,在Transformer解码器中引入了逐步单调的多头注意力机制,解决了合成音频中出现的错读问题。与基准模型相比,TATTS在LJSpeech和VCTK数据集上音频结果自然度分别提升了3.8%和4.8%,在VCTK数据集上风格相似度提升了10%,验证了该语音合成算法的有效性,并且体现出风格控制与迁移能力。

关键词: 语音合成, 时序对齐, 风格控制, Transformer, 风格迁移

Abstract: The goal of speech synthesis style control is to convert natural language into corresponding expressive audio output. The speech synthesis style control algorithm based on Transformer can improve synthesis speed while maintain its quality. But there still exist some shortcomings. Firstly, there is a problem of missing style in synthesized audio, when there is a large disparity between the length of the style reference audio and text. Secondly, the decoding process based on vanilla attention is prone to problems of repeating, omission and skipping. To address the above problems, a temporal alignment style control speech synthesis algorithm TATTS is proposed, which can effectively utilize temporal information in the encoding and decoding processes, respectively. In the encoding process, TATTS proposes a temporal alignment cross-attention module to jointly train style audio and text representations, which can solve the alignment problem of unequal-length audio and texts. In the decoding process, TATTS considers the monotonicity of audio timing. And a stepwise monotonic multi-head attention mechanism in the Transformer decoder is proposed to solve the problem of misreading in synthesized audio. The experimental results show that, compared with the baseline model, TATTS has increased the naturalness index of audio results on the LJSpeech and VCTK datasets by 3.8% and 4.8%, respectively, and the style similarity index on the VCTK dataset has increased by 10%. Experimental results demonstrate the effectiveness of the synthetic algorithm, and the ability to style control and transfer.

Key words: speech synthesis, temporal alignment, style control, Transformer, style transfer

中图分类号: 

  • TP391.2
[1] WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C]//Conference of the International Speech Communication Association. Stockholm: ISCA, 2017: 4006-4010.
[2] REN Y, RUAN Y, TAN X, et al. Fastspeech: fast, robust and controllable text to speech[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2019: 3171-3180.
[3] SHEN J, PANG R, WEISS R J, et al. Natural TTS synthesis by conditioning WaveNet on Mel Spectrogram predictions[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 4779-4783.
[4] LI N, LIU S, LIU Y, et al. Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Hawaii: AAAI, 2019: 6706-6713.
[5] WANG Y, STANTON D, ZHANG Y, et al. Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis[C]//International Conference on Machine Learning. Stockholm: PMLR, 2018: 5180-5189.
[6] SKERRY-RYAN R J, BATTENBERG E, XIAO Y, et al. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron[C]//International Conference on Machine Learning. Stockholm: PMLR, 2018: 4693-4702.
[7] LEE Y, KIM T. Robust and fine-grained prosody control of end-to-end speech synthesis[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE, 2019: 5911-5915.
[8] KLIMKOV V, RONANKI S, ROHNKE J, et al. Fine-grained robust prosody transfer for single-speaker neural Text-To-Speech[C]//Conference of the International Speech Communication Association. Graz: ISCA, 2019: 4440-4444.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Conference on Neural Information Processing Systems. California: NeurIPS, 2017: 6000-6010.
[10] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT. Minnesota: NAACL, 2019: 4171-4186.
[11] BAEVSKI A, ZHOU Y, MOHAMED A, et al. Wav2vec 2.0: a framework for self-supervised learning of speech representations[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 12449-12460.
[12] LI L H, YATSKAR M, YIN D, et al. What does BERT with vision look at?[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020: 5265-5275.
[13] 蔡瑞初, 张盛强, 许柏炎. 基于结构感知混合编码模型的代码注释生成方法[J]. 计算机工程, 2023, 2: 1-11.
CAI R C, ZHANG S Q, XU B Y. Code comment generation method based on structure-aware hybrid encoder [J]. Computer Engineering, 2023, 2: 1-11.
[14] CAI R C, YUAN J J, XU B Y, et al. SADGA: structure-aware dual graph aggregation network for Text-to-SQL[C]//Advances in Neural Information Processing Systems. Online: NeurIPS, 2021: 7664-7676.
[15] CHEN M, TAN X, REN Y, et al. MultiSpeech: multi-speaker text to speech with Transformer[C]//Conference of the International Speech Communication Association. Online: ISCA, 2020: 4024-4028.
[16] GLOROT X, BORDES A, BENGIO Y. Deep sparse rectifier neural networks[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. La Palma: JMLR, 2011: 315-323.
[17] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks [J]. Communications of the ACM, 2017, 60(6): 84-90.
[18] CHOI H S, LEE J, KIM W, et al. Neural analysis and synthesis: reconstructing speech from self-supervised representations[C]//Advances in Neural Information Processing Systems. Online: NeurIPS, 2021: 16251-16265.
[19] CHOI H S, YANG J, LEE J, et al. NANSY++: unified voice synthesis with neural analysis and synthesis[EB/OL]. arxiv: 2211.09407 (2022-11-17) [2023-3-24].https://arxiv.org/abs/2211.09407.
[20] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural computation, 1997, 9(8): 1735-1780.
[21] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]//Proceedings of NAACL-HLT. California: NAACL, 2016: 260-270.
[22] HE M, DENG Y, HE L. Robust sequence-to-sequence acoustic mdeling with stepwise monotonic attention for neural TTS[C]//Conference of the International Speech Communication Association. Graz: ISCA, 2019: 1293-1297.
[23] LIANG X, WU Z, LI R, et al. Enhancing monotonicity for robust autoregressive transformer TTS[C]//Conference of the International Speech Communication Association. Online: ISCA, 2020: 3181-3185.
[24] KEITH I, LINDA J. The LJ speech dataset[EB/OL]. (2018-2-19) [2023-3-24].https://keithito.com/LJ-Speech-Dataset.
[25] YAMAGISHI J, VEAUX C, MACDONALD K. CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) [EB/OL]. (2019-11-13) [2023-3-24].https://doi.org/10.7488/ds/2645.
[26] KONG J, KIM J, BAE J. Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 17022-17033.
[27] KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. arxiv: 1412.6980 (2017-1-30) [2023-3-24].https://arxiv.org/abs/1412.6980.
[28] STREIJL R C, WINKLER S, HANDS D S. Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives [J]. Multimedia Systems, 2016, 22(2): 213-227.
[1] 赖志茂, 章云, 李东. 基于Transformer的人脸深度伪造检测技术综述[J]. 广东工业大学学报, 2023, 40(06): 155-167.
[2] 林小平, 鲁青, 郭伟, 邓杰航, 王超. 一种SmartFusion FPGA的快速语音合成系统设计[J]. 广东工业大学学报, 2014, 31(2): 43-48.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!