Journal of Guangdong University of Technology ›› 2024, Vol. 41 ›› Issue (02): 84-92.doi: 10.12052/gdutxb.230025

• Computer Science and Technology • Previous Articles    

Temporal Alignment Style Control in Text-to-Speech Synthesis Algorithm

Guo Ao1, Xu Bo-yan1, Cai Rui-chu1, Hao Zhi-feng1,2   

  1. 1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China;
    2. College of Science, Shantou University, Shantou 515063, China
  • Received:2323-02-21 Published:2024-04-23

Abstract: The goal of speech synthesis style control is to convert natural language into corresponding expressive audio output. The speech synthesis style control algorithm based on Transformer can improve synthesis speed while maintain its quality. But there still exist some shortcomings. Firstly, there is a problem of missing style in synthesized audio, when there is a large disparity between the length of the style reference audio and text. Secondly, the decoding process based on vanilla attention is prone to problems of repeating, omission and skipping. To address the above problems, a temporal alignment style control speech synthesis algorithm TATTS is proposed, which can effectively utilize temporal information in the encoding and decoding processes, respectively. In the encoding process, TATTS proposes a temporal alignment cross-attention module to jointly train style audio and text representations, which can solve the alignment problem of unequal-length audio and texts. In the decoding process, TATTS considers the monotonicity of audio timing. And a stepwise monotonic multi-head attention mechanism in the Transformer decoder is proposed to solve the problem of misreading in synthesized audio. The experimental results show that, compared with the baseline model, TATTS has increased the naturalness index of audio results on the LJSpeech and VCTK datasets by 3.8% and 4.8%, respectively, and the style similarity index on the VCTK dataset has increased by 10%. Experimental results demonstrate the effectiveness of the synthetic algorithm, and the ability to style control and transfer.

Key words: speech synthesis, temporal alignment, style control, Transformer, style transfer

CLC Number: 

  • TP391.2
[1] WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C]//Conference of the International Speech Communication Association. Stockholm: ISCA, 2017: 4006-4010.
[2] REN Y, RUAN Y, TAN X, et al. Fastspeech: fast, robust and controllable text to speech[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2019: 3171-3180.
[3] SHEN J, PANG R, WEISS R J, et al. Natural TTS synthesis by conditioning WaveNet on Mel Spectrogram predictions[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 4779-4783.
[4] LI N, LIU S, LIU Y, et al. Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Hawaii: AAAI, 2019: 6706-6713.
[5] WANG Y, STANTON D, ZHANG Y, et al. Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis[C]//International Conference on Machine Learning. Stockholm: PMLR, 2018: 5180-5189.
[6] SKERRY-RYAN R J, BATTENBERG E, XIAO Y, et al. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron[C]//International Conference on Machine Learning. Stockholm: PMLR, 2018: 4693-4702.
[7] LEE Y, KIM T. Robust and fine-grained prosody control of end-to-end speech synthesis[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE, 2019: 5911-5915.
[8] KLIMKOV V, RONANKI S, ROHNKE J, et al. Fine-grained robust prosody transfer for single-speaker neural Text-To-Speech[C]//Conference of the International Speech Communication Association. Graz: ISCA, 2019: 4440-4444.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Conference on Neural Information Processing Systems. California: NeurIPS, 2017: 6000-6010.
[10] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT. Minnesota: NAACL, 2019: 4171-4186.
[11] BAEVSKI A, ZHOU Y, MOHAMED A, et al. Wav2vec 2.0: a framework for self-supervised learning of speech representations[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 12449-12460.
[12] LI L H, YATSKAR M, YIN D, et al. What does BERT with vision look at?[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020: 5265-5275.
[13] 蔡瑞初, 张盛强, 许柏炎. 基于结构感知混合编码模型的代码注释生成方法[J]. 计算机工程, 2023, 2: 1-11.
CAI R C, ZHANG S Q, XU B Y. Code comment generation method based on structure-aware hybrid encoder [J]. Computer Engineering, 2023, 2: 1-11.
[14] CAI R C, YUAN J J, XU B Y, et al. SADGA: structure-aware dual graph aggregation network for Text-to-SQL[C]//Advances in Neural Information Processing Systems. Online: NeurIPS, 2021: 7664-7676.
[15] CHEN M, TAN X, REN Y, et al. MultiSpeech: multi-speaker text to speech with Transformer[C]//Conference of the International Speech Communication Association. Online: ISCA, 2020: 4024-4028.
[16] GLOROT X, BORDES A, BENGIO Y. Deep sparse rectifier neural networks[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. La Palma: JMLR, 2011: 315-323.
[17] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks [J]. Communications of the ACM, 2017, 60(6): 84-90.
[18] CHOI H S, LEE J, KIM W, et al. Neural analysis and synthesis: reconstructing speech from self-supervised representations[C]//Advances in Neural Information Processing Systems. Online: NeurIPS, 2021: 16251-16265.
[19] CHOI H S, YANG J, LEE J, et al. NANSY++: unified voice synthesis with neural analysis and synthesis[EB/OL]. arxiv: 2211.09407 (2022-11-17) [2023-3-24].https://arxiv.org/abs/2211.09407.
[20] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural computation, 1997, 9(8): 1735-1780.
[21] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]//Proceedings of NAACL-HLT. California: NAACL, 2016: 260-270.
[22] HE M, DENG Y, HE L. Robust sequence-to-sequence acoustic mdeling with stepwise monotonic attention for neural TTS[C]//Conference of the International Speech Communication Association. Graz: ISCA, 2019: 1293-1297.
[23] LIANG X, WU Z, LI R, et al. Enhancing monotonicity for robust autoregressive transformer TTS[C]//Conference of the International Speech Communication Association. Online: ISCA, 2020: 3181-3185.
[24] KEITH I, LINDA J. The LJ speech dataset[EB/OL]. (2018-2-19) [2023-3-24].https://keithito.com/LJ-Speech-Dataset.
[25] YAMAGISHI J, VEAUX C, MACDONALD K. CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) [EB/OL]. (2019-11-13) [2023-3-24].https://doi.org/10.7488/ds/2645.
[26] KONG J, KIM J, BAE J. Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 17022-17033.
[27] KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. arxiv: 1412.6980 (2017-1-30) [2023-3-24].https://arxiv.org/abs/1412.6980.
[28] STREIJL R C, WINKLER S, HANDS D S. Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives [J]. Multimedia Systems, 2016, 22(2): 213-227.
[1] Lai Zhi-mao, Zhang Yun, Li Dong. A Survey of Deepfake Detection Techniques Based on Transformer [J]. Journal of Guangdong University of Technology, 2023, 40(06): 155-167.
[2] Zhang Miao, Pang Zhuo-biao, Hao Xue-dong, Xie Si-wei, Zhang Xing-wang. A Research on a Transformerless Parallel Hybrid Active Power Filter [J]. Journal of Guangdong University of Technology, 2019, 36(05): 33-37.
[3] Dong Wen-hua, Li Chun-lai, Lan Xiong. Design and Experimental Analysis of an Open-close Micro Current Transformer [J]. Journal of Guangdong University of Technology, 2019, 36(04): 65-69.
[4] Ye Wu-jian, Gao Hai-jian, Weng Shao-wei, Gao Zhi, Wang Shan-jin, Zhang Chun-yu, Liu Yi-jun. A Two-stage Effect Rendering Method for Art Font Based on CGAN Network [J]. Journal of Guangdong University of Technology, 2019, 36(03): 47-55.
[5] He Rui-wen, Xie Qiong-xiang, Cai Ze-xiang. Influence of Digital Acquisition of the Electrical Information on the Reliability of Relay Protection [J]. Journal of Guangdong University of Technology, 2013, 30(2): 68-73.
[6] Chen He-en, , Feng Kai-ping, Pan Li-pei, Wu Yue-ming, . Study of Architecture Transformation [J]. Journal of Guangdong University of Technology, 2012, 29(2): 94-96.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!