广东工业大学学报 ›› 2024, Vol. 41 ›› Issue (02): 84-92.doi: 10.12052/gdutxb.230025
• 计算机科学与技术 • 上一篇
郭傲1, 许柏炎1, 蔡瑞初1, 郝志峰1,2
Guo Ao1, Xu Bo-yan1, Cai Rui-chu1, Hao Zhi-feng1,2
摘要: 语音合成风格控制的目标是将自然语言转化为对应富有表现力的音频输出。基于Transformer的风格控制语音合成算法能在保持质量的情况下提高了合成速度,但仍存在不足:第一,在风格参考音频和文本长度差异大的情况下,存在合成音频部分风格缺失的问题;第二,基于普通注意力的解码过程容易出现复读、漏读以及跳读的问题。针对以上问题,提出了一种基于时间对齐的风格控制语音合成算法(Temporal Alignment Text-to-Speech,TATTS)分别在编码和解码过程中有效利用时序信息。在编码过程中,TATTS提出了时序对齐的交叉注意力模块联合训练风格音频与文本表示,解决了不等长音频文本的对齐问题;在解码过程中,TATTS考虑了音频时序单调性,在Transformer解码器中引入了逐步单调的多头注意力机制,解决了合成音频中出现的错读问题。与基准模型相比,TATTS在LJSpeech和VCTK数据集上音频结果自然度分别提升了3.8%和4.8%,在VCTK数据集上风格相似度提升了10%,验证了该语音合成算法的有效性,并且体现出风格控制与迁移能力。
中图分类号:
[1] WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C]//Conference of the International Speech Communication Association. Stockholm: ISCA, 2017: 4006-4010. [2] REN Y, RUAN Y, TAN X, et al. Fastspeech: fast, robust and controllable text to speech[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2019: 3171-3180. [3] SHEN J, PANG R, WEISS R J, et al. Natural TTS synthesis by conditioning WaveNet on Mel Spectrogram predictions[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 4779-4783. [4] LI N, LIU S, LIU Y, et al. Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Hawaii: AAAI, 2019: 6706-6713. [5] WANG Y, STANTON D, ZHANG Y, et al. Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis[C]//International Conference on Machine Learning. Stockholm: PMLR, 2018: 5180-5189. [6] SKERRY-RYAN R J, BATTENBERG E, XIAO Y, et al. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron[C]//International Conference on Machine Learning. Stockholm: PMLR, 2018: 4693-4702. [7] LEE Y, KIM T. Robust and fine-grained prosody control of end-to-end speech synthesis[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE, 2019: 5911-5915. [8] KLIMKOV V, RONANKI S, ROHNKE J, et al. Fine-grained robust prosody transfer for single-speaker neural Text-To-Speech[C]//Conference of the International Speech Communication Association. Graz: ISCA, 2019: 4440-4444. [9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Conference on Neural Information Processing Systems. California: NeurIPS, 2017: 6000-6010. [10] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT. Minnesota: NAACL, 2019: 4171-4186. [11] BAEVSKI A, ZHOU Y, MOHAMED A, et al. Wav2vec 2.0: a framework for self-supervised learning of speech representations[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 12449-12460. [12] LI L H, YATSKAR M, YIN D, et al. What does BERT with vision look at?[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020: 5265-5275. [13] 蔡瑞初, 张盛强, 许柏炎. 基于结构感知混合编码模型的代码注释生成方法[J]. 计算机工程, 2023, 2: 1-11. CAI R C, ZHANG S Q, XU B Y. Code comment generation method based on structure-aware hybrid encoder [J]. Computer Engineering, 2023, 2: 1-11. [14] CAI R C, YUAN J J, XU B Y, et al. SADGA: structure-aware dual graph aggregation network for Text-to-SQL[C]//Advances in Neural Information Processing Systems. Online: NeurIPS, 2021: 7664-7676. [15] CHEN M, TAN X, REN Y, et al. MultiSpeech: multi-speaker text to speech with Transformer[C]//Conference of the International Speech Communication Association. Online: ISCA, 2020: 4024-4028. [16] GLOROT X, BORDES A, BENGIO Y. Deep sparse rectifier neural networks[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. La Palma: JMLR, 2011: 315-323. [17] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks [J]. Communications of the ACM, 2017, 60(6): 84-90. [18] CHOI H S, LEE J, KIM W, et al. Neural analysis and synthesis: reconstructing speech from self-supervised representations[C]//Advances in Neural Information Processing Systems. Online: NeurIPS, 2021: 16251-16265. [19] CHOI H S, YANG J, LEE J, et al. NANSY++: unified voice synthesis with neural analysis and synthesis[EB/OL]. arxiv: 2211.09407 (2022-11-17) [2023-3-24].https://arxiv.org/abs/2211.09407. [20] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural computation, 1997, 9(8): 1735-1780. [21] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]//Proceedings of NAACL-HLT. California: NAACL, 2016: 260-270. [22] HE M, DENG Y, HE L. Robust sequence-to-sequence acoustic mdeling with stepwise monotonic attention for neural TTS[C]//Conference of the International Speech Communication Association. Graz: ISCA, 2019: 1293-1297. [23] LIANG X, WU Z, LI R, et al. Enhancing monotonicity for robust autoregressive transformer TTS[C]//Conference of the International Speech Communication Association. Online: ISCA, 2020: 3181-3185. [24] KEITH I, LINDA J. The LJ speech dataset[EB/OL]. (2018-2-19) [2023-3-24].https://keithito.com/LJ-Speech-Dataset. [25] YAMAGISHI J, VEAUX C, MACDONALD K. CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) [EB/OL]. (2019-11-13) [2023-3-24].https://doi.org/10.7488/ds/2645. [26] KONG J, KIM J, BAE J. Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 17022-17033. [27] KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. arxiv: 1412.6980 (2017-1-30) [2023-3-24].https://arxiv.org/abs/1412.6980. [28] STREIJL R C, WINKLER S, HANDS D S. Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives [J]. Multimedia Systems, 2016, 22(2): 213-227. |
[1] | 赖志茂, 章云, 李东. 基于Transformer的人脸深度伪造检测技术综述[J]. 广东工业大学学报, 2023, 40(06): 155-167. |
[2] | 林小平, 鲁青, 郭伟, 邓杰航, 王超. 一种SmartFusion FPGA的快速语音合成系统设计[J]. 广东工业大学学报, 2014, 31(2): 43-48. |
|