Journal of Guangdong University of Technology ›› 2024, Vol. 41 ›› Issue (02): 84-92.doi: 10.12052/gdutxb.230025
• Computer Science and Technology • Previous Articles
Guo Ao1, Xu Bo-yan1, Cai Rui-chu1, Hao Zhi-feng1,2
CLC Number:
[1] WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C]//Conference of the International Speech Communication Association. Stockholm: ISCA, 2017: 4006-4010. [2] REN Y, RUAN Y, TAN X, et al. Fastspeech: fast, robust and controllable text to speech[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2019: 3171-3180. [3] SHEN J, PANG R, WEISS R J, et al. Natural TTS synthesis by conditioning WaveNet on Mel Spectrogram predictions[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 4779-4783. [4] LI N, LIU S, LIU Y, et al. Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Hawaii: AAAI, 2019: 6706-6713. [5] WANG Y, STANTON D, ZHANG Y, et al. Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis[C]//International Conference on Machine Learning. Stockholm: PMLR, 2018: 5180-5189. [6] SKERRY-RYAN R J, BATTENBERG E, XIAO Y, et al. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron[C]//International Conference on Machine Learning. Stockholm: PMLR, 2018: 4693-4702. [7] LEE Y, KIM T. Robust and fine-grained prosody control of end-to-end speech synthesis[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE, 2019: 5911-5915. [8] KLIMKOV V, RONANKI S, ROHNKE J, et al. Fine-grained robust prosody transfer for single-speaker neural Text-To-Speech[C]//Conference of the International Speech Communication Association. Graz: ISCA, 2019: 4440-4444. [9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Conference on Neural Information Processing Systems. California: NeurIPS, 2017: 6000-6010. [10] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT. Minnesota: NAACL, 2019: 4171-4186. [11] BAEVSKI A, ZHOU Y, MOHAMED A, et al. Wav2vec 2.0: a framework for self-supervised learning of speech representations[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 12449-12460. [12] LI L H, YATSKAR M, YIN D, et al. What does BERT with vision look at?[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: ACL, 2020: 5265-5275. [13] 蔡瑞初, 张盛强, 许柏炎. 基于结构感知混合编码模型的代码注释生成方法[J]. 计算机工程, 2023, 2: 1-11. CAI R C, ZHANG S Q, XU B Y. Code comment generation method based on structure-aware hybrid encoder [J]. Computer Engineering, 2023, 2: 1-11. [14] CAI R C, YUAN J J, XU B Y, et al. SADGA: structure-aware dual graph aggregation network for Text-to-SQL[C]//Advances in Neural Information Processing Systems. Online: NeurIPS, 2021: 7664-7676. [15] CHEN M, TAN X, REN Y, et al. MultiSpeech: multi-speaker text to speech with Transformer[C]//Conference of the International Speech Communication Association. Online: ISCA, 2020: 4024-4028. [16] GLOROT X, BORDES A, BENGIO Y. Deep sparse rectifier neural networks[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. La Palma: JMLR, 2011: 315-323. [17] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks [J]. Communications of the ACM, 2017, 60(6): 84-90. [18] CHOI H S, LEE J, KIM W, et al. Neural analysis and synthesis: reconstructing speech from self-supervised representations[C]//Advances in Neural Information Processing Systems. Online: NeurIPS, 2021: 16251-16265. [19] CHOI H S, YANG J, LEE J, et al. NANSY++: unified voice synthesis with neural analysis and synthesis[EB/OL]. arxiv: 2211.09407 (2022-11-17) [2023-3-24].https://arxiv.org/abs/2211.09407. [20] HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural computation, 1997, 9(8): 1735-1780. [21] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]//Proceedings of NAACL-HLT. California: NAACL, 2016: 260-270. [22] HE M, DENG Y, HE L. Robust sequence-to-sequence acoustic mdeling with stepwise monotonic attention for neural TTS[C]//Conference of the International Speech Communication Association. Graz: ISCA, 2019: 1293-1297. [23] LIANG X, WU Z, LI R, et al. Enhancing monotonicity for robust autoregressive transformer TTS[C]//Conference of the International Speech Communication Association. Online: ISCA, 2020: 3181-3185. [24] KEITH I, LINDA J. The LJ speech dataset[EB/OL]. (2018-2-19) [2023-3-24].https://keithito.com/LJ-Speech-Dataset. [25] YAMAGISHI J, VEAUX C, MACDONALD K. CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) [EB/OL]. (2019-11-13) [2023-3-24].https://doi.org/10.7488/ds/2645. [26] KONG J, KIM J, BAE J. Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis[C]//Advances in Neural Information Processing Systems. Vancouver: NeurIPS, 2020: 17022-17033. [27] KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. arxiv: 1412.6980 (2017-1-30) [2023-3-24].https://arxiv.org/abs/1412.6980. [28] STREIJL R C, WINKLER S, HANDS D S. Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives [J]. Multimedia Systems, 2016, 22(2): 213-227. |
[1] | Lai Zhi-mao, Zhang Yun, Li Dong. A Survey of Deepfake Detection Techniques Based on Transformer [J]. Journal of Guangdong University of Technology, 2023, 40(06): 155-167. |
[2] | Zhang Miao, Pang Zhuo-biao, Hao Xue-dong, Xie Si-wei, Zhang Xing-wang. A Research on a Transformerless Parallel Hybrid Active Power Filter [J]. Journal of Guangdong University of Technology, 2019, 36(05): 33-37. |
[3] | Dong Wen-hua, Li Chun-lai, Lan Xiong. Design and Experimental Analysis of an Open-close Micro Current Transformer [J]. Journal of Guangdong University of Technology, 2019, 36(04): 65-69. |
[4] | Ye Wu-jian, Gao Hai-jian, Weng Shao-wei, Gao Zhi, Wang Shan-jin, Zhang Chun-yu, Liu Yi-jun. A Two-stage Effect Rendering Method for Art Font Based on CGAN Network [J]. Journal of Guangdong University of Technology, 2019, 36(03): 47-55. |
[5] | He Rui-wen, Xie Qiong-xiang, Cai Ze-xiang. Influence of Digital Acquisition of the Electrical Information on the Reliability of Relay Protection [J]. Journal of Guangdong University of Technology, 2013, 30(2): 68-73. |
[6] | Chen He-en, , Feng Kai-ping, Pan Li-pei, Wu Yue-ming, . Study of Architecture Transformation [J]. Journal of Guangdong University of Technology, 2012, 29(2): 94-96. |
|