广东工业大学学报 ›› 2023, Vol. 40 ›› Issue (02): 45-54.doi: 10.12052/gdutxb.210149

• 综合研究 • 上一篇    下一篇

基于分离结果信噪比估计与自适应调频网络的单通道语音分离技术

张锐, 吕俊   

  1. 广东工业大学 自动化学院, 广东 广州 510006
  • 收稿日期:2021-10-13 出版日期:2023-03-25 发布日期:2023-04-07
  • 通信作者: 吕俊(1979-),男,副研究员,博士,主要研究方向为机器学习,生物医学信号处理,E-mail:lujun.rylj@gmail.com
  • 作者简介:张锐(1997-),男,硕士研究生,主要研究方向为单通道语音分离及深度学习
  • 基金资助:
    国家自然科学基金资助面上项目(62073086)

Single-channel Speech Separation Based on Separated SI-SNR Regression Estimation and Adaptive Frequency Modulation Network

Zhang Rui, Lyu Jun   

  1. School of Automation, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2021-10-13 Online:2023-03-25 Published:2023-04-07

摘要: 在实际应用中,语音分离模型往往受到未知噪声的干扰,从而出现泛化性能严重退化的问题。据此本文提出了基于分离结果信噪比估计与自适应调频网络的单通道语音分离方法。该方法首先通过预测网络对测试信号分离结果的尺度不变信噪比进行估计,以此计算模型的认知不确定性;然后,设计自适应调频网络针对不确定性较高的信号进行自适应频谱调节,以降低模型认知不确定性,从而提升模型在面对未知噪声时的泛化能力。实验结果表明:本文提出的方法相比于单独的时域卷积语音分离网络,将SI-SNR指标从2.72 dB提升至4.57 dB,增幅达到67.94%,在泛化能力上具有较大的改善;相比于增加了软掩膜过滤机制的时域卷积语音分离网络,将SI-SNR指标从3.32 dB提升至4.57 dB,增幅达到37.65%,表明该方法在提高泛化能力方面的能力优于软掩膜过滤机制。

关键词: 语音分离, 不确定性度量, 噪声鲁棒, 神经网络

Abstract: In practical applications, speech separation models are often disturbed by unknown noise, resulting in serious degradation of generalization performance. To solve this problem, Single channel speech separation method based on separate SNR regression estimation and adaptive frequency modulation network is proposed. Firstly, the scale invariant SNR of test signal separation results is estimated by prediction network to calculate the cognitive uncertainty of the model; Then, an adaptive frequency modulation network is designed to adjust the spectrum of signals with high uncertainty to reduce the cognitive uncertainty of the model, so as to improve the generalization ability of the model in the face of unknown noise. The experimental results show that compared with the Conv-Tasnet, the proposed method improves the SI-SNR (Scale Invariant SNR) from 2.72 dB to 4.57 dB, with an increase of 67.94%, and has a great improvement in generalization ability. Compared with Conv-Tasnet with Soft-Mask, the SI-SNR is increased from 3.32 dB to 4.57 dB, with an increase of 37.65%, indicating that this method has better generalization ability than soft mask mechanism. It effectively alleviates the serious degradation of generalization ability of speech separation network in the face of unknown noise.

Key words: speech separation, uncertainty measurement, noise robustness, neural network

中图分类号: 

  • TP301
[1] EPHRAL A, MOSSERI I, LANG O, et al. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation [J]. ACM Transactions on Graphics, 2018, 37(4): 1-11.
[2] KE Y, DONG X, YAN B. Overview of patent technologies for blind separation of mixed speech signals [J]. China Science and Technology Information, 2019(5): 22-23.
[3] 朱阁. 基于深度学习的单通道语音分离技术研究[D]. 南京: 南京邮电大学, 2020.
[4] HUANG P, KIM M, HASEGAWA J M, et al. Deep learning for monaural speech separation[C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014. Florence: IEEE, 2014: 1562-1566.
[5] WU J, WANG Y. Research on speech separation based on GCC-NMF [J]. Journal of Jiangxi University of Technology, 2020, 41(5): 65-72.
[6] GE W, ZHANG T, FAN C, et al. Human voice separation algorithm using sparse nonnegative matrix factorization and deep attractor network under noise [J]. Acta Acoustics Sinica, 2021, 46(1): 55-66.
[7] VARGA A P, MOORE R K. Hidden Markov model decomposition of speech and noise[C]//International Conference on Acoustics, Speech and Signal Processing. New Mexico: IEEE, 1990: 845-848.
[8] OCHIAI T, DELCROIX M, IKESHIKA R, et al. Beam-TasNet: time-domain audio separation network meets frequency-domain beamformer[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Florence: IEEE, 2020: 6384-6388.
[9] WANG D L, CHEN J. Supervised speech separation based on deep learning: an overview [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2018, 26(10): 1702-1726.
[10] KRAWCZYK M, GERKMANN T. STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2014, 22(12): 1931-1940.
[11] MOWLAEE P, CHRISTENSEN M G, JEBSEB S H. Improved single-channel speech separation using sinusoidal modeling[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas, Texas: IEEE Signal Processing Society, 2010: 21-24.
[12] KOLVAK M, YU Z H, JENSEN J. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2017, 25(10): 1901-1913.
[13] XU C, RAO W, XIAO X, et al. Single channel speech separation with constrained utterance level permutation invariant training using grid lstm[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary, Canada: IEEE, 2018: 6-10.
[14] XU C, RAO W, XIAO X, et al. A shifted delta coefficient objective for monaural speech separation using multi-task learning[C]//INTERSPEECH. Hyderabad, India: IEEE, 2018: 3479-3483.
[15] XU C, RAO W, CHNG E S. Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brighton, UK: IEEE, 2019: 6990-6994.
[16] DELCROIX M, ZMOLIKOVA K, KINOSHITA K. Single channel target speaker extraction and recognition with speaker beam[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary, Canada: IEEE, 2018: 5554-5558.
[17] WANG Q, MUCKENHIM H, WILSON K, et al. Voice filter: targeted voice separation by speaker-conditioned spectrogram masking[C]//INTERSPEECH. Graz, Austria: IEEE, 2019: 2728-2732.
[18] DING S, WANG Q, CHANG S, et al. Personal VAD: speaker-conditioned voice activity detection[C]//Proc. Odyssey 2020 The Speaker and Language Recognition Workshop. Tokyo: Odyssey, 2020: 433-439.
[19] TU Y, DU J, XU Y. Deep neural network based speech separation for robust speech recognition[C]//2014 12th International Conference on Signal Processing (ICSP) . Hangzhou: IEEE, 2014: 532-536.
[20] LUO Y, MESGARANI N. Tasnet: time-domain audio separation network for real-time, single-channel speech separation[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary: IEEE, 2018: 696-700.
[21] VEKATANI S, CASEBEER J, SMARAGDIS P. End-to-end source separation with adaptive front-ends[C]//2018 52nd Asilomar Conference on Signals, Systems and Computers. California: IEEE, 2018: 684-688.
[22] LUO Y, MESGARANI N. Conv-Tasnet: surpassing ideal time-frequency magnitude masking for speech separation [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2019, 27(8): 1256-1266.
[23] LUO Y, CHEN Z, YOSHIOKA T. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Spain: IEEE, 2020: 46-50.
[24] WIJAYAKUSUMA A, GOZALI D R, WIDJAJA A, et al. Implementation of real-time speech separation model using time-domain audio separation network (TasNet) and dual-path recurrent neural network (DPRNN) [J]. Procedia Computer Science, 2021, 179: 762-772.
[25] XU C, RAO W. SpEx: multi-scale time domain speaker extraction network[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1370-1384.
[26] GE M, XU C, WANG L. SpEx+: A complete time domain speaker extraction network[C]//INTERSPEECH. Shanghai: IEEE, 2019: 1406-1410.
[27] JIN Y, TANG C, LIU Q. Multi-head self-attention-based deep clustering for single-channel speech separation[J]. IEEE Access, 2020, 8: 100013-100021.
[28] SUN Y, XIAN Y, WANG W. Monaural source separation in complex domain with long short-term memory neural network[J]. IEEE Journal of Selected Topics in Signal Processing, 2019, 13(2) : 359-369.
[29] LI Z, SONG Y, MCLOUGHLIN I. Source-aware context network for single-channel multi-speaker speech separation[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary, Canada: IEEE, 2018: 681-685.
[30] ZMOLIKOVA K, DELCROIX M, KINOSHITA K. Learning speaker representation for neural network based multichannel speaker extraction[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) . Okinawa, Japan: IEEE, 2017: 8-15.
[31] NASSIF A B, SHAHIN I, ATTILI A, et al. Speech recognition using deep neural networks: a systematic review[J]. IEEE Access, 2019, 7: 19143-19165.
[32] ABDAR M, POURPANAH F, HUSSAIN S, et al. A review of uncertainty quantification in deep learning: techniques, applications and challenges[J]. Information Fusion, 2021, 76: 243-297.
[33] ROY A G, CONJETI S, NAVAB N. Bayesian quicknat: model uncertainty in deep whole-brain segmentation for structure-wise quality control[J]. Neuro Image, 2019, 195: 11-22.
[34] CLEMENTS W R, VAN D B, ROBAGLIA B M, et al. Estimating risk and uncertainty in deep reinforcement learning[C]//2020 International Conference on Machine Learning (ICML). Austria: IMLS, 2020: 258-260.
[35] JAIN M, LAHLOU S, NEKOEI H. DEUP: direct epistemic uncertainty prediction[C]//2022 International Conference on Learning Representations(ICLR). Online: Open Review, 2022: 292-294.
[36] COMBALIA M, HUETO F, PUIG S, et al. Uncertainty estimation in deep neural networks for dermoscopic image classification[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle: IEEE, 2020: 744-745.
[37] FAN X, DENG Z, WANG K, et al. Learning discriminative representation for facial expression recognition from uncertainties[C]//2020 IEEE International Conference on Image Processing (ICIP) . Abu Dhabi, Arabia: IEEE, 2020: 903-907.
[38] ZHE L J, LIN Z, PADHY S, et al. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness[J]. Advances in Neural Information Processing Systems, 2020, 33: 7498-7512.
[39] RIBAS D, VINCENT E. An improved uncertainty propagation method for robust i-vector based speaker recognition[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brighton, UK: IEEE, 2019: 6331-6335.
[40] WANG K, PENG X, YANG J, et al. Suppressing uncertainties for large-scale facial expression recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 6897-6906.
[41] TAGASOVSKA N, LOPEZ P D. Single-model uncertainties for deep learning[J]. Advances in Neural Information Processing Systems, 2019, 32: 6417-6428.
[42] 张锐. 基于不确定性度量的单通道语音分离算法研究[D]. 广州: 广东工业大学, 2022.
[43] VARGA A, STEENEKEN H J M. Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems [J]. Speech Communication, 1993, 12(3): 247-251.
[44] HU G, WANG D L. A tandem algorithm for pitch estimation and voiced speech segregation [J]. IEEE Transactions on Audio, Speech and Language Processing, 2010, 18(8): 2067-2079.
[45] PANAYIOTOU V, CHEN G, POKEY D, et al. Libri Speech: an ASR corpus based on public domain audio books[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Oslo, Norway: IEEE, 2015: 5206-5210.
[46] LIU Y, DELARIA M, WANG D L. Deep casa for talker- independent monaural speech separation[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Barcelona: IEEE, 2020: 6354-6358.
[47] SALEEN N, IRFAN M. Noise reduction based on soft masks by incorporating SNR uncertainty in frequency domain [J]. Circuits, Systems and Signal Processing, 2019, 37(6): 2591-2612.
[1] 谢国波, 林立, 林志毅, 贺笛轩, 文刚. 基于YOLOv4-MP的绝缘子爆裂缺陷检测方法[J]. 广东工业大学学报, 2023, 40(02): 15-21.
[2] 邱俊豪, 程志键, 林国怀, 任鸿儒, 鲁仁全. 具有执行器故障的非线性系统指定性能控制[J]. 广东工业大学学报, 2023, 40(02): 55-63.
[3] 陈靖宇, 吕毅. 基于脉冲神经网络的冷链制冷机结霜检测方法[J]. 广东工业大学学报, 2023, 40(01): 29-38.
[4] 叶文权, 李斯, 凌捷. 基于多级残差U-Net的稀疏SPECT图像重建[J]. 广东工业大学学报, 2023, 40(01): 61-67.
[5] 彭美春, 阳晨, 李君平, 叶伟斌, 黄文伟. 基于BP神经网络的车辆碳排放测算研究[J]. 广东工业大学学报, 2023, 40(01): 107-112.
[6] 刘洪伟, 林伟振, 温展明, 陈燕君, 易闽琦. 基于MABM的消费者情感倾向识别模型——以电影评论为例[J]. 广东工业大学学报, 2022, 39(06): 1-9.
[7] 章云, 王晓东. 基于受限样本的深度学习综述与思考[J]. 广东工业大学学报, 2022, 39(05): 1-8.
[8] 彭积广, 肖涵臻. 模型预测控制下多移动机器人的跟踪与避障[J]. 广东工业大学学报, 2022, 39(05): 93-101.
[9] 黎耀东, 任志刚, 吴宗泽. 基于深度神经网络的注塑过程预测控制[J]. 广东工业大学学报, 2022, 39(05): 120-126,136.
[10] 曾江毅, 李志生, 欧耀春, 金宇凯. 季节指数改进的PM2.5质量浓度组合预测模型研究[J]. 广东工业大学学报, 2022, 39(03): 89-94.
[11] Gary Yen, 栗波, 谢胜利. 地球流体动力学模型恢复的长短期记忆网络渐进优化方法[J]. 广东工业大学学报, 2021, 38(06): 1-8.
[12] 郭心德, 丁宏强. 离散制造智能工厂场景的AGV路径规划方法[J]. 广东工业大学学报, 2021, 38(06): 70-76.
[13] 黄剑航, 王振友. 基于特征融合的深度学习目标检测算法研究[J]. 广东工业大学学报, 2021, 38(04): 52-58.
[14] 马少鹏, 梁路, 滕少华. 一种轻量级的高光谱遥感图像分类方法[J]. 广东工业大学学报, 2021, 38(03): 29-35.
[15] 夏皓, 蔡念, 王平, 王晗. 基于多分辨率学习卷积神经网络的磁共振图像超分辨率重建[J]. 广东工业大学学报, 2020, 37(06): 26-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!