基于分离结果信噪比估计与自适应调频网络的单通道语音分离技术

doi:10.12052/gdutxb.210149

摘要/Abstract

摘要： 在实际应用中，语音分离模型往往受到未知噪声的干扰，从而出现泛化性能严重退化的问题。据此本文提出了基于分离结果信噪比估计与自适应调频网络的单通道语音分离方法。该方法首先通过预测网络对测试信号分离结果的尺度不变信噪比进行估计，以此计算模型的认知不确定性；然后，设计自适应调频网络针对不确定性较高的信号进行自适应频谱调节，以降低模型认知不确定性，从而提升模型在面对未知噪声时的泛化能力。实验结果表明：本文提出的方法相比于单独的时域卷积语音分离网络，将SI-SNR指标从2.72 dB提升至4.57 dB，增幅达到67.94%，在泛化能力上具有较大的改善；相比于增加了软掩膜过滤机制的时域卷积语音分离网络，将SI-SNR指标从3.32 dB提升至4.57 dB，增幅达到37.65%，表明该方法在提高泛化能力方面的能力优于软掩膜过滤机制。

关键词: 语音分离, 不确定性度量, 噪声鲁棒, 神经网络

Abstract: In practical applications, speech separation models are often disturbed by unknown noise, resulting in serious degradation of generalization performance. To solve this problem, Single channel speech separation method based on separate SNR regression estimation and adaptive frequency modulation network is proposed. Firstly, the scale invariant SNR of test signal separation results is estimated by prediction network to calculate the cognitive uncertainty of the model; Then, an adaptive frequency modulation network is designed to adjust the spectrum of signals with high uncertainty to reduce the cognitive uncertainty of the model, so as to improve the generalization ability of the model in the face of unknown noise. The experimental results show that compared with the Conv-Tasnet, the proposed method improves the SI-SNR (Scale Invariant SNR) from 2.72 dB to 4.57 dB, with an increase of 67.94%, and has a great improvement in generalization ability. Compared with Conv-Tasnet with Soft-Mask, the SI-SNR is increased from 3.32 dB to 4.57 dB, with an increase of 37.65%, indicating that this method has better generalization ability than soft mask mechanism. It effectively alleviates the serious degradation of generalization ability of speech separation network in the face of unknown noise.

Key words: speech separation, uncertainty measurement, noise robustness, neural network

中图分类号:

TP301

张锐, 吕俊. 基于分离结果信噪比估计与自适应调频网络的单通道语音分离技术[J]. 广东工业大学学报, 2023, 40(02): 45-54.

Zhang Rui, Lyu Jun. Single-channel Speech Separation Based on Separated SI-SNR Regression Estimation and Adaptive Frequency Modulation Network[J]. Journal of Guangdong University of Technology, 2023, 40(02): 45-54.

参考文献

[1] EPHRAL A, MOSSERI I, LANG O, et al. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation [J]. ACM Transactions on Graphics, 2018, 37(4): 1-11.
[2] KE Y, DONG X, YAN B. Overview of patent technologies for blind separation of mixed speech signals [J]. China Science and Technology Information, 2019(5): 22-23.
[3] 朱阁. 基于深度学习的单通道语音分离技术研究[D]. 南京: 南京邮电大学, 2020.
[4] HUANG P, KIM M, HASEGAWA J M, et al. Deep learning for monaural speech separation[C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014. Florence: IEEE, 2014: 1562-1566.
[5] WU J, WANG Y. Research on speech separation based on GCC-NMF [J]. Journal of Jiangxi University of Technology, 2020, 41(5): 65-72.
[6] GE W, ZHANG T, FAN C, et al. Human voice separation algorithm using sparse nonnegative matrix factorization and deep attractor network under noise [J]. Acta Acoustics Sinica, 2021, 46(1): 55-66.
[7] VARGA A P, MOORE R K. Hidden Markov model decomposition of speech and noise[C]//International Conference on Acoustics, Speech and Signal Processing. New Mexico: IEEE, 1990: 845-848.
[8] OCHIAI T, DELCROIX M, IKESHIKA R, et al. Beam-TasNet: time-domain audio separation network meets frequency-domain beamformer[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Florence: IEEE, 2020: 6384-6388.
[9] WANG D L, CHEN J. Supervised speech separation based on deep learning: an overview [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2018, 26(10): 1702-1726.
[10] KRAWCZYK M, GERKMANN T. STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2014, 22(12): 1931-1940.
[11] MOWLAEE P, CHRISTENSEN M G, JEBSEB S H. Improved single-channel speech separation using sinusoidal modeling[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas, Texas: IEEE Signal Processing Society, 2010: 21-24.
[12] KOLVAK M, YU Z H, JENSEN J. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2017, 25(10): 1901-1913.
[13] XU C, RAO W, XIAO X, et al. Single channel speech separation with constrained utterance level permutation invariant training using grid lstm[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary, Canada: IEEE, 2018: 6-10.
[14] XU C, RAO W, XIAO X, et al. A shifted delta coefficient objective for monaural speech separation using multi-task learning[C]//INTERSPEECH. Hyderabad, India: IEEE, 2018: 3479-3483.
[15] XU C, RAO W, CHNG E S. Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brighton, UK: IEEE, 2019: 6990-6994.
[16] DELCROIX M, ZMOLIKOVA K, KINOSHITA K. Single channel target speaker extraction and recognition with speaker beam[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary, Canada: IEEE, 2018: 5554-5558.
[17] WANG Q, MUCKENHIM H, WILSON K, et al. Voice filter: targeted voice separation by speaker-conditioned spectrogram masking[C]//INTERSPEECH. Graz, Austria: IEEE, 2019: 2728-2732.
[18] DING S, WANG Q, CHANG S, et al. Personal VAD: speaker-conditioned voice activity detection[C]//Proc. Odyssey 2020 The Speaker and Language Recognition Workshop. Tokyo: Odyssey, 2020: 433-439.
[19] TU Y, DU J, XU Y. Deep neural network based speech separation for robust speech recognition[C]//2014 12th International Conference on Signal Processing (ICSP) . Hangzhou: IEEE, 2014: 532-536.
[20] LUO Y, MESGARANI N. Tasnet: time-domain audio separation network for real-time, single-channel speech separation[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary: IEEE, 2018: 696-700.
[21] VEKATANI S, CASEBEER J, SMARAGDIS P. End-to-end source separation with adaptive front-ends[C]//2018 52nd Asilomar Conference on Signals, Systems and Computers. California: IEEE, 2018: 684-688.
[22] LUO Y, MESGARANI N. Conv-Tasnet: surpassing ideal time-frequency magnitude masking for speech separation [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2019, 27(8): 1256-1266.
[23] LUO Y, CHEN Z, YOSHIOKA T. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Spain: IEEE, 2020: 46-50.
[24] WIJAYAKUSUMA A, GOZALI D R, WIDJAJA A, et al. Implementation of real-time speech separation model using time-domain audio separation network (TasNet) and dual-path recurrent neural network (DPRNN) [J]. Procedia Computer Science, 2021, 179: 762-772.
[25] XU C, RAO W. SpEx: multi-scale time domain speaker extraction network[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1370-1384.
[26] GE M, XU C, WANG L. SpEx+: A complete time domain speaker extraction network[C]//INTERSPEECH. Shanghai: IEEE, 2019: 1406-1410.
[27] JIN Y, TANG C, LIU Q. Multi-head self-attention-based deep clustering for single-channel speech separation[J]. IEEE Access, 2020, 8: 100013-100021.
[28] SUN Y, XIAN Y, WANG W. Monaural source separation in complex domain with long short-term memory neural network[J]. IEEE Journal of Selected Topics in Signal Processing, 2019, 13(2) : 359-369.
[29] LI Z, SONG Y, MCLOUGHLIN I. Source-aware context network for single-channel multi-speaker speech separation[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary, Canada: IEEE, 2018: 681-685.
[30] ZMOLIKOVA K, DELCROIX M, KINOSHITA K. Learning speaker representation for neural network based multichannel speaker extraction[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) . Okinawa, Japan: IEEE, 2017: 8-15.
[31] NASSIF A B, SHAHIN I, ATTILI A, et al. Speech recognition using deep neural networks: a systematic review[J]. IEEE Access, 2019, 7: 19143-19165.
[32] ABDAR M, POURPANAH F, HUSSAIN S, et al. A review of uncertainty quantification in deep learning: techniques, applications and challenges[J]. Information Fusion, 2021, 76: 243-297.
[33] ROY A G, CONJETI S, NAVAB N. Bayesian quicknat: model uncertainty in deep whole-brain segmentation for structure-wise quality control[J]. Neuro Image, 2019, 195: 11-22.
[34] CLEMENTS W R, VAN D B, ROBAGLIA B M, et al. Estimating risk and uncertainty in deep reinforcement learning[C]//2020 International Conference on Machine Learning (ICML). Austria: IMLS, 2020: 258-260.
[35] JAIN M, LAHLOU S, NEKOEI H. DEUP: direct epistemic uncertainty prediction[C]//2022 International Conference on Learning Representations(ICLR). Online: Open Review, 2022: 292-294.
[36] COMBALIA M, HUETO F, PUIG S, et al. Uncertainty estimation in deep neural networks for dermoscopic image classification[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle: IEEE, 2020: 744-745.
[37] FAN X, DENG Z, WANG K, et al. Learning discriminative representation for facial expression recognition from uncertainties[C]//2020 IEEE International Conference on Image Processing (ICIP) . Abu Dhabi, Arabia: IEEE, 2020: 903-907.
[38] ZHE L J, LIN Z, PADHY S, et al. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness[J]. Advances in Neural Information Processing Systems, 2020, 33: 7498-7512.
[39] RIBAS D, VINCENT E. An improved uncertainty propagation method for robust i-vector based speaker recognition[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brighton, UK: IEEE, 2019: 6331-6335.
[40] WANG K, PENG X, YANG J, et al. Suppressing uncertainties for large-scale facial expression recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 6897-6906.
[41] TAGASOVSKA N, LOPEZ P D. Single-model uncertainties for deep learning[J]. Advances in Neural Information Processing Systems, 2019, 32: 6417-6428.
[42] 张锐. 基于不确定性度量的单通道语音分离算法研究[D]. 广州: 广东工业大学, 2022.
[43] VARGA A, STEENEKEN H J M. Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems [J]. Speech Communication, 1993, 12(3): 247-251.
[44] HU G, WANG D L. A tandem algorithm for pitch estimation and voiced speech segregation [J]. IEEE Transactions on Audio, Speech and Language Processing, 2010, 18(8): 2067-2079.
[45] PANAYIOTOU V, CHEN G, POKEY D, et al. Libri Speech: an ASR corpus based on public domain audio books[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Oslo, Norway: IEEE, 2015: 5206-5210.
[46] LIU Y, DELARIA M, WANG D L. Deep casa for talker- independent monaural speech separation[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Barcelona: IEEE, 2020: 6354-6358.
[47] SALEEN N, IRFAN M. Noise reduction based on soft masks by incorporating SNR uncertainty in frequency domain [J]. Circuits, Systems and Signal Processing, 2019, 37(6): 2591-2612.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed