基于分离结果信噪比估计与自适应调频网络的单通道语音分离技术

doi:10.12052/gdutxb.210149

Abstract

Abstract: In practical applications, speech separation models are often disturbed by unknown noise, resulting in serious degradation of generalization performance. To solve this problem, Single channel speech separation method based on separate SNR regression estimation and adaptive frequency modulation network is proposed. Firstly, the scale invariant SNR of test signal separation results is estimated by prediction network to calculate the cognitive uncertainty of the model; Then, an adaptive frequency modulation network is designed to adjust the spectrum of signals with high uncertainty to reduce the cognitive uncertainty of the model, so as to improve the generalization ability of the model in the face of unknown noise. The experimental results show that compared with the Conv-Tasnet, the proposed method improves the SI-SNR (Scale Invariant SNR) from 2.72 dB to 4.57 dB, with an increase of 67.94%, and has a great improvement in generalization ability. Compared with Conv-Tasnet with Soft-Mask, the SI-SNR is increased from 3.32 dB to 4.57 dB, with an increase of 37.65%, indicating that this method has better generalization ability than soft mask mechanism. It effectively alleviates the serious degradation of generalization ability of speech separation network in the face of unknown noise.

Key words: speech separation, uncertainty measurement, noise robustness, neural network

CLC Number:

TP301

Zhang Rui, Lyu Jun. Single-channel Speech Separation Based on Separated SI-SNR Regression Estimation and Adaptive Frequency Modulation Network[J].Journal of Guangdong University of Technology, 2023, 40(02): 45-54.

References

[1] EPHRAL A, MOSSERI I, LANG O, et al. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation [J]. ACM Transactions on Graphics, 2018, 37(4): 1-11.
[2] KE Y, DONG X, YAN B. Overview of patent technologies for blind separation of mixed speech signals [J]. China Science and Technology Information, 2019(5): 22-23.
[3] 朱阁. 基于深度学习的单通道语音分离技术研究[D]. 南京: 南京邮电大学, 2020.
[4] HUANG P, KIM M, HASEGAWA J M, et al. Deep learning for monaural speech separation[C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014. Florence: IEEE, 2014: 1562-1566.
[5] WU J, WANG Y. Research on speech separation based on GCC-NMF [J]. Journal of Jiangxi University of Technology, 2020, 41(5): 65-72.
[6] GE W, ZHANG T, FAN C, et al. Human voice separation algorithm using sparse nonnegative matrix factorization and deep attractor network under noise [J]. Acta Acoustics Sinica, 2021, 46(1): 55-66.
[7] VARGA A P, MOORE R K. Hidden Markov model decomposition of speech and noise[C]//International Conference on Acoustics, Speech and Signal Processing. New Mexico: IEEE, 1990: 845-848.
[8] OCHIAI T, DELCROIX M, IKESHIKA R, et al. Beam-TasNet: time-domain audio separation network meets frequency-domain beamformer[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Florence: IEEE, 2020: 6384-6388.
[9] WANG D L, CHEN J. Supervised speech separation based on deep learning: an overview [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2018, 26(10): 1702-1726.
[10] KRAWCZYK M, GERKMANN T. STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2014, 22(12): 1931-1940.
[11] MOWLAEE P, CHRISTENSEN M G, JEBSEB S H. Improved single-channel speech separation using sinusoidal modeling[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas, Texas: IEEE Signal Processing Society, 2010: 21-24.
[12] KOLVAK M, YU Z H, JENSEN J. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2017, 25(10): 1901-1913.
[13] XU C, RAO W, XIAO X, et al. Single channel speech separation with constrained utterance level permutation invariant training using grid lstm[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary, Canada: IEEE, 2018: 6-10.
[14] XU C, RAO W, XIAO X, et al. A shifted delta coefficient objective for monaural speech separation using multi-task learning[C]//INTERSPEECH. Hyderabad, India: IEEE, 2018: 3479-3483.
[15] XU C, RAO W, CHNG E S. Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brighton, UK: IEEE, 2019: 6990-6994.
[16] DELCROIX M, ZMOLIKOVA K, KINOSHITA K. Single channel target speaker extraction and recognition with speaker beam[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary, Canada: IEEE, 2018: 5554-5558.
[17] WANG Q, MUCKENHIM H, WILSON K, et al. Voice filter: targeted voice separation by speaker-conditioned spectrogram masking[C]//INTERSPEECH. Graz, Austria: IEEE, 2019: 2728-2732.
[18] DING S, WANG Q, CHANG S, et al. Personal VAD: speaker-conditioned voice activity detection[C]//Proc. Odyssey 2020 The Speaker and Language Recognition Workshop. Tokyo: Odyssey, 2020: 433-439.
[19] TU Y, DU J, XU Y. Deep neural network based speech separation for robust speech recognition[C]//2014 12th International Conference on Signal Processing (ICSP) . Hangzhou: IEEE, 2014: 532-536.
[20] LUO Y, MESGARANI N. Tasnet: time-domain audio separation network for real-time, single-channel speech separation[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary: IEEE, 2018: 696-700.
[21] VEKATANI S, CASEBEER J, SMARAGDIS P. End-to-end source separation with adaptive front-ends[C]//2018 52nd Asilomar Conference on Signals, Systems and Computers. California: IEEE, 2018: 684-688.
[22] LUO Y, MESGARANI N. Conv-Tasnet: surpassing ideal time-frequency magnitude masking for speech separation [J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2019, 27(8): 1256-1266.
[23] LUO Y, CHEN Z, YOSHIOKA T. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Spain: IEEE, 2020: 46-50.
[24] WIJAYAKUSUMA A, GOZALI D R, WIDJAJA A, et al. Implementation of real-time speech separation model using time-domain audio separation network (TasNet) and dual-path recurrent neural network (DPRNN) [J]. Procedia Computer Science, 2021, 179: 762-772.
[25] XU C, RAO W. SpEx: multi-scale time domain speaker extraction network[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1370-1384.
[26] GE M, XU C, WANG L. SpEx+: A complete time domain speaker extraction network[C]//INTERSPEECH. Shanghai: IEEE, 2019: 1406-1410.
[27] JIN Y, TANG C, LIU Q. Multi-head self-attention-based deep clustering for single-channel speech separation[J]. IEEE Access, 2020, 8: 100013-100021.
[28] SUN Y, XIAN Y, WANG W. Monaural source separation in complex domain with long short-term memory neural network[J]. IEEE Journal of Selected Topics in Signal Processing, 2019, 13(2) : 359-369.
[29] LI Z, SONG Y, MCLOUGHLIN I. Source-aware context network for single-channel multi-speaker speech separation[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary, Canada: IEEE, 2018: 681-685.
[30] ZMOLIKOVA K, DELCROIX M, KINOSHITA K. Learning speaker representation for neural network based multichannel speaker extraction[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) . Okinawa, Japan: IEEE, 2017: 8-15.
[31] NASSIF A B, SHAHIN I, ATTILI A, et al. Speech recognition using deep neural networks: a systematic review[J]. IEEE Access, 2019, 7: 19143-19165.
[32] ABDAR M, POURPANAH F, HUSSAIN S, et al. A review of uncertainty quantification in deep learning: techniques, applications and challenges[J]. Information Fusion, 2021, 76: 243-297.
[33] ROY A G, CONJETI S, NAVAB N. Bayesian quicknat: model uncertainty in deep whole-brain segmentation for structure-wise quality control[J]. Neuro Image, 2019, 195: 11-22.
[34] CLEMENTS W R, VAN D B, ROBAGLIA B M, et al. Estimating risk and uncertainty in deep reinforcement learning[C]//2020 International Conference on Machine Learning (ICML). Austria: IMLS, 2020: 258-260.
[35] JAIN M, LAHLOU S, NEKOEI H. DEUP: direct epistemic uncertainty prediction[C]//2022 International Conference on Learning Representations(ICLR). Online: Open Review, 2022: 292-294.
[36] COMBALIA M, HUETO F, PUIG S, et al. Uncertainty estimation in deep neural networks for dermoscopic image classification[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle: IEEE, 2020: 744-745.
[37] FAN X, DENG Z, WANG K, et al. Learning discriminative representation for facial expression recognition from uncertainties[C]//2020 IEEE International Conference on Image Processing (ICIP) . Abu Dhabi, Arabia: IEEE, 2020: 903-907.
[38] ZHE L J, LIN Z, PADHY S, et al. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness[J]. Advances in Neural Information Processing Systems, 2020, 33: 7498-7512.
[39] RIBAS D, VINCENT E. An improved uncertainty propagation method for robust i-vector based speaker recognition[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brighton, UK: IEEE, 2019: 6331-6335.
[40] WANG K, PENG X, YANG J, et al. Suppressing uncertainties for large-scale facial expression recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 6897-6906.
[41] TAGASOVSKA N, LOPEZ P D. Single-model uncertainties for deep learning[J]. Advances in Neural Information Processing Systems, 2019, 32: 6417-6428.
[42] 张锐. 基于不确定性度量的单通道语音分离算法研究[D]. 广州: 广东工业大学, 2022.
[43] VARGA A, STEENEKEN H J M. Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems [J]. Speech Communication, 1993, 12(3): 247-251.
[44] HU G, WANG D L. A tandem algorithm for pitch estimation and voiced speech segregation [J]. IEEE Transactions on Audio, Speech and Language Processing, 2010, 18(8): 2067-2079.
[45] PANAYIOTOU V, CHEN G, POKEY D, et al. Libri Speech: an ASR corpus based on public domain audio books[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Oslo, Norway: IEEE, 2015: 5206-5210.
[46] LIU Y, DELARIA M, WANG D L. Deep casa for talker- independent monaural speech separation[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Barcelona: IEEE, 2020: 6354-6358.
[47] SALEEN N, IRFAN M. Noise reduction based on soft masks by incorporating SNR uncertainty in frequency domain [J]. Circuits, Systems and Signal Processing, 2019, 37(6): 2591-2612.

Related Articles 15

[1]	Xie Guo-bo, Lin Li, Lin Zhi-yi, He Di-xuan, Wen Gang. An Insulator Burst Defect Detection Method Based on YOLOv4-MP [J]. Journal of Guangdong University of Technology, 2023, 40(02): 15-21.
[2]	Qiu Jun-hao, Cheng Zhi-jian, Lin Guo-huai, Ren Hong-ru, Lu Ren-quan. Prescribed Performance Control for a Class of Nonlinear Pure-feedback Systems with Actuator Faults [J]. Journal of Guangdong University of Technology, 2023, 40(02): 55-63.
[3]	Chen Jing-yu, Lyu Yi. Frost Detection Method of Cold Chain Refrigerating Machine Based on Spiking Neural Network [J]. Journal of Guangdong University of Technology, 2023, 40(01): 29-38.
[4]	Ye Wen-quan, Li Si, Ling Jie. Sparse-view SPECT Image Reconstruction Based on Multilevel-residual U-Net [J]. Journal of Guangdong University of Technology, 2023, 40(01): 61-67.
[5]	Peng Mei-chun, Yang Chen, Li Jun-ping, Ye Wei-bin, Huang Wen-wei. A Research on Vehicle Carbon Emission Calculating Method Based on BP Neural Network [J]. Journal of Guangdong University of Technology, 2023, 40(01): 107-112.
[6]	Liu Hong-wei, Lin Wei-zhen, Wen Zhan-ming, Chen Yan-jun, Yi Min-qi. A MABM-based Model for Identifying Consumers' Sentiment Polarity―Taking Movie Reviews as an Example [J]. Journal of Guangdong University of Technology, 2022, 39(06): 1-9.
[7]	Zhang Yun, Wang Xiao-dong. A Review and Thinking of Deep Learning with a Restricted Number of Samples [J]. Journal of Guangdong University of Technology, 2022, 39(05): 1-8.
[8]	Peng Ji-guang, Xiao Han-zhen. Tracking and Obstacle Avoidance of Multi-mobile Robots Under Model Predictive Control [J]. Journal of Guangdong University of Technology, 2022, 39(05): 93-101.
[9]	Li Yao-dong, Ren Zhi-gang, Wu Zong-ze. Deep Neural Network Based Predictive Control for Injection Molding Process [J]. Journal of Guangdong University of Technology, 2022, 39(05): 120-126,136.
[10]	Zeng Jiang-yi, Li Zhi-sheng, Ou Yao-chun, Jin Yu-kai. PM_2.5 Concentration Improving Prediction Modeling of Seasonal Index [J]. Journal of Guangdong University of Technology, 2022, 39(03): 89-94.
[11]	Gary Yen, Li Bo, Xie Sheng-li. An Evolutionary Optimization of LSTM for Model Recovery of Geophysical Fluid Dynamics [J]. Journal of Guangdong University of Technology, 2021, 38(06): 1-8.
[12]	Guo Xin-de, Chris Hong-qiang Ding. An AGV Path Planning Method for Discrete Manufacturing Smart Factory [J]. Journal of Guangdong University of Technology, 2021, 38(06): 70-76.
[13]	Huang Jian-hang, Wang Zhen-you. A Research on Deep Learning Object Detection Algorithm Based on Feature Fusion [J]. Journal of Guangdong University of Technology, 2021, 38(04): 52-58.
[14]	Ma Shao-peng, Liang Lu, Teng Shao-hua. A Lightweight Hyperspectral Remote Sensing Image Classification Method [J]. Journal of Guangdong University of Technology, 2021, 38(03): 29-35.
[15]	Xia Hao, Cai Nian, Wang Ping, Wang Han. Magnetic Resonance Image Super-Resolution via Multi-Resolution Learning [J]. Journal of Guangdong University of Technology, 2020, 37(06): 26-31.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Single-channel Speech Separation Based on Separated SI-SNR Regression Estimation and Adaptive Frequency Modulation Network

HTML

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0