Journal of Guangdong University of Technology ›› 2024, Vol. 41 ›› Issue (03): 91-101.doi: 10.12052/gdutxb.230037

• Computer Science and Technology • Previous Articles     Next Articles

Speaker-Aware Cross Attention Speaker Extraction Network

Li Zhuo-zhang1, Xu Bo-yan1, Cai Rui-chu1, Hao Zhi-feng1,2   

  1. 1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China;
    2. College of Science, Shantou University, Shantou 515063, China
  • Received:2023-02-27 Online:2024-05-25 Published:2024-06-14

Abstract: Target speaker extraction aims to extract the speech of the specific speaker from mixed audio, which usually treats the enrolled audio of the target speaker as auxiliary information. Existing approaches mainly have the following limitations: the auxiliary network for speaker recognition cannot capture the critical information from enrolled audio, and the second one is the lack of an interactive learning mechanism between mixed and enrolled audio embedding. These limitations lead to speaker confusion when the difference between the enrolled and target audio is significant. To address this, a speaker-aware cross-attention speaker extraction network (SACAN) is proposed. First, SACAN introduces an attention-based speaker aggregation module in the speaker recognition auxiliary network, which effectively aggregates critical information about target speaker characteristics. Then, it uses mixed audio to enhance target speaker embedding. After that, to promote the integration of speaker embedding and mixed audio embedding, SACAN builds an interactive learning mechanism through cross-attention and enhances the speaker perception ability of the model. The experimental results show that SACAN improves by 0.0133 and 1.0695 in terms of STOI and SI-SDRi when compared with the benchmark model, validating the effectiveness of the proposed module in speaker confusion assessment and ablation experiments.

Key words: speech separation, target speaker extraction, speaker embedding, cross attention, multi-task learning

CLC Number: 

  • TP391.2
[1] CHERRY E C. Some experiments on the recognition of speech, with one and with two ears [J]. The Journal of the Acoustical Society of America, 1953, 25(5): 975-979.
[2] HERSHEY J R, CHEN Z, LE ROUX J, et al. Deep clustering: discriminative embeddings for segmentation and separation[C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Shanghai: IEEE, 2016: 31-35.
[3] YU D, KOLBæK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans: IEEE, 2017: 241-245.
[4] CHEN Z, LUO Y, MESGARANI N. Deep attractor network for single-microphone speaker separation[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . New Orleans: IEEE, 2017: 246-250.
[5] LUO Y, MESGARANI N. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256-1266.
[6] LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Online: IEEE, 2020: 46-50.
[7] CHEN J, MAO Q, LIU D. Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation[C]//Conference of the International Speech Communication Association. Online: ISCA, 2020: 2642-2646.
[8] ZEGHIDOUR N, GRANGIER D. Wavesplit: end-to-end speech separation by speaker clustering [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2840-2849.
[9] LAM M W Y, WANG J, SU D, et al. Sandglasset: a light multi-granularity self-attentive network for time-domain speech separation[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Online: IEEE, 2021: 5759-5763.
[10] SUBAKAN C, RAVANELLI M, CORNELL S, et al. Attention is all you need in speech separation[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Online: IEEE, 2021: 21-25.
[11] DELCROIX M, ZMOLIKOVA K, KINOSHITA K, et al. Single channel target speaker extraction and recognition with speaker beam[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary: IEEE, 2018: 5554-5558.
[12] WAN L, WANG Q, PAPIR A, et al. Generalized end-to-end loss for speaker verification[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary: IEEE, 2018: 4879-4883.
[13] WANG Q, MUCKENHIRN H, WILSON K, et al. VoiceFilter: targeted voice separation by speaker-conditioned spectrogram masking [C]//Conference of the International Speech Communication Association. Graz: ISCA, 2019: 2728-2732.
[14] XU C, RAO W, CHNG E S, et al. Spex: multi-scale time domain speaker extraction network [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1370-1384.
[15] GE M, XU C, WANG L, et al. SpEx+: a complete time domain speaker extraction network[C]//Conference of the International Speech Communication Association. Online: ISCA, 2020: 1406-1410.
[16] GE M, XU C, WANG L, et al. Multi-stage speaker extraction with utterance and frame-level reference signals[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Online: IEEE, 2021: 6109-6113.
[17] DENG C, MA S, SHA Y, et al. Robust speaker extraction network based on iterative refined adaptation. [C]// Conference of the International Speech Communication Association. Online: 2021, 3530-3534.
[18] HAO Y, XU J, ZHANG P, et al. Wase: learning when to attend for speaker extraction in cocktail party environments[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Online: IEEE, 2021: 6104-6108.
[19] JI X, YU M, ZHANG C, et al. Speaker-aware target speaker enhancement by jointly learning with speaker embedding extraction[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Online: IEEE, 2020: 7294-7298.
[20] JU Y, RAO W, YAN X, et al. TEA-PSE: tencent-ethereal-audio-lab personalized speech enhancement system for ICASSP 2022 DNS CHALLENGE[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Singapore: IEEE, 2022: 9291-9295.
[21] ZHAO Z, GU R, YANG D, et al. Speaker-aware mixture of mixtures training for weakly supervised speaker extraction [C]//Conference of the International Speech Communication Association. Incheon: ISCA, 2022, 5318-5322.
[22] PANDEY A, WANG D L. Attentive training: a new training framework for talker-independent speaker extraction. [C]//Conference of the International Speech Communication Association. Incheon: ISCA, 2022, 201-205
[23] DELCROIX M, ZMOLIKOVA K, KINOSHITA K, et al. Single channel target speaker extraction and recognition with speaker beam[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Calgary Alberta: IEEE, 2018: 5554-5558.
[24] WANG W, XU C, GE M, et al. Neural speaker extraction with speaker-speech cross-attention network[C]// Conference of the International Speech Communication Association. Online: 2021: 3535-3539.
[25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [J]. Advances in Neural Information Processing Systems, 2017, 30: 5998-6008.
[26] ZHAO Z, YANG D, GU R, et al. Target confusion in end-to-end speaker extraction: analysis and approaches. [C]//Conference of the International Speech Communication Association. Incheon: ISCA, 2022: 5333-5337.
[27] OKABE K, KOSHINAKA T, SHINODA K. Attentive statistics pooling for deep speaker embedding[C]//Conference of the International Speech Communication Association. Hyderabad: ISCA, 2018: 2252-2256.
[28] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(4): 834-848.
[29] LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR–half-baked or well done?[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brighton: IEEE, 2019: 626-630.
[30] COSENTINO J, PARIENTE M, CORNELL S, et al. Librimix: an open-source dataset for generalizable speech separation[EB/OL]. (2020-5-22) [2023-3-27].https://arxiv.org/abs/2005.11262.
[31] PANAYOTOV V, CHEN G, POVEY D, et al. Librispeech: an ASR corpus based on public domain audio books[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Brisbane: IEEE, 2015: 5206-5210.
[32] TAAL C H, HENDRIKS R C, HEUSDENS R, et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//2010 IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas: IEEE, 2010: 4214-4217.
[33] DELCROIX M, OCHIAI T, ZMOLIKOVA K, et al. Improving speaker discrimination of target speech extraction with time-domain speakerbeam[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Online: IEEE, 2020: 691-695.
[1] Wu Xiao-ling, Chen Xiang-wang, Zhan Wen-tao, Ling Jie. Chinese Medical Named Entity Recognition Based on Gated Attention Unit [J]. Journal of Guangdong University of Technology, 2023, 40(06): 176-184.
[2] Zhang Rui, Lyu Jun. Single-channel Speech Separation Based on Separated SI-SNR Regression Estimation and Adaptive Frequency Modulation Network [J]. Journal of Guangdong University of Technology, 2023, 40(02): 45-54.
[3] Li Qi-xiang, Xiao Yan-shan, Hao Zhi-feng, Ruan Yi-bang. An Algorithm Based on Multi-task Multi-instance Anti-noise Learning [J]. Journal of Guangdong University of Technology, 2018, 35(03): 47-53.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!