基于多维注意力机制的单通道语音增强方法

doi:10.13232/j.cnki.jnju.2023.04.013

南京大学学报(自然科学版) ›› 2023, Vol. 59 ›› Issue (4): 669–679.doi: 10.13232/j.cnki.jnju.2023.04.013

基于多维注意力机制的单通道语音增强方法

姚瑶¹, 杨吉斌¹(), 张雄伟¹(), 陈乐乐¹, 范君怡²

^1.陆军工程大学指挥控制工程学院，南京，210007
^2.中国科学院声学研究所东海研究站，上海，201815

收稿日期:2023-06-05 出版日期:2023-07-31 发布日期:2023-08-18
通讯作者: 杨吉斌,张雄伟 E-mail:yjbice@sina.com;xwzhang9898@163.com
基金资助:
国家自然科学基金(62071484);陆军工程大学基础前沿项目(KYZYJKQTZQ23001)

Single⁃channel speech enhancement based on multi⁃dimensional attention mechanism

Yao Yao¹, Jibin Yang¹(), Xiongwei Zhang¹(), Lele Chen¹, Junyi Fan²

^1.School of Command and Control Engineering，Army Engineering University，Nanjing, 210007，China
^2.Shanghai Acoustics Laboratory，Chinese Academy of Sciences，Shanghai，201815, China

Received:2023-06-05 Online:2023-07-31 Published:2023-08-18
Contact: Jibin Yang, Xiongwei Zhang E-mail:yjbice@sina.com;xwzhang9898@163.com

摘要/Abstract

摘要：

基于深度学习的单通道语音增强技术能有效改善语音增强的质量，但在低信噪比环境下，语音增强的效果不能令人满意.为了改善低信噪比下单通道语音增强的质量，提出一种多维注意力机制（Multi?Dimensional Attention Mechanism，MDAM），通过将通道注意力和全局、局部时间注意力进行级联，充分挖掘深度神经网络各通道间语音特征的长短时相关性.在此基础上，设计了基于多维注意力机制的时域语音增强网络MDAM?Net，采用跳跃连接的编解码结构获取深层语音特征，并采用MDAM充分关注干净语音特征在网络通道间、时间方向上全局与局部范围的变化差异，可以更好地建模语音特征的上下文联系.仿真实验的结果表明，在保持较低模型参数量条件下，MDAM?Net在VoiceBank?DEMAND公开数据集上增强语音的PESQ（Perceptual Evaluation of Speech Quality）评分可以达到3.25.在低信噪比条件下，增强语音质量显著优于已有的单通道语音增强模型.

关键词: 单通道语音增强, 多维注意力, 通道注意力, Transformer

Abstract:

In recent years，deep learning?based single?channel speech enhancement technology effectively improves the quality of speech enhancement. However，in low signal?to?noise ratio environments，the enhanced speech effect is still not satisfactory. In order to improve the quality of single?channel speech enhancement in low signal?to?noise ratio，a multi?dimensional attention mechanism (MDAM) is proposed，which fully explores the long?term and short?term correlations between speech features among various channels in deep neural networks by cascading channel attention and global and local temporal attention. Based on this，MDAM?Net is designed which is a time?domain speech enhancement network based on multi?dimensional attention mechanism. This network adopts a skip?connection encoder?decoder structure to obtain deep speech features，and uses MDAM to fully pay attention to the global and local variations of clean speech features in channel and temporal directions，which betterly model the contextual relationships of speech features. Simulation experiment results show that under the condition of keeping a relatively low model parameter volume，the PESQ (Perceptual Evaluation of Speech Quality） score of the enhanced speech by MDAM?Net on the VoiceBank?DEMAND public dataset reaches 3.25. Under low signal?to?noise ratio conditions，the enhanced speech quality is significantly better than existing single?channel speech enhancement models.

Key words: single?channel speech enhancement, multi?dimensional attention, channel attention, Transformer

中图分类号:

TN912

姚瑶, 杨吉斌, 张雄伟, 陈乐乐, 范君怡. 基于多维注意力机制的单通道语音增强方法[J]. 南京大学学报(自然科学版), 2023, 59(4): 669–679.

Yao Yao, Jibin Yang, Xiongwei Zhang, Lele Chen, Junyi Fan. Single⁃channel speech enhancement based on multi⁃dimensional attention mechanism[J]. Journal of Nanjing University(Natural Sciences), 2023, 59(4): 669–679.

图/表 14

图1

图2

图3

图4

图5

图6

图7

图8

表1

表2

表3

图9

表4

图10

参考文献 33

1	Sun Z Y， Li Y D， Jiang H J，et al. A supervised speech enhancement method for smartphone?based binaural hearing aids. IEEE Transactions on Biomedical Circuits and Systems，2020，14(5)：951-960.
2	徐勇. 基于深层神经网络的语音增强方法研究. 博士学位论文. 合肥：中国科学技术大学，2015.
	Xu Y. Research on deep neural network based speech enhancement. Ph.D. Dissertation. Hefei：University of Science and Technology of China，2015.
3	魏泉水. 基于深度神经网络的语音增强算法研究. 硕士学位论文. 南京：南京大学，2016.
	Wei Q S. Research on speech enhancement algorithm based on deep neural network. Master Dissertation. Nanjing：Nanjing University，2016.
4	叶文政. 基于深度学习的极低信噪比语音增强方法. 硕士学位论文. 成都：电子科技大学，2021.
	Ye W Z. Extremely low signal?to?noise ratio speech enhancement method based on deep learning. Master Dissertation. Chengdu：University of Electronic Science and Technology of China，2021.
5	Hao X， Su X D， Wang Z Y，et al. UNetGAN：A robust speech enhancement approach in time domain for extremely low signal?to?noise ratio condition∥The 20^th Annual Conference of the International Speech Communication Association. Graz，Austria：ISCA，2019：1786-1790.
6	Weninger F， Hershey J R， Le Roux J，et al. Discriminatively trained recurrent neural networks for single?channel speech separation∥2014 IEEE Global Conference on Signal and Information Processing. Atlanta，GA，USA：IEEE，2014：577-581.
7	Pandey A， Wang D L. TCNN：Temporal convolutional neural network for real?time speech enhancement in the time domain∥2019 IEEE International Conference on Acoustics，Speech and Signal Processing. Brighton，UK：IEEE，2019：6875-6879.
8	Macartney C， Weyde T. Improved speech enhance?ment with the wave?U?Net. 2018，arXiv：.
9	Vaswani A， Shazeer N， Parmar N，et al. Attention is all you need∥Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach，CA，USA：Curran Associates Inc.，2017：6000-6010.
10	Kim J， El?Khamy M， Lee J. T?GSA：Transformer with gaussian?weighted self?attention for speech enhancement//2020 IEEE International Conference on Acoustics，Speech and Signal Processing. Barcelona，Spain：IEEE，2020：6649-6653.
11	Giri R， Isik U， Krishnaswamy A. Attention Wave?U?Net for speech enhancement∥2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz，NY，USA：IEEE，2019：249-253.
12	Pandey A， Wang D L. Dense CNN with self?attention for time?domain speech enhancement. IEEE/ACM Transactions on Audio，Speech，and Language Processing，2021(29)：1270-1279.
13	Woo S， Park J， Lee J Y，et al. CBAM：Convolutional block attention module∥Proceedings of the 15th European Conference on Computer Vision. Springer Berlin Heidelberg，2018：3-19.
14	Tolooshams B， Giri R， Song A H，et al. Channel?attention dense U?Net for multichannel speech enhancement∥2020 IEEE International Conference on Acoustics，Speech and Signal Processing. Barcelona，Spain：IEEE，2020：836-840.
15	Park H J， Kang B H， Shin W，et al. MANNER：Multi?view attention network for noise erasure∥2022 IEEE International Conference on Acoustics，Speech and Signal Processing. Singapore：IEEE，2022：7842-7846.
16	Hu J， Shen L， Sun G. Squeeze?and?excitation networks∥Proceedings of the IEEE/CVF Confe?rence on Computer Vision and Pattern Recognition. Salt Lake City，UT，USA：IEEE，2018：7132-7141.
17	Sperber M， Niehues J， Neubig G，et al. Self?attentional acoustic models∥The 19^th Annual Conference of the International Speech Communi?cation Association. Hyderabad，India：ISCA，2018：3723-3727.
18	Pandey A， Wang D L. On cross?corpus genera?lization of deep learning based speech enhancement. IEEE/ACM Transactions on Audio，Speech，and Language Processing，2020(28)：2489-2499.
19	Valentini-Botinhao C， Wang X， Takaki S，et al. Investigating RNN?based speech enhancement methods for noise?robust text?to?speech∥The 9^th ISCA Speech Synthesis Workshop. Sunnyvale，CA，USA：ISCA，2016：146-152.
20	Veaux C， Yamagishi J， King S. The voice bank corpus：Design，collection and data analysis of a large regional accent speech database∥2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Gurgaon，India：IEEE，2013：1-4.
21	Thiemann J， Ito N， Vincent E. The diverse environments multi?channel acoustic noise database：A database of multichannel environmental noise recordings. The Journal of the Acoustical Society of America，2013，133(S5)：3591.
22	Rix A W， Beerends J G， Hollier M P，et al. Perceptual evaluation of speech quality (PESQ)：A new method for speecn quality assessment of telephone networks and codecs∥Proceedings of the 26th International Conference on Acoustics，Speech，and Signal Processing. Salt Lake City，Utah，USA：IEEE，2001：749-752.
23	Taal C H， Hendriks R C， Heusdens R，et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio，Speech，and Language Processing，2011，19(7)：2125-2136.
24	Hu Y， Loizou P C. Evaluation of objective quality measures for speech enhancement. IEEE Tran?sactions on Audio，Speech，and Language Processing，2008，16(1)：229-238.
25	Pascual S， Bonafonte A， Serrà J. SEGAN：Speech enhancement generative adversarial network∥The 18^th Annual Conference of the International Speech Communication Association. Stockholm，Sweden：ISCA，2017：3642-3646.
26	Soni M H， Shah N， Patil H A. Time?frequency masking?based speech enhancement using generative adversarial network∥2018 IEEE International Conference on Acoustics，Speech and Signal Processing. Calgary，AB，Canada：IEEE，2018：5039-5043.
27	Fu S W， Liao C F， Tsao Y，et al. MetricGAN：Generative adversarial networks based black?box metric scores optimization for speech enhancement∥The 36th International Conference on Machine Learning. Long Beach，CA，USA：PMLR，2019：2031-2041.
28	Yin D C， Luo C， Xiong Z W，et al. PHASEN：A phase?and?harmonics?aware speech enhancement network∥Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York，NY，USA：AAAI，2020：9458-9465.
29	Zhang Q Q， Nicolson A， Wang M J，et al. DeepMMSE：A deep learning approach to MMSE?based noise power spectral density estimation. IEEE/ACM Transactions on Audio，Speech，and Language Processing，2020(28)：1404-1415.
30	Defossez A， Synnaeve G， Adi Y. Real time speech enhancement in the waveform domain∥The 21^st Annual Conference of the International Speech Communication Association. Shanghai，China：ISCA，2020：3291-3295.
31	Wang K， He B B， Zhu W P. TSTNN：Two?stage transformer based neural network for speech enhancement in the time domain∥IEEE International Conference on Acoustics，Speech and Signal Processing. Toronto，Canada：IEEE，2021：7098-7102.
32	Kong Z F， Ping W， Dantrey A，et al. Speech denoising in the waveform domain with self?attention∥2022 IEEE International Conference on Acoustics，Speech and Signal Processing. Singapore：IEEE，2022：7867-7871.
33	范君怡，杨吉斌，张雄伟，等. U?net网络中融合多头注意力机制的单通道语音增强. 声学学报，2022，47(6)：703-716.
	Fan J Y， Yang J B， Zhang X W，et al. Monaural speech enhancement using U?net fused with multi?head self?attention. Acta Acustica，2022，47(6)：703-716.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

	输入	参数	输出
网络结构	（通道数× 样点个数）	（卷积核大小，步长，输出通道数）	（通道数× 样点个数）
上采样层	1×64000	—	1×256084
第一层编码层	1×256084	8,4,48	48×64020
第二层编码层	48×64020	8,4,96	96×16004
第三层编码层	96×16004	8,4,192	192×4000
第四层编码层	192×4000	8,4,384	384×999
MDAM×4	384×999	MDAM×4	384×999
第四层解码层	384×999	8,4,192	192×4000
第三层解码层	192×4000	8,4,96	96×16004
第二层解码层	96×16004	8,4,48	48×64020
第一层解码层	48×64020	8,4,1	1×256084
下采样层	1×256084	—	1×64000

模型	PESQ	STOI	CSIG	CBAK	COVL
U⁃Net	2.50	0.93	3.62	3.25	3.10
U⁃Net+Channel Attention	2.56	0.93	3.75	3.28	3.15
U⁃Net+global Attention	2.81	0.94	3.83	3.18	3.31
U⁃Net+local Attention	2.63	0.94	3.79	3.21	3.41
U⁃Net+MDAM	3.13	0.95	4.33	3.51	3.63
MDAM⁃Net	3.25	0.95	4.53	3.66	3.93

模型	处理域	PESQ	STOI	CSIG	CBAK	COVL	参数（MB）
Noisy	—	1.97	0.91	3.34	2.44	2.63
SEGAN,2017^[25]	T	2.16	0.93	3.48	2.94	2.80	43.2
Wave⁃U⁃Net,2018^[8]	T	2.40	—	3.52	3.24	2.96	38.1
MMSE⁃GAN,2018^[26]	F	2.53	0.93	3.80	3.12	3.14	—
MetricGAN,2019^[27]	F	2.86	—	3.99	3.18	3.42	—
PHASEN,2020^[28]	F	2.99	—	4.21	3.55	3.62	—
DeepMMSE,2020^[29]	F	2.95	0.94	4.28	3.46	3.64	—
DEMUCS,2020^[30]	T	3.07	0.95	4.31	3.40	3.63	130.5
TSTNN,2021^[31]	T	2.96	0.95	4.33	3.53	3.67	3.5
CleanUNet,2022^[32]	T	2.90	0.95	4.33	3.42	3.64	46.1
MANNER,2022^[15]	T	3.21	0.95	4.53	3.65	3.91	24.1
MDAM⁃Net	T	3.25	0.95	4.53	3.66	3.93	16.9

	Noisy		Wave⁃U⁃Net^[8]		DEMUCS^[30]		TU⁃Net^[33]		MDAM⁃Net
	PESQ	STOI	PESQ	STOI	PESQ	STOI	PESQ	STOI	PESQ	STOI
7.5 dB	2.20	0.90	2.78	0.94	3.08	0.96	3.28	0.96	3.55	0.97
2.5 dB	1.73	0.83	2.43	0.93	2.73	0.92	3.01	0.95	3.03	0.95
-2.5 dB	1.41	0.75	1.99	0.88	2.06	0.89	2.21	0.91	2.56	0.92
-7.5 dB	1.24	0.62	1.61	0.76	1.71	0.82	1.90	0.86	2.18	0.88

[1]	谭嘉辰, 董永权, 张国玺. SSM: 基于孪生网络的糖尿病视网膜眼底图像分类模型[J]. 南京大学学报(自然科学版), 2023, 59(3): 425-434.
[2]	宋耀莲, 殷喜喆, 杨俊. 基于时空特征学习Transformer的运动想象脑电解码方法[J]. 南京大学学报(自然科学版), 2023, 59(2): 313-321.
[3]	唐伟佳, 张华, 侯志荣. 基于空间卷积融合的中文文本匹配方法[J]. 南京大学学报(自然科学版), 2022, 58(5): 868-875.
[4]	井花花, 晏涛, 刘渊. 融合全局和局部特征的光场图像空间超分辨率算法[J]. 南京大学学报(自然科学版), 2022, 58(2): 298-308.

基于多维注意力机制的单通道语音增强方法

Single⁃channel speech enhancement based on multi⁃dimensional attention mechanism

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 14

参考文献 33

相关文章 4

Metrics

本文评价

推荐阅读 0