南京大学学报(自然科学版) ›› 2023, Vol. 59 ›› Issue (4): 669–679.doi: 10.13232/j.cnki.jnju.2023.04.013

• • 上一篇    下一篇

基于多维注意力机制的单通道语音增强方法

姚瑶1, 杨吉斌1(), 张雄伟1(), 陈乐乐1, 范君怡2   

  1. 1.陆军工程大学指挥控制工程学院,南京,210007
    2.中国科学院声学研究所东海研究站,上海,201815
  • 收稿日期:2023-06-05 出版日期:2023-07-31 发布日期:2023-08-18
  • 通讯作者: 杨吉斌,张雄伟 E-mail:yjbice@sina.com;xwzhang9898@163.com
  • 基金资助:
    国家自然科学基金(62071484);陆军工程大学基础前沿项目(KYZYJKQTZQ23001)

Single⁃channel speech enhancement based on multi⁃dimensional attention mechanism

Yao Yao1, Jibin Yang1(), Xiongwei Zhang1(), Lele Chen1, Junyi Fan2   

  1. 1.School of Command and Control Engineering,Army Engineering University,Nanjing, 210007,China
    2.Shanghai Acoustics Laboratory,Chinese Academy of Sciences,Shanghai,201815, China
  • Received:2023-06-05 Online:2023-07-31 Published:2023-08-18
  • Contact: Jibin Yang, Xiongwei Zhang E-mail:yjbice@sina.com;xwzhang9898@163.com

摘要:

基于深度学习的单通道语音增强技术能有效改善语音增强的质量,但在低信噪比环境下,语音增强的效果不能令人满意.为了改善低信噪比下单通道语音增强的质量,提出一种多维注意力机制(Multi?Dimensional Attention Mechanism,MDAM),通过将通道注意力和全局、局部时间注意力进行级联,充分挖掘深度神经网络各通道间语音特征的长短时相关性.在此基础上,设计了基于多维注意力机制的时域语音增强网络MDAM?Net,采用跳跃连接的编解码结构获取深层语音特征,并采用MDAM充分关注干净语音特征在网络通道间、时间方向上全局与局部范围的变化差异,可以更好地建模语音特征的上下文联系.仿真实验的结果表明,在保持较低模型参数量条件下,MDAM?Net在VoiceBank?DEMAND公开数据集上增强语音的PESQ(Perceptual Evaluation of Speech Quality)评分可以达到3.25.在低信噪比条件下,增强语音质量显著优于已有的单通道语音增强模型.

关键词: 单通道语音增强, 多维注意力, 通道注意力, Transformer

Abstract:

In recent years,deep learning?based single?channel speech enhancement technology effectively improves the quality of speech enhancement. However,in low signal?to?noise ratio environments,the enhanced speech effect is still not satisfactory. In order to improve the quality of single?channel speech enhancement in low signal?to?noise ratio,a multi?dimensional attention mechanism (MDAM) is proposed,which fully explores the long?term and short?term correlations between speech features among various channels in deep neural networks by cascading channel attention and global and local temporal attention. Based on this,MDAM?Net is designed which is a time?domain speech enhancement network based on multi?dimensional attention mechanism. This network adopts a skip?connection encoder?decoder structure to obtain deep speech features,and uses MDAM to fully pay attention to the global and local variations of clean speech features in channel and temporal directions,which betterly model the contextual relationships of speech features. Simulation experiment results show that under the condition of keeping a relatively low model parameter volume,the PESQ (Perceptual Evaluation of Speech Quality) score of the enhanced speech by MDAM?Net on the VoiceBank?DEMAND public dataset reaches 3.25. Under low signal?to?noise ratio conditions,the enhanced speech quality is significantly better than existing single?channel speech enhancement models.

Key words: single?channel speech enhancement, multi?dimensional attention, channel attention, Transformer

中图分类号: 

  • TN912

图1

多维注意力机制MDAM的示意图"

图2

通道注意力"

图3

TransformerIE与传统Transformer的结构对比"

图4

多头注意力结构图"

图5

全局注意力"

图6

局部注意力"

图7

掩码模块"

图8

MDAM?Net的网络结构"

表1

MDAM?Net网络模型的参数"

输入参数输出
网络结构

(通道数×

样点个数)

(卷积核大小,步长,输出通道数)

(通道数×

样点个数)

上采样层1×640001×256084
第一层编码层1×2560848,4,4848×64020
第二层编码层48×640208,4,9696×16004
第三层编码层96×160048,4,192192×4000
第四层编码层192×40008,4,384384×999
MDAM×4384×999MDAM×4384×999
第四层解码层384×9998,4,192192×4000
第三层解码层192×40008,4,9696×16004
第二层解码层96×160048,4,4848×64020
第一层解码层48×640208,4,11×256084
下采样层1×2560841×64000

表2

MDAM?Net注意力模块的消融实验"

模型PESQSTOICSIGCBAKCOVL
U⁃Net2.500.933.623.253.10
U⁃Net+Channel Attention2.560.933.753.283.15
U⁃Net+global Attention2.810.943.833.183.31
U⁃Net+local Attention2.630.943.793.213.41
U⁃Net+MDAM3.130.954.333.513.63
MDAM⁃Net3.250.954.533.663.93

表3

MDAM?Net与近几年模型的增强性能对比"

模型处理域PESQSTOICSIGCBAKCOVL参数(MB)
Noisy1.970.913.342.442.63
SEGAN,2017[25]T2.160.933.482.942.8043.2
Wave⁃U⁃Net,2018[8]T2.403.523.242.9638.1
MMSE⁃GAN,2018[26]F2.530.933.803.123.14
MetricGAN,2019[27]F2.863.993.183.42
PHASEN,2020[28]F2.994.213.553.62
DeepMMSE,2020[29]F2.950.944.283.463.64
DEMUCS,2020[30]T3.070.954.313.403.63130.5
TSTNN,2021[31]T2.960.954.333.533.673.5
CleanUNet,2022[32]T2.900.954.333.423.6446.1
MANNER,2022[15]T3.210.954.533.653.9124.1
MDAM⁃NetT3.250.954.533.663.9316.9

图9

MDAM?Net与不同模型增强效果的对比"

表4

不同语音增强模型在低信噪比下的性能对比"

NoisyWave⁃U⁃Net[8]DEMUCS[30]TU⁃Net[33]MDAM⁃Net
PESQSTOIPESQSTOIPESQSTOIPESQSTOIPESQSTOI
7.5 dB2.200.902.780.943.080.963.280.963.550.97
2.5 dB1.730.832.430.932.730.923.010.953.030.95
-2.5 dB1.410.751.990.882.060.892.210.912.560.92
-7.5 dB1.240.621.610.761.710.821.900.862.180.88

图10

信噪比为-7.5 dB时的语音增强效果对比"

1 Sun Z Y, Li Y D, Jiang H J,et al. A supervised speech enhancement method for smartphone?based binaural hearing aids. IEEE Transactions on Biomedical Circuits and Systems202014(5):951-960.
2 徐勇. 基于深层神经网络的语音增强方法研究. 博士学位论文. 合肥:中国科学技术大学,2015.
Xu Y. Research on deep neural network based speech enhancement. Ph.D. Dissertation. Hefei:University of Science and Technology of China,2015.
3 魏泉水. 基于深度神经网络的语音增强算法研究. 硕士学位论文. 南京:南京大学,2016.
Wei Q S. Research on speech enhancement algorithm based on deep neural network. Master Dissertation. Nanjing:Nanjing University,2016.
4 叶文政. 基于深度学习的极低信噪比语音增强方法. 硕士学位论文. 成都:电子科技大学,2021.
Ye W Z. Extremely low signal?to?noise ratio speech enhancement method based on deep learning. Master Dissertation. Chengdu:University of Electronic Science and Technology of China,2021.
5 Hao X, Su X D, Wang Z Y,et al. UNetGAN:A robust speech enhancement approach in time domain for extremely low signal?to?noise ratio condition∥The 20th Annual Conference of the International Speech Communication Association. Graz,Austria:ISCA,2019:1786-1790.
6 Weninger F, Hershey J R, Le Roux J,et al. Discriminatively trained recurrent neural networks for single?channel speech separation∥2014 IEEE Global Conference on Signal and Information Processing. Atlanta,GA,USA:IEEE,2014:577-581.
7 Pandey A, Wang D L. TCNN:Temporal convolutional neural network for real?time speech enhancement in the time domain∥2019 IEEE International Conference on Acoustics,Speech and Signal Processing. Brighton,UK:IEEE,2019:6875-6879.
8 Macartney C, Weyde T. Improved speech enhance?ment with the wave?U?Net. 2018,arXiv:.
9 Vaswani A, Shazeer N, Parmar N,et al. Attention is all you need∥Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach,CA,USA:Curran Associates Inc.,2017:6000-6010.
10 Kim J, El?Khamy M, Lee J. T?GSA:Transformer with gaussian?weighted self?attention for speech enhancement//2020 IEEE International Conference on Acoustics,Speech and Signal Processing. Barcelona,Spain:IEEE,2020:6649-6653.
11 Giri R, Isik U, Krishnaswamy A. Attention Wave?U?Net for speech enhancement∥2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz,NY,USA:IEEE,2019:249-253.
12 Pandey A, Wang D L. Dense CNN with self?attention for time?domain speech enhancement. IEEE/ACM Transactions on Audio,Speech,and Language Processing2021(29):1270-1279.
13 Woo S, Park J, Lee J Y,et al. CBAM:Convolutional block attention module∥Proceedings of the 15th European Conference on Computer Vision. Springer Berlin Heidelberg,2018:3-19.
14 Tolooshams B, Giri R, Song A H,et al. Channel?attention dense U?Net for multichannel speech enhancement∥2020 IEEE International Conference on Acoustics,Speech and Signal Processing. Barcelona,Spain:IEEE,2020:836-840.
15 Park H J, Kang B H, Shin W,et al. MANNER:Multi?view attention network for noise erasure∥2022 IEEE International Conference on Acoustics,Speech and Signal Processing. Singapore:IEEE,2022:7842-7846.
16 Hu J, Shen L, Sun G. Squeeze?and?excitation networks∥Proceedings of the IEEE/CVF Confe?rence on Computer Vision and Pattern Recognition. Salt Lake City,UT,USA:IEEE,2018:7132-7141.
17 Sperber M, Niehues J, Neubig G,et al. Self?attentional acoustic models∥The 19th Annual Conference of the International Speech Communi?cation Association. Hyderabad,India:ISCA,2018:3723-3727.
18 Pandey A, Wang D L. On cross?corpus genera?lization of deep learning based speech enhancement. IEEE/ACM Transactions on Audio,Speech,and Language Processing2020(28):2489-2499.
19 Valentini-Botinhao C, Wang X, Takaki S,et al. Investigating RNN?based speech enhancement methods for noise?robust text?to?speech∥The 9th ISCA Speech Synthesis Workshop. Sunnyvale,CA,USA:ISCA,2016:146-152.
20 Veaux C, Yamagishi J, King S. The voice bank corpus:Design,collection and data analysis of a large regional accent speech database∥2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Gurgaon,India:IEEE,2013:1-4.
21 Thiemann J, Ito N, Vincent E. The diverse environments multi?channel acoustic noise database:A database of multichannel environmental noise recordings. The Journal of the Acoustical Society of America2013133(S5):3591.
22 Rix A W, Beerends J G, Hollier M P,et al. Perceptual evaluation of speech quality (PESQ):A new method for speecn quality assessment of telephone networks and codecs∥Proceedings of the 26th International Conference on Acoustics,Speech,and Signal Processing. Salt Lake City,Utah,USA:IEEE,2001:749-752.
23 Taal C H, Hendriks R C, Heusdens R,et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio,Speech,and Language Processing201119(7):2125-2136.
24 Hu Y, Loizou P C. Evaluation of objective quality measures for speech enhancement. IEEE Tran?sactions on Audio,Speech,and Language Processing,200816(1):229-238.
25 Pascual S, Bonafonte A, Serrà J. SEGAN:Speech enhancement generative adversarial network∥The 18th Annual Conference of the International Speech Communication Association. Stockholm,Sweden:ISCA,2017:3642-3646.
26 Soni M H, Shah N, Patil H A. Time?frequency masking?based speech enhancement using generative adversarial network∥2018 IEEE International Conference on Acoustics,Speech and Signal Processing. Calgary,AB,Canada:IEEE,2018:5039-5043.
27 Fu S W, Liao C F, Tsao Y,et al. MetricGAN:Generative adversarial networks based black?box metric scores optimization for speech enhancement∥The 36th International Conference on Machine Learning. Long Beach,CA,USA:PMLR,2019:2031-2041.
28 Yin D C, Luo C, Xiong Z W,et al. PHASEN:A phase?and?harmonics?aware speech enhancement network∥Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York,NY,USA:AAAI,2020:9458-9465.
29 Zhang Q Q, Nicolson A, Wang M J,et al. DeepMMSE:A deep learning approach to MMSE?based noise power spectral density estimation. IEEE/ACM Transactions on Audio,Speech,and Language Processing2020(28):1404-1415.
30 Defossez A, Synnaeve G, Adi Y. Real time speech enhancement in the waveform domain∥The 21st Annual Conference of the International Speech Communication Association. Shanghai,China:ISCA,2020:3291-3295.
31 Wang K, He B B, Zhu W P. TSTNN:Two?stage transformer based neural network for speech enhancement in the time domain∥IEEE International Conference on Acoustics,Speech and Signal Processing. Toronto,Canada:IEEE,2021:7098-7102.
32 Kong Z F, Ping W, Dantrey A,et al. Speech denoising in the waveform domain with self?attention∥2022 IEEE International Conference on Acoustics,Speech and Signal Processing. Singapore:IEEE,2022:7867-7871.
33 范君怡,杨吉斌,张雄伟,等. U?net网络中融合多头注意力机制的单通道语音增强. 声学学报202247(6):703-716.
Fan J Y, Yang J B, Zhang X W,et al. Monaural speech enhancement using U?net fused with multi?head self?attention. Acta Acustica202247(6):703-716.
[1] 谭嘉辰, 董永权, 张国玺. SSM: 基于孪生网络的糖尿病视网膜眼底图像分类模型[J]. 南京大学学报(自然科学版), 2023, 59(3): 425-434.
[2] 宋耀莲, 殷喜喆, 杨俊. 基于时空特征学习Transformer的运动想象脑电解码方法[J]. 南京大学学报(自然科学版), 2023, 59(2): 313-321.
[3] 唐伟佳, 张华, 侯志荣. 基于空间卷积融合的中文文本匹配方法[J]. 南京大学学报(自然科学版), 2022, 58(5): 868-875.
[4] 井花花, 晏涛, 刘渊. 融合全局和局部特征的光场图像空间超分辨率算法[J]. 南京大学学报(自然科学版), 2022, 58(2): 298-308.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!