基于生成对抗网络的音频目标分类对抗

doi:10.13232/j.cnki.jnju.2021.05.009

南京大学学报(自然科学版) ›› 2021, Vol. 57 ›› Issue (5): 793–800.doi: 10.13232/j.cnki.jnju.2021.05.009

• • 上一篇

基于生成对抗网络的音频目标分类对抗

张强¹, 杨吉斌²(), 张雄伟², 曹铁勇², 梅鹏程¹

^1.陆军工程大学研究生院，南京，210007
^2.陆军工程大学指挥控制工程学院，南京，210007

收稿日期:2021-06-23 出版日期:2021-09-29 发布日期:2021-09-29
通讯作者: 杨吉斌 E-mail:yjbice@sina.com
作者简介:E⁃mail：yjbice@sina.com
基金资助:
国家自然科学基金(62071484)

Generating adversarial examples in audio object classification using generative adversarial network

Qiang Zhang¹, Jibin Yang²(), Xiongwei Zhang², Tieyong Cao², Pengcheng Mei¹

^1.Graduate School，Army Engineering University，Nanjing，210007，China
^2.School of Command and Control Engineering，Army Engineering University，Nanjing，210007，China

Received:2021-06-23 Online:2021-09-29 Published:2021-09-29
Contact: Jibin Yang E-mail:yjbice@sina.com

摘要/Abstract

摘要：

音频对抗样本可以用于提高音频目标分类系统的可靠性，然而目前音频对抗样本的感知质量较低，生成质量不能令人满意.为提升音频对抗样本的质量，首次采用生成对抗网络（Generative Adversarial Network，GAN）实现音频目标分类的对抗样本生成.提出用于音频目标分类对抗样本生成的通用GAN框架，将待攻击的分类模型引入GAN.在此基础上，提出基于GAN的分段扰动/整体攻击（GAN?based Segmented?perturbation Overall?attack，SOGAN）方法.SOGAN通过对抗训练，学习短时分段音频数据上的有效扰动，按照与原始音频的对应关系生成整体扰动，并形成时长可变的对抗样本.该方法可以缩小音频对抗样本的搜索空间，降低对抗样本生成的复杂度.在UrbanSound8k，ESC50等音频目标分类数据集上的实验表明，和已有音频目标对抗样本设计方法相比，所提方法生成的对抗样本可感知性更低，对典型音频目标分类系统具有较高的攻击成功率和攻击效率.

关键词: 音频信号处理, 对抗样本, 音频目标分类, 生成对抗网络

Abstract:

Audio adversarial examples can be applied to boost audio object classification system. However, the quality of the adversarial examples in current systems is still not satisfactory. To improve the performance of adversarial examples，Generative Adversarial Network (GAN) is firstly adopted to generate adversarial examples for audio object classification. A general GAN framework integrated with the attacked audio classification model is proposed to optimize the classification attack effects. Based on the framework, a GAN?based Segmented?perturbation Overall?attack method (SOGAN) is proposed to narrow the search space in acoustic signal adversarial sample generation. SOGAN learns effective perturbations on the local data through adversarial training, and then synthesizes global perturbations to generate adversarial samples with variable length, which can not only reduce the complexity of audio adversarial example generation, but also improve the generality and performance of the GAN?based audio adversarial example generation. Experiments are carried out on typical audio object classification datasets, such as UrbanSound8k and ESC50. The results show that the proposed method can generate adversarial examples with higher attack success rate, lower perceptibility and higher attack efficiency on the state?of?the?art audio classification systems, compared with the existing audio adversarial example design methods.

Key words: audio signal processing, adversarial example, audio object classification, Generative Adversarial Network (GAN)

中图分类号:

TP391

张强, 杨吉斌, 张雄伟, 曹铁勇, 梅鹏程. 基于生成对抗网络的音频目标分类对抗[J]. 南京大学学报(自然科学版), 2021, 57(5): 793–800.

Qiang Zhang, Jibin Yang, Xiongwei Zhang, Tieyong Cao, Pengcheng Mei. Generating adversarial examples in audio object classification using generative adversarial network[J]. Journal of Nanjing University(Natural Sciences), 2021, 57(5): 793–800.

图/表 8

图1

图2

表1

表2

图3

表3

表4

图4

参考文献 26

1	Khamparia A，Gupta D，Nguyen N G，et al. Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access，2019(7)：7717-7727.
2	Hannun A，Case C，Casper J，et al. Deep speech：Scaling up end?to?end speech recognition. 2014，arXiv:.
3	Hoshen Y，Weiss R J，Wilson K W. Speech acoustic modeling from raw multichannel waveforms∥2015 IEEE International Conference on Acoustics，Speech and Signal Processing. South Brisbane，Australia：IEEE，2015：4624-4628.
4	Van Den Oord A，Dieleman S，Zen H，et al. Wavenet：A generative model for raw audio. 2016，arXiv:.
5	Sainath T N，Weiss R J，Senior A，et al. Learning the speech front?end with raw waveform CLDNNs∥The 16^th Annual Conference of the International Speech Communication Association. Dresden，Germany：IEEE，2015：1-5.
6	Szegedy C，Zaremba W，Sutskever I，et al. Intriguing properties of neural networks. 2013,arXiv：1312. 6199v4.
7	Goodfellow I J，Shlens J，Szegedy C. Explaining and harnessing adversarial examples. 2014,arXiv：1412. 6572v2.
8	Kurakin A，Goodfellow I J，Bengio S. Adversarial examples in the physical world. 2016,arXiv：1607. 02533v1.
9	Carlini N，Wagner D. Towards evaluating the robustness of neural networks∥2017 IEEE Symposium on Security and Privacy. San Jose，CA，USA：IEEE，2017：39-57.
10	Papernot N，McDaniel P，Jha S，et al. The limitations of deep learning in adversarial settings∥2016 IEEE European Symposium on Security and Privacy. Saarbruecken，Germany：IEEE，2016：372-387.
11	Moosavi?Dezfooli S M，Fawzi A，Frossard P. Deepfool：A simple and accurate method to fool deep neural networks∥2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas，NV，USA：IEEE，2016：2574-2582.
12	Hu W W，Tan Y. Generating adversarial malware examples for black?box attacks based on GAN. 2017,arXiv：.
13	Moosavi?Dezfooli S M，Fawzi A，Fawzi O，et al. Universal adversarial perturbations∥2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu，HI，USA：IEEE，2017：86-94.
14	Xiao C W，Li B，Zhu J Y，et al. Generating adversarial examples with adversarial networks∥Proceedings of the 27th International Joint Conference on Artificial Intelligence. Stockholm,Sweden：AAAI Press，2018：3905-3911.
15	Carlini N，Wagner D. Audio adversarial examples：Targeted attacks on speech?to?text∥2018 IEEE Security and Privacy Workshops. San Francisco，CA，USA：IEEE，2018：1-7.
16	Alzantot M，Balaji B，Srivastava M. Did you hear that?Adversarial examples against automatic speech recognition. 2018,arXiv：.
17	Du T Y，Ji S L，Li J F，et al. SirenAttack：Generating adversarial audio for end?to?end acoustic systems∥Proceedings of the 15th ACM Asia Conference on Computer and Communications Security. New York，NY,USA：ACM，2020：357-369.
18	Abdoli S，Hafemann L G，Rony J，et al. Universal adversarial audio perturbations. 2019,arXiv:1908. 03173.
19	Wang D H，Dong L，Wang R D，et al. Targeted speech adversarial example generation with generative adversarial network. IEEE Access，2020(8)：124503-124513.
20	Goodfellow I J，Pouget?Abadie J，Mirza M，et al. Generative adversarial networks. 2014,arXiv:1406. 2661.
21	Salamon J，Jacoby C，Bello J P. A dataset and taxonomy for urban sound research∥Proceedings of the 22nd ACM International Conference on Multimedia. New York，NY，USA：ACM，2014：1041-1044.
22	Piczak K J. ESC：Dataset for environmental sound classification∥Proceedings of the 23rd ACM International Conference on Multimedia. New York，NY，USA：ACM，2015：1015-1018.
23	Johnson J，Alahi A，Li F F. Perceptual losses for real?time style transfer and super?resolution∥European Conference on Computer Vision. Springer Berlin Heidelberg，2016：694-711.
24	Tokozume Y，Harada T. Learning environmental sounds with end?to?end convolutional neural network∥2017 IEEE International Conference on Acoustics，Speech and Signal Processing. New Orleans，LA，USA：IEEE，2017：2721-2725.
25	Paszke A，Gross S，Massa F，et al. PyTorch：An imperative style，high?performance deep learning library. 2019，arXiv：.
26	Tokozume Y，Ushiku Y，Harada T. Learning from between?class examples for deep sound recognition. 2017，arXiv：.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

G	encoder	conv2d (16,9,1,0),BN,LRelu
		conv2d (32,10,2,0),BN,LRelu
		conv2d (64,10,2,0),BN,LRelu
		conv2d (128,10,2,0),BN,LRelu
	bottlenek	Resnetblock×4
	decoder	deconv2d (128,10,2,0),BN,LRelu
		deconv2d (64,10,2,0),BN,LRelu
		deconv2d (32,10,2,0),BN,LRelu
		deconv2d (16,9,1,0),BN,LRelu
D	MFE	conv1d (40,8,1,4),BN,LRelu
		conv1d (40,8,1,4),BN,LRelu
		maxpool (160,160)
	DEL	conv2d (50,(8,13),1,0),BN,LRelu
		maxpool (3,3)
		conv2d (50,(1,5),1,0),BN,LRelu
		maxpool ((1,3),(1,3))
		conv2d (1,(2,5),(3,3),0),LRelu
	classifier	linear (16,1)

方法		目标类别
方法		Air Conditioner	Car Horn	Children Playing	Dog Bark	Drilling	Engine Idling	Gun Shot	Jackhammer	Siren	Street Music
Iterative^[18]	ASR (train)	0.939	0.928	0.921	0.925	0.882	0.923	0.902	0.929	0.901	0.906
	ASR (test)	0.797	0.835	0.697	0.767	0.764	0.763	0.804	0.744	0.747	0.754
	MSNR (test)	22.852	22.932	22.371	23.183	22.702	23.387	20.868	23.355	23.049	22.637
Penalty^[18]	ASR (train)	0.926	0.935	0.916	0.928	0.923	0.904	0.929	0.904	0.936	0.919
	ASR (test)	0.873	0.902	0.860	0.873	0.867	0.888	0.895	0.879	0.866	0.863
	MSNR (test)	22.143	21.208	22.241	21.601	22.251	21.917	20.798	22.367	21.971	21.818
SOGAN	ASR (train)	0.951	0.956	0.946	0.964	0.947	0.959	0.948	0.945	0.954	0.934
	ASR (test)	0.949	0.944	0.935	0.953	0.934	0.945	0.932	0.934	0.938	0.928
	MSNR (test)	28.387	28.156	28.503	28.617	28.737	28.637	28.756	28.556	27.387	27.365

方法	训练集			测试集
方法	ACC	ASR	MSNR	ACC	ASR	MSNR	Time (s)
Iterative^[18]	N/A	91.0%	N/A	N/A	66.9%	24.960	N/A
Penalty^[18]	N/A	90.0%	N/A	N/A	83.1%	18.727	N/A
SOGAN	99.5%	94.8%	28.531	95.7%	93.6%	28.310	0.01

[1]	房笑宇, 曹陈涵, 夏彬. 基于注意力机制的大规模系统日志异常检测方法[J]. 南京大学学报(自然科学版), 2021, 57(5): 785-792.
[2]	戴臣超, 王洪元, 曹亮, 殷雨昌, 张继. 一种多目标跨摄像头跟踪技术研究与实现[J]. 南京大学学报(自然科学版), 2021, 57(2): 227-236.

基于生成对抗网络的音频目标分类对抗

Generating adversarial examples in audio object classification using generative adversarial network

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 26

相关文章 2

Metrics

本文评价

推荐阅读 0