基于FPGA的卷积神经网络加速模块设计

doi:10.13232/j.cnki.jnju.2020.04.016

南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (4): 581–590.doi: 10.13232/j.cnki.jnju.2020.04.016

基于FPGA的卷积神经网络加速模块设计

梅志伟^1,²,王维东^1,²()

^1.浙江大学信息与电子工程学院，杭州，310013
^2.浙江大学?瑞芯微多媒体系统联合实验室，浙江大学信息与电子工程学院，杭州，310013

收稿日期:2020-06-05 出版日期:2020-07-30 发布日期:2020-08-06
通讯作者: 王维东 E-mail:21760190@zju.edu.cn

Design of Convolutional Neural Network acceleration module based on FPGA

Zhiwei Mei^1,²,Weidong Wang^1,²()

^1.College of Information Science & Electronic Engineering，Zhejiang University，Hangzhou，310013，China
^2.ZJU?Rock Chips Joint Laboratory of Multimedia System，College of Information Science & Electronic Engineering，Zhejiang University，Hangzhou，310013，China

Received:2020-06-05 Online:2020-07-30 Published:2020-08-06
Contact: Weidong Wang E-mail:21760190@zju.edu.cn

摘要/Abstract

摘要：

针对卷积神经网络前向推理硬件加速的研究，提出一种基于FPGA(Field Programmable Gate Array)的卷积神经网络加速模块，以期在资源受限的硬件平台中加速卷积运算.通过分析卷积神经网络基本结构与常见卷积神经网络的特性，设计了一种适用于常见卷积神经网络的硬件加速架构.在该架构中，采用分层次缓存数据与分类复用数据策略，优化卷积层片外访存总量，缓解带宽压力；在计算模块中，在输入输出通道上并行计算，设计了将乘加树与脉动阵列相结合的高效率计算阵列，兼顾了计算性能与资源消耗.实验结果表明，提出的加速模块运行VGG?16(Visual Geometry Group)卷积神经网络性能达到189.03 GOPS(Giga Operations per Second)，在DSP(Digital Signal Processor)性能效率上优于大部分现有的解决方案，内存资源消耗比现有解决方案减少41%，适用于移动端卷积神经网络硬件加速.

关键词: 卷积神经网络, 硬件加速, FPGA, 并行计算, 高效率乘加阵列

Abstract:

To accelerate the convolutional operation of Convolutional Neural Network in resource?constrained hardware platforms，a Convolutional Neural Network acceleration module based on FPGA (Field Programmable Gate Array) is proposed. By analyzing the basic structure of Convolutional Neural Network and the characteristics of common Convolutional Neural Networks，a hardware acceleration architecture for common Convolutional Neural Networks is designed. In the above architecture，the strategies of hierarchical caching data and classified reusing data are adopted to minimize the total amount of external memory access data and reduce the pressure of bandwidth. Considering the computing performance and resource consumption，a high efficiency computing array is designed which combines multiplicative and additive tree with systolic array for parallel computation on input and output channels in the computing module. The experimental results show that the performance of the proposed acceleration module reaches 189.03 GOPS (Giga Operations per Second) when running VGG?16(Visual Geometry Group) Convolutional Neural Network，which is better than most of the existing solutions in terms of DSP performance efficiency，and 41% lower than the existing solutions in terms of memory resource consumption. The proposed module is suitable for hardware acceleration of mobile terminal Convolutional Neural Network.

Key words: Convolutional Neural Network, hardware acceleration, FPGA, parallel computation, DSP performance efficiency

中图分类号:

梅志伟,王维东. 基于FPGA的卷积神经网络加速模块设计[J]. 南京大学学报(自然科学版), 2020, 56(4): 581–590.

Zhiwei Mei,Weidong Wang. Design of Convolutional Neural Network acceleration module based on FPGA[J]. Journal of Nanjing University(Natural Sciences), 2020, 56(4): 581–590.

图/表 13

图1

表1

图2

图3

图5

图4

图6

图7

表2

表3

表4

表5

表6

参考文献 16

1	Girshick R，Donahue J，Darrell T，et al. Rich feature hierarchies for accurate object detection and semantic segmentation∥2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus，OH，USA：IEEE，2014：580-587.
2	He K M，Zhang X Y，Ren S Q，et al. Delving deep into rectifiers：surpassing human?level performance on imagenet classification∥2015 IEEE International Conference on Computer Vision. Santiago，Chile：IEEE，2015：1026-1034.
3	Coates A,Huval B,Wang T,et al. Deep learning with COTS HPC systems∥International Conference on Machine Learning. Atlanta,USA:ACM,2013：1337-1345.
4	Navarro C A，Hitschfeld?Kahler N，Mateu L. A survey on parallel computing and its applications in data?parallel problems using GPU architectures. Communications in Computational Physics，2014，15(2)：285-329.
5	Chen T S，Du Z D，Sun N H，et al. Diannao：a small?footprint high?throughput accelerator for ubiquitous machine?learning. ACM SIGARCH Computer Architecture News，2014，42(1)：269-284.
6	Qiu J T，Wang J，Yao S，et al. Going deeper with embedded fpga platform for convolutional neural network∥The 2016 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. Monterey，CA，USA：ACM，2016：26-35.
7	Chen Y H，Krishna T，Emer J S，et al. Eyeriss：An energy?efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid?State Circuits，2016，52(1)：127-138.
8	Wei X C，Yu C H，Zhang P，et al. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs∥The 54^th Annual Design Automation Conference 2017. Austin，TX，USA：ACM，2017：1-6.
9	Jouppi N P，Young C，Patil N，et al. In?datacenter performance analysis of a tensor processing unit∥2017 ACM/IEEE 44^th Annual International Symposium on Computer Architecture. Toronto，Canada：IEEE，2017：1-12.
10	Feng G，Hu Z Y，Chen S，et al. Energy?efficient and high?throughput FPGA?based accelerator for convolutional neural networks∥2016 13^th IEEE International Conference on Solid?State and Integrated Circuit Technology (ICSICT). Hangzhou，China：IEEE，2016：624-626.
11	NVIDIA Corporation. Hardware Manual. http:∥nvdla.org/hw/contents.html，2018-05-12.
12	Zhang C，Li P，Sun G Y，et al. Optimizing fpga?based accelerator design for deep convolutional neural networks∥The 2015 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. Monterey，CA，USA：ACM，2015：161-170.
13	Suda N，Chandra V，Dasika G，et al. Throughput?optimized OpenCL?based FPGA accelerator for large?scale convolutional neural networks∥The 2016 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. Monterey，CA，USA：ACM，2016：16-25.
14	Guo K Y，Sui L Z，Qiu J T，et al. Angel?Eye：a complete design flow for mapping CNN onto embedded FPGA. IEEE Transactions on Computer：Aided Design of Integrated Circuits and Systems，2017，37(1)：35-47.
15	Zhang C，Sun G Y，Fang Z M，et al. Caffeine：Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer：Aided Design of Integrated Circuits and Systems，2018，38(11)：2072-2085.
16	Shen J Z，Huang Y，Wang Z L，et al. Towards a uniform template?based architecture for accelerating 2D and 3D CNNS on FPGA∥The 2018 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. Monterey，CA，USA：ACM，2018：97-106.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

卷积神经网络	AlexNet	VGG?16	GoogleNet v1	Yolo v3
卷积核尺寸	3,5,11	3	1,3,5,7	1,3,7
除第一层外通道数	64~256	32~512	32~832	32~1024
卷积层参数量	2.3 M	14.7 M	6.0 M	40.5 M
卷积层乘加数量	666 M	15.3 G	1.43 G	65.4 G
全连接层参数量	58.6 M	124 M	1 M	-
全连接层乘加数量	58.6 M	130 M	1 M	-

AlexNet层数	卷积核/步长	时间(ms)	乘加阵列效率
conv1	11×11/4	11.3	9.10%
conv2	5×5/1	4.4	98.42%
conv3	3×3/1	1.7	81.84%
conv4	3×3/1	2.6	83.66%
conv5	3×3/1	1.8	81.66%
fc1	6×6/1	12.1	3.07%
fc2	1×1/1	5.5	3.00%
fc3	1×1/1	1.4	2.74%

VGG?16网络层	乘加数量	时间(ms)	性能(GOPS)	乘加阵列效率
conv1	0.0867 G	9.06	19.27	9.35%
conv2	1.8496 G	18.16	203.71	99.51%
conv3	0.9248 G	9.05	204.39	99.80%
conv4	1.8497 G	18.18	203.49	99.40%
conv5	0.9248 G	9.06	204.16	99.73%
conv6	1.8497 G	18.38	201.27	98.32%
conv7	1.8496 G	18.18	203.49	99.40%
conv8	0.9248 G	9.29	199.11	97.28%
conv9	1.8496 G	18.58	199.11	97.27%
conv10	1.8496 G	18.58	199.11	97.27%
conv11	0.4624 G	5.32	173.84	85.03%
conv12	0.4624 G	5.32	173.84	85.03%
conv13	0.4624 G	5.32	173.84	85.03%
卷积层	15.347 G	162.48	189.03

AlexNet加速	[12]	[13]	[8]	本文方案
年份	2015	2016	2017	2020
FPGA	Virtex7 VX485T	Stratix?V GSD8	Arria10 GT 1150	Virtex7 VX485T
时钟	100 MHz	120 MHz	270.8 MHz	100 MHz
数据	32 bits	16 bits	32 bits	16 bits
DSP	2240	1963	1290	588(21%)
BRAM	1024	2567	2360	288(28%)
性能(GOPS)	61.62	72.4	406.10	113.55
DSP性能效率 (GOPS/DSP)	0.027	0.037	0.315	0.193

VGG?16加速	[13]	[14]	[15]	本文方案
年份	2016	2018	2018	2020
FPGA	Stratix?V GSD8	Zynq XC7Z045	Virtex7 VX690T	Virtex7 VX485T
时钟	120 MHz	150 MHz	150 MHz	100 MHz
数据	16 bits	16 bits	16 bits	16 bits
DSP	1963	780	2833	588(21%)
BRAM	2567	486	1248	288(28%)
性能(GOPS)	136.50	187.80	488.00	189.03
DSP性能效率 (GOPS/DSP)	0.070	0.241	0.172	0.321

基于FPGA的卷积神经网络加速模块设计

Design of Convolutional Neural Network acceleration module based on FPGA

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 16

相关文章 11

Metrics

本文评价

推荐阅读 10

[1]	朱伟,张帅,辛晓燕,李文飞,王骏,张建,王炜. 结合区域检测和注意力机制的胸片自动定位与识别[J]. 南京大学学报(自然科学版), 2020, 56(4): 591-600.
[2]	赵子龙,赵毅强,叶茂. 基于FPGA的多卷积神经网络任务实时切换方法[J]. 南京大学学报(自然科学版), 2020, 56(2): 167-174.
[3]	王吉地,郭军军,黄于欣,高盛祥,余正涛,张亚飞. 融合依存信息和卷积神经网络的越南语新闻事件检测[J]. 南京大学学报(自然科学版), 2020, 56(1): 125-131.
[4]	狄　岚, 何锐波, 梁久祯. 基于可能性聚类和卷积神经网络的道路交通标识识别算法[J]. 南京大学学报(自然科学版), 2019, 55(2): 238-250.
[5]	安　晶, 艾　萍, 徐　森, 刘　聪, 夏建生, 刘大琨. 一种基于一维卷积神经网络的旋转机械智能故障诊断方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 133-142.
[6]	胡　太, 杨　明. 结合目标检测的小目标语义分割算法[J]. 南京大学学报(自然科学版), 2019, 55(1): 73-84.
[7]	梁蒙蒙1，周　涛1，2*，夏　勇3，张飞飞1，杨　健1. 基于随机化融合和CNN的多模态肺部肿瘤图像识别[J]. 南京大学学报(自然科学版), 2018, 54(4): 775-.
[8]	余双春，袁　杰，沈庆宏*. 实时视频去隔行的并行算法研究[J]. 南京大学学报(自然科学版), 2016, 52(5): 795-.
[9]	沙金*. LDPC码硬件仿真平台的FPGA实现[J]. 南京大学学报(自然科学版), 2014, 50(3): 325-.
[10]	冯伟，袁杰*. 高清视频并行处理的研究[J]. 南京大学学报(自然科学版), 2012, 48(1): 33-39.
[11]	张騄 1, 2 , 张兴敢 1,2** , 柏业超 1,2 , 张尉 2, 3 . 便携式可在线编程雷达信号模拟器* [J]. 南京大学学报(自然科学版), 2010, 46(4): 359-365.