南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (4): 581–590.doi: 10.13232/j.cnki.jnju.2020.04.016

• • 上一篇    下一篇

基于FPGA的卷积神经网络加速模块设计

梅志伟1,2,王维东1,2()   

  1. 1.浙江大学信息与电子工程学院,杭州,310013
    2.浙江大学?瑞芯微多媒体系统联合实验室,浙江大学信息与电子工程学院,杭州,310013
  • 收稿日期:2020-06-05 出版日期:2020-07-30 发布日期:2020-08-06
  • 通讯作者: 王维东 E-mail:21760190@zju.edu.cn

Design of Convolutional Neural Network acceleration module based on FPGA

Zhiwei Mei1,2,Weidong Wang1,2()   

  1. 1.College of Information Science & Electronic Engineering,Zhejiang University,Hangzhou,310013,China
    2.ZJU?Rock Chips Joint Laboratory of Multimedia System,College of Information Science & Electronic Engineering,Zhejiang University,Hangzhou,310013,China
  • Received:2020-06-05 Online:2020-07-30 Published:2020-08-06
  • Contact: Weidong Wang E-mail:21760190@zju.edu.cn

摘要:

针对卷积神经网络前向推理硬件加速的研究,提出一种基于FPGA(Field Programmable Gate Array)的卷积神经网络加速模块,以期在资源受限的硬件平台中加速卷积运算.通过分析卷积神经网络基本结构与常见卷积神经网络的特性,设计了一种适用于常见卷积神经网络的硬件加速架构.在该架构中,采用分层次缓存数据与分类复用数据策略,优化卷积层片外访存总量,缓解带宽压力;在计算模块中,在输入输出通道上并行计算,设计了将乘加树与脉动阵列相结合的高效率计算阵列,兼顾了计算性能与资源消耗.实验结果表明,提出的加速模块运行VGG?16(Visual Geometry Group)卷积神经网络性能达到189.03 GOPS(Giga Operations per Second),在DSP(Digital Signal Processor)性能效率上优于大部分现有的解决方案,内存资源消耗比现有解决方案减少41%,适用于移动端卷积神经网络硬件加速.

关键词: 卷积神经网络, 硬件加速, FPGA, 并行计算, 高效率乘加阵列

Abstract:

To accelerate the convolutional operation of Convolutional Neural Network in resource?constrained hardware platforms,a Convolutional Neural Network acceleration module based on FPGA (Field Programmable Gate Array) is proposed. By analyzing the basic structure of Convolutional Neural Network and the characteristics of common Convolutional Neural Networks,a hardware acceleration architecture for common Convolutional Neural Networks is designed. In the above architecture,the strategies of hierarchical caching data and classified reusing data are adopted to minimize the total amount of external memory access data and reduce the pressure of bandwidth. Considering the computing performance and resource consumption,a high efficiency computing array is designed which combines multiplicative and additive tree with systolic array for parallel computation on input and output channels in the computing module. The experimental results show that the performance of the proposed acceleration module reaches 189.03 GOPS (Giga Operations per Second) when running VGG?16(Visual Geometry Group) Convolutional Neural Network,which is better than most of the existing solutions in terms of DSP performance efficiency,and 41% lower than the existing solutions in terms of memory resource consumption. The proposed module is suitable for hardware acceleration of mobile terminal Convolutional Neural Network.

Key words: Convolutional Neural Network, hardware acceleration, FPGA, parallel computation, DSP performance efficiency

中图分类号: 

  • TN4

图1

LeNet?5结构图"

表1

常见卷积神经网络参数分析"

卷积神经网络AlexNetVGG?16GoogleNet v1Yolo v3
卷积核尺寸3,5,1131,3,5,71,3,7
除第一层外通道数64~25632~51232~83232~1024
卷积层参数量2.3 M14.7 M6.0 M40.5 M
卷积层乘加数量666 M15.3 G1.43 G65.4 G
全连接层参数量58.6 M124 M1 M-
全连接层乘加数量58.6 M130 M1 M-

图2

卷积神经网络加速器架构"

图3

片上缓存与复用方式对AlexNet不同卷积层EMA数据量的影响"

图5

片上缓存与复用方式对AlexNet所有卷积层EMA数据量的影响"

图4

片上缓存与复用方式对AlexNet单一卷积层EMA数据量的影响"

图6

并行计算设计"

图7

乘加阵列结构图"

表2

AlexNet硬件加速性能"

AlexNet层数卷积核/步长时间(ms)乘加阵列效率
conv111×11/411.39.10%
conv25×5/14.498.42%
conv33×3/11.781.84%
conv43×3/12.683.66%
conv53×3/11.881.66%
fc16×6/112.13.07%
fc21×1/15.53.00%
fc31×1/11.42.74%

表3

AlexNet卷积层加速性能"

AlexNet网络层conv1conv2conv3conv4conv5卷积层
性能(GOPS)18.66203.59175.91172.52166.13113.55

表4

VGG?16卷积层加速性能"

VGG?16网络层乘加数量时间(ms)性能(GOPS)乘加阵列效率
conv10.0867 G9.0619.279.35%
conv21.8496 G18.16203.7199.51%
conv30.9248 G9.05204.3999.80%
conv41.8497 G18.18203.4999.40%
conv50.9248 G9.06204.1699.73%
conv61.8497 G18.38201.2798.32%
conv71.8496 G18.18203.4999.40%
conv80.9248 G9.29199.1197.28%
conv91.8496 G18.58199.1197.27%
conv101.8496 G18.58199.1197.27%
conv110.4624 G5.32173.8485.03%
conv120.4624 G5.32173.8485.03%
conv130.4624 G5.32173.8485.03%
卷积层15.347 G162.48189.03

表5

与其他加速器在AlexNet卷积层加速效果对比"

AlexNet加速[12][13][8]本文方案
年份2015201620172020
FPGA

Virtex7

VX485T

Stratix?V GSD8Arria10 GT 1150

Virtex7

VX485T

时钟100 MHz120 MHz270.8 MHz100 MHz
数据32 bits16 bits32 bits16 bits
DSP224019631290588(21%)
BRAM102425672360288(28%)
性能(GOPS)61.6272.4406.10113.55

DSP性能效率

(GOPS/DSP)

0.0270.0370.3150.193

表6

与其他加速器在VGG?16卷积层加速效果对比"

VGG?16加速[13][14][15]本文方案
年份2016201820182020
FPGAStratix?V GSD8Zynq XC7Z045Virtex7 VX690TVirtex7 VX485T
时钟120 MHz150 MHz150 MHz100 MHz
数据16 bits16 bits16 bits16 bits
DSP19637802833588(21%)
BRAM25674861248288(28%)
性能(GOPS)136.50187.80488.00189.03

DSP性能效率

(GOPS/DSP)

0.0700.2410.1720.321
1 Girshick R,Donahue J,Darrell T,et al. Rich feature hierarchies for accurate object detection and semantic segmentation∥2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus,OH,USA:IEEE,2014:580-587.
2 He K M,Zhang X Y,Ren S Q,et al. Delving deep into rectifiers:surpassing human?level performance on imagenet classification∥2015 IEEE International Conference on Computer Vision. Santiago,Chile:IEEE,2015:1026-1034.
3 Coates A,Huval B,Wang T,et al. Deep learning with COTS HPC systems∥International Conference on Machine Learning. Atlanta,USA:ACM,2013:1337-1345.
4 Navarro C A,Hitschfeld?Kahler N,Mateu L. A survey on parallel computing and its applications in data?parallel problems using GPU architectures. Communications in Computational Physics,2014,15(2):285-329.
5 Chen T S,Du Z D,Sun N H,et al. Diannao:a small?footprint high?throughput accelerator for ubiquitous machine?learning. ACM SIGARCH Computer Architecture News,2014,42(1):269-284.
6 Qiu J T,Wang J,Yao S,et al. Going deeper with embedded fpga platform for convolutional neural network∥The 2016 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. Monterey,CA,USA:ACM,2016:26-35.
7 Chen Y H,Krishna T,Emer J S,et al. Eyeriss:An energy?efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid?State Circuits,2016,52(1):127-138.
8 Wei X C,Yu C H,Zhang P,et al. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs∥The 54th Annual Design Automation Conference 2017. Austin,TX,USA:ACM,2017:1-6.
9 Jouppi N P,Young C,Patil N,et al. In?datacenter performance analysis of a tensor processing unit∥2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture. Toronto,Canada:IEEE,2017:1-12.
10 Feng G,Hu Z Y,Chen S,et al. Energy?efficient and high?throughput FPGA?based accelerator for convolutional neural networks∥2016 13th IEEE International Conference on Solid?State and Integrated Circuit Technology (ICSICT). Hangzhou,China:IEEE,2016:624-626.
11 NVIDIA Corporation. Hardware Manual. http:∥nvdla.org/hw/contents.html,2018-05-12.
12 Zhang C,Li P,Sun G Y,et al. Optimizing fpga?based accelerator design for deep convolutional neural networks∥The 2015 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. Monterey,CA,USA:ACM,2015:161-170.
13 Suda N,Chandra V,Dasika G,et al. Throughput?optimized OpenCL?based FPGA accelerator for large?scale convolutional neural networks∥The 2016 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. Monterey,CA,USA:ACM,2016:16-25.
14 Guo K Y,Sui L Z,Qiu J T,et al. Angel?Eye:a complete design flow for mapping CNN onto embedded FPGA. IEEE Transactions on Computer:Aided Design of Integrated Circuits and Systems,2017,37(1):35-47.
15 Zhang C,Sun G Y,Fang Z M,et al. Caffeine:Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer:Aided Design of Integrated Circuits and Systems,2018,38(11):2072-2085.
16 Shen J Z,Huang Y,Wang Z L,et al. Towards a uniform template?based architecture for accelerating 2D and 3D CNNS on FPGA∥The 2018 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. Monterey,CA,USA:ACM,2018:97-106.
[1] 朱伟,张帅,辛晓燕,李文飞,王骏,张建,王炜. 结合区域检测和注意力机制的胸片自动定位与识别[J]. 南京大学学报(自然科学版), 2020, 56(4): 591-600.
[2] 赵子龙,赵毅强,叶茂. 基于FPGA的多卷积神经网络任务实时切换方法[J]. 南京大学学报(自然科学版), 2020, 56(2): 167-174.
[3] 王吉地,郭军军,黄于欣,高盛祥,余正涛,张亚飞. 融合依存信息和卷积神经网络的越南语新闻事件检测[J]. 南京大学学报(自然科学版), 2020, 56(1): 125-131.
[4] 狄 岚, 何锐波, 梁久祯. 基于可能性聚类和卷积神经网络的道路交通标识识别算法[J]. 南京大学学报(自然科学版), 2019, 55(2): 238-250.
[5] 安 晶, 艾 萍, 徐 森, 刘 聪, 夏建生, 刘大琨. 一种基于一维卷积神经网络的旋转机械智能故障诊断方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 133-142.
[6] 胡 太, 杨 明. 结合目标检测的小目标语义分割算法[J]. 南京大学学报(自然科学版), 2019, 55(1): 73-84.
[7] 梁蒙蒙1,周 涛1,2*,夏 勇3,张飞飞1,杨 健1. 基于随机化融合和CNN的多模态肺部肿瘤图像识别[J]. 南京大学学报(自然科学版), 2018, 54(4): 775-.
[8] 余双春,袁 杰,沈庆宏*. 实时视频去隔行的并行算法研究[J]. 南京大学学报(自然科学版), 2016, 52(5): 795-.
[9] 沙 金*. LDPC码硬件仿真平台的FPGA实现[J]. 南京大学学报(自然科学版), 2014, 50(3): 325-.
[10]  冯伟,袁杰**.  高清视频并行处理的研究*[J]. 南京大学学报(自然科学版), 2012, 48(1): 33-39.
[11] 张 騄 1, 2 , 张兴敢 1,2** , 柏业超 1,2 , 张 尉 2, 3 .  便携式可在线编程雷达信号模拟器*

[J]. 南京大学学报(自然科学版), 2010, 46(4): 359-365.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 段新春,施 斌,孙梦雅,魏广庆,顾 凯,冯晨曦. FBG蒸发式湿度计研制及其响应特性研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1075 -1084 .
[2] 胡 淼,王开军,李海超,陈黎飞. 模糊树节点的随机森林与异常点检测[J]. 南京大学学报(自然科学版), 2018, 54(6): 1141 -1151 .
[3] 王 倩,聂秀山,耿蕾蕾,尹义龙. D2D通信中基于Q学习的联合资源分配与功率控制算法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1183 -1192 .
[4] 谢 欢, 朱 荀, 卢 俊, 沈庆宏, 胡安东. 基于机器视觉的LED大屏亮度一致性检测与矫正[J]. 南京大学学报(自然科学版), 2019, 55(2): 170 -179 .
[5] 王飞永, 彭建兵, 卢全中, 黄强兵, 孟振江, 乔建伟. 渭河盆地地裂缝同生机制研究[J]. 南京大学学报(自然科学版), 2019, 55(3): 339 -348 .
[6] 董越男,吴兵党. 铁离子对紫外/乙酰丙酮法降解甲基橙的影响[J]. 南京大学学报(自然科学版), 2019, 55(3): 504 -510 .
[7] 王霞, 谭斯文, 李俊余, 吴伟志. 基于条件属性蕴含的概念格构造及简化[J]. 南京大学学报(自然科学版), 2019, 55(4): 553 -563 .
[8] 郭英杰, 胡峰, 于洪, 张红亮. 基于时间粒的铝电解过热度预测模型[J]. 南京大学学报(自然科学版), 2019, 55(4): 624 -632 .
[9] 伊小素,冀羽,曾华菘,熊瑞. 考虑邻接点贡献的通信网关键节点评估方法[J]. 南京大学学报(自然科学版), 2019, 55(5): 725 -732 .
[10] 黄华娟,韦修喜. 基于自适应调节极大熵的孪生支持向量回归机[J]. 南京大学学报(自然科学版), 2019, 55(6): 1030 -1039 .