南京大学学报(自然科学版) ›› 2021, Vol. 57 ›› Issue (6): 1075–1082.doi: 10.13232/j.cnki.jnju.2021.06.016

• • 上一篇    

基于FPGA的卷积神经网络训练加速器设计

孟浩1,2, 刘强1,2()   

  1. 1.天津大学微电子学院,天津,300072
    2.天津市成像与感知微电子技术重点实验室,天津,300072
  • 收稿日期:2021-05-31 出版日期:2021-12-03 发布日期:2021-12-03
  • 通讯作者: 刘强 E-mail:qiangliu@tju.edu.cn
  • 作者简介:E⁃mail:qiangliu@tju.edu.cn
  • 基金资助:
    国家自然科学基金(61974102)

A FPGA⁃based convolutional neural network training accelerator

Hao Meng1,2, Qiang Liu1,2()   

  1. 1.School of Microelectronics,Tianjin University, Tianjin,300072,China
    2.Tianjin Key Laboratory of Imaging and Sensing Microelectronics Technology,Tianjin,300072,China
  • Received:2021-05-31 Online:2021-12-03 Published:2021-12-03
  • Contact: Qiang Liu E-mail:qiangliu@tju.edu.cn

摘要:

近年来卷积神经网络在图像分类、图像分割等任务中应用广泛.针对基于FPGA (Field Programmable Gate Array)的卷积神经网络训练加速器中存在的权重梯度计算效率低和加法器占用资源多的问题,设计一款高性能的卷积神经网络训练加速器.首先提出一种卷积单引擎架构,在推理卷积硬件架构的基础上增加额外的自累加单元,可兼容卷积层的正向传播与反向传播(误差反向传递和权重梯度计算),提高加速器的复用能力,同时提升权重梯度计算的效率;然后提出一种适配卷积核内加法树与自累加单元的新型加法树设计,进一步节约计算资源;最后在Xilinx Zynq xc7z045平台上实现了所提出的训练加速器,并基于CIFAR?10数据集训练VGG?like (Visual Geometry Group)网络模型.实验结果表明,在200 MHz的时钟频率下,支持8位定点的训练加速器可以达到64.6 GOPS (Giga Operations per Second)的平均性能,是Intel Xeon E5?2630 v4 CPU (Central Processing Unit)训练平台的9.36倍,能效是NVIDIA Tesla K40C GPU(Graphics Processing Unit)训练平台的17.96倍.与已有的FPGA加速器相比,提出的加速器在处理性能和存储资源使用效率上具有优势.

关键词: 卷积神经网络, 深度学习, 图像处理, 现场可编程门阵列

Abstract:

In recent years,convolutional neural networks have been widely used in image classification and image segmentation tasks. Convolutional neural network training accelerator based on FPGA (Field Programmable Gate Array) has the problems of low weight gradient calculation efficiency and high resource occupation of the adder. A high?performance convolutional neural network training accelerator is designed in this paper. Firstly,we propose a convolutional single?engine architecture which adds an additional self?accumulation unit based on the inference accelerator. The accelerator supports the forward propagation and back propagation (error back propagation and weight gradient calculation) of the convolutional layer to improve the reusability and the efficiency of weight gradient calculation at the same time. Then we design a novel adder tree which supports the kernel adder tree and self?accumulation units. Finally,the proposed training accelerator is implemented on the Xilinx Zynq xc7z045 platform to train VGG?like (Visual Geometry Group) network model on the CIFAR?10 dataset. Experimental results show that at a clock frequency of 200 MHz,the 8?bit fixed?point training accelerator achieves an average performance of 64.6 GOPS (Giga Operations per Second). Its average performance is 9.36 times higher than that of Intel Xeon E5?2630 v4 CPU (Central Processing Unit) training platform,and its energy efficiency is 17.96 times higher than that of NVIDIA Tesla K40C GPU (Graphics Processing Unit) training platform.Compared with several existing FPGA accelerators,the proposed accelerator has advantages in performance and storage resource usage.

Key words: convolutional neural network, deep learning, image processing, field programmable gate array

中图分类号: 

  • TP338.6

图1

正向传播和反向传播过程中的卷积操作"

图2

加速器的整体架构"

图3

卷积计算单元的具体模块"

图4

权重梯度计算及其优化"

图5

新型加法树的累加形式"

表1

VGG?like网络模型结构"

通道数卷积核数输出高宽卷积核
卷积1312832×323×3
卷积212812832×323×3
池化12812816×162×2
卷积312825616×163×3
卷积425625616×163×3
池化2562568×82×2
卷积52565128×83×3
卷积65125128×83×3
池化5125124×42×2
全连接81921024--
全连接102410--

表2

加速器资源使用情况"

LUTFFBRAMDSP
总资源数218600437200545900
占用资源数5472574458134.5615
使用率25.03%17.03%24.68%68.33%

表3

加法树资源使用对比"

LUT,FF优化前优化后优化比例
卷积核内加法树自累加单元合计
PC=1,PF=1257,388480,680737,1068607,48482.3%,45.3%
PC=8,PF=818264,2664830720,4352048984,7016840064,3279281.8%,46.7%

表4

CPU,GPU和FPGA的性能比较"

参数CPUGPUFPGA
平台Intel Xeon E5?2630 v4

NVIDIA

Tesla K40C

Xilinx Zynq xc7z045
工艺(nm)142828
频率(GHz)2.20.7450.2
精度32位浮点32位浮点8位定点
每批训练时长(ms)539.9713.6257.60
功耗(W)852453.21
操作性能(GOPS)6.9273.364.6
能效(GOPS·W-1)0.081.1220.12

表5

不同FPGA训练加速器性能对比"

加速器平台

模型

数据集

数据

类型

DSP/LUTs/

FFs/BRAM

功耗

(W)

操作性能

(GOPS)

DSP使用效率

(GOPS/DSP)

BRAM使用效率(GOPS/BRAM)

能效

(GOPS·W-1)

Zhao et al[17]

Altera

Stratix V

AlexNet

-

单精度

浮点

≈2214/-/

-/-

-62.060.028--
Liu et al[18]

Xilinx

ZU19EG

LeNet?5

CIFAR?10

单精度

浮点

1500/329.3 k/

466.0k/174

14.2417.960.0120.101.26
Fox et al[19]

Zynq

ZCU111

VGG16

CIFAR?10

8位

定点

1037/73.1 k/

25.6 k/1045

-2050.190.20-
Luo et al[20]

UltraScale+

XCVU9P

VGG?like

CIFAR?10

8位

定点

4202/480 k/

-/≈4717

13.514170.330.30104.9
本文

ZC706

Xc7z045

VGG?like

CIFAR?10

8位

定点

615/54.7 k/

74.5 k/134.5

3.2164.60.1050.4820.1
1 Redmon J,Farhadi A. YOLO9000:Better,faster,stronger∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu,HI,USA:IEEE,2017:6517-6525.
2 Russakovsk O,Deng J,Su H,et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision,2015,115(3):211-252.
3 Zhang Y,Chan W,Jaitly N. Very deep convolutional networks for end?to?end speech recognition∥Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing. New Orleans,LA,USA:IEEE,2017:4845-4849.
4 Qiu J T,Wang J,Yao S,et al. Going deeper with embedded FPGA platform for convolutional neural network∥Proceedings of the 2016 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. New York:ACM Press,2016:26-35.
5 Zhao R Z,Niu X Y,Wu Y J,et al. Optimizing CNN?based object detection algorithms on embedded FPGA platforms∥13th International Symposium on Applied Reconfigurable Computing. Springer Berlin Heidelberg,2017:255-267.
6 仇越,马文涛,柴志雷. 一种基于FPGA的卷积神经网络加速器设计与实现. 微电子学与计算机,2018,35(8):68-72,77.
Qiu Y,Ma W T,Chai Z L. Design and implementation of a convolutional neural network accelerator based on FPGA. Microelectronics & Computer,2018,35(8):68-72,77.
7 陈辰,柴志雷,夏珺. 基于Zynq7000 FPGA异构平台的YOLOv2加速器设计与实现. 计算机科学与探索,2019,13(10):1677-1693.
Chen C,Chai Z L,Xia J. Design and implementation of YOLOv2 acce?lerator based on Zynq7000 FPGA heterogeneous platform. Journal of Frontiers of Computer Science and Technology,2019,13(10):1677-1693.
8 曾成龙,刘强. 面向嵌入式FPGA的高性能卷积神经网络加速器设计. 计算机辅助设计与图形学学报,2019,31(9):1645-1652.
Zeng C L,Liu Q. Design of high performance convolutional neural network accelerator for embedded FPGA. Journal of Computer?Aided Design and Computer Graphics,2019,31(9):1645-1652.
9 徐欣,刘强,王少军. 一种高度并行的卷积神经网络加速器设计方法. 哈尔滨工业大学学报,2020,52(4):31-37.
Xu X,Liu Q,Wang S J. A highly parallel design method for convolutional neural networks accelerator. Journal of Harbin Institute of Technology,2020,52(4):31-37.
10 Li F F,Zhang B,Liu B. Ternary weight networks. 2016,arXiv:.
11 Zhou S C,Wu Y X,Ni Z K,et al. Dorefa?net:Training low bitwidth convolutional neural networks with low bitwidth gradients. 2018,arXiv:.
12 Banner R,Hubara I,Hoffer E,et al. Scalable methods for 8?bit training of neural networks∥Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal,Canada:NIPS,2018:5145-5153.
13 Das D,Mellempudi N,Mudigere D,et al. Mixed precision training of convolutional neural networks using integer operation. 2018,arXiv:.
14 Deng L,Li G Q,Han S,et al. Model compression and hardware acceleration for neural networks:A comprehensive survey. Proceedings of the IEEE,2020,108(4):485-532.
15 Lin Y J,Han S,Mao H Z,et al. Deep gradient compression:Reducing the communication bandwidth for distributed training. 2020,arXiv:.
16 Han D,Lee J,Lee J,et al.A141.4 mW low?power online deep neural network training processor for real?time object tracking in mobile devices∥2018 IEEE International Symposium on Circuits and Systems. Florence,Italy:IEEE,2018:1-5.
17 Zhao W L,Fu H H,Luk W,et al. F?CNN:An FPGA?based framework for training convolutional neural networks∥IEEE 27th International Conference on Application?specific Systems,Architectures and Processors. London,UK:IEEE,2016:107-114.
18 Liu Z Q,Dou Y,Jiang J F,et al. An FPGA?based processor for training convolutional neural networks∥2017 International Conference on Field Programmable Technology. Melbourne,Australia:IEEE,2017:207-210.
19 Fox S,Faraon J,Boland D,et al. Training deep neural networks in low?precision with high accuracy using FPGAs∥2019 International Conference on Field?Programmable Technology. Tianjin,China:IEEE,2019:1-9.
20 Luo C,Sit M K,Fan H X,et al. Towards efficient deep neural network training by FPGA?based batch?level parallelism. Journal of Semiconductors,2020,41(2):022403.
[1] 樊炎, 匡绍龙, 许重宝, 孙立宁, 张虹淼. 一种同步提取运动想象信号时⁃频⁃空特征的卷积神经网络算法[J]. 南京大学学报(自然科学版), 2021, 57(6): 1064-1074.
[2] 陈磊, 孙权森, 王凡海. 基于深度对抗网络和局部模糊探测的目标运动去模糊[J]. 南京大学学报(自然科学版), 2021, 57(5): 735-749.
[3] 倪斌, 陆晓蕾, 童逸琦, 马涛, 曾志贤. 胶囊神经网络在期刊文本分类中的应用[J]. 南京大学学报(自然科学版), 2021, 57(5): 750-756.
[4] 杨静, 赵文仓, 徐越, 冯旸赫, 黄金才. 一种基于少样本数据的在线主动学习与分类方法[J]. 南京大学学报(自然科学版), 2021, 57(5): 757-766.
[5] 贾霄, 郭顺心, 赵红. 基于图像属性的零样本分类方法综述[J]. 南京大学学报(自然科学版), 2021, 57(4): 531-543.
[6] 普志方, 陈秀宏. 基于卷积神经网络的细胞核图像分割算法[J]. 南京大学学报(自然科学版), 2021, 57(4): 566-574.
[7] 段建设, 崔超然, 宋广乐, 马乐乐, 马玉玲, 尹义龙. 基于多尺度注意力融合的知识追踪方法[J]. 南京大学学报(自然科学版), 2021, 57(4): 591-598.
[8] 颜志良, 丰智鹏, 刘丹, 王会青. 一种混合深度神经网络的赖氨酸乙酰化位点预测方法[J]. 南京大学学报(自然科学版), 2021, 57(4): 627-640.
[9] 方志文, 刘青山, 周峰. 基于像素⁃目标级共生关系学习的多标签航拍图像分类方法[J]. 南京大学学报(自然科学版), 2021, 57(2): 208-216.
[10] 罗金屯, 滕飞, 周亚波, 池茂儒, 张海波. 数据驱动的高速铁路轮轨作用力反演模型[J]. 南京大学学报(自然科学版), 2021, 57(2): 299-308.
[11] 范习健, 杨绪兵, 张礼, 业巧林, 业宁. 一种融合视觉和听觉信息的双模态情感识别算法[J]. 南京大学学报(自然科学版), 2021, 57(2): 309-317.
[12] 曾宪华, 陆宇喆, 童世玥, 徐黎明. 结合马尔科夫场和格拉姆矩阵特征的写实类图像风格迁移[J]. 南京大学学报(自然科学版), 2021, 57(1): 1-9.
[13] 余方超, 方贤进, 张又文, 杨高明, 王丽. 增强深度学习中的差分隐私防御机制[J]. 南京大学学报(自然科学版), 2021, 57(1): 10-20.
[14] 高春永, 柏业超, 王琼. 基于改进的半监督阶梯网络SAR图像识别[J]. 南京大学学报(自然科学版), 2021, 57(1): 160-166.
[15] 张萌, 韩冰, 王哲, 尤富生, 李浩然. 基于深度主动学习的甲状腺癌病理图像分类方法[J]. 南京大学学报(自然科学版), 2021, 57(1): 21-28.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!