基于FPGA的卷积神经网络训练加速器设计

doi:10.13232/j.cnki.jnju.2021.06.016

南京大学学报(自然科学版) ›› 2021, Vol. 57 ›› Issue (6): 1075–1082.doi: 10.13232/j.cnki.jnju.2021.06.016

• • 上一篇

基于FPGA的卷积神经网络训练加速器设计

孟浩¹^,², 刘强¹^,²()

^1.天津大学微电子学院，天津，300072
^2.天津市成像与感知微电子技术重点实验室，天津，300072

收稿日期:2021-05-31 出版日期:2021-12-03 发布日期:2021-12-03
通讯作者: 刘强 E-mail:qiangliu@tju.edu.cn
作者简介:E⁃mail：qiangliu@tju.edu.cn
基金资助:
国家自然科学基金(61974102)

A FPGA⁃based convolutional neural network training accelerator

Hao Meng¹^,², Qiang Liu¹^,²()

^1.School of Microelectronics，Tianjin University, Tianjin，300072，China
^2.Tianjin Key Laboratory of Imaging and Sensing Microelectronics Technology，Tianjin，300072，China

Received:2021-05-31 Online:2021-12-03 Published:2021-12-03
Contact: Qiang Liu E-mail:qiangliu@tju.edu.cn

摘要/Abstract

摘要：

近年来卷积神经网络在图像分类、图像分割等任务中应用广泛.针对基于FPGA （Field Programmable Gate Array）的卷积神经网络训练加速器中存在的权重梯度计算效率低和加法器占用资源多的问题，设计一款高性能的卷积神经网络训练加速器.首先提出一种卷积单引擎架构，在推理卷积硬件架构的基础上增加额外的自累加单元，可兼容卷积层的正向传播与反向传播（误差反向传递和权重梯度计算），提高加速器的复用能力，同时提升权重梯度计算的效率；然后提出一种适配卷积核内加法树与自累加单元的新型加法树设计，进一步节约计算资源；最后在Xilinx Zynq xc7z045平台上实现了所提出的训练加速器，并基于CIFAR?10数据集训练VGG?like （Visual Geometry Group）网络模型.实验结果表明，在200 MHz的时钟频率下，支持8位定点的训练加速器可以达到64.6 GOPS （Giga Operations per Second）的平均性能，是Intel Xeon E5?2630 v4 CPU （Central Processing Unit）训练平台的9.36倍，能效是NVIDIA Tesla K40C GPU（Graphics Processing Unit）训练平台的17.96倍.与已有的FPGA加速器相比，提出的加速器在处理性能和存储资源使用效率上具有优势.

关键词: 卷积神经网络, 深度学习, 图像处理, 现场可编程门阵列

Abstract:

In recent years，convolutional neural networks have been widely used in image classification and image segmentation tasks. Convolutional neural network training accelerator based on FPGA (Field Programmable Gate Array) has the problems of low weight gradient calculation efficiency and high resource occupation of the adder. A high?performance convolutional neural network training accelerator is designed in this paper. Firstly，we propose a convolutional single?engine architecture which adds an additional self?accumulation unit based on the inference accelerator. The accelerator supports the forward propagation and back propagation (error back propagation and weight gradient calculation) of the convolutional layer to improve the reusability and the efficiency of weight gradient calculation at the same time. Then we design a novel adder tree which supports the kernel adder tree and self?accumulation units. Finally，the proposed training accelerator is implemented on the Xilinx Zynq xc7z045 platform to train VGG?like (Visual Geometry Group) network model on the CIFAR?10 dataset. Experimental results show that at a clock frequency of 200 MHz，the 8?bit fixed?point training accelerator achieves an average performance of 64.6 GOPS (Giga Operations per Second). Its average performance is 9.36 times higher than that of Intel Xeon E5?2630 v4 CPU (Central Processing Unit) training platform，and its energy efficiency is 17.96 times higher than that of NVIDIA Tesla K40C GPU (Graphics Processing Unit) training platform.Compared with several existing FPGA accelerators，the proposed accelerator has advantages in performance and storage resource usage.

Key words: convolutional neural network, deep learning, image processing, field programmable gate array

中图分类号:

TP338.6

孟浩, 刘强. 基于FPGA的卷积神经网络训练加速器设计[J]. 南京大学学报(自然科学版), 2021, 57(6): 1075–1082.

Hao Meng, Qiang Liu. A FPGA⁃based convolutional neural network training accelerator[J]. Journal of Nanjing University(Natural Sciences), 2021, 57(6): 1075–1082.

图/表 10

图1

图2

图3

图4

图5

表1

表2

表3

加法树资源使用对比"

$L U T, F F$	优化前			优化后	优化比例
$L U T, F F$	卷积核内加法树	自累加单元	合计	优化后	优化比例
$P C = 1, P F = 1$	$257,388$	$480,680$	$737,1068$	$607,484$	$82.3 %, 45.3 %$
$P C = 8, P F = 8$	$18264,26648$	$30720,43520$	$48984,70168$	$40064,32792$	$81.8 %, 46.7 %$

表3

表4

表5

参考文献 20

1	Redmon J，Farhadi A. YOLO9000：Better，faster，stronger∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu，HI，USA：IEEE，2017：6517-6525.
2	Russakovsk O，Deng J，Su H，et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision，2015，115(3)：211-252.
3	Zhang Y，Chan W，Jaitly N. Very deep convolutional networks for end?to?end speech recognition∥Proceedings of the IEEE International Conference on Acoustics，Speech and Signal Processing. New Orleans，LA，USA：IEEE，2017：4845-4849.
4	Qiu J T，Wang J，Yao S，et al. Going deeper with embedded FPGA platform for convolutional neural network∥Proceedings of the 2016 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. New York：ACM Press，2016：26-35.
5	Zhao R Z，Niu X Y，Wu Y J，et al. Optimizing CNN?based object detection algorithms on embedded FPGA platforms∥13^th International Symposium on Applied Reconfigurable Computing. Springer Berlin Heidelberg，2017：255-267.
6	仇越，马文涛，柴志雷. 一种基于FPGA的卷积神经网络加速器设计与实现. 微电子学与计算机，2018，35(8)：68-72，77.
	Qiu Y，Ma W T，Chai Z L. Design and implementation of a convolutional neural network accelerator based on FPGA. Microelectronics & Computer，2018，35(8)：68-72，77.
7	陈辰，柴志雷，夏珺. 基于Zynq7000 FPGA异构平台的YOLOv2加速器设计与实现. 计算机科学与探索，2019，13(10)：1677-1693.
	Chen C，Chai Z L，Xia J. Design and implementation of YOLOv2 acce?lerator based on Zynq7000 FPGA heterogeneous platform. Journal of Frontiers of Computer Science and Technology，2019，13(10)：1677-1693.
8	曾成龙，刘强. 面向嵌入式FPGA的高性能卷积神经网络加速器设计. 计算机辅助设计与图形学学报，2019，31(9)：1645-1652.
	Zeng C L，Liu Q. Design of high performance convolutional neural network accelerator for embedded FPGA. Journal of Computer?Aided Design and Computer Graphics，2019，31(9)：1645-1652.
9	徐欣，刘强，王少军. 一种高度并行的卷积神经网络加速器设计方法. 哈尔滨工业大学学报，2020，52(4)：31-37.
	Xu X，Liu Q，Wang S J. A highly parallel design method for convolutional neural networks accelerator. Journal of Harbin Institute of Technology，2020，52(4)：31-37.
10	Li F F，Zhang B，Liu B. Ternary weight networks. 2016，arXiv：.
11	Zhou S C，Wu Y X，Ni Z K，et al. Dorefa?net：Training low bitwidth convolutional neural networks with low bitwidth gradients. 2018，arXiv：.
12	Banner R，Hubara I，Hoffer E，et al. Scalable methods for 8?bit training of neural networks∥Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal，Canada：NIPS，2018：5145-5153.
13	Das D，Mellempudi N，Mudigere D，et al. Mixed precision training of convolutional neural networks using integer operation. 2018，arXiv：.
14	Deng L，Li G Q，Han S，et al. Model compression and hardware acceleration for neural networks：A comprehensive survey. Proceedings of the IEEE，2020，108(4)：485-532.
15	Lin Y J，Han S，Mao H Z，et al. Deep gradient compression：Reducing the communication bandwidth for distributed training. 2020，arXiv：.
16	Han D，Lee J，Lee J，et al.A141.4 mW low?power online deep neural network training processor for real?time object tracking in mobile devices∥2018 IEEE International Symposium on Circuits and Systems. Florence，Italy：IEEE，2018：1-5.
17	Zhao W L，Fu H H，Luk W，et al. F?CNN：An FPGA?based framework for training convolutional neural networks∥IEEE 27^th International Conference on Application?specific Systems，Architectures and Processors. London，UK：IEEE，2016：107-114.
18	Liu Z Q，Dou Y，Jiang J F，et al. An FPGA?based processor for training convolutional neural networks∥2017 International Conference on Field Programmable Technology. Melbourne，Australia：IEEE，2017：207-210.
19	Fox S，Faraon J，Boland D，et al. Training deep neural networks in low?precision with high accuracy using FPGAs∥2019 International Conference on Field?Programmable Technology. Tianjin，China：IEEE，2019：1-9.
20	Luo C，Sit M K，Fan H X，et al. Towards efficient deep neural network training by FPGA?based batch?level parallelism. Journal of Semiconductors，2020，41(2)：022403.

相关文章 15

[1]	樊炎, 匡绍龙, 许重宝, 孙立宁, 张虹淼. 一种同步提取运动想象信号时⁃频⁃空特征的卷积神经网络算法[J]. 南京大学学报(自然科学版), 2021, 57(6): 1064-1074.
[2]	陈磊, 孙权森, 王凡海. 基于深度对抗网络和局部模糊探测的目标运动去模糊[J]. 南京大学学报(自然科学版), 2021, 57(5): 735-749.
[3]	倪斌, 陆晓蕾, 童逸琦, 马涛, 曾志贤. 胶囊神经网络在期刊文本分类中的应用[J]. 南京大学学报(自然科学版), 2021, 57(5): 750-756.
[4]	杨静, 赵文仓, 徐越, 冯旸赫, 黄金才. 一种基于少样本数据的在线主动学习与分类方法[J]. 南京大学学报(自然科学版), 2021, 57(5): 757-766.
[5]	贾霄, 郭顺心, 赵红. 基于图像属性的零样本分类方法综述[J]. 南京大学学报(自然科学版), 2021, 57(4): 531-543.
[6]	普志方, 陈秀宏. 基于卷积神经网络的细胞核图像分割算法[J]. 南京大学学报(自然科学版), 2021, 57(4): 566-574.
[7]	段建设, 崔超然, 宋广乐, 马乐乐, 马玉玲, 尹义龙. 基于多尺度注意力融合的知识追踪方法[J]. 南京大学学报(自然科学版), 2021, 57(4): 591-598.
[8]	颜志良, 丰智鹏, 刘丹, 王会青. 一种混合深度神经网络的赖氨酸乙酰化位点预测方法[J]. 南京大学学报(自然科学版), 2021, 57(4): 627-640.
[9]	方志文, 刘青山, 周峰. 基于像素⁃目标级共生关系学习的多标签航拍图像分类方法[J]. 南京大学学报(自然科学版), 2021, 57(2): 208-216.
[10]	罗金屯, 滕飞, 周亚波, 池茂儒, 张海波. 数据驱动的高速铁路轮轨作用力反演模型[J]. 南京大学学报(自然科学版), 2021, 57(2): 299-308.
[11]	范习健, 杨绪兵, 张礼, 业巧林, 业宁. 一种融合视觉和听觉信息的双模态情感识别算法[J]. 南京大学学报(自然科学版), 2021, 57(2): 309-317.
[12]	曾宪华, 陆宇喆, 童世玥, 徐黎明. 结合马尔科夫场和格拉姆矩阵特征的写实类图像风格迁移[J]. 南京大学学报(自然科学版), 2021, 57(1): 1-9.
[13]	余方超, 方贤进, 张又文, 杨高明, 王丽. 增强深度学习中的差分隐私防御机制[J]. 南京大学学报(自然科学版), 2021, 57(1): 10-20.
[14]	高春永, 柏业超, 王琼. 基于改进的半监督阶梯网络SAR图像识别[J]. 南京大学学报(自然科学版), 2021, 57(1): 160-166.
[15]	张萌, 韩冰, 王哲, 尤富生, 李浩然. 基于深度主动学习的甲状腺癌病理图像分类方法[J]. 南京大学学报(自然科学版), 2021, 57(1): 21-28.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

层	通道数	卷积核数	输出高宽	卷积核
卷积1	3	128	32×32	3×3
卷积2	128	128	32×32	3×3
池化	128	128	16×16	2×2
卷积3	128	256	16×16	3×3
卷积4	256	256	16×16	3×3
池化	256	256	8×8	2×2
卷积5	256	512	8×8	3×3
卷积6	512	512	8×8	3×3
池化	512	512	4×4	2×2
全连接	8192	1024	-	-
全连接	1024	10	-	-

	LUT	FF	BRAM	DSP
总资源数	218600	437200	545	900
占用资源数	54725	74458	134.5	615
使用率	25.03%	17.03%	24.68%	68.33%

参数	CPU	GPU	FPGA
平台	Intel Xeon E5?2630 v4	NVIDIA Tesla K40C	Xilinx Zynq xc7z045
工艺(nm)	14	28	28
频率(GHz)	2.2	0.745	0.2
精度	32位浮点	32位浮点	8位定点
每批训练时长(ms)	539.97	13.62	57.60
功耗(W)	85	245	3.21
操作性能(GOPS)	6.9	273.3	64.6
能效(GOPS·W^-1)	0.08	1.12	20.12

加速器	平台	模型数据集	数据类型	DSP/LUTs/ FFs/BRAM	功耗 (W)	操作性能 (GOPS)	DSP使用效率 (GOPS/DSP)	BRAM使用效率(GOPS/BRAM)	能效 (GOPS·W^-1)
Zhao et al^[17]	Altera Stratix V	AlexNet -	单精度浮点	≈2214/-/ -/-	-	62.06	0.028	-	-
Liu et al^[18]	Xilinx ZU19EG	LeNet?5 CIFAR?10	单精度浮点	1500/329.3 k/ 466.0k/174	14.24	17.96	0.012	0.10	1.26
Fox et al^[19]	Zynq ZCU111	VGG16 CIFAR?10	8位定点	1037/73.1 k/ 25.6 k/1045	-	205	0.19	0.20	-
Luo et al^[20]	UltraScale+ XCVU9P	VGG?like CIFAR?10	8位定点	4202/480 k/ -/≈4717	13.5	1417	0.33	0.30	104.9
本文	ZC706 Xc7z045	VGG?like CIFAR?10	8位定点	615/54.7 k/ 74.5 k/134.5	3.21	64.6	0.105	0.48	20.1

基于FPGA的卷积神经网络训练加速器设计

A FPGA⁃based convolutional neural network training accelerator

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 20

相关文章 15

Metrics

本文评价

推荐阅读 0