南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (5): 733–739.doi: 10.13232/j.cnki.jnju.2019.05.004

• • 上一篇    下一篇

一种负载均衡的LSTM硬件加速器设计

查羿,潘红兵()   

  1. 南京大学电子科学与工程学院,南京,210093
  • 收稿日期:2019-05-06 出版日期:2019-09-30 发布日期:2019-11-01
  • 通讯作者: 潘红兵 E-mail:liningrigs@vip.sina.com
  • 基金资助:
    国家自然科学基金(61376075)

A load balanced LSTM hardware accelerator design

Yi Zha,Hongbing Pan()   

  1. School of Electronic Science and Engineering,Nanjing University, Nanjing, 210093, China
  • Received:2019-05-06 Online:2019-09-30 Published:2019-11-01
  • Contact: Hongbing Pan E-mail:liningrigs@vip.sina.com

摘要:

神经网络在嵌入式端的应用日益广泛,为满足嵌入式端低功耗,低延迟等特点,通常的解决方案是针对长短记忆序列LSTM模型(Long?Short Term Memory)进行压缩,并定制专用的硬件加速器.当LSTM模型经过剪枝等压缩操作后,其网络模型将变得稀疏且不规则,会给PE(Process Element)运算单元带来负载不均衡的问题.通过排序的方法,将权重矩阵按一定的规则重新分发给各个PE单元,并在此基础上针对稀疏化的模型定制专用的硬件单元.在赛灵思zynq系列XCZU9EG?2FFVB1156E开发板上进行实验,实验结果显示,当PE单元多消耗0.314%硬件资源的情况下,其运算速度取得了2%的提升.

关键词: 神经网络加速器, 模型压缩, 负载均衡, 嵌入式设计

Abstract:

Neural network is increasingly widely used in embedded end. In order to meet the characteristics of low power consumption and low delay of embedded end,the common solution is to compress LSTM (Long?Short Term Memory) model and customize special hardware accelerator. When LSTM model is compressed by pruning,its network model will become sparse and irregular,which will bring unbalanced load to PE (Process Element) operation unit. In this paper,weight matrix is redistributed to each PE unit according to certain rules by sorting method. On this basis,a special hardware unit is customized for the sparse model. The zynq series xczu9eg?2ffvb1156e development board was used for experiments,and the experimental results showed that the PE unit achieved a 2% improvement in computing speed when it consumed 0.314% more hardware resources.

Key words: neural network accelerator, model compression, load balance, embedded design

中图分类号: 

  • TN492

图1

LSTM单元结构示意图"

图2

权值矩阵的重组拼凑"

图3

权值矩阵重新排序以及PE的分配过程(以上步骤1到3的总概括图)"

图4

LSTM算法的运算模块示意图"

图5

矩阵向量乘分块运算"

图6

稀疏矩阵向量乘运算的状态图"

图7

单个PE的SpMV单元内部模块"

表 1

不同平台进行稀疏矩阵向量乘运算的性能比较"

CPU GPU ESE 本设计
167.6 14.5 5.36 6.68 运算时间(μs)
11.56× 31.27× 25.09× 加速比
38 202 41 7.10 功耗(W)
5.32× 1.07× 0.18× 功耗比
2.17× 29.22× 139.3× 能效比(加速比/功耗比)

表2

稀疏度不同的矩阵效果对比"

稀疏度(零元素值所占比例)

运算时间

(μs)

差值
负载不均衡 0.883 6.84 0.16
本设计 6.68
负载不均衡 0.483 25.95 0.665
本设计 25.285
负载不均衡 0.083 45.02 0.06
本设计 44.96
1 Graves A , Mohamed A R , Hinton G . Speech recognition with deep recurrent neural networks∥Proceedings of the 2013 IEEE International Conference on Acoustics,Speech and Signal Processing. Vancouver,Canada:IEEE,2013:6645-6649.
2 Rybalkin V , Wehn N , Yousefi M R ,et al . Hardware architecture of bidirectional long short?term memory neural network for optical character recognition∥Design,Automation & Test in Europe Conference & Exhibition (DATE),2017.Lausanne,Switzerland:IEEE,2017.
3 Cho K , Van Merrienboer B , Gulcehre C ,et al . Learning phrase representations using RNN encoder?decoder for statistical machine translation. arXiv:1406.1078,2014.
4 Cho K , Van Merrienboerning B , Gulcehre C ,et al .Learning phrase representations using RNN encoder?decoder for statistical machine translation. arXiv:1406.1078,2014.
5 Ye J M , Wang L N , Li G X ,et al . Learning compact recurrent neural networks with block?term tensor decomposition. arXiv:1712.05134,2018.
6 Hochreiter S , Schmidhuber J . Long short?term memory. Neural Computation,1997,9(8):1735-1780.
7 Guan Y J , Yuan Z H , Sun G Y ,et al . FPGA?based accelerator for long short?term memory recurrent neural networks∥2017 22nd Asia and South Pacific Design Automation Conference (ASP?DAC). Chiba,Japan:IEEE,2017:629-634.
8 Ouyang P , Yin S Y , Wei S J . A fast and power efficient architecture to parallelize LSTM based RNN for cognitive intelligence applications∥2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). Austin,TX,USA:IEEE,2017:1-63.
9 Han S , Kang J L , Mao H Z ,et al . ESE:Efficient speech recognition engine with sparse LSTM on FPGA∥Proceedings of the 2017 ACM/SIGDA International Symposium on Field?Programmable Gate Arrays. Monterey,CA,USA:ACM,2017.
10 Han S , Liu X Y , Mao H Z ,et al . EIE:Efficient inference engine on compressed deep neural network∥2016 ACM/IEEE 43rd Annual Interna?tional Symposium on Computer Architecture (ISCA). Seoul,South Korea:IEEE,2016:243-254.
11 Wang Z S , Lin J , Wang Z F . Accelerating recurrent neural networks: A memory?efficient approach. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,2017,25(10):2763-2775.
12 Wang S , Li Z , Ding C W ,et al . C?LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs∥Proceedings of the 2018 ACM/SIGDA International Sympo?sium on Field?Programmable Gate Arrays. Monterey,CA,USA:ACM,2017.
13 Conti F , Cavigelli L , Paulin G ,et al . CHIPMUNK:A systolically scalable 0.9 mm2,3.08 Gop/s/mW @1.2mW accelerator for near?sensor recurrent neural network inference. arXiv: 1711.05734,2018.
14 Liu B , Dong W , Xu T T ,et al . E?ERA:An energy?efficient reconfigurable architecture for RNNs using dynamically adaptive approximate computing. IEICE Electronics Express,2017,14(15):20170637.
15 Shin D , Lee J , Lee J ,et al . 14.2 DNPU:An 8.1TOPS/W reconfigurable CNN?RNN processor for general?purpose deep neural networks∥2017 IEEE International Solid?State Circuits Conference (ISSCC). San Francisco ,CA,USA:IEEE,2017.
16 Park J , Kung J , Yi W ,et al . Maximizing system performance by balancing computation loads in LSTM accelerators∥2018 Design,Automation & Test in Europe Conference & Exhibition (DATE). Dresden,Germany:IEEE,2018.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 林 銮,陆武萍,唐朝生,赵红崴,冷 挺,李胜杰. 基于计算机图像处理技术的松散砂性土微观结构定量分析方法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1064 -1074 .
[2] 段新春,施 斌,孙梦雅,魏广庆,顾 凯,冯晨曦. FBG蒸发式湿度计研制及其响应特性研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1075 -1084 .
[3] 梅世嘉,施 斌,曹鼎峰,魏广庆,张 岩,郝 瑞. 基于AHFO方法的Green-Ampt模型K0取值试验研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1085 -1094 .
[4] 卢 毅,于 军,龚绪龙,王宝军,魏广庆,季峻峰. 基于DFOS的连云港第四纪地层地面沉降监测分析[J]. 南京大学学报(自然科学版), 2018, 54(6): 1114 -1123 .
[5] 孙 玫,张 森,聂培尧,聂秀山. 基于朴素贝叶斯的网络查询日志session划分方法研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1132 -1140 .
[6] 胡 淼,王开军,李海超,陈黎飞. 模糊树节点的随机森林与异常点检测[J]. 南京大学学报(自然科学版), 2018, 54(6): 1141 -1151 .
[7] 洪思思,曹辰捷,王 喆*,李冬冬. 基于矩阵的AdaBoost多视角学习[J]. 南京大学学报(自然科学版), 2018, 54(6): 1152 -1160 .
[8] 魏 桐,童向荣. 基于加权启发式搜索的鲁棒性信任路径生成[J]. 南京大学学报(自然科学版), 2018, 54(6): 1161 -1170 .
[9] 周星星,张海平,吉根林. 具有时空特性的区域移动模式挖掘算法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1171 -1182 .
[10] 韩明鸣, 郭虎升, 王文剑. 面向非平衡多分类问题的二次合成QSMOTE方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 1 -13 .