南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (2): 343–.

• • 上一篇    下一篇

基于Spark的LIBSVM参数优选并行化算法

李 坤1,2,3,刘 鹏2,3*,吕雅洁1,2,张国鹏2,3,黄宜华4   

  • 出版日期:2016-03-27 发布日期:2016-03-27
  • 作者简介: 1.中国矿业大学信息与电气工程学院,徐州,221116;2.中国矿业大学物联网(感知矿山)研究中心,徐州,221008;3.矿山互联网应用技术国家地方联合工程实验室,徐州,221008;4.南京大学计算机系PASA大数据实验室,南京,210023
  • 基金资助:
    基金项目:国家高技术研究发展计划(863计划)(2013AA06A411),国家自然科学基金(61471361),中央高校基本科研业务费(2011QNB26)
    收稿日期:2015-09-29
    *通讯联系人,E­mail:liupeng@cumt.edu.cn

The parallel algorithms for LIBSVM parameter optimization based on Spark

Li Kun1,2,3,Liu Peng2,3*,Lv Yajie1,2,Zhang Guopeng2,3,Huang Yihua4   

  • Online:2016-03-27 Published:2016-03-27
  • About author: 1.School of Information and Electrical Engineering,China University of Mining and Technology,Xuzhou,221116,China;2.Internet of Things Perception Mine Research Centre,China University of Mining and Technology,Xuzhou,221008,China;3.National and Local Joint Engineering Laboratory of Internet Application Technology on Mine,Xuzhou,221008,China;4.PASA Big­data Laboratory,Department of Computer Science,Nanjing University,Nanjing,210023,China

摘要: 利用Spark集群设计LIBSVM参数优选的并行化实现.LIBSVM是一款广泛使用的SVM软件包,广泛应用于模型搭建、样本训练和结果预测等方面.在用LIBSVM训练数据集时,参数的选择对训练结果影响显著,其中以参数C和g最为重要.LIBSVM软件包中采用网格搜索算法对C、g参数组合进行寻优,尽管该算法在单机上实现了并行化,但当数据量达到一定程度时,仍需要花费大量的时间.基于Spark并行计算架构,进行了LIBSVM的C、g参数网格优选并行算法的设计与实现.实验结果表明,提出的并行粗粒度网格搜索C、g参数优选算法比传统算法速度提升了近7倍,而且这一提升将随着集群规模的扩大而进一步加大.另一方面,在粗粒度网格搜索的基础上,进而提出的细粒度并行网格搜索算法又进一步提升了C、g参数组合的优选结果.

Abstract: The purpose of this work is to design a parallel implementation of LIBSVM parameters optimization using Spark cluster.LIBSVM is a widely­used software package,which applies in models building,samples training,results predicting,etc.When LIBSVM is used to train data set,the choice of parameters,especially the parameter C and parameter g,has a significant impact on the training results.In LIBSVM,the grid search algorithm is chosen to finish the optimization of combination of parameter C and parameter g,which will run for a long time when the data volume reaches a certain degree,even though it is carried out in parallel manner on a single computer.In recent years,with the development of big data,cluster parallel computing and the emergence of in­memory computing platforms,such as Apache Spark,the efficiency of parameter optimization will be expected to increase dramatically when the parameter optimization is implemented in parallel manner on computing clusters.In this paper,we design and implement the parallelized parameter optimization algorithms of LIBSVM based on Spark parallel computing architecture.Experiment results show that the speed of parallelized parameter optimization by coarse­grained grid­search algorithm,proposed in this paper,is about 7 times as much as the serial one.And this improvement result will be further promoted with the expansion of the cluster scale.On the other hand,based on the coarse­grained grid­search algorithm,we achieve another improvement on the result of C and g parameter combination optimization,after the application of fine­grained parallel grid search algorithm.

[1] Chang C C,Lin C J.LIBSVM:A library for support vector machines.ACM Transactions on Intelligent Systems and Technology,2011,2(3):75-102.
[2]  Zaharia M,Chowdhury M,Das T,et al.Resilient distributed datasets:A fault­tolerant abstraction for in­memory cluster computing.In:Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.Berkeley:USENIX Association,2012,2-16.
[3]  高彦杰.Spark大数据处理技术、应用与性能优化.北京:机械工业出版社,2014,3-56.
[4]  Ichihashi H,Honda K,Notsu A.Comparison of scaling behavior between fuzzy c­means based classifier with many parameters and LibSVM.Fuzzy Systems,2011,35(2):386-393.
[5]  Joseph S M,Hameed A.Online handwritten malayalam character recognition using LIBSVM in matlab.Australian Computer Society,2014,15(1):21-25.
[6]  刘天祥,包腾飞,宋锦焘等.基于遗传算法的LIBSVM模型大坝扬压力预测研究.三峡大学学报,2013,35(6):24-28.(Liu T X,Bao T F,Song J T,et al.Study of LIBSVM model based on GA optimization in uplift pressure forecasting of dam.Journal of China Three Gorges University(Natural Sciences),2013,35(6):24-28.)
[7]  吴 浩,李群湛,刘 炜.基于PSO­LIBSVM的广域后备保护新算法.电力系统保护与控制,2013,41(15):49-58.(Wu H,Li Q Z,Liu W.A new algorithm of wide­area backup protection based on PSO­LIBSVM.Power System Protection and Control,2013,41(15):49-58.)
[8]  卢洪波,王金龙.基于LIBSVM和智能算法的电站锅炉飞灰含碳量优化.东北电力大学学报,2014,34(1):16-20.(Lu H B,Wang J L.Optimizing the exhaust carbon contented of fly ash of power station boiler based on LIBSVM and artificial intelligence algorithm.Journal of Northeast China Institute of Electric Power Engineering,2014,34(1):16-20.)
[9]  刘志强,顾 荣,袁春风等.基于SparkR的分类算法并行化研究.计算机科学与探索,2015.优先出版.DOI:10.3778/j.issn.1673-9418.1503036.(Liu Z Q,Gu R,Yuan C F,et al.Parallelization of classification algorithms based on Spark R.Journal of Frontiers of Computer Science & Technology,DOI:10.3778/j.issn.1673-9418.1503036.)
[10]  邱荣财.基于Spark平台的CURE算法并行化设计与应用.硕士学位论文.广州:华南理工大学,2014.(Qiu R C.The parallel design and application of the CURE algorithm based on Spark platform.Master Dissertation.Guangzhou:South China University of Technology,2014.)
[11]  唐振坤.基于Spark的机器学习平台设计与实现.硕士学位论文.厦门:厦门大学,2014.(Tang Z K.Design and implementation of machine learning platform based on Spark.Master Dissertation.Xiamen:Xiamen University,2014.)
[12]  肖佳林,赵聿晴,王 英.基于HMM与SVM的语音活动检测.计算机工程,2014,40(1):203-208.(Xiao J L,Zhao Y Q,Wang Y.Voice activity detection based on HMM and SVM.Computer Engineering,2014,40(1):203-208.)
[13]  刘 爽.支持向量机在自动文本分类中的应用.大连:大连海事大学出版社,2014,75-106.
[14]  翟俊海,王婷婷,王熙照.一种改进的样例约简支持向量机.南京大学学报(自然科学),2013,49(5):596-602.(Zhai J H,Wang T T,Wang X Z.An mproved instance reduction support vector machine.Journal of Nanjing University(Natural Sciences),2013,49(5):596-602.)
[15]  纪昌明,周 婷,向腾飞等.基于网格搜索和交叉验证的支持向量机在梯级水电系统隐随机调度中的应用.电力自动化设备,2014,34(3):125-131.(Ji C M,Zhou T,Xiang T F,et al.Application of support vector machine based on grid search and cross validation in implicit stochastic dispatch of cascaded hydropower stations.Electric Power Automation Equipment,2014,34(3):125-131.)
[16]  郑 晔,李 剑.Scala程序设计.北京:人民邮电出版社,2010,1-196.
[17]  黄海旭,高宇翔.Scala编程.北京:电子工业出版社,2010,30-278.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1]  刘建华**,杨荣华,孙水华.  离散二进制粒子群算法分析*
[J]. 南京大学学报(自然科学版), 2011, 47(6): 504 -514 .
[2] 安妍妍, 李 赢, 时胜国, 时 洁, . 声矢量圆阵宽带相干信号的方位估计[J]. 南京大学学报(自然科学版), 2017, 53(4): 621 .