南京大学学报(自然科学版) ›› 2018, Vol. 54 ›› Issue (2): 452–.

• • 上一篇    下一篇

 基于SVM的高维不平衡数据集分类算法

 赵小强1,2,3*,张 露1   

  • 出版日期:2018-03-31 发布日期:2018-03-31
  • 作者简介:1.兰州理工大学电气工程与信息工程学院,兰州,730050;
    2.甘肃省工业过程先进控制重点实验室,兰州,730050;
    3.兰州理工大学电气与控制工程国家级实验教学示范中心,兰州,730050
  • 基金资助:
     基金项目:国家自然科学基金(61763029,61370037),甘肃省基础研究创新群体(1506RJIA031)
    收稿日期:2017-12-21
    *通讯联系人,E-mail:xqzhao@lut.cn

 Classification algorithm of high-dimensional and imbalanced data based on support vector machine

 Zhao Xiaoqiang1,2,3*,Zhang Lu1   

  • Online:2018-03-31 Published:2018-03-31
  • About author:1.College of Electrical and Information Engineering,Lanzhou University of Technology,Lanzhou,730050,China;
    2.Key Laboratory of Gansu Advanced Control for Industrial Processes,Lanzhou,730050,China;
    3.National Demonstration Center for Experimental Electrical and Control Engineering Education,Lanzhou University of Technology,Lanzhou,730050,China

摘要:  由于数据量的不断增长,出现了大量的不平衡高维数据,传统的数据挖掘分类算法在处理这些数据时,易受到样本分布和维数的影响,存在分类性能不佳的问题. 提出一种针对不平衡高维数据集的改进支持向量机(Supported Vector Machine,SVM)分类算法,首先通过核函数将数据集映射到特征空间中,再引入改进的核SMOTE(Kernel Synthetic Minority Over-sampling Technique)算法而得到正类样本,使两类样本数目平衡化;然后将维数高的数据集通过稀疏表示的方法投影到低维的空间中,实现降维;最后根据空间的距离关系来确定在输入空间中合成样本的原像,再对得到的平衡样本集通过SVM来分类,通过仿真实验验证了该算法对于高维不平衡数据集有较优的分类性能.

Abstract:  High-dimensional data and imbalance data are very common in real life,but classification algorithms of traditional data mining have low classification performance due to the impacts of the sample distribution and dimensions. An improved Supported Vector Machine(SVM) classification algorithm for high-dimensional and imbalanced data is proposed in this paper. Firstly,the algorithm maps the original imbalanced dataset into feature space by kernel function,and homogeneous K-Nearest Neighborhood and heterogeneous K-Nearest Neighborhood of positive samples are seeking in feature space. Threshold value of adaptive neighbor is set according to interior distribution character of samples and the set of K-Nearest Neighborhood is obtained. The number of two kinds of samples is balanced. Then in feature space,sparse fractions are obtained by calculating features of training samples and are arranged according to numerical value. Feature selection based on sparse representation is applied to reduce the dimensionality of high dimensional dataset. Finally,these pre-images of the synthetic samples are found in input space by using distance relation between feature space and input space. The disposed dataset of balance sample is trained by SVM to classify. Experimental results show that the proposed algorithm can improve classification performance of high-dimensional and imbalanced dataset.

 [1] Yang Q,Wu X D. 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making,2006,5(4):597-604.
[2] Mo D Y. Robust and efficient feature selection for high-dimensional datasets. Ph. D. Disser-tation. Cincinnati:University of Cincinnati,2010.
[3] 吴 旗,刘健男,寇文龙等. 改进的单类支持向量机的网络流量检测. 吉林大学学报(工学版),2013,43(S1):124-126.(Wu Q,Liu J N,Kou W L,et al. Internet traffic identification by using improved one class support vector machines. Journal of Jilin University(Engineering and Technology Edition),2013,43(S1):124-126. )
[4] 于秋玲. 基于改进NN-SVM算法的网络入侵检测. 系统工程理论与实践,2010,30(1):126-130.(Yu Q L. Internet intrusion detection system based on improved NN-SVM. Systems Engineering-Theory & Practice,2010,30(1):126-130. )
[5] 李 晟,姜青山,郭 顺等. 一种优化的蛋白质序列模式挖掘方法. 计算机研究与发展,2009,46(S1):227-233.(Li S,Jiang Q S,Guo S,et al. An optimized method for protein motif mining. Journal of Computer Research and Development,2009,46(S1):227-233. )
[6] 何 跃,马丽霞,腾格尔. 基于用户访问兴趣的web日志挖掘. 系统工程理论与实践,2012,32(6):1353-1361.(He Y,Ma L X,Teng G E. Web log mining based on user’s accessing interest. Systems Engineering-Theory & Practice,2012,32(6):1353-1361. )
[7] 闫 伟,何 桢,李岸达. 基于CEM-IG算法的复杂产品关键质量特性识别. 系统工程理论与实践,2014,34(5):1230-1236.(Yan W,He Z,Li A D. Identification of critical-to-quality characteris-tics for complex products using CEM-IG algorithm. Systems Engineering-Theory & Practice,2014,34(5):1230-1236. )
[8] 尹 华,胡玉平. 基于随机森林的不平衡特征选择算法. 中山大学学报(自然科学版),2014,53(5):59-65.(Yin H,Hu Y P. An imbalanced feature selection algorithm based on random forest. Acta Scientiarum Naturalium Universi-tatis Sunyatseni,2014,53(5):59-65. )
[9] Deepa T,Punithavalli M. A new sampling technique and SVM classification for feature selection in high-dimensional imbalanced dataset ∥ 3rd International Conference on Electronics Computer Technology(ICECT). Kanyakumari,India:IEEE,2011:395-398.
[10] Yu H L,Ni J,Zhao J. ACO sampling:An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing,2013,101:309-318.
[11] 杨二伟. 基于改进非平衡策略的入侵检测系统研究. 硕士学位论文. 郑州:郑州大学,2014.(Yang E W. The research of intrusion detection system based on improved SMOTE algorithm. Master Dissertation. Zhengzhou:Zhengzhou University,2014. )
[12] Pal M,Foody G M. Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing,2010,48(5):2297-2307.
[13] Boser B E,Guyon I M,Vapnik V N. A training algorithm for optimal margin classifiers ∥ Proceedings of the 5th Annual Workshop on Computational Learning Theory. Pittsburgh,Pennsylvania,USA:ACM,1992:144-152.
[14] Shanab A A,Khoshgoftaar T M,Wald R,et al. Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data ∥ 2011 IEEE International Conference on Information Reuse and Integration(IRI). Las Vegas,NV,USA:IEEE,2011:234-239.
[15] 曾志强,吴 群,廖备水等. 一种基于核SMOTE的非平衡数据集分类方法. 电子学报,2009,37(11):2489-2495.(Zeng Z Q,Wu Q,Liao B S,et al. A classfication method for imbalance data set based on kernel SMOTE. Acta Electronica Sinica,2009,37(11):2489-2495. )
[16] Anthony G,Ruther H. Comparison of feature selection techniques for SVM classification. In:10th International Symposium on Physical Measurements and Signatures in Remote Sensing. Davos,Switzerland:ISPRS,2007:258-263.
[17] 朱 林. 基于特征加权与特征选择的数据挖掘算法研究. 博士学论文. 上海:上海交通大学,2013.(Zhu L. Research on feature weighting and feature selection-based data mining algorithms. Ph. D. Dissertation. Shanghai:Shanghai Jiao Tong University,2013. )
[18] Gao S H,Tsang I W H,Chia L T. Kernel sparse representation for image classification and face recognition ∥ Proceedings of the 11th European Conference on Computer Vision. Springer Berlin Heidelberg,2010:1-14.
[19] Kwok J T,Tsang I W. The pre-image problem in kernel methods. IEEE Transactions on Neural Networks,2004,15(6):1517-1525.
[20] Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision,2004,60(2):91-110.
[21] Kubat M,Holte R,Matwin S. Learning when negative examples abound ∥ Proceedings of 9th European Conference on Machine Learning. Springer Berlin Heidelberg,1997:146-153.
[22] Davis J,Goadrich M. The relationship between precision-recall and ROC curves ∥ Proceeding of the 23rd International Conference on Machine Learning. Pittsburgh,Pennsylvania,USA:ACM,2006:233-240.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!