南京大学学报(自然科学版) ›› 2018, Vol. 54 ›› Issue (2): 452.
赵小强1,2,3*,张 露1
Zhao Xiaoqiang1,2,3*,Zhang Lu1
摘要: 由于数据量的不断增长,出现了大量的不平衡高维数据,传统的数据挖掘分类算法在处理这些数据时,易受到样本分布和维数的影响,存在分类性能不佳的问题. 提出一种针对不平衡高维数据集的改进支持向量机(Supported Vector Machine,SVM)分类算法,首先通过核函数将数据集映射到特征空间中,再引入改进的核SMOTE(Kernel Synthetic Minority Over-sampling Technique)算法而得到正类样本,使两类样本数目平衡化;然后将维数高的数据集通过稀疏表示的方法投影到低维的空间中,实现降维;最后根据空间的距离关系来确定在输入空间中合成样本的原像,再对得到的平衡样本集通过SVM来分类,通过仿真实验验证了该算法对于高维不平衡数据集有较优的分类性能.
[1] Yang Q,Wu X D. 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making,2006,5(4):597-604. [2] Mo D Y. Robust and efficient feature selection for high-dimensional datasets. Ph. D. Disser-tation. Cincinnati:University of Cincinnati,2010. [3] 吴 旗,刘健男,寇文龙等. 改进的单类支持向量机的网络流量检测. 吉林大学学报(工学版),2013,43(S1):124-126.(Wu Q,Liu J N,Kou W L,et al. Internet traffic identification by using improved one class support vector machines. Journal of Jilin University(Engineering and Technology Edition),2013,43(S1):124-126. ) [4] 于秋玲. 基于改进NN-SVM算法的网络入侵检测. 系统工程理论与实践,2010,30(1):126-130.(Yu Q L. Internet intrusion detection system based on improved NN-SVM. Systems Engineering-Theory & Practice,2010,30(1):126-130. ) [5] 李 晟,姜青山,郭 顺等. 一种优化的蛋白质序列模式挖掘方法. 计算机研究与发展,2009,46(S1):227-233.(Li S,Jiang Q S,Guo S,et al. An optimized method for protein motif mining. Journal of Computer Research and Development,2009,46(S1):227-233. ) [6] 何 跃,马丽霞,腾格尔. 基于用户访问兴趣的web日志挖掘. 系统工程理论与实践,2012,32(6):1353-1361.(He Y,Ma L X,Teng G E. Web log mining based on user’s accessing interest. Systems Engineering-Theory & Practice,2012,32(6):1353-1361. ) [7] 闫 伟,何 桢,李岸达. 基于CEM-IG算法的复杂产品关键质量特性识别. 系统工程理论与实践,2014,34(5):1230-1236.(Yan W,He Z,Li A D. Identification of critical-to-quality characteris-tics for complex products using CEM-IG algorithm. Systems Engineering-Theory & Practice,2014,34(5):1230-1236. ) [8] 尹 华,胡玉平. 基于随机森林的不平衡特征选择算法. 中山大学学报(自然科学版),2014,53(5):59-65.(Yin H,Hu Y P. An imbalanced feature selection algorithm based on random forest. Acta Scientiarum Naturalium Universi-tatis Sunyatseni,2014,53(5):59-65. ) [9] Deepa T,Punithavalli M. A new sampling technique and SVM classification for feature selection in high-dimensional imbalanced dataset ∥ 3rd International Conference on Electronics Computer Technology(ICECT). Kanyakumari,India:IEEE,2011:395-398. [10] Yu H L,Ni J,Zhao J. ACO sampling:An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing,2013,101:309-318. [11] 杨二伟. 基于改进非平衡策略的入侵检测系统研究. 硕士学位论文. 郑州:郑州大学,2014.(Yang E W. The research of intrusion detection system based on improved SMOTE algorithm. Master Dissertation. Zhengzhou:Zhengzhou University,2014. ) [12] Pal M,Foody G M. Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing,2010,48(5):2297-2307. [13] Boser B E,Guyon I M,Vapnik V N. A training algorithm for optimal margin classifiers ∥ Proceedings of the 5th Annual Workshop on Computational Learning Theory. Pittsburgh,Pennsylvania,USA:ACM,1992:144-152. [14] Shanab A A,Khoshgoftaar T M,Wald R,et al. Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data ∥ 2011 IEEE International Conference on Information Reuse and Integration(IRI). Las Vegas,NV,USA:IEEE,2011:234-239. [15] 曾志强,吴 群,廖备水等. 一种基于核SMOTE的非平衡数据集分类方法. 电子学报,2009,37(11):2489-2495.(Zeng Z Q,Wu Q,Liao B S,et al. A classfication method for imbalance data set based on kernel SMOTE. Acta Electronica Sinica,2009,37(11):2489-2495. ) [16] Anthony G,Ruther H. Comparison of feature selection techniques for SVM classification. In:10th International Symposium on Physical Measurements and Signatures in Remote Sensing. Davos,Switzerland:ISPRS,2007:258-263. [17] 朱 林. 基于特征加权与特征选择的数据挖掘算法研究. 博士学论文. 上海:上海交通大学,2013.(Zhu L. Research on feature weighting and feature selection-based data mining algorithms. Ph. D. Dissertation. Shanghai:Shanghai Jiao Tong University,2013. ) [18] Gao S H,Tsang I W H,Chia L T. Kernel sparse representation for image classification and face recognition ∥ Proceedings of the 11th European Conference on Computer Vision. Springer Berlin Heidelberg,2010:1-14. [19] Kwok J T,Tsang I W. The pre-image problem in kernel methods. IEEE Transactions on Neural Networks,2004,15(6):1517-1525. [20] Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision,2004,60(2):91-110. [21] Kubat M,Holte R,Matwin S. Learning when negative examples abound ∥ Proceedings of 9th European Conference on Machine Learning. Springer Berlin Heidelberg,1997:146-153. [22] Davis J,Goadrich M. The relationship between precision-recall and ROC curves ∥ Proceeding of the 23rd International Conference on Machine Learning. Pittsburgh,Pennsylvania,USA:ACM,2006:233-240. |
No related articles found! |
|