南京大学学报(自然科学版) ›› 2014, Vol. 50 ›› Issue (4): 457–.

• • 上一篇    下一篇

基于最小最大策略的集成特征选择

周国静,李 云*   

  • 出版日期:2014-08-23 发布日期:2014-08-23
  • 作者简介: 南京邮电大学计算机学院,南京,210023
  • 基金资助:
     国家自然科学基金(61105082),江苏省自然科学基金(BK20131378)

Ensemble feature selection using min-max strategy

 Zhou Guojing, Li Yun   

  • Online:2014-08-23 Published:2014-08-23
  • About author: College of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China

摘要:  特征选择是机器学习和数据挖掘中的一个关键问题,它可以实现数据维度的约减,从而提高学习模型的泛化能力。近年来,为了提高特征选择算法的性能,集成思想被应用到特征选择算法中,即将多个基特征选择器进行集成。本文从提高特征选择算法对大规模数据处理能力的角度出发,提出了一种基于最小最大策略的集成特征选择方法。它主要包括三个步骤:第一,将原始数据根据类别信息划分成多个相对较小的平衡数据子集;第二,在每一个数据子集上进行特征选择,得到多个特征选择结果;第三,对多个特征选择结果依据最小—最大策略进行集成,得出最终的特征选择结果。通过实验对比了该集成策略与其它三种集成策略对分类准确率的影响,结果表明最小最大集成策略在大部分情况下能够获得较好的性能,且基于最小最大策略的集成特征选择可以有效处理大规模数据。

Abstract:  Feature selection is one of the key problems in machine learning and data mining. It involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It can reduce the dimensionality of original data, speed up the learning process and build comprehensible learning models with good generalization performance. Nowadays, ensemble idea has been used to improve the performance of feature selection by integrating multiple base feature selection models into an ensemble one. It turned out to be effective in dealing with high dimensionality small sample size problem especially for robust biomarker identification. In this paper, we aim to improve the efficiency of feature selection in dealing with large scale problems. In order to deal with such problems, ensemble feature selection using min-max strategy is proposed. The method consists of three main steps: firstly, the original data is decomposed into a group of relatively smaller balanced ones according to their structure and labels. Secondly, feature selection method is used to deal with all of the sub-problems and obtain the result of feature selection, such as feature weight. Lastly, the final result is obtained by combining the different results of sub-problems according to the min-max strategy. The experiments are designed to compare the Min-Max strategy based ensemble method with three other strategies, namely, Mean-Weight, Voting and K-Medoid, on the accuracy of classification. In this paper, G-Mean evaluation metric is choosed for the reason of considering the imbalance of data. Also, different feature selection algorithms and data decomposition methods are choosed to show the effect of the proposed method. The results, on 5 publicly available real-world datasets, demonstrate that the Min-Max strategy is superior to other ones in most cases and ensemble feature selection using Min-Max strategy can efficiently deal with large scale data.

 [1] Han J W, Kamber M, Jian P. Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann, 2006, 800.
[2] Tang J L, Alelyani S, Liu H. Feature selection for classification: A review. Florida: The Chemical Rubber Company Press, 2013, 33.
[3] Li Y, Gao S Y, Chen S C. Ensemble feature weighting based on local learning and diversity. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence. California: AAAI Press, 2012, 1019~1025.
[4] Song Q B, Ni J J, Wang G G. A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Transactions on Knowledge and Data Enegineering, 2013, 25(1): 1~14.
[5] Gu Q Q, Li Z H, Han J W. Generalized Fisher score for feature selection. In: Proceedings of the International Conference on Uncertainty of Artificial Intelligence. California: Morgan Kaufmann Publishers, 2011: 266~273.
[6] Marko R S, Igor K. Theoretical and empirical analysis of ReliefF and RreliefF. Machine Learning, 2003, 53(1-2): 23~69.
[7] Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46(1-3): 389~422.
[8] Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(4): 491~502.
[9] Woznica A, Nguyen P, Kalousis A. Model mining for robust feature selection. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2012: 913~921.
[10] Hoi S C H, Wang J L, Zhao P L, et al. Online feature selection for mining big data. In: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. New York: ACM Press, 2012: 93~100.
[11] Wu X D, Yu K, Wang H. Online streaming feature selection. In: Proceedings of the 27th international Conference on Machine Learning. California: AAAI Press, 2010: 1159~1166.
[12] Awada W, Khoshgoftaar T M, Dittman D, et al. A review of the stability of feature selection techniques for bioinformatics data. In: Proceedinds of the IEEE 13th International Conference on Information Reuse and Integration. United Kingdom: Emerald Group Publishing Limited, 2012: 356~363.
[13] Li Y, Feng L L. Integrating feature selection and Min-Max modular SVM for powerful ensember. In: Proceedings of the IEEE International Joint Conference on Neural Networks. New York: IEEE Conference Publications, 2012: 1~8.
[14] Lu X, Mu Z. Stochastic stepwise ensembers for variable selection. Journal of Computational and Graphical Statistics, 2012, 21(2): 275~294.
[15] Lu B L, Ito M. Task decomposition and module combination based on class relations: A modular neural network for pattern classification. IEEE Transactions on Neural Networks, 1999, 10(5): 1244~1256.
[16] Wang K A, Zhao H, Lu B L. Task decomposition using geometric relation for Min-Max modular SVMs. Advances in Neural Networks, 2005: 887~892.
[17] Chu X L, Ma C, Li J, et al. Large-scale patent classification with Min-Max modular support vector machines. In: Proceedings of the IEEE International Joint Conference on Neural Networks. New York: IEEE Conference Publications, 2008: 3972~3979.
[18] Wu K, Lu B L, Utiyama M, et al. An empirical comparison of min-max-modular k-NN with different voting methods to large-scale text categorization. Soft Computing-A Fusion of Foundations, Methodologies and Applications, 2008, 12(7): 647~655.
[19] Chang C C, Lin C J. LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, 2014-4-1.
[20] Guyon I. Agnostic learning vs prior knowledge. http://www.agnostic.inf.ethz.ch/datasets.php, 2007-2-1.
[21] Liu H, Ye J P, Zhao Z, et al. Feature selection at Arizona State University in conjunction with the DMML. http://featureselection.asu.edu/datasets.php, 2008-1-14.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!