南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (1): 203–211.

• • 上一篇    

基于多类重采样的非平衡数据极速学习机集成学习

邢 胜1,2 王熙照3*, 王晓兰4   

  • 出版日期:2016-01-27 发布日期:2016-01-27
  • 作者简介:(1. 河北大学管理学院,保定,071002;2. 沧州师范学院计算机系,沧州,061001; 3. 河北大学数学与信息科学学院,保定,071002;4. 沧州职业技术学院信息工程系,沧州,061001)
  • 基金资助:
    基金项目:国家自然科学基金(61170040,71371063)
    收稿日期:2015-06-16
    *通讯联系人,E-mail:xizhaowang01@126.com

Extreme learning machine ensemble learning based on multi class resampling for imbalanced data

Xing Sheng 1,2, Wang Xizhao 3*, Wang Xiaolan 4   

  • Online:2016-01-27 Published:2016-01-27
  • About author:(1. College of Management, Hebei University, Baoding, 071002, China; 2. Department of Computer, Cangzhou Normal University, Cangzhou, 061001, China) 3. College of Mathematics and Information Science, Hebei University, Baoding, 071002, China; 4. Department of Information Engineering, Cangzhou Technical College, Cangzhou, 061001, China)

摘要: 极速学习机(Extreme learning machineELM)虽然已在理论和应用中证实有很好的泛化性能和极快的训练速度,但是在处理非均衡数据时,它更偏向多数类且及容易忽略少数类基于数据重采样的集成学习可以帮助ELM解决少数类分类精度低的问题。提出一种按类别重采样技术并根据此发展了一种ELM集成学习方法。该方法可充分利用少数类样本的信息,实验结果显示该方法性能明显优于单一的ELM学习模型。由于重采样是大数据处理的最核心的技术之一,该方法对非均衡大数据的学习模型建立有着一般性的指导意义。

Abstract: Although ELM (Extreme learning machine) has been confirmed that have good generalization performance and fast training speed in theory and application, it can’t handle the imbalanced data well, because it is biased towards the majority class and ignore the minority class, leading to a low classification accuracy of the minority class. Imbalanced data is common in life. For examples, identifying fraudulent credit card transactions, checking network intrusion detection, learning word pronunciations, predicting preterm births, analyzing medical diagnosis, and predicting telecommunication equipment failures, detecting oil spills from satellite images and so on. The main strategy to solve the imbalanced data classification problems include technology resampling, integrated learning, cost sensitive learning. Resampling technique is the main method of imbalanced data processing. The basic sampling methods include under-sampling and over-sampling. This paper proposes a resampling technique that multiple under-sampling technique, and based on this method develops an ELM ensemble learning method. Ensemble learning based on data resampling can help ELM algorithm to solve the ELM low accuracy of minority class classification. To achieve better classification results of imbalanced data, we use F-Measure and G-mean values as the evaluation criteria in the experiment. The value is higher, the performance of the minority class is better. In this paper, experimental results show that this method outperforms the single ELM learning model. This method can get higher F-Measure and G-mean values than the single ELM learning model. That means the ELM ensemble learning based on multiple under-sampling technique can improve the classification performance of the minority class. Because each classifier from training to test before the vote is independent of each other, the resampling method can parallel computing, large data set is decomposed into a plurality of small data sets, each small data sets were used to learn by ELM. It can improve the computing speed. The method is not limited in ELM ensemble learning, it can be extended to other classifier ensemble learning. Due to the resampling technique is one of the core data processing technology, the method has general guidance significance to establish the learning model for large imbalanced data.

[1] Chan P, Stolfo S. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. Menlo Park, 1998: 164-168.
[2] Patcha A, Park J M. An overview of anomaly detection techniques: existing solutions and latest technological trends. Computer Networks, 2007, 51(12): 3448-3470.
[3] Fawcett T. “In vivo” spam filtering: a challenge problem for data mining. SIGKDD Explorations, 2003, 5(2): 140-148.
[4] Kubat M, HolteR C, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 1998, 30(2):195-215.
[5] Maes F, Vandermeulen D, Suetens P. Medical image registration using mutual information. Proceedings of the IEEE, 2003, 10(91): 1699-1722.
[6] Huang G B, Zhu Q Y, Siew C K. Extreme learning machine: Theory and applications. Neurocomputing, 2006, 70 (1-3): 489-501.
[7] Wang Y, Cao F, Yuan Y. A Study on effectiveness of extreme learning machine. Neurocomputing, 2011, 74(16): 2483-2490.
[8] Huang G, Wang D, Lan Y. Extreme learning machines: A survey. International Journal of Machine Learning and Cybernetics, 2011, 2(2): 107-122.
[9] Zong W, Huang G B, Chen Y. Weighted extreme learning machine for imbalance learning. Neurocomputing, 2013, 101: 229-242.
[10] Mirza B, Lin Z, Toh K A. Weighted on line sequential extreme learning machine for class imbalance learning, Neural Processing Letters, 2013, 38(3): 465-486.
[11] 杨泽平. 基于神经网络的不平衡数据分类方法研究. 博士学位论文. 上海:华东理工大学,2015.
[12] Weiss G. Mining with rarity: A unifying framework. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 7-19.
[13] He H, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering,2009, 21 (9): 1263-1284.
[14] Tomek I. Two modifications of CNN. IEEE Transactions on Systems, Man and Communications, 1976, 6: 769-772.
[15] Hart P E. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 1968, 14(3): 515- 516.
[16] Kubat M, Matwin S. Addressing the course of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1997: 179-186.
[17] Laurikkala J. Improving identification of difficult small classes by balancing class distribution. In: Proceedings of the 8th Conference on AI in Medicine. Europe, Artificial Intelligence Medicine, 2001: 63-66.
[18] Wilson D L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Communications, 1972, 2(3): 408-421.
[19] 朱亚奇, 邓维斌. 一种基于不平衡数据的聚类抽样方法. 南京大学学报(自然科学), 2015, 51(2): 421-429
[20] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[21] Han H, Wang W Y, Mao B H. Borderline-smote: A new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing. Springer Verlag, 2005: 878-887.
[22] Batista G E, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.



 

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!