南京大学学报(自然科学版) ›› 2015, Vol. 51 ›› Issue (2): 421–429.

• • 上一篇    下一篇

一种基于不平衡数据的聚类抽样方法

朱亚奇1,邓维斌1 ,2*   

  • 出版日期:2015-03-30 发布日期:2015-03-30
  • 作者简介:(1. 重庆邮电大学计算智能重庆市重点实验室,重庆,400065; 2. 西南交通大学信息科学与技术学院,成都,610031 )
  • 基金资助:
     国家自然科学基金( 6 1 2 7 2 0 6 0 , 6 1 3 0 9 0 1 4 ) , 重庆市自然科学基金( c s t c 2 0 1 2 j j A 4 0 0 3 2 , c s t c 2 0 1 3 j c y j A 4 0 0 6 3 ) , 重庆市/信息产业部计算机网络与通信技术重点实验室开放基金( C Y-C N C L-2 0 1 0-0 5 )

A method using clustering and sampling approach for imbalance data

Zhu Yaqi 1, Deng Weibin1 ,2*   

  • Online:2015-03-30 Published:2015-03-30
  • About author:(1. Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications ,Chongqing, 400065,China; 2. School of Information Science and Technology, Southwest Jiaotong University, Chengdu, 610031, China)

摘要: 许多研究表明传统分类器在对海量不平衡数据分类时偏向多数类规则,因此,会导致少数类实例被错误判断为多数类。针对上述问题,提出了一种基于分解求解的学习分类算法。算法先对样本数据进行聚类,在聚类的基础上多次根据权值对数据集进行欠抽样,产生平衡的数据集,对每个平衡数据集进行验证同时提高误判样本的权值。综合考虑每个基分类器的错误率作为分类器的权值,选择分类效果较好的基分类器进行加权集成。实验表明算法有较高的少数类正确率以及少数类F度量,同时可以大幅减少训练集数量。

Abstract: The classification issue is an important research content in the fields of machine learning. The current classification methods have been relatively mature, and generally, using such methods to classify the balanced data can achieve a good effect of classification. But in real world, data proportion is unbalanced in many cases. The traditional classifiers are designed based on the premise of balanced data, and always pursue the best overall accuracy. Therefore, using the traditional classifiers to classify massive unbalanced data will lead to the sharp fall of classifiers’ performance, and the classification result obtained will be greatly biased. It is most commonly seen that the recognition rate of minority samples is far less than that of majority samples. For this reason, the samples which should belong to minority type will be mistakenly classified to majority type. Aimed at the above problem, we can transfer unbalanced datasets to balanced datasets by under-sampling technique, so as to reduce the unbalanced degree of data and allow the traditional classifiers to achieve a good effect when classifying. However, under-sampling will cause the loss of important information, and using clustering algorithm will counteract this loss. Meanwhile, the combination method can be integrated to use many classifiers for weighted voting, which can significantly improve the accuracy and generalization ability of models. This paper also proposes a learning algorithm based on decomposition. Firstly, we cluster the sample data by k-means algorithm. Then on the basis of the clustering, we take advantage of under-sampling technology in accordance with weights to produce a balanced dataset. Furthermore, we adopt decision tree algorithm to train and test each balanced dataset, and adjust the weights of mistaken samples. High weights of samples will lead to high possibility of being selected. Thinking comprehensively about the error ratio of each base classifier to be regarded as the weight of classifiers, we can select a best classifier to get weight ensemble. Finally, we complete three experiments by choosing eight groups of data from UCI dataset. The first experiment is aimed to pick out a classifier with good effect when combined with methods in this paper. The last two experiments compare the data-level method and algorithm-level method. Compared with other algorithms, the proposed method has a higher precision of the minority class and and F-measure of the minority class. At the same time, it can greatly reduce the number of training sets

[1] Chan P K, Stolfo S J. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, Menlo Park: AAAI Press,2001: 164~168.
[2] Kubat M, Holte R, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 1998, 30(2/3): 195~215.
[3] Choe W, Ersoy O K, Bina M. Neural network schemes for detecting rare events in human genomic DNA. Bioinformatics, 2000, 16(12): 1062~1072.
[4] Plant C B, Ohm C, Bernhard T, et al. Enhancing instance-based classification with local density: A new algorithm for classifying unbalanced biomedical data. Bioinformatics, 2006, 22(8): 981~988.
[5] 李雄飞,李 军,董元方等.一种新的不平衡数据学习算法Boost. 计算机学报, 2012, 35(2):202~209.
[6] Dietterich T G. Machine learning research: Four current directions. AI Magazine, 1997, 18(4): 97~136.
[7] MacQueen J B. Some methods for classification and analysis of multivariate observations. In: Le Cam L M, Neyman J. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967, 1: 281~297.
[8] Freund Y. Boosting a weak algorithm by majority. Information and Computation, 1995, 121(2): 256~285.
[9] Maloof M A. Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML- 2003Workshop on Learning from Imbalanced Data Sets II . Washington DC: AAA I Press , 2003. 154~160
[10] Weiss G M. Mining with rarity : A unifying framework. ACM SIGKDD Explorations, 2004, 6(1): 7~19.
[11] Bartlett P L, Traskin M. AdaBoost is consistent[J]. Journal of Machine Learning Research, 2007.2347~2368.
[12] Thammasiri D, Meesad P. AdaBoost ensemble data classification based on diversity of classifiers. Advanced Materials Research, 2012, 403-408: 3682~3687.
[13] Yen S J, Lee Y S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 2009: 5718~5727.
[14] 李晓翠,孟凡荣,周 勇.一种基于代表点的快速聚类算法.南京大学学报(自然科学),2012,48(4): 504~512.
[15] Krawczyk B, WoZniak M, Haefer G S. Cost-sensitive decision tree ensembles for effective imbalanced classification . Applied Soft Computing, 2014: 554~562

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!