南京大学学报(自然科学版) ›› 2015, Vol. 51 ›› Issue (2): 421429.
朱亚奇1,邓维斌1 ,2*
Zhu Yaqi 1, Deng Weibin1 ,2*
摘要: 许多研究表明传统分类器在对海量不平衡数据分类时偏向多数类规则,因此,会导致少数类实例被错误判断为多数类。针对上述问题,提出了一种基于分解求解的学习分类算法。算法先对样本数据进行聚类,在聚类的基础上多次根据权值对数据集进行欠抽样,产生平衡的数据集,对每个平衡数据集进行验证同时提高误判样本的权值。综合考虑每个基分类器的错误率作为分类器的权值,选择分类效果较好的基分类器进行加权集成。实验表明算法有较高的少数类正确率以及少数类F度量,同时可以大幅减少训练集数量。
[1] Chan P K, Stolfo S J. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, Menlo Park: AAAI Press,2001: 164~168. [2] Kubat M, Holte R, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 1998, 30(2/3): 195~215. [3] Choe W, Ersoy O K, Bina M. Neural network schemes for detecting rare events in human genomic DNA. Bioinformatics, 2000, 16(12): 1062~1072. [4] Plant C B, Ohm C, Bernhard T, et al. Enhancing instance-based classification with local density: A new algorithm for classifying unbalanced biomedical data. Bioinformatics, 2006, 22(8): 981~988. [5] 李雄飞,李 军,董元方等.一种新的不平衡数据学习算法Boost. 计算机学报, 2012, 35(2):202~209. [6] Dietterich T G. Machine learning research: Four current directions. AI Magazine, 1997, 18(4): 97~136. [7] MacQueen J B. Some methods for classification and analysis of multivariate observations. In: Le Cam L M, Neyman J. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967, 1: 281~297. [8] Freund Y. Boosting a weak algorithm by majority. Information and Computation, 1995, 121(2): 256~285. [9] Maloof M A. Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML- 2003Workshop on Learning from Imbalanced Data Sets II . Washington DC: AAA I Press , 2003. 154~160 [10] Weiss G M. Mining with rarity : A unifying framework. ACM SIGKDD Explorations, 2004, 6(1): 7~19. [11] Bartlett P L, Traskin M. AdaBoost is consistent[J]. Journal of Machine Learning Research, 2007.2347~2368. [12] Thammasiri D, Meesad P. AdaBoost ensemble data classification based on diversity of classifiers. Advanced Materials Research, 2012, 403-408: 3682~3687. [13] Yen S J, Lee Y S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 2009: 5718~5727. [14] 李晓翠,孟凡荣,周 勇.一种基于代表点的快速聚类算法.南京大学学报(自然科学),2012,48(4): 504~512. [15] Krawczyk B, WoZniak M, Haefer G S. Cost-sensitive decision tree ensembles for effective imbalanced classification . Applied Soft Computing, 2014: 554~562 |
No related articles found! |
|