南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (1): 113.doi: 10.13232/j.cnki.jnju.2019.01.001
• • 下一篇
韩明鸣1,郭虎升1,王文剑2*
Han Mingming1,Guo Husheng1,Wang Wenjian2*
摘要: 近年来非平衡多分类数据的学习问题在机器学习和数据挖掘领域备受关注,上采样技术成为解决数据不平衡问题的主要方法,然而已有的上采样技术仍有很多的不足,例如新合成的少数类样本仍可能分布在对应少数类样本的原始区域内,不能有效改善数据分布的不平衡情况. 此外,若原始样本中不同类别样本分布存在重叠,则新合成的样本会更容易偏离到其他类样本分布中,从而造成过泛化现象,影响少数类样本的分类精度. 为解决上述问题,提出一种二次合成的上采样方法(Quadratic Synthetic Minority Over-sampling Technique,QSMOTE). 首先通过少数类样本的支持度选择包含重要信息的样本来进行第一次合成,然后通过分析指定少数类样本质心的邻域内样本分布情况来调整第二次样本合成范围,并最终进行第二次合成. 在UCI和MNIST数据集上的实验结果表明,QSMOTE不仅可以改善数据分布的不平衡问题,而且可以尽可能地减少过泛化现象,特别是对少数类样本的分类准确率有大幅提升.
中图分类号:
[1] He H B,Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284. [2] Abdi L,Hashemi S. To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Transactions on Knowledge & Data Engineering,2016,28(1):238-251. [3] Lim P,Goh C K,Tan K C. Evolutionary cluster-based synthetic oversampling ensemble(ECO-Ensemble)for imbalance learning. IEEE Transactions on Cybernetics,2017,47(9):2850-2861. [4] Lin M L,Tang K,Yao X. Dynamic sampling approach to training neural networks for multiclass imbalance classification. IEEE Transactions on Neural Networks & Learning Systems,2013,24(4):647-660. [5] Wang S,Yao X. Multiclass imbalance problems:Analysis and potential solutions. IEEE Transactions on Systems,Man,& Cybernetics,Part B(Cybernetics),2012,42(4):1119-1130. [6] Maciejewski T,Stefanowski J. Local neighbourhood extension of SMOTE for mining imbalanced data ∥ 2011 IEEE Symposium on Computational Intelligence and Data Mining. Paris,France:IEEE,2011:104-111. [7] Wang B X,Japkowicz N. Imbalanced data set learning with synthetic examples ∥ Proceedings of IRIS Machine Learning Workshop. Piscataway,NJ,USA:IEEE,2004,153-162. [8] Cateni S,Colla V,Vannucci M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing,2014,135:32-41. [9] Liu Z,Tang D Y,Li J C,et al. Objective cost-sensitive-boosting-WELM for handling multi class imbalance problem ∥ 2017 International Joint Conference on Neural Networks. Anchorage,AK,USA:IEEE,2017:1975-1982. [10] Dong Y J,Wang X H. A new over-sampling approach:Random-SMOTE for learning from imbalanced data sets ∥ International Conference on Knowledge Science,Engineering and Management(KSEM 2011). Springer Berlin Heidelberg,2011:343-352. [11] Zhao Y G,Li M M,Chung R,et al. Multi-class kernel margin maximization for kernel learning. Neurocomputing,2016,207:843-847. [12] Napierala K,Stefanowski J. Identification of different types of minority class examples in imbalanced data ∥ International Conference on Hybrid Artificial Intelligent Systems(HAIS 2012). Springer Berlin Heidelberg,2012:139-150. [13] Sez J A,Krawczyk B,WoAz'niak M. Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognition,2016,57:164-178. [14] Chawla N V,Bowyer K W,Hall L O,et al. SMOTE:Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research,2002,16:321-357. [15] Jian C X,Gao J,Ao Y H. A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing,2016,193:115-122. [16] Han H,Wang W Y,Mao B H. Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning ∥ Advances in Intelligent Computing(ICIC 2005). Lecture Notes in Computer Science. Springer Berlin Heidelberg,2005:878-887. [17] He H B,Bai Y,Garcia E A,et al. ADASYN:Adaptive synthetic sampling approach for imbalanced learning ∥ 2008 IEEE International Joint Conference on Neural Networks. Hong Kong,China:IEEE,2008:1322-1328. [18] Blagus R,Lusa L. Evaluation of SMOTE for high-dimensional class-imbalanced microarray data ∥ 2012 11th International Conference on Machine Learning and Applications. Boca Raton,FL,USA:IEEE,2013:89-94. [19] Wu Y C,Lee Y S,Yang J C. Robust and efficient multiclass SVM models for phrase pattern recognition. Pattern Recognition,2008,41(9):2874-2889. [20] Wu D H,Wang Z L,Chen Y,et al. Mixed-kernel based weighted extreme learning machine for inertial sensor based human activity recognition with imbalanced dataset. Neurocomputing,2016,190:35-49. |
[1] | 汪敏,赵飞,闵帆. 储层预测的代价敏感主动学习算法[J]. 南京大学学报(自然科学版), 2020, 56(4): 561-569. |
[2] | 万青, 魏玲, 任睿思. 协调多源决策表的规则提取[J]. 南京大学学报(自然科学版), 2020, 56(4): 494-504. |
[3] | 刘鑫,胡军,张清华. 属性组序下基于代价敏感的约简方法[J]. 南京大学学报(自然科学版), 2020, 56(4): 469-479. |
[4] | 李同军,于洋,吴伟志,顾沈明. 经典粗糙近似的一个公理化刻画[J]. 南京大学学报(自然科学版), 2020, 56(4): 445-451. |
[5] | 高云樵,马建敏. 变精度区间集概念格[J]. 南京大学学报(自然科学版), 2020, 56(4): 437-444. |
[6] | 朱荀,刘国强,丁华平,沈庆宏. 一种通过支持向量机对交通拥堵情况进行分类的方法[J]. 南京大学学报(自然科学版), 2020, 56(2): 278-283. |
[7] | 王卫星,刘兆伟,石敬华. 基于时间敏感滑动窗口的CP⁃nets结构学习[J]. 南京大学学报(自然科学版), 2020, 56(2): 175-185. |
[8] | 信统昌,刘兆伟. 基于贝叶斯⁃遗传算法的多值无环CP⁃nets学习[J]. 南京大学学报(自然科学版), 2020, 56(1): 74-84. |
[9] | 程玉胜,陈飞,庞淑芳. 标记倾向性的粗糙互信息k特征核选择[J]. 南京大学学报(自然科学版), 2020, 56(1): 19-29. |
[10] | 张玉州,张子为,江克勤. 多跑道进离港地面等待问题建模及协同优化[J]. 南京大学学报(自然科学版), 2020, 56(1): 132-141. |
[11] | 郑文萍,刘韶倩,穆俊芳. 一种基于相对熵的随机游走相似性度量模型[J]. 南京大学学报(自然科学版), 2019, 55(6): 984-999. |
[12] | 黄华娟,韦修喜. 基于自适应调节极大熵的孪生支持向量回归机[J]. 南京大学学报(自然科学版), 2019, 55(6): 1030-1039. |
[13] | 马娜, 范敏, 李金海. 复杂网络下的概念认知学习[J]. 南京大学学报(自然科学版), 2019, 55(4): 609-623. |
[14] | 姚宁, 苗夺谦, 张远健, 康向平. 属性的变化对于流图的影响[J]. 南京大学学报(自然科学版), 2019, 55(4): 519-528. |
[15] | 龙柄翰, 徐伟华. 模糊三支概念分析与模糊三支概念格[J]. 南京大学学报(自然科学版), 2019, 55(4): 537-545. |
|