南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (1): 1–13.doi: 10.13232/j.cnki.jnju.2019.01.001

• •    下一篇

面向非平衡多分类问题的二次合成QSMOTE方法

韩明鸣1,郭虎升1,王文剑2*   

  1. 1.山西大学计算机与信息技术学院,太原,030006; 2.计算智能与中文信息处理教育部重点实验室,山西大学,太原,030006 3.苏州南智传感科技有限公司,苏州,215123
  • 接受日期:2018-08-19 出版日期:2019-02-01 发布日期:2019-01-26
  • 通讯作者: 王文剑,E-mail:wjwang@sxu.edu.cn E-mail:wjwang@sxu.edu.cn
  • 基金资助:
    国家自然科学基金(61673249,61503229),山西省回国留学人员科研基金(2016-004),赛尔网络下一代互联网技术创新项目(NGII20170601)

Quadratic synthetic minority over-sampling technique for classification of multiclass imbalance problems

Han Mingming1,Guo Husheng1,Wang Wenjian2*   

  1. 1.School of Computer and Information Technology,Shanxi University,Taiyuan,030006,China; 2.Key Laboratory of Computational Intelligence and Chinese Information Processing,Ministry of Education,Shanxi University,Taiyuan,030006,China
  • Accepted:2018-08-19 Online:2019-02-01 Published:2019-01-26
  • Contact: Wang Wenjian,E-mail:wjwang@sxu.edu.cn E-mail:wjwang@sxu.edu.cn

摘要: 近年来非平衡多分类数据的学习问题在机器学习和数据挖掘领域备受关注,上采样技术成为解决数据不平衡问题的主要方法,然而已有的上采样技术仍有很多的不足,例如新合成的少数类样本仍可能分布在对应少数类样本的原始区域内,不能有效改善数据分布的不平衡情况. 此外,若原始样本中不同类别样本分布存在重叠,则新合成的样本会更容易偏离到其他类样本分布中,从而造成过泛化现象,影响少数类样本的分类精度. 为解决上述问题,提出一种二次合成的上采样方法(Quadratic Synthetic Minority Over-sampling Technique,QSMOTE). 首先通过少数类样本的支持度选择包含重要信息的样本来进行第一次合成,然后通过分析指定少数类样本质心的邻域内样本分布情况来调整第二次样本合成范围,并最终进行第二次合成. 在UCI和MNIST数据集上的实验结果表明,QSMOTE不仅可以改善数据分布的不平衡问题,而且可以尽可能地减少过泛化现象,特别是对少数类样本的分类准确率有大幅提升.

关键词: 多类非平衡问题, 过泛化, 重 叠, 合成少数类上采样技术(SMOTE)

Abstract: In recent years,multiclass imbalance data learning has attracted increasing interests on the domain of machine learning and data mining. Over-sampling is the most popular technique to solve the problem of imbalanced classification. However,the existing approaches based on over-sampling have some limits,such as the new synthetic samples located in the area of the initial region of their own class could not improve the class imbalance distribution actively in data space. In addition,the samples belong to different classes in original data space may be overlapping,which will lead the synthetic samples to deviate to the areas of other classes. It will cause serious over generalization,and then decrease the accuracy of the minority classes. In order to solve these problems,the Quadratic Synthetic Minority Over-sampling Technique,termed QSMOTE,is proposed. Firstly,the samples with more important information are selected from the samples with large support value during the first synthesis. In the second phase,the synthetic sphere is adjusted by the distribution of samples on a certain domain,which is defined by the center of mass of the minority classes. Substantial experiments on UCI and MNIST data sets demonstrat that the proposed QSMOTE algorithm can not only decrease the imbalance of data distribution,but also avoid over generalization as much as possible. Moreover,it can perform well on the classification accuracy of unbalanced data sets,especially for the minority data.

Key words: multiclass imbalance problems, over generation, overlapping, Synthetic Minority Over-sampling Technique(SMOTE)

中图分类号: 

  • TP18
[1] He H B,Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
[2] Abdi L,Hashemi S. To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Transactions on Knowledge & Data Engineering,2016,28(1):238-251.
[3] Lim P,Goh C K,Tan K C. Evolutionary cluster-based synthetic oversampling ensemble(ECO-Ensemble)for imbalance learning. IEEE Transactions on Cybernetics,2017,47(9):2850-2861.
[4] Lin M L,Tang K,Yao X. Dynamic sampling approach to training neural networks for multiclass imbalance classification. IEEE Transactions on Neural Networks & Learning Systems,2013,24(4):647-660.
[5] Wang S,Yao X. Multiclass imbalance problems:Analysis and potential solutions. IEEE Transactions on Systems,Man,& Cybernetics,Part B(Cybernetics),2012,42(4):1119-1130.
[6] Maciejewski T,Stefanowski J. Local neighbourhood extension of SMOTE for mining imbalanced data ∥ 2011 IEEE Symposium on Computational Intelligence and Data Mining. Paris,France:IEEE,2011:104-111.
[7] Wang B X,Japkowicz N. Imbalanced data set learning with synthetic examples ∥ Proceedings of IRIS Machine Learning Workshop. Piscataway,NJ,USA:IEEE,2004,153-162.
[8] Cateni S,Colla V,Vannucci M. A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing,2014,135:32-41.
[9] Liu Z,Tang D Y,Li J C,et al. Objective cost-sensitive-boosting-WELM for handling multi class imbalance problem ∥ 2017 International Joint Conference on Neural Networks. Anchorage,AK,USA:IEEE,2017:1975-1982.
[10] Dong Y J,Wang X H. A new over-sampling approach:Random-SMOTE for learning from imbalanced data sets ∥ International Conference on Knowledge Science,Engineering and Management(KSEM 2011). Springer Berlin Heidelberg,2011:343-352.
[11] Zhao Y G,Li M M,Chung R,et al. Multi-class kernel margin maximization for kernel learning. Neurocomputing,2016,207:843-847.
[12] Napierala K,Stefanowski J. Identification of different types of minority class examples in imbalanced data ∥ International Conference on Hybrid Artificial Intelligent Systems(HAIS 2012). Springer Berlin Heidelberg,2012:139-150.
[13] Sez J A,Krawczyk B,WoAz'niak M. Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognition,2016,57:164-178.
[14] Chawla N V,Bowyer K W,Hall L O,et al. SMOTE:Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research,2002,16:321-357.
[15] Jian C X,Gao J,Ao Y H. A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing,2016,193:115-122.
[16] Han H,Wang W Y,Mao B H. Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning ∥ Advances in Intelligent Computing(ICIC 2005). Lecture Notes in Computer Science. Springer Berlin Heidelberg,2005:878-887.
[17] He H B,Bai Y,Garcia E A,et al. ADASYN:Adaptive synthetic sampling approach for imbalanced learning ∥ 2008 IEEE International Joint Conference on Neural Networks. Hong Kong,China:IEEE,2008:1322-1328.
[18] Blagus R,Lusa L. Evaluation of SMOTE for high-dimensional class-imbalanced microarray data ∥ 2012 11th International Conference on Machine Learning and Applications. Boca Raton,FL,USA:IEEE,2013:89-94.
[19] Wu Y C,Lee Y S,Yang J C. Robust and efficient multiclass SVM models for phrase pattern recognition. Pattern Recognition,2008,41(9):2874-2889.
[20] Wu D H,Wang Z L,Chen Y,et al. Mixed-kernel based weighted extreme learning machine for inertial sensor based human activity recognition with imbalanced dataset. Neurocomputing,2016,190:35-49.
[1] 汪敏,赵飞,闵帆. 储层预测的代价敏感主动学习算法[J]. 南京大学学报(自然科学版), 2020, 56(4): 561-569.
[2] 万青, 魏玲, 任睿思. 协调多源决策表的规则提取[J]. 南京大学学报(自然科学版), 2020, 56(4): 494-504.
[3] 刘鑫,胡军,张清华. 属性组序下基于代价敏感的约简方法[J]. 南京大学学报(自然科学版), 2020, 56(4): 469-479.
[4] 李同军,于洋,吴伟志,顾沈明. 经典粗糙近似的一个公理化刻画[J]. 南京大学学报(自然科学版), 2020, 56(4): 445-451.
[5] 高云樵,马建敏. 变精度区间集概念格[J]. 南京大学学报(自然科学版), 2020, 56(4): 437-444.
[6] 朱荀,刘国强,丁华平,沈庆宏. 一种通过支持向量机对交通拥堵情况进行分类的方法[J]. 南京大学学报(自然科学版), 2020, 56(2): 278-283.
[7] 王卫星,刘兆伟,石敬华. 基于时间敏感滑动窗口的CP⁃nets结构学习[J]. 南京大学学报(自然科学版), 2020, 56(2): 175-185.
[8] 信统昌,刘兆伟. 基于贝叶斯⁃遗传算法的多值无环CP⁃nets学习[J]. 南京大学学报(自然科学版), 2020, 56(1): 74-84.
[9] 程玉胜,陈飞,庞淑芳. 标记倾向性的粗糙互信息k特征核选择[J]. 南京大学学报(自然科学版), 2020, 56(1): 19-29.
[10] 张玉州,张子为,江克勤. 多跑道进离港地面等待问题建模及协同优化[J]. 南京大学学报(自然科学版), 2020, 56(1): 132-141.
[11] 郑文萍,刘韶倩,穆俊芳. 一种基于相对熵的随机游走相似性度量模型[J]. 南京大学学报(自然科学版), 2019, 55(6): 984-999.
[12] 黄华娟,韦修喜. 基于自适应调节极大熵的孪生支持向量回归机[J]. 南京大学学报(自然科学版), 2019, 55(6): 1030-1039.
[13] 马娜, 范敏, 李金海. 复杂网络下的概念认知学习[J]. 南京大学学报(自然科学版), 2019, 55(4): 609-623.
[14] 姚宁, 苗夺谦, 张远健, 康向平. 属性的变化对于流图的影响[J]. 南京大学学报(自然科学版), 2019, 55(4): 519-528.
[15] 龙柄翰, 徐伟华. 模糊三支概念分析与模糊三支概念格[J]. 南京大学学报(自然科学版), 2019, 55(4): 537-545.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 孙 玫,张 森,聂培尧,聂秀山. 基于朴素贝叶斯的网络查询日志session划分方法研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1132 -1140 .
[2] 周星星,张海平,吉根林. 具有时空特性的区域移动模式挖掘算法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1171 -1182 .
[3] 刘 素, 刘惊雷. 基于特征选择的CP-nets结构学习[J]. 南京大学学报(自然科学版), 2019, 55(1): 14 -28 .
[4] 王伯伟, 聂秀山, 马林元, 尹义龙. 基于语义相似度的无监督图像哈希方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 41 -48 .
[5] 孔 颉, 孙权森, 纪则轩, 刘亚洲. 基于仿射不变离散哈希的遥感图像快速目标检测新方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 49 -60 .
[6] 贾海宁, 王士同. 面向重尾噪声的模糊规则模型[J]. 南京大学学报(自然科学版), 2019, 55(1): 61 -72 .
[7] 严云洋, 瞿学新, 朱全银, 李 翔, 赵 阳. 基于离群点检测的分类结果置信度的度量方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 102 -109 .
[8] 阚 威, 李 云. 基于LSTM的脑电情绪识别模型[J]. 南京大学学报(自然科学版), 2019, 55(1): 110 -116 .
[9] 董少春,种亚辉,胡 欢,黄璐璐. 基于时序InSAR的常州市2015-2018年地面沉降监测[J]. 南京大学学报(自然科学版), 2019, 55(3): 370 -380 .
[10] 徐成华,谈金忠,骆祖江,李 兆. 地铁盾构施工引发地面沉降三维流固全耦合数值模拟预测[J]. 南京大学学报(自然科学版), 2019, 55(3): 409 -419 .