基于密度峰值聚类和局部稀疏度的过采样算法
1.
2.
3.
Oversampling algorithm based on density peaks clustering and local sparsity
1.
2.
3.
通讯作者:
收稿日期: 2022-03-15
基金资助: |
|
Received: 2022-03-15
关键词:
Keywords:
本文引用格式
吕佳, 郭铭.
Jia Lü, Guo Ming.
机器学习中的不平衡分类是指在类别比例倾斜的数据集中进行分类,数据集中样本较多的类为多数类,样本较少的类为少数类.传统分类算法的目标是通过校准损失函数以使损失最小化来获得最佳精度,导致结果更加偏向于多数类样本.在对此类数据集进行分类时,如不进行预处理,则最终的分类结果往往有很大偏差,以致少数类样本无法被正确分类.例如,对一样本总数为100的数据集,多数类样本个数为99,少数类样本个数为1,即使采取将所有样本都归为多数类的分类策略,最终的准确率也可以达到99%,然而这样做就完全忽略了少数类样本并将其进行错误分类.在实际应用中,如果处理的样本为极端不平衡的医学病例样本,那么更重要的病例患者就无法被准确识别,带来的后果将极为严重.因此,对不平衡分类问题的研究重要且必要.在近年各类实际问题中,不平衡问题出现在各个领域,如医学诊断[1]、网络入侵检测[2]、破产预测[3]、软件缺陷预测[4]、客户行为分析[5]、文本分类[6]等.
为了解决不平衡数据集的分类问题,研究者提出许多方法,主要有三类:数据级方法、算法级方法和成本敏感方法[7-8].数据级方法使用基于采样的方法来平衡多数类和少数类之间的样本数,它可以分三种:过采样[9]、欠采样[10]和混合方法[11].在三种基于采样的方法中,过采样比其他数据级方法使用得更频繁,因为欠采样会消除多数类中的数据,导致重要数据的丢失.此外,Batista et al[12]通过ROC曲线下面积(Area Under Curve,AUC)的测量证明,过采样通常比欠采样表现更好.合成少数过采样技术(Synthetic Minority Over⁃Sampling Technique,SMOTE)被认为是最有影响力的采样算法[13],它通过在相邻的少数类样本之间进行插值来平衡原始训练数据.此算法有效提高了分类器的性能,但在采样过程中会产生大量的噪声且样本重叠现象变得严重.研究发现,数据的类间失衡程度并不是阻碍分类器最终效果的唯一因素,而数据的空间复杂性才是分类恶化的主要决定因素[14].数据重叠(sample overlapping)、小析取(small disjuncts)等问题[15-16]会对分类器的性能产生消极影响,样本的数据重叠发生在少数类样本位于多数类样本中时,在两类样本间可能存在大量的重叠样本.单独存在于多数类样本中的噪声可以很好地进行过滤,但被大量多数类样本包围的少数类样本簇或者噪声簇,即小析取问题,会产生更多难以处理的噪声.
为此,大量的SMOTE改进方法被应用于采样过程.Han et al[17]设计了Borderline⁃SMOTE来选择边界线附近的少数类样本进行过采样,但边界样本的投票选择策略使大量的高密度少数类样本参与过采样,在某些数据集中极易导致过拟合.He et al[18]针对不平衡学习设计了一种新的自适应学习算法ADASYN,它以权重分布为标准,根据每个少数类样本的学习难度自动决定需要为其生成的合成样本的数量.然而,此算法中边界区域中的样本难以被过采样,且该方法未对噪声进行处理,使学习任务更困难.Douzas et al[19]提出基于K⁃means的SMOTE算法,采取针对全部样本聚类的策略,在一定程度上解决了小析取的问题,但对簇内所有少数类样本进行过采样,容易导致过拟合,且在非球形数据集上无法取得较好的分类效果.Chen et al[20]提出的RSMOTE是采取聚类策略的另一种方法,它在应用SMOTE过采样之前,确定每一个少数类样本的相对密度,通过对密度进行2⁃means聚类,使少数类样本被自然地分为两类,选择在密度中心较大的簇中进行过采样.这种方法的目标是在样本集中的少数类较密集区域合成样本来提升类别可分性,但忽略了边界线周围样本的敏感程度,在样本分布复杂的区域中效果会大打折扣.
基于聚类的过采样方法以K⁃means聚类为主,但K⁃means仅适用于超球形数据,且K无法自动确定到最优值,而密度峰值聚类不仅可以处理形状不规则、大小不统一的数据,其参数量也较少.针对二分类数据集过采样的小析取及数据重叠问题,本文提出一种基于密度峰值聚类和局部稀疏度的过采样算法.首先对全部样本进行密度峰值聚类,通过过滤出符合条件的簇来减少小析取现象带来的消极影响,再根据簇内样本的不平衡比例确定各簇内的少数样本过采样个数,最终由各簇内少数类样本的局部稀疏度找到稀疏度较高的样本.通过对这些样本进行SMOTE,既避免了在少数类样本集中的区域合成样本导致过拟合,又可合理地补充少数类样本的稀疏性.
1 本文算法
1.1 自适应密度峰值聚类
2014年Rodriguez and Laio[21]的密度峰值聚类算法(Density Peaks Clustering,DPC)只需计算两个变量:局部密度
其中,
根据
受马春来等[22]的启发,为了能够自动得到聚类中心,首先将
令
簇中心即为
1.2 簇的过滤与过采样权重的设置
小析取是不平衡数据分布中的常见情形.如图1所示,多数类样本和少数类样本相比,在少数类矩形区域的中心区域中的代表性不足,而少数类样本仅覆盖整个数据集的一小部分,并且被放置在多数类样本内部.这种现象在不平衡分类问题中十分普遍,且对过采样有极大的影响.在SMOTE算法中,小析取现象被放大,合成样本大量分布于多数类样本中,不仅破坏了原样本集的数据分布,也产生了更多的噪声与重叠样本.
图1
现有的基于聚类的过采样策略大多仅针对少数类样本进行聚类操作,这样会使过采样仅发生在少数类样本集中的区域.但若存在小析取问题,那么很有可能后续的合成样本分布在错误的位置,聚集在多数类样本中,如图2所示.
图2
图2中虚线框内的小析取在多数类聚集的区域产生了更多噪声,对最终的分类任务产生了一定的消极影响.为了保证采样后的平衡样本集与原样本集保持高度相似的数据分布,同时避免由小析取带来的一系列危害,本文使用改进的DPC对数据集所有样本进行聚类.基于密度策略的聚类方法能有效识别小析取的存在,单个存在的噪声样本也可用更小的代价被筛出.在完成自适应的聚类操作后,过采样过程仅发生在不平衡比符合条件的簇中.首先确定每个簇的不平衡率
其中,
过采样过程发生在少数类样本中,对少数类样本进行扩充,使其与多数类样本达到相应比例.过滤不符合条件的簇后,需要确定适合过采样簇的采样数量.基于聚类的过采样方法中,传统策略直接使用每个聚类的不平衡比例来确定采样权重,但这样做仅仅补充了少数类样本数量,没有考虑类内样本的分散程度,无法强化少数类样本与多数类样本的边界,使最终的分类效果受到很大影响.高采样权重应当对应于少数类样本密度较低的簇来生成更多的合成样本,以此保证过采样结束后的数据分布与原数据集样本分布相近.边界处为少数类样本较稀疏的区域,分配给此类样本较大的采样权重,在过采样完成后会使少数类样本与多数类样本的边界更清晰,能取得更好的分类效果.采样权重的计算首先得到每个簇的平均距离,此处基于欧氏距离进行计算,划分出的
其中,
其中,
权重相加之和为1,这样就确定出每个簇中合成样本的数量,定义为:
其中,
1.3 局部稀疏度的确定
数据重叠往往发生在过采样合成新样本的过程中,尤其在原始数据集中,如果大量少数类样本本就存在于多数类样本中,采取随机线性插值策略生成的样本极易与多数类样本发生重叠,使最终分类器的效果不尽人意.数据重叠也会在同类数据中发生,相似样本产生的合成样本会与原样本重叠.所有数据的内在特征中,类别间的重叠是影响最大的问题,如图3所示.合成样本的位置分布是所有过采样算法中最重要的环节.合成样本的最优位置是既不在少数类样本的高密度区域以避免最终的分类过拟合,又能不发生或尽量少发生样本重叠的现象,从而使分类器不会因此发生分类错误.
图3
在决定每个簇的样本合成数量后,为了使合成样本的位置更合理,采取的策略不是在整个簇内对少数类样本进行随机过采样,而是选择特定的样本进行过采样.采样策略为在簇内选择少数类样本较稀疏的区域进行合成过采样.异常值在不平衡分类中产生的主要问题为:在异常值附近可能会合成新的样本从而形成噪声存在于数据集中,导致分类器的分类性能变差.本文算法在第一阶段采用密度峰值聚类的方法进行处理,聚类完成后进行簇的过滤,多数类样本个数大于少数类样本个数的簇会被筛除,而少数类中的异常值往往更靠近多数类样本或远离少数类样本,则异常值存在的簇一般不会被选择进入过采样阶段,可以避免异常值带来的消极影响.
在过滤后的簇
其中,
其中,
将计算得到的局部稀疏度降序排序,由前面得到的每个簇中少数类样本过采样个数
1.4 算法步骤
综上,本文提出的基于密度峰值聚类和局部稀疏度的过采样算法(Oversampling Algorithm Based on Density Peaks Clustering and Local Sparsity,DPCLSO)的具体步骤如算法1所示,相应算法流程图如图4所示.
图4
DPCLSO
输入:训练集
输出:新的样本集
Step 1. 用改进的DPC算法对全部样本
Step 2. 由
Step 3. 由
Step 4. 对于过滤出的子簇
Step 5. 选择
Step 6. 对簇内全体少数类样本进行合成过采样,将合成样本放入样本集
Step 7. 重复Step 3至Step 6,直至迭代完所有簇.
Step 8. 将样本集
Step 9. 返回新样本集
1.5 时间复杂度分析
给定不平衡数据集D,假设包含N个维度为D的样本,通过三个步骤计算本算法的时间复杂度.第一步,由DPC算法选出符合条件的簇,DPC需计算每两个样本间的欧氏距离来构建距离矩阵,时间复杂度为
2 实验结果与分析
2.1 实验数据集
为了验证算法的有效性,采用UCI的18个数据集进行验证,包括不同样本量、不同不平衡比例、不同数据分布的各类数据集.样本量从170到21424,不平衡比例从1.16到57.64,数据分布有球形、流形等.本文算法均针对二分类问题,在数据预处理阶段,根据数据集中样本标签的不同,将其划分为少数类和多数类,数据集的具体信息如表1所示.
表1 实验数据集
Table 1
序号 | 名称 | 样本数 | 特征数 | 不平衡比 | 简称 |
---|---|---|---|---|---|
1 | Wine | 178 | 13 | 2.71 | D1 |
2 | Wholesale_Customers | 440 | 7 | 2.10 | D2 |
3 | spambase | 4601 | 57 | 1.54 | D3 |
4 | liver | 345 | 6 | 1.38 | D4 |
5 | german | 1000 | 24 | 2.33 | D5 |
6 | SPECTF_Heart | 267 | 44 | 3.85 | D6 |
7 | Ionosphere | 351 | 34 | 1.79 | D7 |
8 | Indian_Liver_Patientt | 583 | 10 | 2.49 | D8 |
9 | Breast_cancer_wisconsin | 699 | 9 | 1.90 | D9 |
10 | WaveForm | 5000 | 21 | 2.04 | D10 |
11 | mammographic_masses | 961 | 5 | 1.16 | D11 |
12 | Pendigits | 3498 | 16 | 9.44 | D12 |
13 | heart | 170 | 13 | 1.25 | D13 |
14 | Letter | 15534 | 16 | 26.64 | D14 |
15 | Wilt | 4339 | 5 | 57.64 | D15 |
16 | USPS | 9289 | 256 | 12.13 | D16 |
17 | MNIST348 | 21424 | 784 | 1.94 | D17 |
18 | MNIST69 | 8190 | 784 | 5.00 | D18 |
2.2 实验设置
为了验证DPCLSO在分类器上的表现效果,与其他八个过采样算法进行对比,分别为SMOTE,Borderline⁃SMOTE,SafeLevel⁃SMOTE,ADASYN,SMOTE⁃IPF,K⁃means SMOTE,Adaptive⁃SMOTE,RSMOTE,分别简称为SMO,Bor,Safe,ADAS,SM⁃IPF,K⁃m S,ADAP,RSM.其中包括基于KNN、聚类、噪声过滤等技术,均采用默认参数.各算法在不同分类器上的表现不同,为保证客观全面,使用三个分类器与本文算法及对比算法相结合,分别为KNN、决策树(C4.5)和随机森林(Random Forest).
2.3 评价指标
针对不平衡分类问题,评价分类性能常用的评价指标为F1⁃Measure,G⁃mean,AUC.首先给出这些指标依赖的混淆矩阵,如表2所示,实验中,少数类样本定义为正类,多数类样本定义为负类.
(1)Precision:所有预测的少数类样本中正确预测的少数类样本的比率.
(2)Recall:少数类样本被正确预测的比率.
(3)F1⁃measure:是Precision和Recall的调和平均值,更接近较小值.
(4)G⁃mean:综合评价少数类和多数类样本预测准确性的指标.
(5)AUC:同时考虑真阳性率 和假阳性率 ,得到ROC曲线下的面积.
2.4 实验结果及分析
为了避免实验结果受到随机影响,保证客观性,采用十折交叉验证取平均值的方法进行实验.
表3 各算法在KNN分类器上的各指标评价结果
Table 3
Datasets | Evaluation | SMO | Bor | Safe | ADAS | SM⁃IPF | K⁃m S | ADAP | RSM | DPCLSO |
---|---|---|---|---|---|---|---|---|---|---|
D1 | F1⁃measure | 62.0333% | 62.1173% | 55.6688% | 61.5440% | 60.4950% | 54.6957% | 48.7924% | 58.6876% | 62.2829% |
G⁃mean | 79.6812% | 78.4615% | 71.2518% | 80.0162% | 77.9393% | 73.2348% | 66.9467% | 75.8509% | 80.0405% | |
D2 | F1⁃measure | 85.9964% | 86.7993% | 86.3374% | 85.3952% | 87.9143% | 86.8775% | 85.4217% | 87.0937% | 88.2993% |
G⁃mean | 90.3121% | 90.6767% | 90.2701% | 90.1816% | 92.0551% | 90.7103% | 89.7805% | 90.8855% | 91.7266% | |
D3 | F1⁃measure | 78.2442% | 76.8568% | 77.7085% | 77.4810% | 78.2107% | 78.4768% | 77.6307% | 77.4382% | 78.4280% |
G⁃mean | 81.6440% | 80.5218% | 81.3824% | 80.7309% | 81.5465% | 82.1207% | 81.3010% | 81.2429% | 82.0311% | |
D4 | F1⁃measure | 56.6667% | 55.2083% | 56.7912% | 59.7166% | 65.1515% | 54.7297% | 46.1047% | 42.0513% | 57.8748% |
G⁃mean | 59.6238% | 57.3185% | 58.2452% | 55.3644% | 49.1523% | 55.8029% | 53.8242% | 50.2559% | 61.3867% | |
D5 | F1⁃measure | 52.7560% | 49.7018% | 48.9448% | 53.0428% | 49.9616% | 52.7413% | 47.9911% | 54.4706% | 56.0870% |
G⁃mean | 63.3895% | 60.4853% | 61.1482% | 63.3134% | 58.5940% | 64.5689% | 60.4762% | 66.0340% | 67.4871% | |
D6 | F1⁃measure | 56.3467% | 59.7222% | 61.1111% | 51.6340% | 52.3810% | 56.8627% | 60.7692% | 50.0000% | 60.0000% |
G⁃mean | 76.4860% | 79.4070% | 81.3536% | 73.2013% | 71.7161% | 78.1713% | 75.9926% | 71.8818% | 81.2447% | |
D7 | F1⁃measure | 84.0336% | 81.1765% | 80.3571% | 84.0336% | 80.3571% | 81.1765% | 77.5000% | 74.3421% | 84.0336% |
G⁃mean | 87.5140% | 84.9819% | 84.3065% | 87.5140% | 84.3065% | 84.9819% | 81.7744% | 79.0784% | 87.5140% | |
D8 | F1⁃measure | 51.3191% | 46.2766% | 46.9176% | 47.8095% | 52.1569% | 41.8762% | 41.2782% | 49.5541% | 49.7459% |
G⁃mean | 65.7490% | 61.4930% | 62.1053% | 62.1248% | 65.7728% | 56.9598% | 57.9467% | 64.6729% | 64.4944% | |
D9 | F1⁃measure | 95.7412% | 94.7253% | 95.7412% | 92.5052% | 93.5185% | 94.392% | 93.5185% | 95.7412% | 95.7412% |
G⁃mean | 97.8306% | 96.8547% | 97.8306% | 95.8093% | 96.7736% | 97.3521% | 96.7736% | 97.8306% | 97.8306% | |
D10 | F1⁃measure | 83.3056% | 84.1304% | 84.4078% | 81.6459% | 81.8952% | 82.5409% | 84.6341% | 85.6083% | 85.1662% |
G⁃mean | 88.7784% | 89.1231% | 89.5641% | 87.4620% | 87.7329% | 88.2285% | 89.8910% | 90.3583% | 90.3027% | |
D11 | F1⁃measure | 79.6456% | 78.5138% | 79.6987% | 78.2845% | 79.4329% | 77.8749% | 80.5446% | 77.1168% | 81.0368% |
G⁃mean | 77.2319% | 76.3439% | 78.7317% | 75.6164% | 75.5961% | 74.9554% | 78.3277% | 77.3277% | 79.6327% | |
D12 | F1⁃measure | 99.0321% | 99.1194% | 99.0758% | 98.9376% | 98.7178% | 99.0757% | 98.8969% | 98.8541% | 99.9314% |
G⁃mean | 99.5399% | 99.5500% | 99.5449% | 99.5299% | 99.5046% | 99.5449% | 99.5248% | 99.5197% | 99.5599% | |
D13 | F1⁃measure | 69.0476% | 67.8161% | 68.0672% | 68.6275% | 73.8095% | 70.0000% | 68.2870% | 69.2951% | 73.2252% |
G⁃mean | 58.1695% | 61.5895% | 60.5979% | 60.7131% | 65.6850% | 63.2075% | 64.1788% | 66.7669% | 66.5811% | |
D14 | F1⁃measure | 99.0909% | 99.0210% | 98.6777% | 98.2524% | 96.9538% | 99.1010% | 98.2712% | 96.1969% | 99.1211% |
G⁃mean | 99.7417% | 99.6118% | 99.5918% | 99.7903% | 99.0551% | 99.6118% | 99.5517% | 99.5119% | 99.5852% | |
D15 | F1⁃measure | 69.6581% | 72.2222% | 61.8421% | 71.0084% | 70.8333% | 64.7727% | 70.8333% | 75.7143% | 74.7727% |
G⁃mean | 77.0973% | 77.1389% | 72.3167% | 78.1661% | 74.0852% | 69.4628% | 74.0852% | 84.0976% | 79.3485% | |
D16 | F1⁃measure | 95.5630% | 95.5493% | 95.2310% | 96.0512% | 94.5915% | 95.4878% | 95.9256% | 94.4447% | 96.5470% |
G⁃mean | 98.4711% | 97.9041% | 98.0402% | 98.4924% | 97.3255% | 98.0329% | 98.4806% | 98.4967% | 98.4418% | |
D17 | F1⁃measure | 98.5657% | 98.6980% | 98.2979% | 98.7261% | 98.2305% | 98.8469% | 98.4576% | 98.6974% | 98.9396% |
G⁃mean | 99.2300% | 99.4004% | 99.0884% | 99.3136% | 99.0321% | 99.3256% | 99.1593% | 99.2671% | 99.3602% | |
D18 | F1⁃measure | 98.7282% | 99.3010% | 98.1242% | 98.6898% | 98.2638% | 99.1763% | 98.7282% | 99.3025% | 99.3043% |
G⁃mean | 98.7595% | 99.3234% | 98.1586% | 98.7161% | 98.3042% | 99.2945% | 98.7595% | 99.3003% | 99.3043% |
表4 各算法在Random Forest分类器上的各个指标评价结果
Table 4
Datasets | Evaluation | SMO | Bor | Safe | ADAS | SM⁃IPF | K⁃m S | ADAP | RSM | DPCLSO |
---|---|---|---|---|---|---|---|---|---|---|
D1 | F1⁃measure | 90.6709% | 97.6623% | 92.3217% | 96.7532% | 90.8537% | 94.6580% | 95.5152% | 94.1239% | 98.5641% |
G⁃mean | 95.2461% | 97.7890% | 95.7880% | 98.3726% | 96.3308% | 96.5675% | 97.9535% | 98.0415% | 99.1089% | |
D2 | F1⁃measure | 81.8589% | 85.6321% | 84.9480% | 83.0939% | 83.2347% | 84.6951% | 84.5597% | 85.0680% | 86.1102% |
G⁃mean | 86.4057% | 89.7323% | 89.0247% | 87.7998% | 88.0274% | 88.6122% | 88.3136% | 89.0348% | 90.1449% | |
D3 | F1⁃measure | 92.0729% | 92.0027% | 91.9530% | 91.7707% | 89.4000% | 91.8976% | 92.2054% | 91.7536% | 92.1298% |
G⁃mean | 93.6905% | 93.5232% | 93.3898% | 93.3818% | 91.3498% | 93.3722% | 93.5795% | 93.2385% | 93.6200% | |
D4 | F1⁃measure | 64.5992% | 62.3709% | 59.6078% | 62.7903% | 59.9194% | 61.0649% | 60.8345% | 61.5568% | 64.6409% |
G⁃mean | 67.1170% | 64.6089% | 62.2487% | 63.8087% | 41.3795% | 62.8179% | 64.8092% | 64.5395% | 67.5858% | |
D5 | F1⁃measure | 51.7652% | 51.8004% | 54.4513% | 48.3501% | 52.3519% | 54.1108% | 53.5563% | 51.4006% | 53.6928% |
G⁃mean | 62.8859% | 62.8798% | 65.1517% | 60.0904% | 64.0961% | 64.5052% | 64.0665% | 62.8836% | 64.5460% | |
D6 | F1⁃measure | 54.0671% | 46.8065% | 47.9250% | 42.8420% | 48.7296% | 43.0843% | 44.2958% | 47.8398% | 55.4072% |
G⁃mean | 73.3234% | 66.3916% | 67.9502% | 65.4356% | 70.4429% | 63.9262% | 63.5712% | 67.8754% | 75.6030% | |
D7 | F1⁃measure | 85.3300% | 87.3711% | 87.7565% | 85.3962% | 80.8829% | 85.4548% | 86.1940% | 86.3410% | 89.4372% |
G⁃mean | 88.0239% | 89.5964% | 89.5145% | 88.4350% | 84.1845% | 88.0619% | 88.4730% | 88.2307% | 90.9879% | |
D8 | F1⁃measure | 51.1653% | 48.5114% | 52.7296% | 47.0887% | 53.3318% | 49.3093% | 47.9999% | 51.1118% | 52.9969% |
G⁃mean | 63.9683% | 61.9052% | 65.3732% | 60.5885% | 64.3949% | 62.6383% | 61.2848% | 64.2067% | 65.7240% | |
D9 | F1⁃measure | 94.1638% | 93.6029% | 93.8141% | 92.6962% | 93.3060% | 93.8845% | 93.1508% | 93.8321% | 94.5209% |
G⁃mean | 95.9574% | 95.2493% | 95.7804% | 94.7729% | 95.3362% | 95.8004% | 95.1061% | 95.8240% | 96.3447% | |
D10 | F1⁃measure | 83.2339% | 82.3063% | 82.0069% | 82.3937% | 81.7358% | 82.0862% | 82.9586% | 83.1568% | 83.3381% |
G⁃mean | 88.1863% | 86.9737% | 86.7420% | 87.6875% | 87.6375% | 87.4105% | 87.8475% | 88.1125% | 88.2262% | |
D11 | F1⁃measure | 77.0901% | 76.7172% | 76.9860% | 76.6018% | 79.9519% | 77.2552% | 77.2467% | 76.1814% | 78.8932% |
G⁃mean | 78.1902% | 77.8985% | 77.8975% | 77.5758% | 79.5166% | 78.3964% | 78.4043% | 77.4141% | 80.2633% | |
D12 | F1⁃measure | 97.4557% | 97.3688% | 97.8214% | 97.3651% | 94.8871% | 97.4840% | 97.0970% | 97.8501% | 97.5070% |
G⁃mean | 98.4767% | 98.0245% | 98.5630% | 98.3665% | 96.4680% | 98.1916% | 98.2208% | 98.4285% | 98.7891% | |
D13 | F1⁃measure | 72.5802% | 72.9477% | 72.1871% | 73.9721% | 70.0999% | 70.6548% | 75.7488% | 70.7328% | 75.0237% |
G⁃mean | 74.0201% | 74.5555% | 74.0323% | 73.7829% | 69.1405% | 70.6940% | 76.3049% | 71.3652% | 77.3360% | |
D14 | F1⁃measure | 92.5700% | 93.3006% | 93.7434% | 94.1998% | 88.1873% | 92.2087% | 92.7789% | 92.7552% | 93.0408% |
G⁃mean | 94.2282% | 94.5115% | 94.7542% | 95.1248% | 91.2529% | 93.0817% | 94.6443% | 94.2959% | 94.2480% | |
D15 | F1⁃measure | 68.4615% | 70.0000% | 63.6364% | 78.4615% | 53.3333% | 67.2727% | 65.7143% | 68.6415% | 70.6061% |
G⁃mean | 74.8425% | 78.5644% | 71.3957% | 80.3533% | 60.3553% | 71.4435% | 74.7961% | 74.8655% | 75.9740% | |
D16 | F1⁃measure | 87.7604% | 86.2575% | 88.0239% | 88.8134% | 80.9454% | 85.1676% | 88.5639% | 87.0832% | 86.1974% |
G⁃mean | 91.2486% | 89.1500% | 90.9489% | 92.2731% | 85.0772% | 88.5405% | 91.9600% | 91.1035% | 92.9946% | |
D17 | F1⁃measure | 97.4588% | 97.2423% | 97.1397% | 96.7293% | 97.1309% | 97.6680% | 97.0848% | 97.6840% | 97.7708% |
G⁃mean | 98.0740% | 97.778% | 97.8242% | 97.6311% | 97.8802% | 98.0869% | 97.8347% | 98.2998% | 97.7063% | |
D18 | F1⁃measure | 97.1561% | 96.3506% | 96.6975% | 96.8042% | 96.3382% | 96.9671% | 96.6692% | 97.2820% | 97.2836% |
G⁃mean | 97.2626% | 96.4719% | 96.8070% | 96.9031% | 96.4385% | 97.0726% | 96.7803% | 97.3824% | 97.3885% |
表5 各算法在C4.5分类器上的各个指标评价结果
Table 5
Datasets | Evaluation | SMO | Bor | Safe | ADAS | SM⁃IPF | K⁃m S | ADAP | RSM | DPCLSO |
---|---|---|---|---|---|---|---|---|---|---|
D1 | F1⁃measure | 95.3504% | 95.9219% | 95.0244% | 95.9219% | 89.1739% | 94.6911% | 97.0330% | 95.9219% | 95.5800% |
G⁃mean | 96.7189% | 97.0969% | 96.2903% | 97.0969% | 93.3878% | 97.0969% | 98.1526% | 97.0969% | 97.4750% | |
D2 | F1⁃measure | 85.2866% | 80.7499% | 85.6621% | 82.5853% | 88.1044% | 82.6621% | 83.5422% | 84.0379% | 86.7211% |
G⁃mean | 90.6153% | 87.4349% | 89.4995% | 89.1953% | 92.7331% | 88.6294% | 88.9781% | 89.0365% | 91.4930% | |
D3 | F1⁃measure | 91.0761% | 90.7114% | 89.5812% | 89.7583% | 86.9262% | 89.2127% | 91.3965% | 90.4775% | 92.4308% |
G⁃mean | 92.4931% | 92.2069% | 91.2686% | 91.5019% | 89.0857% | 90.9029% | 92.6818% | 92.0207% | 93.5368% | |
D4 | F1⁃measure | 58.0147% | 56.5508% | 55.8179% | 52.8216% | 58.5686% | 51.9370% | 58.2561% | 58.6345% | 57.5804% |
G⁃mean | 62.2464% | 59.6573% | 61.0640% | 55.4117% | 43.2818% | 55.9819% | 63.7136% | 62.6257% | 63.3409% | |
D5 | F1⁃measure | 50.2477% | 46.2616% | 45.2641% | 47.9719% | 50.9139% | 49.9714% | 46.9857% | 45.5351% | 49.9561% |
G⁃mean | 63.0937% | 59.8204% | 58.9365% | 61.2255% | 63.4165% | 62.8464% | 60.1788% | 59.0383% | 62.7733% | |
D6 | F1⁃measure | NaN | 38.3952% | NaN | NaN | 45.3598% | 35.6085% | 43.3759% | NaN | 46.5640% |
G⁃mean | 56.1748% | 55.9361% | 44.1640% | 49.1841% | 63.9365% | 53.2240% | 59.0591% | 57.1104% | 63.5003% | |
D7 | F1⁃measure | 77.7990% | 76.7148% | 81.4483% | 75.0648% | 74.3187% | 77.9407% | 77.3053% | 79.2532% | 82.6409% |
G⁃mean | 83.3495% | 82.6656% | 85.6686% | 81.4911% | 80.7767% | 83.1034% | 83.0608% | 84.4409% | 87.2673% | |
D8 | F1⁃measure | 40.8288% | 44.8296% | 42.0709% | 44.4995% | 52.4160% | 40.5528% | 33.3644% | 38.9963% | 47.1016% |
G⁃mean | 55.3644% | 59.9155% | 56.7622% | 59.2967% | 65.0128% | 55.4519% | 49.6612% | 54.8542% | 60.7536% | |
D9 | F1⁃measure | 91.1449% | 90.9919% | 90.9771% | 91.0146% | 89.7471% | 91.1402% | 89.7448% | 90.8830% | 91.2878% |
G⁃mean | 93.1435% | 92.9086% | 92.9363% | 92.9736% | 92.2880% | 93.1267% | 91.8047% | 92.9036% | 93.3944% | |
D10 | F1⁃measure | 76.2830% | 75.4806% | 76.8681% | 76.6538% | 77.1066% | 76.3623% | 76.6250% | 77.4209% | 77.7131% |
G⁃mean | 82.7515% | 81.8017% | 82.8221% | 83.1245% | 84.2765% | 82.9522% | 82.9646% | 83.6477% | 83.8657% | |
D11 | F1⁃measure | 77.2319% | 77.2165% | 76.2536% | 77.0151% | 81.0328% | 78.3353% | 78.8932% | 79.1148% | 78.8529% |
G⁃mean | 76.7661% | 77.3651% | 76.3400% | 77.0226% | 78.2896% | 77.9084% | 79.0216% | 79.1874% | 79.1526% | |
D12 | F1⁃measure | 96.4695% | 95.9850% | 96.8202% | 96.6032% | 94.0345% | 97.1781% | 96.0692% | 97.2658% | 97.5309% |
G⁃mean | 98.1752% | 97.3734% | 98.0413% | 98.2613% | 96.4397% | 98.0484% | 97.7318% | 98.4311% | 98.6106% | |
D13 | F1⁃measure | 64.9351% | 69.5000% | 72.8778% | 71.1304% | 74.7826% | 69.3333% | 69.5652% | 65.3333% | 72.2826% |
G⁃mean | 66.1703% | 72.3171% | 76.1999% | 73.9165% | 75.4439% | 72.2677% | 73.2411% | 69.1026% | 76.4439% | |
D14 | F1⁃measure | 92.0246% | 92.6147% | 91.7111% | 92.1847% | 82.0128% | 90.5300% | 90.1884% | 92.0944% | 93.2130% |
G⁃mean | 96.0898% | 96.1157% | 95.7903% | 96.1208% | 90.4755% | 95.3021% | 96.1252% | 96.1801% | 96.2612% | |
D15 | F1⁃measure | 90.4892% | 60.4396% | 74.7253% | 79.1667% | 77.5000% | 90.4213% | 96.1538% | 89.0110% | 90.5983% |
G⁃mean | 96.1528% | 71.2328% | 81.6052% | 85.5086% | 89.4334% | 96.2842% | 96.2910% | 89.5923% | 96.2328% | |
D16 | F1⁃measure | 76.4519% | 77.4593% | 75.9186% | 76.1207% | 74.6716% | 77.0081% | 77.1894% | 74.8543% | 76.6652% |
G⁃mean | 88.7605% | 88.4716% | 86.6235% | 88.5958% | 82.3485% | 87.5854% | 90.1391% | 87.9138% | 88.2057% | |
D17 | F1⁃measure | 95.6210% | 95.8070% | 95.9727% | 94.0132% | 94.9890% | 96.0063% | 95.5166% | 95.4641% | 96.0063% |
G⁃mean | 96.9293% | 96.8651% | 97.0851% | 95.6145% | 96.5988% | 96.9495% | 96.8329% | 96.6793% | 96.9495% | |
D18 | F1⁃measure | 95.5141% | 95.3489% | 95.2893% | 95.3475% | 94.5738% | 95.4256% | 94.9416% | 95.5958% | 95.8715% |
G⁃mean | 95.6370% | 95.5051% | 95.4224% | 95.5040% | 94.6074% | 95.9596% | 95.1048% | 95.7345% | 96.0039% |
表3是各算法在KNN分类器上的评价指标,由表可见,本算法的各评价指标在大多数数据集上取胜,而未能取胜的数据集,其大部分指标都可排在前三名.还可以看出,数据量越大,数据集特征越多,各算法的所有性能指标都会出现不同程度的下降,这说明无论不平衡比是多少,只要数据高度复杂,分类器的性能就会变差,无法在一定限度内得到提高.综合来说,在KNN分类器上,本文算法在九个数据集上的三项评价指标全面优于其他算法.另外,F1⁃measure是衡量类之间重叠的指标,可以更好地表征最终获得的精度,F1⁃measure越大,表明新数据集的数据重叠现象越少.本文算法在18个数据集中的11个数据集上的F1⁃measure最大,说明本文算法能减小数据重叠对分类器带来的消极影响,弱化噪声数据带来的分类偏差.
表4是各算法在Random Forest分类器上的指标评价.由表可见,和KNN分类器的跑分结果相比,与Random Forest分类器的结合放大了本文算法的优势,同时,采样点的选取策略极好地减少了噪声样本的消极影响,所以本文算法在更多的数据集上的表现更好.在许多数据集上,本文算法和K⁃means SMOTE之间的指标差异显著,这是由于本文算法在稀疏区域的不同生成样本策略对标签噪声表现出更好的鲁棒性.G⁃mean在不平衡的数据集上会变得更敏感,换言之,和没有进行过采样而直接分类的数据集相比,该指标有明显提升.本文算法的G⁃mean在15个数据集上最优,证明本文算法过采样策略的合理性.
表5是各算法在C4.5分类器上的评价指标.由表可见,在数据量与不平衡率较大的D14数据集上,本文算法的F1⁃measure与G⁃mean都是最大的,证明本文算法对不同类型数据的普适性.
图5
图6
图6
各算法在Random Forest分类器上的AUC
Fig.6
AUC of each algorithm on Random Forest classifier
图7
综合来看,本文算法的AUC在大部分数据集上都取得了最优的结果,这代表任意取一个正类样本和负类样本,正类样本得分大于负类样本的概率更大,证明本文算法具有较高的整体稳定性.本文算法的三个评价指标全面优于K⁃means SMOTE和RSMOTE,说明不同的聚类策略虽然对某些数据集产生了一定的分类影响,但最终决定分类效果的依然是由过采样器合成新样本后与原样本集数据分布的一致程度,所以本文算法表现了较优的分类性能.
分析以上实验结果,证明本文算法采取的聚类策略与过采样选点策略能有效地控制合成样本的数据分布,同时大大减少数据重叠现象的发生,最终提高对少数类样本的分类性能.
综上,在大多数数据集上,本文算法结合KNN,C4.5,Random Forest分类器都取得了较好的实验结果,证明本文算法在少数类及整体上的分类性能都比较优秀.
2.5 时间成本分析
基于KNN分类器,通过十次执行的平均时间来评估本文算法的计算效率.实验由Matlab实现,在CPU为AMD Ryzen 7 5800H 3.20 GHz,内存16 GB的机器上运行,系统为Windows10.选择在样本数与不平衡比依次增加的三个数据集上进行比较,各算法的平均运行时间如表6所示.由表可见,本文算法DPCLSO的时间开销处于中高值,而运行时间最长的是K⁃means SMOTE.本文算法在样本量较少的数据集上速度较快,但随着样本量与不平衡比的增加,运行速度逐渐变慢.一个可能的原因是,随着数据量增加,两次距离矩阵的建立需要花费更多时间.然而,通过一定的计算时间代价换取较高的分类性能是可以被接受的.
表6 各算法的运行时间 (s)
Table 6
Method | D4 | D10 | D14 |
---|---|---|---|
SMO | 0.0312 | 0.0468 | 0.5312 |
Bor | 0.2500 | 0.3437 | 1.2343 |
Safe | 0.0156 | 0.4062 | 0.5468 |
ADAS | 0.0156 | 0.3437 | 1.0000 |
SM⁃IPF | 0.5625 | 3.5000 | 2.0312 |
K⁃m S | 0.0781 | 3.7187 | 26.6870 |
ADAP | 0.0625 | 0.6406 | 0.7031 |
RSM | 0.2968 | 2.0937 | 1.8906 |
DPCLSO | 0.0937 | 1.6406 | 15.2180 |
3 结论
针对二分类数据集的类间与类内不平衡问题,本文提出一种基于密度峰值聚类和局部稀疏度的过采样算法.对全部样本进行自适应DPC,很好地过滤了噪声簇,有效减少了小析取问题对分类的消极影响.考虑过采样过程中的数据重叠现象,选取局部稀疏度较高的少数类样本参与最终的过采样,有效弥补了少数类样本在数据分布中的稀疏性,同时减少了数据重叠现象的发生.通过在三种分类器上的分类实验验证了本文算法的有效性.本文对二分类问题进行了研究,而多类不平衡问题也是机器学习中一个研究热点,下一步将研究不平衡数据的多分类任务.
参考文献
Multistage fusion approaches based on a generative model and multivariate exponentially weighted moving average for diagnosis of cardiovascular autonomic nerve dysfunction
GAN augmentation to deal with imbalance in imaging⁃based intrusion detection
Imbalanced enterprise credit evaluation with DTE⁃SBD:Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates
A comprehensive investigation of the role of imbalanced learning for software defect prediction
Customer purchase prediction from the perspective of imbalanced data:A machine learning framework based on factorization machine
Imbalanced text sentiment classification using universal and domain⁃specific knowledge
Learning from class⁃imbalanced data:Review of methods and applications
Efficient utilization of missing data in cost⁃sensitive learning
Improving interpolation⁃based oversampling for imbalanced data learning
Radial⁃Based Undersampling for imbalanced data classification
An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling
A study of the behavior of several methods for balancing machine learning training data
Smote:Synthetic minority over⁃sampling technique
Learning from imbalanced data
SMOTE for learning from imbalanced data:Progress and challenges,marking the 15⁃year anniversary
面向非平衡多分类问题的二次合成QSMOTE方法
.
Quadratic synthetic minority over⁃sampling technique for classification of multiclass imbalance problems
Borderline⁃SMOTE:A new over⁃sampling method in imbalanced data sets learning
∥
ADASYN:Adaptive synthetic sampling approach for imbalanced learning
∥
Improving imbalanced learning through a heuristic oversampling method based on k⁃means and SMOTE
RSMOTE:A self⁃adaptive robust SMOTE for imbalanced problems with label noise
Clustering by fast search and find of density peaks
/
〈 |
|
〉 |
