南京大学学报(自然科学版) ›› 2024, Vol. 60 ›› Issue (1): 118–129.doi: 10.13232/j.cnki.jnju.2024.01.012

• • 上一篇    

多采样近似粒集成学习

侯贤宇(), 陈玉明, 吴克寿   

  1. 厦门理工学院计算机与信息工程学院,厦门,361024
  • 收稿日期:2023-09-25 出版日期:2024-01-30 发布日期:2024-01-29
  • 通讯作者: 侯贤宇 E-mail:416410794@qq.com
  • 基金资助:
    国家自然科学基金(61976183)

A granular ensemble learning based on multi⁃sampling approximate granulation

Xianyu Hou(), Yuming Chen, Keshou Wu   

  1. College of Computer and Information Engineering,Xiamen University of Technology,Xiamen,361024,China
  • Received:2023-09-25 Online:2024-01-30 Published:2024-01-29
  • Contact: Xianyu Hou E-mail:416410794@qq.com

摘要:

粒化是一种构建粒数据与粒模型的方法.近些年来,有多种粒化方法被提出,如基于样本相似度尺度的相似度粒化、基于邻域关系的邻域粒化和基于特征尺度变换的旋转粒化等.这些粒化方法都在监督与非监督任务中获得优秀的表现.但是这些粒化方法都是基于样本本身的度量关系构建的,会导致样本在粒化过程中的信息量呈现不同量级的扩展现象.这一特征使粒化后的粒子在一些情况下难以处理.因此,提出一种基于多采样方法构建近似粒子的粒化方式以保证粒化过程被限制在有限量级,并且在粒化过程中抛弃固定的度量关系式,粒化的结果会随着选取的近似集与近似基模型的不同而变化,使得样本在粒化为粒子时有着更高的灵活性.文中对多采样近似粒化和多种粒化方法进行详细比较,结果表明多采样近似粒化有着更高的分类性能,且与多种先进的集成算法做了详细比较,结果表明在分类任务上多采样近似粒集成模型拥有着更好的鲁棒性与泛化性.

关键词: 粒计算, 粒表示, 多采样近似粒化, 重要性采样, 粒集成学习

Abstract:

Granulation is a method to construct the granular data and granular models. In recent years,several granulation methods have been proposed. For instance,similarity granulation based on sample similarity scale,neighborhood granulation derived from neighborhood relationship,rotation granulation based on feature transformation,and so forth,have demonstrated outstanding performance in supervised and unsupervised tasks. Nevertheless,these granulation techniques are formulated on the metric associations of the samples themselves,which result in varying extents of information expansion during the granulation process. This property renders the granules challenging to manage in certain cases. An approach to construct approximate granules using a multi?sampling method is proposed in this paper. This method guarantees a finite amount of granulation. Furthermore,the fixed metric relation is discarded in the granulation process,causing the granules to vary with the chosen approximation set and approximation base model. This variation increases the flexibility of samples in granulation to granules. We present a comprehensive comparison of multi?sampling approximate granulation with multiple granulation methods. The results demonstrate that multi?sampling approximate granulation outperforms other methods in terms of classification performance. Furthermore,we conduct a thorough comparison with various advanced ensemble algorithms, the final results indicate that the granular ensemble model exhibits superior robustness and generalization for classification tasks.

Key words: granular computing, granular representation, multi?sampling approximate granulation, importance sampling, granular ensemble learning

中图分类号: 

  • TP311.13

图1

多采样近似粒化"

图2

多采样近似粒集成模型"

表1

数据集的具体属性"

数据集维度类别数样本数
breast cancer302569
mobile2042000
diabetes82768
blood42747
raisin72900
Shill Bidding1026321
Wine Quality21105000
yeast8101484
waveform2135000
Debrecen1921150

表2

三种采样方法对比"

采样方法优点缺点
随机采样算法简单有效,计算效率高随机性较高,需要多次采样才能有好的结果
聚类采样构建的近似分布方差较小计算效率较高,需要预先计算簇
重要性采样构建的近似分布方差最小,更符合原始分布重要性较低的样本很难被选择

图3

采样分布对比"

表3

采样方法的方差对比"

数据集原始分布随机采样邻域采样重要性采样
breast cancer0.02080.02380.02170.0125
mobile0.13050.13080.13010.1263
diabetes0.02580.02490.02570.0200
blood0.02670.02470.02370.0154
raisin0.02240.02160.02220.0111
Shill Bidding0.11190.11180.11060.1036
Wine Quality0.01800.01800.01710.0131
yeast0.01380.01330.01280.0096
waveform0.02410.02460.02410.0220
Debrecen0.03200.03140.02920.0257
均值0.04260.04240.04170.0358

图4

不同采样比例的对比"

图5

三个数据集的十次分类结果对比"

表4

三种采样方法的分类结果对比"

数据集随机采样聚类采样重要性采样
最大最小平均最大最小平均最大最小平均
breast cancer0.97720.96310.97120.98060.97190.97450.98060.97540.9777
mobile0.91100.88650.89750.91000.88600.89830.91200.88750.9035
diabetes0.77870.75660.76830.77740.74350.76740.78260.76310.7710
blood0.78180.76040.76560.79120.75900.76680.77240.76310.7646
raisin0.86440.84780.85720.86670.85220.86000.87000.86000.8653
Shill Bidding0.98540.98130.98330.98690.98010.98390.98480.98210.9831
Wine Quality0.59350.57350.58690.59290.57850.58790.59660.58600.5906
yeast0.57250.54210.56050.57580.55570.56460.57250.56370.5673
waveform0.86980.86500.86800.87180.86620.86930.87180.86760.8693
Debrecen0.67650.65570.66230.68090.65570.65900.68700.66000.6738

表5

多种粒化方法在数据集上的对比结果"

数据集RFRF_FuzzyRF_ConditionRF_NeighborRF_SAG
breast cancer0.9614±0.00070.9596±0.00110.9632±0.00140.9631±0.00070.9667±0.0007
mobile0.8755±0.00020.8915±0.00010.8915±0.00020.9090±0.00030.9405±0.0002
diabetes0.7474±0.00250.7721±0.00250.7565±0.00250.7656±0.00190.7500±0.0020
blood0.7363±0.00080.7470±0.00080.7483±0.00040.7377±0.00110.7536±0.0008
raisin0.8556±0.00040.8544±0.00070.8611±0.00050.8578±0.00110.8656±0.0005
Shill Bidding0.9959±0.00000.9975±0.00000.9975±0.00000.9984±0.00000.9847±0.0000
Wine Quality0.6979±0.00100.7004±0.00090.6967±0.00050.7005±0.00160.7035±0.0008
yeast0.6103±0.00060.6042±0.00110.6150±0.00080.5866±0.00110.6135±0.0009
waveform0.8490±0.00020.8392±0.00010.8400±0.00020.8384±0.00020.8586±0.0002
Debrecen0.6878±0.00030.6774±0.00110.6635±0.00120.6843±0.00250.6930±0.0002
均值0.8017±0.00070.8043±0.00080.8033±0.00080.8041±0.00100.8130±0.0006

表6

多种集成方法在数据集上的对比结果"

数据集RFAdaBoostHGBXGBoostMSAGEL
breast cancer0.9614±0.00060.9667±0.00050.9684±0.00030.9789±0.00060.9842±0.0002
mobile0.8825±0.00060.7210±0.00180.9120±0.00020.9205±0.00020.9610±0.0002
diabetes0.7474±0.00250.7527±0.00330.7344±0.00140.7357±0.00190.8724±0.0022
blood0.6643±0.01960.7873±0.01120.6883±0.02140.7444±0.00190.8488±0.0016
raisin0.8600±0.00130.8544±0.00210.8467±0.00240.8511±0.00150.9444±0.0008
Shill Bidding0.9911±0.00000.9913±0.00000.9962±0.00000.9972±0.00000.9994±0.0000
Wine Quality0.5647±0.00160.5253±0.00500.5428±0.00080.6898±0.00220.8405±0.0005
yeast0.6197±0.00180.4323±0.00040.5846±0.00110.5947±0.00080.7876±0.0009
waveform0.8264±0.00020.8094±0.00040.8518±0.00040.8450±0.00030.9026±0.0002
Debrecen0.6600±0.00230.6522±0.00110.7043±0.00240.7122±0.00050.8478±0.0004
均值0.7778±0.00300.7492±0.00260.7829±0.00300.8069±0.00100.8989±0.0007

表7

MSAGEL和XGBoost在数据集上的多指标对比结果"

数据集模型F1Acc召回率
breast cancerMSAGEL0.9810±0.00040.9847±0.00020.9785±0.0005
XGBoost0.9737±0.00080.9742±0.00080.9738±0.0008
mobileMSAGEL0.9599±0.00010.9604±0.00010.9600±0.0001
XGBoost0.9208±0.00020.9205±0.00020.9205±0.0002
diabetesMSAGEL0.8569±0.00090.8660±0.00090.8524±0.0011
XGBoost0.7103±0.00140.7159±0.00140.7116±0.0017
bloodMSAGEL0.7736±0.00120.8031±0.00050.7587±0.0019
XGBoost0.5995±0.00500.6343±0.00950.5964±0.0038
raisinMSAGEL0.9444±0.00080.9448±0.00060.9444±0.0008
XGBoost0.8508±0.00150.8536±0.00140.8511±0.0015
Shill BiddingMSAGEL0.9967±0.00000.9974±0.00000.9960±0.0000
XGBoost0.9921±0.00000.9919±0.00000.9925±0.0001
Wine QualityMSAGEL0.7112±0.00170.7188±0.00340.7190±0.0019
XGBoost0.3622±0.00290.3806±0.00650.3619±0.0024
yeastMSAGEL0.7369±0.00180.768±0.00160.7393±0.0028
XGBoost0.5267±0.00580.5560±0.00730.5255±0.0062
waveformMSAGEL0.9021±0.00030.9027±0.00030.9020±0.0003
XGBoost0.8469±0.00030.8476±0.00030.8472±0.0003
DebrecenMSAGEL0.8472±0.00120.8488±0.00120.8480±0.0013
XGBoost0.7113±0.00050.7131±0.00050.7126±0.0005
均值MSAGEL0.8711±0.00080.8795±0.00090.8698±0.0011
XGBoost0.7494±0.00180.7588±0.00280.7493±0.0018
1 Morente?Molinera J A, Mezei J, Carlsson C,et al. Improving supervised learning classification methods using multigranular linguistic modeling and fuzzy entropy. IEEE Transactions on Fuzzy Systems201725(5):1078-1089.
2 Opitz D, Maclin R. Popular ensemble methods:An empirical study. Journal of Artificial Intelligence Research199911(1):169-198.
3 Quadrianto N, Ghahramani Z. A very simple safe?Bayesian random forest. IEEE Transactions on Pattern Analysis and Machine Intelligence201537(6):1297-1303.
4 Jiang S H, Mao H Y, Ding Z M,et al. Deep decision tree transfer boosting. IEEE Transactions on Neural Networks and Learning Systems202031(2):383-395.
5 Zadeh L A. Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems199790(2):111-127.
6 Bhapkar H R, Mahalle P N, Shinde G R,et al. Rough sets in COVID?19 to predict symptomatic cases∥Santosh K C,Joshi A. COVID?19:Prediction,decision?making,and its impacts. Springer Berlin Heidelberg,2021:57-68.
7 Chen Y M, Zhu S Z, Li W,et al. Fuzzy granular convolutional classifiers. Fuzzy Sets and Systems2022,426:145-162.
8 Niu J J, Chen D G, Li J H,et al. Fuzzy rule?based classification method for incremental rule learning. IEEE Transactions on Fuzzy Systems202230(9):3748-3761.
9 Meher S K, Pal S K. Rough?wavelet granular space and classification of multispectral remote sensing image. Applied Soft Computing201111(8):5662-5673.
10 Borowska K, Stepaniuk J. A rough?granular approach to the imbalanced data classification problem. Applied Soft Computing2019,83:105607.
11 Hu X C, Pedrycz W, Wang X M. Fuzzy classifiers with information granules in feature space and logic?based computing. Pattern Recognition2018,80:156-167.
12 Yao Y Y. Three perspectives of granular computing. Journal of Nanchang Institute of Technology200625(2):16-21.
13 胡清华,于达仁,谢宗霞. 基于邻域粒化和粗糙逼近的数值属性约简. 软件学报200819(3):640-649.
Hu Q H, Yu D R, Xie Z X. Numerical attribute reduction based on neighborhood granulation and rough approximation. Journal of Software200819(3):640-649.
14 傅兴宇,陈颖悦,陈玉明,等. 一种全连接粒神经网络分类方法. 山西大学学报(自然科学版)202346(1):91-100.
Fu X Y, Chen Y Y, Chen Y M,et al. A classification method of fully connected granular neural network. Journal of Shanxi University (Natural Science Edition)202346(1):91-100.
15 Jiang H L, Chen Y M, Kong L R,et al. An LVQ clustering algorithm based on neighborhood granules. Journal of Intelligent & Fuzzy Systems:Applications in Engineering and Technology,202243(5):6109-6122.
16 Li W, Chen Y M, Song Y P. Boosted K?nearest neighbor classifiers based on fuzzy granules. Knowledge?Based Systems2020,195:105606.
17 Lin S H, Zhang K B, Guan D,et al. An intrusion detection method based on granular autoencoders. Journal of Intelligent & Fuzzy Systems:Applications in Engineering and Technology,202344(5):8413-8424.
18 陈玉明,蔡国强,卢俊文,等. 一种邻域粒K均值聚类方法. 控制与决策202338(3):857-864.
Chen Y M, Cai G Q, Lu J W,et al. A neighborhood granular K?means clustering method. Control and Decision202338(3):857-864.
19 Chen Y M, Qin N, Li W,et al. Granule structures,distances and measures in neighborhood systems. Knowledge?Based Systems2019,165:268-281.
20 Chen Y M, Zhu Q X, Wu K S,et al. A binary granule representation for uncertainty measures in rough set theory. Journal of Intelligent & Fuzzy Systems:Applications in Engineering and Technology,201528(2):867-878.
21 Chen J F, Zhu J, Song L. Stochastic training of graph convolutional networks with variance reduction. 2018,arXiv:.
22 Chiang W L, Liu X Q, Si S,et al. Cluster?GCN:An efficient algorithm for training deep and large graph convolutional networks∥Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage,USA:ACM,2019:257-266.
23 Feng K X, Lu Z Z, Ling C Y,et al. Fuzzy importance sampling method for estimating failure possibility. Fuzzy Sets and Systems2021,424:170-184.
24 Müller T, McWilliams B, Rousselle F,et al. Neural importance sampling. ACM Transactions on Graphics201938(5):145.
25 Grittmann P, Georgiev I, Slusallek P,et al. Variance?aware multiple importance sampling. ACM Transactions on Graphics201938(6):152.
26 Huang X L, Li Z H, Jin Y L,et al. Fair?AdaBoost:Extending AdaBoost method to achieve fair classification. Expert Systems with Applications2022,202:117240.
27 Liu B, Liu C D, Xiao Y S,et al. AdaBoost?based transfer learning method for positive and unlabelled learning problem. Knowledge?Based Systems2022,241:108162.
28 Jiang X, Xu Y, Ke W,et al. An imbalanced multifault diagnosis method based on bias weights AdaBoost. IEEE Transactions on Instrumentation and Measurement2022,71:3505908.
29 Guryanov A. Histogram?based algorithm for building gradient boosting ensembles of piecewise linear decision trees∥8th International Conference on Analysis of Images,Social Networks and Texts. Springer Berlin Heidelberg,2019:39-50.
30 Chen T Q, Guestrin C. Xgboost:A scalable tree boosting system∥Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco,USA:ACM,2016:785-794.
31 Dong W, Huang Y M, Lehane B,et al. XGBoost algorithm?based prediction of concrete electrical resistivity for structural health monitoring. Automation in Construction2020,114:103155.
[1] 李小川, 张超, 李德玉, 上官学奎, 马瑾男, 陆文瑞. 基于TPOP法的犹豫模糊语言稳健型多粒度群决策[J]. 南京大学学报(自然科学版), 2023, 59(1): 22-34.
[2] 于子淳, 吴伟志. 用证据理论刻画协调的具有多尺度决策的信息系统的最优尺度选择[J]. 南京大学学报(自然科学版), 2022, 58(1): 71-81.
[3] 郑嘉文, 吴伟志, 包菡, 谭安辉. 基于熵的多尺度决策系统的最优尺度选择[J]. 南京大学学报(自然科学版), 2021, 57(1): 130-140.
[4] 郭英杰, 胡峰, 于洪, 张红亮. 基于时间粒的铝电解过热度预测模型[J]. 南京大学学报(自然科学版), 2019, 55(4): 624-632.
[5]  方 宇1,闵 帆1*,刘忠慧1,杨 新2.  序贯三支决策的代价敏感分类方法[J]. 南京大学学报(自然科学版), 2018, 54(1): 148-.
[6] 顾沈明, 张 昊, 吴伟志, 谭安辉, . 多标记序决策系统中基于局部最优粒度的规则获取[J]. 南京大学学报(自然科学版), 2017, 53(6): 1012-.
[7] 杨 洁1,2,王国胤1*,庞紫玲1. 密度峰值聚类相关问题的研究[J]. 南京大学学报(自然科学版), 2017, 53(4): 791-.
[8] 李同军1,2*,吴伟志1,2,顾沈明1,2. 双论域上基于Brouwer-正交补的粗糙近似[J]. 南京大学学报(自然科学版), 2016, 52(5): 871-.
[9] 顾沈明1,2*,万雅虹1,2,吴伟志1,2,徐优红1,2. 多粒度决策系统的局部最优粒度选择[J]. 南京大学学报(自然科学版), 2016, 52(2): 280-.
[10] 马对霞*,祝 峰,林姿琼. 基于等价关系的粗糙拟阵的性质[J]. 南京大学学报(自然科学版), 2016, 52(2): 300-.
[11] 戴志聪,吴伟志*. 不完备多粒度序信息系统的粗糙近似

[J]. 南京大学学报(自然科学版), 2015, 51(2): 361-367.
[12]  谢 珺*,秦 琴,续欣莹.  全覆盖粒计算模型的粒化、知识逼近及其算子性质研究[J]. 南京大学学报(自然科学版), 2015, 51(1): 105-110.
[13] 徐健锋1,2张远健1Duanning Zhou2,Dan Li 2,李宇1,. 基于粒计算的不确定性时间序列建模及其聚类[J]. 南京大学学报(自然科学版), 2014, 50(1): 86-.
[14]  顾沈明**,叶晓敏,吴伟志
.  多标记粒度不完备信息系统的粗糙近似*[J]. 南京大学学报(自然科学版), 2013, 49(2): 250-257.
[15]  据春华,帅朝谦**,封毅
.  基于粒计算的商业数据流概念漂移特征选择*[J]. 南京大学学报(自然科学版), 2011, 47(4): 391-397.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!