多采样近似粒集成学习

doi:10.13232/j.cnki.jnju.2024.01.012

南京大学学报(自然科学版) ›› 2024, Vol. 60 ›› Issue (1): 118–129.doi: 10.13232/j.cnki.jnju.2024.01.012

• • 上一篇

多采样近似粒集成学习

侯贤宇(), 陈玉明, 吴克寿

厦门理工学院计算机与信息工程学院，厦门，361024

收稿日期:2023-09-25 出版日期:2024-01-30 发布日期:2024-01-29
通讯作者: 侯贤宇 E-mail:416410794@qq.com
基金资助:
国家自然科学基金(61976183)

A granular ensemble learning based on multi⁃sampling approximate granulation

Xianyu Hou(), Yuming Chen, Keshou Wu

College of Computer and Information Engineering，Xiamen University of Technology，Xiamen，361024，China

Received:2023-09-25 Online:2024-01-30 Published:2024-01-29
Contact: Xianyu Hou E-mail:416410794@qq.com

摘要/Abstract

摘要：

粒化是一种构建粒数据与粒模型的方法.近些年来，有多种粒化方法被提出，如基于样本相似度尺度的相似度粒化、基于邻域关系的邻域粒化和基于特征尺度变换的旋转粒化等.这些粒化方法都在监督与非监督任务中获得优秀的表现.但是这些粒化方法都是基于样本本身的度量关系构建的，会导致样本在粒化过程中的信息量呈现不同量级的扩展现象.这一特征使粒化后的粒子在一些情况下难以处理.因此，提出一种基于多采样方法构建近似粒子的粒化方式以保证粒化过程被限制在有限量级，并且在粒化过程中抛弃固定的度量关系式，粒化的结果会随着选取的近似集与近似基模型的不同而变化，使得样本在粒化为粒子时有着更高的灵活性.文中对多采样近似粒化和多种粒化方法进行详细比较，结果表明多采样近似粒化有着更高的分类性能，且与多种先进的集成算法做了详细比较，结果表明在分类任务上多采样近似粒集成模型拥有着更好的鲁棒性与泛化性.

关键词: 粒计算, 粒表示, 多采样近似粒化, 重要性采样, 粒集成学习

Abstract:

Granulation is a method to construct the granular data and granular models. In recent years，several granulation methods have been proposed. For instance，similarity granulation based on sample similarity scale，neighborhood granulation derived from neighborhood relationship，rotation granulation based on feature transformation，and so forth，have demonstrated outstanding performance in supervised and unsupervised tasks. Nevertheless，these granulation techniques are formulated on the metric associations of the samples themselves，which result in varying extents of information expansion during the granulation process. This property renders the granules challenging to manage in certain cases. An approach to construct approximate granules using a multi?sampling method is proposed in this paper. This method guarantees a finite amount of granulation. Furthermore，the fixed metric relation is discarded in the granulation process，causing the granules to vary with the chosen approximation set and approximation base model. This variation increases the flexibility of samples in granulation to granules. We present a comprehensive comparison of multi?sampling approximate granulation with multiple granulation methods. The results demonstrate that multi?sampling approximate granulation outperforms other methods in terms of classification performance. Furthermore，we conduct a thorough comparison with various advanced ensemble algorithms， the final results indicate that the granular ensemble model exhibits superior robustness and generalization for classification tasks.

Key words: granular computing, granular representation, multi?sampling approximate granulation, importance sampling, granular ensemble learning

中图分类号:

TP311.13

侯贤宇, 陈玉明, 吴克寿. 多采样近似粒集成学习[J]. 南京大学学报(自然科学版), 2024, 60(1): 118–129.

Xianyu Hou, Yuming Chen, Keshou Wu. A granular ensemble learning based on multi⁃sampling approximate granulation[J]. Journal of Nanjing University(Natural Sciences), 2024, 60(1): 118–129.

图/表 12

图1

图2

表1

表2

图3

表3

图4

图5

表4

表5

表6

表7

参考文献 31

1	Morente?Molinera J A， Mezei J， Carlsson C，et al. Improving supervised learning classification methods using multigranular linguistic modeling and fuzzy entropy. IEEE Transactions on Fuzzy Systems，2017，25(5)：1078-1089.
2	Opitz D， Maclin R. Popular ensemble methods：An empirical study. Journal of Artificial Intelligence Research，1999，11(1)：169-198.
3	Quadrianto N， Ghahramani Z. A very simple safe?Bayesian random forest. IEEE Transactions on Pattern Analysis and Machine Intelligence，2015，37(6)：1297-1303.
4	Jiang S H， Mao H Y， Ding Z M，et al. Deep decision tree transfer boosting. IEEE Transactions on Neural Networks and Learning Systems，2020，31(2)：383-395.
5	Zadeh L A. Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems，1997，90(2)：111-127.
6	Bhapkar H R， Mahalle P N， Shinde G R，et al. Rough sets in COVID?19 to predict symptomatic cases∥Santosh K C，Joshi A. COVID?19：Prediction，decision?making，and its impacts. Springer Berlin Heidelberg，2021：57-68.
7	Chen Y M， Zhu S Z， Li W，et al. Fuzzy granular convolutional classifiers. Fuzzy Sets and Systems，2022,426：145-162.
8	Niu J J， Chen D G， Li J H，et al. Fuzzy rule?based classification method for incremental rule learning. IEEE Transactions on Fuzzy Systems，2022，30(9)：3748-3761.
9	Meher S K， Pal S K. Rough?wavelet granular space and classification of multispectral remote sensing image. Applied Soft Computing，2011，11(8)：5662-5673.
10	Borowska K， Stepaniuk J. A rough?granular approach to the imbalanced data classification problem. Applied Soft Computing，2019,83：105607.
11	Hu X C， Pedrycz W， Wang X M. Fuzzy classifiers with information granules in feature space and logic?based computing. Pattern Recognition，2018，80：156-167.
12	Yao Y Y. Three perspectives of granular computing. Journal of Nanchang Institute of Technology，2006，25(2)：16-21.
13	胡清华，于达仁，谢宗霞. 基于邻域粒化和粗糙逼近的数值属性约简. 软件学报，2008，19(3)：640-649.
	Hu Q H， Yu D R， Xie Z X. Numerical attribute reduction based on neighborhood granulation and rough approximation. Journal of Software，2008，19(3)：640-649.
14	傅兴宇，陈颖悦，陈玉明,等. 一种全连接粒神经网络分类方法. 山西大学学报(自然科学版)，2023，46(1)：91-100.
	Fu X Y， Chen Y Y， Chen Y M，et al. A classification method of fully connected granular neural network. Journal of Shanxi University (Natural Science Edition)，2023，46(1)：91-100.
15	Jiang H L， Chen Y M， Kong L R，et al. An LVQ clustering algorithm based on neighborhood granules. Journal of Intelligent & Fuzzy Systems：Applications in Engineering and Technology，2022，43(5)：6109-6122.
16	Li W， Chen Y M， Song Y P. Boosted K?nearest neighbor classifiers based on fuzzy granules. Knowledge?Based Systems，2020，195：105606.
17	Lin S H， Zhang K B， Guan D，et al. An intrusion detection method based on granular autoencoders. Journal of Intelligent & Fuzzy Systems：Applications in Engineering and Technology，2023，44(5)：8413-8424.
18	陈玉明，蔡国强，卢俊文,等. 一种邻域粒K均值聚类方法. 控制与决策，2023，38(3)：857-864.
	Chen Y M， Cai G Q， Lu J W，et al. A neighborhood granular K?means clustering method. Control and Decision，2023，38(3)：857-864.
19	Chen Y M， Qin N， Li W，et al. Granule structures，distances and measures in neighborhood systems. Knowledge?Based Systems，2019，165：268-281.
20	Chen Y M， Zhu Q X， Wu K S，et al. A binary granule representation for uncertainty measures in rough set theory. Journal of Intelligent & Fuzzy Systems：Applications in Engineering and Technology，2015，28(2)：867-878.
21	Chen J F， Zhu J， Song L. Stochastic training of graph convolutional networks with variance reduction. 2018,arXiv:.
22	Chiang W L， Liu X Q， Si S，et al. Cluster?GCN：An efficient algorithm for training deep and large graph convolutional networks∥Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage，USA：ACM，2019：257-266.
23	Feng K X， Lu Z Z， Ling C Y，et al. Fuzzy importance sampling method for estimating failure possibility. Fuzzy Sets and Systems，2021，424：170-184.
24	Müller T， McWilliams B， Rousselle F，et al. Neural importance sampling. ACM Transactions on Graphics，2019，38(5)：145.
25	Grittmann P， Georgiev I， Slusallek P，et al. Variance?aware multiple importance sampling. ACM Transactions on Graphics，2019，38(6)：152.
26	Huang X L， Li Z H， Jin Y L，et al. Fair?AdaBoost：Extending AdaBoost method to achieve fair classification. Expert Systems with Applications，2022，202：117240.
27	Liu B， Liu C D， Xiao Y S，et al. AdaBoost?based transfer learning method for positive and unlabelled learning problem. Knowledge?Based Systems，2022，241：108162.
28	Jiang X， Xu Y， Ke W，et al. An imbalanced multifault diagnosis method based on bias weights AdaBoost. IEEE Transactions on Instrumentation and Measurement，2022，71：3505908.
29	Guryanov A. Histogram?based algorithm for building gradient boosting ensembles of piecewise linear decision trees∥8^th International Conference on Analysis of Images，Social Networks and Texts. Springer Berlin Heidelberg，2019：39-50.
30	Chen T Q， Guestrin C. Xgboost：A scalable tree boosting system∥Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco,USA：ACM，2016：785-794.
31	Dong W， Huang Y M， Lehane B，et al. XGBoost algorithm?based prediction of concrete electrical resistivity for structural health monitoring. Automation in Construction，2020，114：103155.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

数据集	维度	类别数	样本数
breast cancer	30	2	569
mobile	20	4	2000
diabetes	8	2	768
blood	4	2	747
raisin	7	2	900
Shill Bidding	10	2	6321
Wine Quality	21	10	5000
yeast	8	10	1484
waveform	21	3	5000
Debrecen	19	2	1150

采样方法	优点	缺点
随机采样	算法简单有效，计算效率高	随机性较高，需要多次采样才能有好的结果
聚类采样	构建的近似分布方差较小	计算效率较高，需要预先计算簇
重要性采样	构建的近似分布方差最小，更符合原始分布	重要性较低的样本很难被选择

数据集	原始分布	随机采样	邻域采样	重要性采样
breast cancer	0.0208	0.0238	0.0217	0.0125
mobile	0.1305	0.1308	0.1301	0.1263
diabetes	0.0258	0.0249	0.0257	0.0200
blood	0.0267	0.0247	0.0237	0.0154
raisin	0.0224	0.0216	0.0222	0.0111
Shill Bidding	0.1119	0.1118	0.1106	0.1036
Wine Quality	0.0180	0.0180	0.0171	0.0131
yeast	0.0138	0.0133	0.0128	0.0096
waveform	0.0241	0.0246	0.0241	0.0220
Debrecen	0.0320	0.0314	0.0292	0.0257
均值	0.0426	0.0424	0.0417	0.0358

数据集	随机采样			聚类采样			重要性采样
数据集	最大	最小	平均	最大	最小	平均	最大	最小	平均
breast cancer	0.9772	0.9631	0.9712	0.9806	0.9719	0.9745	0.9806	0.9754	0.9777
mobile	0.9110	0.8865	0.8975	0.9100	0.8860	0.8983	0.9120	0.8875	0.9035
diabetes	0.7787	0.7566	0.7683	0.7774	0.7435	0.7674	0.7826	0.7631	0.7710
blood	0.7818	0.7604	0.7656	0.7912	0.7590	0.7668	0.7724	0.7631	0.7646
raisin	0.8644	0.8478	0.8572	0.8667	0.8522	0.8600	0.8700	0.8600	0.8653
Shill Bidding	0.9854	0.9813	0.9833	0.9869	0.9801	0.9839	0.9848	0.9821	0.9831
Wine Quality	0.5935	0.5735	0.5869	0.5929	0.5785	0.5879	0.5966	0.5860	0.5906
yeast	0.5725	0.5421	0.5605	0.5758	0.5557	0.5646	0.5725	0.5637	0.5673
waveform	0.8698	0.8650	0.8680	0.8718	0.8662	0.8693	0.8718	0.8676	0.8693
Debrecen	0.6765	0.6557	0.6623	0.6809	0.6557	0.6590	0.6870	0.6600	0.6738

数据集	RF	RF_Fuzzy	RF_Condition	RF_Neighbor	RF_SAG
breast cancer	0.9614±0.0007	0.9596±0.0011	0.9632±0.0014	0.9631±0.0007	0.9667±0.0007
mobile	0.8755±0.0002	0.8915±0.0001	0.8915±0.0002	0.9090±0.0003	0.9405±0.0002
diabetes	0.7474±0.0025	0.7721±0.0025	0.7565±0.0025	0.7656±0.0019	0.7500±0.0020
blood	0.7363±0.0008	0.7470±0.0008	0.7483±0.0004	0.7377±0.0011	0.7536±0.0008
raisin	0.8556±0.0004	0.8544±0.0007	0.8611±0.0005	0.8578±0.0011	0.8656±0.0005
Shill Bidding	0.9959±0.0000	0.9975±0.0000	0.9975±0.0000	0.9984±0.0000	0.9847±0.0000
Wine Quality	0.6979±0.0010	0.7004±0.0009	0.6967±0.0005	0.7005±0.0016	0.7035±0.0008
yeast	0.6103±0.0006	0.6042±0.0011	0.6150±0.0008	0.5866±0.0011	0.6135±0.0009
waveform	0.8490±0.0002	0.8392±0.0001	0.8400±0.0002	0.8384±0.0002	0.8586±0.0002
Debrecen	0.6878±0.0003	0.6774±0.0011	0.6635±0.0012	0.6843±0.0025	0.6930±0.0002
均值	0.8017±0.0007	0.8043±0.0008	0.8033±0.0008	0.8041±0.0010	0.8130±0.0006

多采样近似粒集成学习

A granular ensemble learning based on multi⁃sampling approximate granulation

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 31

相关文章 15

Metrics

本文评价

推荐阅读 0

数据集	RF	AdaBoost	HGB	XGBoost	MSAGEL
breast cancer	0.9614±0.0006	0.9667±0.0005	0.9684±0.0003	0.9789±0.0006	0.9842±0.0002
mobile	0.8825±0.0006	0.7210±0.0018	0.9120±0.0002	0.9205±0.0002	0.9610±0.0002
diabetes	0.7474±0.0025	0.7527±0.0033	0.7344±0.0014	0.7357±0.0019	0.8724±0.0022
blood	0.6643±0.0196	0.7873±0.0112	0.6883±0.0214	0.7444±0.0019	0.8488±0.0016
raisin	0.8600±0.0013	0.8544±0.0021	0.8467±0.0024	0.8511±0.0015	0.9444±0.0008
Shill Bidding	0.9911±0.0000	0.9913±0.0000	0.9962±0.0000	0.9972±0.0000	0.9994±0.0000
Wine Quality	0.5647±0.0016	0.5253±0.0050	0.5428±0.0008	0.6898±0.0022	0.8405±0.0005
yeast	0.6197±0.0018	0.4323±0.0004	0.5846±0.0011	0.5947±0.0008	0.7876±0.0009
waveform	0.8264±0.0002	0.8094±0.0004	0.8518±0.0004	0.8450±0.0003	0.9026±0.0002
Debrecen	0.6600±0.0023	0.6522±0.0011	0.7043±0.0024	0.7122±0.0005	0.8478±0.0004
均值	0.7778±0.0030	0.7492±0.0026	0.7829±0.0030	0.8069±0.0010	0.8989±0.0007

数据集	模型	F1	Acc	召回率
breast cancer	MSAGEL	0.9810±0.0004	0.9847±0.0002	0.9785±0.0005
breast cancer	XGBoost	0.9737±0.0008	0.9742±0.0008	0.9738±0.0008
mobile	MSAGEL	0.9599±0.0001	0.9604±0.0001	0.9600±0.0001
mobile	XGBoost	0.9208±0.0002	0.9205±0.0002	0.9205±0.0002
diabetes	MSAGEL	0.8569±0.0009	0.8660±0.0009	0.8524±0.0011
diabetes	XGBoost	0.7103±0.0014	0.7159±0.0014	0.7116±0.0017
blood	MSAGEL	0.7736±0.0012	0.8031±0.0005	0.7587±0.0019
blood	XGBoost	0.5995±0.0050	0.6343±0.0095	0.5964±0.0038
raisin	MSAGEL	0.9444±0.0008	0.9448±0.0006	0.9444±0.0008
raisin	XGBoost	0.8508±0.0015	0.8536±0.0014	0.8511±0.0015
Shill Bidding	MSAGEL	0.9967±0.0000	0.9974±0.0000	0.9960±0.0000
Shill Bidding	XGBoost	0.9921±0.0000	0.9919±0.0000	0.9925±0.0001
Wine Quality	MSAGEL	0.7112±0.0017	0.7188±0.0034	0.7190±0.0019
Wine Quality	XGBoost	0.3622±0.0029	0.3806±0.0065	0.3619±0.0024
yeast	MSAGEL	0.7369±0.0018	0.768±0.0016	0.7393±0.0028
yeast	XGBoost	0.5267±0.0058	0.5560±0.0073	0.5255±0.0062
waveform	MSAGEL	0.9021±0.0003	0.9027±0.0003	0.9020±0.0003
waveform	XGBoost	0.8469±0.0003	0.8476±0.0003	0.8472±0.0003
Debrecen	MSAGEL	0.8472±0.0012	0.8488±0.0012	0.8480±0.0013
Debrecen	XGBoost	0.7113±0.0005	0.7131±0.0005	0.7126±0.0005
均值	MSAGEL	0.8711±0.0008	0.8795±0.0009	0.8698±0.0011
均值	XGBoost	0.7494±0.0018	0.7588±0.0028	0.7493±0.0018

[1]	李小川, 张超, 李德玉, 上官学奎, 马瑾男, 陆文瑞. 基于TPOP法的犹豫模糊语言稳健型多粒度群决策[J]. 南京大学学报(自然科学版), 2023, 59(1): 22-34.
[2]	于子淳, 吴伟志. 用证据理论刻画协调的具有多尺度决策的信息系统的最优尺度选择[J]. 南京大学学报(自然科学版), 2022, 58(1): 71-81.
[3]	郑嘉文, 吴伟志, 包菡, 谭安辉. 基于熵的多尺度决策系统的最优尺度选择[J]. 南京大学学报(自然科学版), 2021, 57(1): 130-140.
[4]	郭英杰, 胡峰, 于洪, 张红亮. 基于时间粒的铝电解过热度预测模型[J]. 南京大学学报(自然科学版), 2019, 55(4): 624-632.
[5]	方　宇1，闵　帆1*，刘忠慧1，杨　新2. 序贯三支决策的代价敏感分类方法[J]. 南京大学学报(自然科学版), 2018, 54(1): 148-.
[6]	顾沈明, 张　昊, 吴伟志, 谭安辉, . 多标记序决策系统中基于局部最优粒度的规则获取[J]. 南京大学学报(自然科学版), 2017, 53(6): 1012-.
[7]	杨　洁1，2，王国胤1*，庞紫玲1. 密度峰值聚类相关问题的研究[J]. 南京大学学报(自然科学版), 2017, 53(4): 791-.
[8]	李同军1，2*，吴伟志1，2，顾沈明1，2. 双论域上基于Brouwer－正交补的粗糙近似[J]. 南京大学学报(自然科学版), 2016, 52(5): 871-.
[9]	顾沈明^1，2*，万雅虹^1，2，吴伟志^1，2，徐优红^1，2. 多粒度决策系统的局部最优粒度选择[J]. 南京大学学报(自然科学版), 2016, 52(2): 280-.
[10]	马对霞*，祝　峰，林姿琼. 基于等价关系的粗糙拟阵的性质[J]. 南京大学学报(自然科学版), 2016, 52(2): 300-.
[11]	戴志聪，吴伟志*. 不完备多粒度序信息系统的粗糙近似 [J]. 南京大学学报(自然科学版), 2015, 51(2): 361-367.
[12]	谢珺*，秦琴，续欣莹. 全覆盖粒计算模型的粒化、知识逼近及其算子性质研究[J]. 南京大学学报(自然科学版), 2015, 51(1): 105-110.
[13]	徐健锋1,2张远健1Duanning Zhou2，Dan Li 2，李宇1,. 基于粒计算的不确定性时间序列建模及其聚类[J]. 南京大学学报(自然科学版), 2014, 50(1): 86-.
[14]	顾沈明*，叶晓敏，吴伟志 . 多标记粒度不完备信息系统的粗糙近似[J]. 南京大学学报(自然科学版), 2013, 49(2): 250-257.
[15]	据春华，帅朝谦*，封毅 . 基于粒计算的商业数据流概念漂移特征选择[J]. 南京大学学报(自然科学版), 2011, 47(4): 391-397.