南京大学学报(自然科学版) ›› 2024, Vol. 60 ›› Issue (1): 106–117.doi: 10.13232/j.cnki.jnju.2024.01.011

• • 上一篇    

不完备数据集的邻域容差互信息选择集成分类算法

李丽红1,2,3(), 董红瑶1,2,3,6, 刘文杰4, 李宝霖1,2,3, 代琪5   

  1. 1.华北理工大学理学院,唐山,063210
    2.河北省数据科学与应用重点实验室,华北理工大学,唐山,063210
    3.唐山市工程计算重点实验室,华北理工大学,唐山,063210
    4.华北理工大学人工智能学院,唐山,063210
    5.中国石油大学(北京)自动化系, 北京,102249
    6.首钢矿业公司职工子弟学校,唐山,064404
  • 收稿日期:2023-09-27 出版日期:2024-01-30 发布日期:2024-01-29
  • 通讯作者: 李丽红 E-mail:22687426@qq.com
  • 基金资助:
    河北省数据科学与应用重点实验室项目(10120201);唐山市数据科学重点实验室项目(10120301)

Neighborhood⁃tolerance mutual information selection ensemble classification algorithm for incomplete data sets

Lihong Li1,2,3(), Hongyao Dong1,2,3,6, Wenjie Liu4, Baolin Li1,2,3, Qi Dai5   

  1. 1.College of Science,North China University of Science and Technology,Tangshan,063210,China
    2.Hebei Province Key Laboratory of Data Science and Application,North China University of Science and Technology,Tangshan,063210,China
    3.Tangshan Key Laboratory of Engineering Computing,North China University of Science and Technology,Tangshan,063210,China
    4.College of Artificial Intelligence,North China University of Science and Technology,Tangshan,063210,China
    5.Department of Automation,China University of Petroleum,Beijing,102249,China
    6.Shougang Minning workers' Children School,Tangshan,064404,China
  • Received:2023-09-27 Online:2024-01-30 Published:2024-01-29
  • Contact: Lihong Li E-mail:22687426@qq.com

摘要:

针对不完备混合信息系统的分类问题,结合粒计算中的邻域容差关系和互信息理论,定义邻域容差互信息的概念,并利用集成学习的思想,提出不完备数据集的邻域容差互信息选择集成分类算法.该算法首先根据缺失属性得到信息粒,划分粒层构建粒空间,在不同的粒层上使用以BP神经网络作为基分类器的集成算法,构建新的基分类器;然后,根据每个信息粒的缺失属性计算出关于类属性的邻域容差互信息,来衡量各个信息粒的重要度,并根据基分类器预测准确率以及邻域容差互信息重新定义基分类器权重;最后,根据预测样本对基分类器加权集成预测分类结果,并与传统的集成分类算法进行对比分析.对于部分不完备混合型数据集,新提出的集成分类算法能有效提升分类准确率.

关键词: 不完备混合信息系统, 邻域容差互信息, 集成学习, 分类

Abstract:

In order to solve the classification problem of incomplete mixed information systems,the concept of neighborhood?tolerance mutual information is defined by combining neighborhood?tolerance and mutual information theory in granular computing,and a selective ensemble classification algorithm based on neighborhood?tolerance mutual information is proposed by using ensemble learning. In this algorithm,information particles are obtained according to the missing attributes,and the space is constructed by dividing the particles into different layers. A new base classifier is constructed by integrating the BP neural network as the base classifier on different layers. Then,the neighborhood?tolerance mutual information about class attributes is calculated according to the missing attributes of each information particle to measure the importance of each information particle,and the weight of the base classifier is redefined according to the prediction accuracy of the base classifier and the neighborhood?tolerance mutual information. Finally,based on the predicted samples,the weighted ensemble prediction results of base classifier are analyzed and compared with the traditional ensemble classification algorithm. For partial incomplete mixed data sets,the proposed ensemble classification algorithm can effectively improve the classification accuracy.

Key words: incomplete hybrid information system, neighborhood?tolerance mutual information, ensemble learning, classification

中图分类号: 

  • TP181

图1

一个通用的集成学习框架"

图2

基于邻域容差互信息的选择集成分类训练流程图"

表1

不完备混合型数据集"

样本属性a1属性a2属性a3属性a4属性Y
x10.15110.21
x20.700*1
x30.2**0.51
x40.3000.72
x50.8000.80
x60.850**0

表2

数据集的详细信息"

数据集样本数类别数属性数连续属性数离散属性数属性值缺失率
Housing loan614213581.81%
Adult32561214680.87%
Credit690215690.61%
Mushroom81242220221.33%
Water quality327629904.38%
E⁃commerce transportation10999210640
Shill Bidding63212121110

表3

Housing loan数据集缺失属性的邻域容差互信息"

信息粒缺失属性邻域容差互信息基分类器预测权重
a10.12530.1451
a30.25830.1315
a50.11430.1527
a80.10880.1460
a90.12330.1480
a100.16990.1518
a5,a100.23160.1249

表4

Adult数据集缺失属性的邻域容差互信息"

信息粒缺失属性邻域容差互信息基分类器预测权重
a60.33270.1945
a130.08950.2712
a1,a60.21570.3451
a1,a6,a130.36610.1891

表5

Credit数据集缺失属性的邻域容差互信息"

信息粒缺失属性邻域容差互信息基分类器预测权重
a10.21450.1639
a20.28590.1506
a140.27560.1502
a1,a140.39400.1414
a6,a70.46710.1327
a1,a6,a70.49130.1296
a4,a5,a6,a7,a140.48220.1317

表6

Water quality数据集缺失属性的邻域容差互信息"

信息粒缺失属性邻域容差互信息基分类器预测权重
a10.13300.1643
a50.09960.0778
a80.00720.0262
a1,a50.14650.1534
a1,a80.16780.2756
a5,a80.15560.1762
a1,a5,a80.23650.1256

表7

不同分类器预测不同数据集准确率的对比"

数据集属性值缺失率NTMISECAXGBoostRFGBDTAdaBoostStacking
Housing loan1.81%81.6265%75.4599%76.5934%71.7033%77.4725%75.8791%
Adult0.87%83.8565%83.3838%85.7027%86.2422%85.8737%86.4469%
Credit0.61%87.5300%87.6712%85.0481%82.7885%82.2115%84.2788%
Mushroom1.33%100%99.7538%100%100%100%100%
Water quality4.38%68.8459%65.3103%65.9207%66.4293%63.0722%68.4639%
E⁃commerce transportation5%65.6758%63.3758%64.8694%65,1122%64.5684%62.7886%
10%65.9581%63.1212%64.3565%64.6657%63.6746%59.9897%
Shill Bidding5%94.8156%94.3115%93.2627%93.2212%92.1678%94.1223%
10%93.2296%90.1354%90.2319%92.2376%91.1324%93.0034%

表8

所有分类器的Friedman排名和事后检验结果"

分类器Friedman排名未调整p调整后的p
NTMISECA2.2222--
Stacking3.33330.2077120.261440
RF3.55560.1305700.261440
GBDT3.77780.0777600.233280
XGBoost4.00000.0438200.175279
AdaBoost4.11110.0322100.161048
1 邓建新,单路宝,贺德强,等. 缺失数据的处理方法及其发展趋势. 统计与决策201935(23):8-34.
Deng J X, Shan L B, He D Q,et al. Processing method of missing data and its developing tendency. Statistics and Decision201935(23):28-34.
2 Tran C T, Zhang M J, Andreae P,et al. An effective and efficient approach to classification with incomplete data. Knowledge?Based Systems2018,154:1-16.
3 张利亭,冯涛,李欢. 不完备信息系统的直觉模糊决策粗糙集. 郑州大学学报(理学版)202153(2):57-65.
Zhang L T, Feng T, Li H. Intuitionistic fuzzy decision rough sets for incomplete information systems. Journal of Zhengzhou University (Natural Science Edition)202153(2):57-65.
4 杨美丽. 基于相容关系的不完整数据集成分类方法研究. 硕士学位论文. 合肥:安徽大学,2021.
Yang M L. Incomplete data ensemble classification based?on tolerance relationship. Master Dissertation. Hefei:Anhui University,2021.
5 刘海峰,续欣莹,申雪芬,等. 基于限制邻域关系的不完备混合决策系统属性约简. 广西师范大学学报(自然科学版)201331(3):30-36.
Liu H F, Xu X Y, Shen X F,et al. Attribute reduction of incomplete mixed decision system based on limited neighborhood relation. Journal of Guangxi Normal University (Natural Science Edition)201331(3):30-36.
6 Zhao H, Qin K Y. Mixed feature selection in incomplete decision table. Knowledge?Based Systems2014,57:181-190.
7 梁吉业,钱宇华,李德玉,等. 大数据挖掘的粒计算理论与方法. 中国科学:信息科学,201545(11):1355-1369.
Liang J Y, Qian Y H, Li D Y,et al. Theory and method of granular computing for big data mining. Science in China (Information Sciences)201545(11):1355-1369.
8 Krause S, Polikar R. An ensemble of classifiers approach for the missing feature problem ∥ Proceedings of the International Joint Conference on Neural Networks,2003. Portland,OR,USA:IEEE,2003:553-558.
9 吕靖,舒礼莲. 基于AdaBoost的不完整数据的信息熵分类算法. 计算机与现代化2013(9):31-34.
Lü J, Shu L L. Incomplete data information entropy classification algorithm based on AdaBoost. Computer and Modernization2013,9:31-34.
10 Chen H X, Du Y P, Jiang K. Classification of incomplete data using classifier ensembles ∥ 2012 International Conference on Systems and Informatics (ICSAI2012). Yantai,China:IEEE,2012:2229-2232.
11 Yan Y T, Zhang Y P, Zhang Y W. Multi?granulation ensemble classification for incomplete data ∥ 9th International Conference on Rough Sets and Knowledge Technology. Springer Berlin Heidelberg2014:343-351.
12 Zhang T, Dai Q, Ma Z C. Extreme learning machines' ensemble selection with GRASP. Applied Intelligence201543(2):439-459.
13 Ma Z C, Dai Q, Liu N Z. Several novel evaluation measures for rank?based ensemble pruning with applications to time series prediction. Expert Systems with Applications201542(1):280-292.
14 Chen T Q, He T, Benesty M,et al. Xgboost:Extreme gradient boosting20151(4):1-4.
15 Yan Y T, Zhang Y P, Zhang Y W,et al. A selective neural network ensemble classification for incomplete data. International Journal of Machine Learning and Cybernetics20178(5):1513-1524.
16 彭莉,张海清,李代伟,等. 基于粗糙集理论的不完备数据分析方法的混合信息系统填补算法. 计算机应用202141(3):677-685.
Peng L, Zhang H Q, Li D W,et al. Imputation algorithm for hybrid information system of incomplete data analysis approach based on rough set theory. Journal of Computer Applications202141(3):677-685.
17 李金海,王飞,吴伟志,等. 基于粒计算的多粒度数据分析方法综述. 数据采集与处理202136(3):418-435.
Li J H, Wang F, Wu W Z,et al. Review of multi?granularity data analysis methods based on granular computing. Journal of Data Acquisition and Processing202136(3):418-435.
18 李明,甘秀娜,王月波. 基于集成学习的决策粗糙集特定类属性约简算法. 计算机应用与软件202138(6):262-270.
Li M, Gan X N, Wang Y B. Class-specific attribute reduction algorithm for decision-theoretic rough sets based on ensemble learning. Computer Applications and Software202138(6):262-270.
19 杨小平. 粗集中最大相似度的不完备数据补齐. 计算机工程与应用201248(36):164-166.
Yang X P. Completing incomplete data based on maximum similarity in rough sets. Computer Engineering and Applications201248(36):164-166.
20 姚晟,陈菊,吴照玉. 一种基于邻域容差信息熵的组合度量方法. 小型微型计算机系统202041(1):46-50.
Yao S, Chen J, Wu Z Y. Combination measurement method based on neighborhood tolerance information entropy. Journal of Chinese Computer Systems202041(1):46-50.
21 刘丹,徐立新,李敬伟. 不完备邻域多粒度决策理论粗糙集与三支决策. 计算机应用与软件201936(5):145-157.
Liu D, Xu L X, Li J W. Incomplete neighborhood multi?granulation decision?theoretic rough set and three?way decision. Computer Applications and Software201936(5):145-157.
22 滕书华,鲁敏,杨阿锋,等. 基于一般二元关系的粗糙集加权不确定性度量. 计算机学报201437(3):649-665.
Teng S H, Lu M, Yang A F,et al. A weighted uncertainty measure of rough sets based on general binary relation. Chinese Journal of Computers201437(3):649-665.
23 Hu Q H, Yu D R, Liu J F,et al. Neighborhood rough set based heterogeneous feature subset selection. Information Sciences2008178(18):3577-3594.
24 He Q, Xie Z X, Hu Q H,et al. Neighborhood based sample and feature selection for SVM classification learning. Neurocomputing201174(10):1585-1594.
25 Shannon C E. A mathematical theory of communication. The Bell System Technical Journal,1948,27(3):379-423.
[1] 刘芳, 李磊军, 米据生, 李美争. 分层多尺度决策信息系统的序贯三支决策[J]. 南京大学学报(自然科学版), 2023, 59(6): 981-995.
[2] 韩雪, 周晨. 大气探测激光雷达的分类和特征[J]. 南京大学学报(自然科学版), 2023, 59(5): 900-913.
[3] 孟元, 张轶哲, 张功萱, 宋辉. 基于特征类内紧凑性的不平衡医学图像分类方法[J]. 南京大学学报(自然科学版), 2023, 59(4): 580-589.
[4] 谭嘉辰, 董永权, 张国玺. SSM: 基于孪生网络的糖尿病视网膜眼底图像分类模型[J]. 南京大学学报(自然科学版), 2023, 59(3): 425-434.
[5] 吕佳, 肖锋. 内存有效的快速双层深度规则分类器[J]. 南京大学学报(自然科学版), 2023, 59(3): 446-459.
[6] 仲兆满, 熊玉龙, 黄贤波. 基于异构集成学习的多元文本情感分析研究[J]. 南京大学学报(自然科学版), 2023, 59(3): 471-482.
[7] 冯海, 马甲林, 许林杰, 杨宇, 谢乾. 融合标签嵌入和知识感知的多标签文本分类方法[J]. 南京大学学报(自然科学版), 2023, 59(2): 273-281.
[8] 陈瑞, 徐金东, 刘兆伟, 阎维青, 王璇, 宋永超, 倪梦莹. 基于模糊空谱特征的高光谱图像分类[J]. 南京大学学报(自然科学版), 2023, 59(1): 145-154.
[9] 田小瑜, 秦永彬, 黄瑞章, 陈艳平. 基于相关性约束矩阵分解的多标签分类方法[J]. 南京大学学报(自然科学版), 2023, 59(1): 76-84.
[10] 张艳莎, 冯夫健, 王杰, 潘凤, 谭棉, 张再军, 王林. 基于张量特征的小样本图像快速分类方法[J]. 南京大学学报(自然科学版), 2022, 58(6): 1059-1069.
[11] 刘艳鹏, 龚安民, 赵磊, 罗建功, 王帆, 伏云发. 不同实验范式下言语想象的脑神经机制[J]. 南京大学学报(自然科学版), 2022, 58(5): 836-845.
[12] 孙晓燕, 乔娅利. 基于迁移与半监督共生融合的虚假评论识别[J]. 南京大学学报(自然科学版), 2022, 58(5): 846-855.
[13] 徐旭, 张凡, 李义丰. 基于非线性Lamb波改进全聚焦成像的板中损伤分类与定位[J]. 南京大学学报(自然科学版), 2022, 58(5): 894-903.
[14] 梁纬, 逯洋, 王淳, 张桂杰. 尺度选择完备局部导数模式及其在热轧带钢图像分类中的应用研究[J]. 南京大学学报(自然科学版), 2022, 58(4): 615-628.
[15] 曾艺祥, 林耀进, 范凯钧, 曾伯儒. 基于层次类别邻域粗糙集的在线流特征选择算法[J]. 南京大学学报(自然科学版), 2022, 58(3): 506-518.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!