南京大学学报(自然科学版) ›› 2021, Vol. 57 ›› Issue (1): 150–159.doi: 10.13232/j.cnki.jnju.2021.01.016

• • 上一篇    

共现邻域关系下的属性约简研究

毛振宇1, 窦慧莉1(), 宋晶晶1,3, 姜泽华1, 王平心2   

  1. 1.江苏科技大学计算机学院,镇江,212003
    2.江苏科技大学理学院,镇江,212003
    3.数据科学与智能应用福建省高校重点实验室,漳州,363000
  • 收稿日期:2020-09-16 出版日期:2021-01-21 发布日期:2021-01-21
  • 通讯作者: 窦慧莉 E-mail:douhuili@163.com
  • 作者简介:E⁃mail:douhuili@163.com
  • 基金资助:
    国家自然科学基金(62076111);数据科学与智能应用福建省高校重点实验室开放课题(D1901)

Research on attribute reduction via co⁃occurrence neighborhood relation

Zhenyu Mao1, Huili Dou1(), Jingjing Song1,3, Zehua Jiang1, Pingxin Wang2   

  1. 1.School of Computer,Jiangsu University of Science and Technology,Zhenjiang,212003,China
    2.School of Science,Jiangsu University of Science and Technology,Zhenjiang,212003,China
    3.Fujian Province University Key Laboratory of Data Science and Intelligent Application,Zhangzhou,363000,China
  • Received:2020-09-16 Online:2021-01-21 Published:2021-01-21
  • Contact: Huili Dou E-mail:douhuili@163.com

摘要:

在邻域粗糙集的研究中,往往借助给定的半径来约束样本之间的相似性进而实现邻域信息粒化,需要注意的是,若给定的半径较大,则不同类别的样本将落入同一邻域中,易引起邻域中信息的不精确或不一致.为改善这一问题,已有学者给出了伪标记邻域的策略,然而无论是传统邻域还是伪标记邻域,都仅仅使用样本间的距离来度量样本之间的相似性,忽略了邻域信息粒内部不同样本所对应的邻域之间的结构关系.鉴于此,通过引入邻域距离度量,提出一种共现邻域的信息粒化机制,并构造了新型的共现邻域以及伪标记共现邻域粗糙集模型,在此基础上使用前向贪心搜索策略实现了所构造的两种模型下的约简求解.实验结果表明,与传统邻域关系以及伪标记邻域关系所求得的约简相比,利用共现邻域方法求得的约简能够在不降低分类器准确率的前提下产生更高的约简率.

关键词: 粗糙集, 信息粒化, 邻域关系, 伪标记, 共现邻域

Abstract:

In the research of neighborhood rough set,a radius is generally appointed to restrain the similarities between the samples,which follows that the neighborhood information granulation can be realized. It should be noticed that if the radius is too great,then the samples in different classes may fall into the same neighborhood,and they may result in imprecise or inconsistent information. To alleviate the problem,the strategy of pseudo?label neighborhood has been proposed. Nevertheless,in both traditional neighborhood and pseudo?label neighborhood,the similarities of the samples are only measured by the distances between them,while the structural relationship of neighborhoods related to different samples contained in one neighborhood information granule is ignored. In view of this,through introducing the measure for neighborhood distance,the mechanism of co?occurrence neighborhood information granulation is proposed. Based on such mechanism,co?occurrence neighborhood rough set model and pseudo?label co?occurrence neighborhood rough set model are constructed. Then,the forward greedy searching strategy is employed to obtain the corresponding reducts. The experimental results demonstrate that compared with the reduct based on the neighborhood relation and pseudo?label neighborhood relation,the reduct based on co?occurrence neighborhood may provide the higher reduction ratio while the classification accuracy does not decrease.

Key words: rough set, information granulation, neighborhood relation, pseudo?label, co?occurrence neighborhood

中图分类号: 

  • TP18

图1

邻域结构"

表1

数据集的描述"

ID数据集样本数量属性数决策数
1

Breast Cancer Wisconsin

(Diagnostic)

569302
2Cardiotocography21262110
3Connectionist Bench (Vowel Recognition?Deterding Data)9901311
4Dermatology366346
5Fertility10092
6Lymphography98183
7Sonar208602
8Statlog (Heart)270132
9

Synthetic Control Chart

Time Series

600606
10Urban Land Cover6751479
11Vertebral Column31062
12Wine178133

表2

四种方法在不同阈值下的约简率"

ID

NR

(ε=1)

CNR

(ε=1)

NR

(ε=0.95)

CNR

(ε=0.95)

PLNR

(ε=1)

PLCNR

(ε=1)

PLNR

(ε=0.95)

PLCNR

(ε=0.95)

均值0.41730.47010.49400.57120.62990.63690.65710.6852
10.31530.42680.49430.59070.88800.89830.90070.9157
20.10430.17760.27520.35950.63050.59520.61430.6229
30.13000.18690.19850.28540.69230.65080.69540.6785
40.64740.70710.71680.76940.67820.70880.75910.7747
50.17890.24220.27110.38780.14110.18220.20780.3122
60.54720.59720.57330.63560.49000.52830.52200.6000
70.81280.85130.84000.88100.79180.83130.80820.8518
80.21080.25770.32080.41230.21230.21150.27080.3377
90.79350.81070.82280.85570.87000.87870.87970.8720
100.87900.88140.90620.92760.93440.93850.94290.9475
110.01330.08000.05000.14170.70170.66330.71330.6783
120.37540.50690.45920.60770.52850.55620.57150.6308

表3

四种方法在不同阈值下的KNN分类准确率均值"

ID

NR

(ε=1)

CNR

(ε=1)

NR

(ε=0.95)

CNR

(ε=0.95)

PLNR

(ε=1)

PLCNR

(ε=1)

PLNR

(ε=0.95)

PLCNR

(ε=0.95)

均值0.83610.84110.83620.83630.76620.79350.76350.7883
10.96210.96050.95910.95540.91280.91100.91570.9071
20.76280.76620.76020.76650.61370.67620.60780.6743
30.95450.95130.94410.93540.55030.66840.56760.6522
40.92120.92410.92360.91910.89390.91370.87040.8984
50.86150.86300.86050.86300.86300.86250.85800.8535
60.75960.77680.77360.77860.73420.76030.73840.7423
70.77970.77330.77720.77040.78060.76620.76550.7627
80.77090.77310.77540.76810.77560.78130.76440.7694
90.82940.84630.82090.82680.72420.79730.73190.8163
100.71910.73630.72150.73010.67340.69560.67490.7045
110.76500.77690.77340.78530.74610.75130.74290.7495
120.94790.94580.94510.93680.92620.93820.92410.9294

表4

四种方法在不同阈值下的SVM分类准确率均值"

ID

NR

(ε=1)

CNR

(ε=1)

NR

(ε=0.95)

CNR

(ε=0.95)

PLNR

(ε=1)

PLCNR

(ε=1)

PLNR

(ε=0.95)

PLCNR

(ε=0.95)

均值0.76430.76520.76130.76030.72440.74200.72030.7399
10.95040.95060.95090.95170.89900.90840.90310.9020
20.69510.69850.68320.69070.56520.59590.56000.5985
30.31330.33170.31560.33900.25370.30270.26270.2994
40.92240.91200.92130.90460.89100.89900.85950.8842
50.88000.88000.88000.88000.88000.88000.88000.8800
60.75420.76500.74450.75330.73630.77180.73130.7724
70.71450.70390.71630.70430.71900.70490.71520.7105
80.80000.79370.79610.77810.80280.80590.79720.7919
90.84480.85070.84130.83360.71930.78870.72860.8062
100.67820.68770.67810.69420.62780.64420.62010.6474
110.67150.67320.67240.67470.67810.67810.67850.6784
120.94770.93510.93600.92000.92000.92400.90790.9080

表5

四种方法在不同阈值下的KNN分类准确率最大值"

ID

NR

(ε=1)

CNR

(ε=1)

NR

(ε=0.95)

CNR

(ε=0.95)

PLNR

(ε=1)

PLCNR

(ε=1)

PLNR

(ε=0.95)

PLCNR

(ε=0.95)

均值0.86780.88270.87390.88080.87570.87440.87100.8666
10.97010.97190.97190.97010.95430.95080.95950.9561
20.77990.78270.78640.79730.79630.79770.79300.7907
30.96460.96570.96260.96360.94140.94040.92930.9354
40.95640.96450.95350.95090.95080.94540.95370.9346
50.88000.88000.89000.89000.88000.89000.89000.8900
60.80530.82580.81630.82530.81580.81580.81530.7847
70.83670.86060.83690.86090.83680.82720.82250.8269
80.81110.81850.82220.84070.82220.81850.81480.8185
90.93330.93170.92000.91500.93330.92500.89830.9017
100.73630.79110.75410.75560.79560.80000.80590.7911
110.77420.82260.80650.82260.80970.80970.80970.8097
120.96600.97750.96600.97730.97160.97170.96020.9600

表6

四种方法在不同阈值下的SVM分类准确率最大值"

ID

NR

(ε=1)

CNR

(ε=1)

NR

(ε=0.95)

CNR

(ε=0.95)

PLNR

(ε=1)

PLCNR

(ε=1)

PLNR

(ε=0.95)

PLCNR

(ε=0.95)

均值0.79670.79920.79980.80180.79860.80160.80420.7973
10.95430.95610.95780.95780.95260.95080.95080.9490
20.70930.70840.71120.71400.71210.70740.71820.7084
30.38480.38790.38990.42730.37780.40200.42530.3828
40.94820.95910.95630.94270.95910.93730.93460.9372
50.88000.88000.88000.88000.88000.88000.88000.8800
60.82630.81630.80580.80530.82740.83680.83630.8374
70.74550.75950.75520.76460.77460.76910.77430.7646
80.82590.83330.83330.83330.82590.83330.82960.8370
90.91000.91000.90500.90830.88330.90500.89500.8933
100.72740.72740.74520.73630.74070.75110.75260.7363
110.67740.68060.68060.68060.68390.68060.68710.6806
120.97160.97190.97730.97170.96590.96600.96620.9605
1 胡清华,于达仁,谢宗霞. 基于邻域粒化和粗糙逼近的数值属性约简. 软件学报,2008,19(3):640-649.
Hu Q H,Yu D R,Xie Z X. Numerical attribute reduction based on neighborhood granulation and rough approximation. Journal of Software,2008,19(3):640-649.
2 Zhao H,Wang P,Hu Q H. Cost?sensitive feature selection based on adaptive neighborhood granularity with multi?level confidence. Information Sciences,2016,366:134-149.
3 Ferone A. Feature selection based on composition of rough sets induced by feature granulation. International Journal of Approximate Reasoning,2018,101:276-292.
4 Liu K Y,Yang X B,Fujita H,et al. An efficient selector for multi?granularity attribute reduction. Information Sciences,2019,505:457-472.
5 Yang X B,Liang S C,Yu H L,et al. Pseudo?label neighborhood rough set:measures and attribute reductions. International Journal of Approximate Reasoning,2019,105:112-129.
6 Jiang H B,Chen Y M. Neighborhood granule classifiers. Applied Sciences,2018,8(12):2646.
7 Yao Y Y,Zhang X Y. Class?specific attribute reducts in rough set theory. Information Sciences,2017,418-419:601-618.
8 Yang X B,Yao Y Y. Ensemble selector for attribute reduction. Applied Soft Computing,2018,70:1-11.
9 高媛,陈向坚,王平心等. 面向一致性样本的属性约简. 智能系统学报,2019,14(6):1170-1178.
Gao Y,Chen X J,Wang P X,et al. Attribute reduction over consistent samples. CAAI Transactions on Intelligent Systems,2019,14(6):1170-1178.
10 Li J Z,Yang X B,Song X N,et al. Neighborhood attribute reduction:a multi?criterion approach. International Journal of Machine Learning and Cybernetics,2017,10(4):731-742.
11 姜泽华,王怡博,徐刚等. 面向多尺度的属性约简加速器. 计算机科学,2019,46(12):250-256.
Jiang Z H,Wang Y B,Xu G,et al. Multi?scale based accelerator for attribute reduction. Computer Science,2019,46(12):250-256.
12 Hu Q H,Yu D R,Liu J F,et al. Neighborhood rough set based heterogeneous feature subset selection. Information Sciences,2008,178(18):3577-3594.
13 Liu K Y,Yang X B,Yu H L,et al. Rough set based semi-supervised feature selection via ensemble selector. Knowledge?Based Systems,2019,165:282-296.
14 Xu S P,Yang X B,Yu H L,et al. Multi?label learning with label?specific feature reduction. Knowledge?Based Systems,2016,104:52-61.
15 张文冬,亓慧,刘克宇等. 基于粗糙集特征选择的过拟合现象及应对策略. 南京航空航天大学学报,2019,51(5):687-692.
Zhang W D,Qi H,Liu K Y,et al. Over?fitting and its countermeasure in feature selection based on rough set. Journal of Nanjing University of Aeronautics & Astronautics,2019,51(5):687-692.
16 Song J J,Tsang E C C,Chen D G,et al. Minimal decision cost reduct in fuzzy decision?theoretic rough set model. Knowledge?Based Systems,2017,126:104-112.
17 李京政,杨习贝,王平心等. 模糊粗糙集的稳定约简方法. 南京理工大学学报,2018,42(1):68-75.
Li J Z,Yang X B,Wang P X,et al. Stable attribute reduction approach for fuzzy rough set. Journal of Nanjing University of Science and Technology,2018,42(1):68-75.
18 Hu Q H,Pedrycz W,Yu D R,et al. Selecting discrete and continuous features based on neighborhood decision error minimization. IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2010,40(1):137-150.
19 Chen Y,Liu K Y,Song J J,et al. Attribute group for attribute reduction. Information Sciences,2020,535:64-80.
20 lzak D. Approximate entropy reducts. Fundamenta Informaticae,2002,53:365-390.
21 Zhang X,Mei C L,Chen D G,et al. Feature selection in mixed data:a method using a novel fuzzy rough set?based information entropy. Pattern Recognition,2016,56:1-15.
[1] 刘琼, 代建华, 陈姣龙. 区间值数据的代价敏感特征选择[J]. 南京大学学报(自然科学版), 2021, 57(1): 121-129.
[2] 郑嘉文, 吴伟志, 包菡, 谭安辉. 基于熵的多尺度决策系统的最优尺度选择[J]. 南京大学学报(自然科学版), 2021, 57(1): 130-140.
[3] 郑文彬, 李进金, 张燕兰, 廖淑娇. 基于矩阵的多粒度粗糙集粒度约简方法[J]. 南京大学学报(自然科学版), 2021, 57(1): 141-149.
[4] 李同军,于洋,吴伟志,顾沈明. 经典粗糙近似的一个公理化刻画[J]. 南京大学学报(自然科学版), 2020, 56(4): 445-451.
[5] 任睿,张超,庞继芳. 有限理性下多粒度q⁃RO模糊粗糙集的最优粒度选择及其在并购对象选择中的应用[J]. 南京大学学报(自然科学版), 2020, 56(4): 452-460.
[6] 王宝丽,姚一豫. 信息表中约简补集对及其一般定义[J]. 南京大学学报(自然科学版), 2020, 56(4): 461-468.
[7] 姚宁, 苗夺谦, 张远健, 康向平. 属性的变化对于流图的影响[J]. 南京大学学报(自然科学版), 2019, 55(4): 519-528.
[8] 程永林, 李德玉, 王素格. 基于极大相容块的邻域粗糙集模型[J]. 南京大学学报(自然科学版), 2019, 55(4): 529-536.
[9] 张龙波, 李智远, 杨习贝, 王怡博. 决策代价约简求解中的交叉验证策略[J]. 南京大学学报(自然科学版), 2019, 55(4): 601-608.
[10] 李藤, 杨田, 代建华, 陈鸰. 基于模糊区分矩阵的结直肠癌基因选择[J]. 南京大学学报(自然科学版), 2019, 55(4): 633-643.
[11] 张 婷1,2,张红云1,2*,王 真3. 基于三支决策粗糙集的迭代量化的图像检索算法[J]. 南京大学学报(自然科学版), 2018, 54(4): 714-.
[12] 温 欣1,李德玉1,2*,王素格1,2. 一种基于邻域关系和模糊决策的特征选择方法[J]. 南京大学学报(自然科学版), 2018, 54(4): 733-.
[13] 敬思惠,秦克云*. 决策系统基于特定决策类的上近似约简[J]. 南京大学学报(自然科学版), 2018, 54(4): 804-.
[14] 胡玉文1,2,3*,徐久成1,2,张倩倩1,2. 决策演化集的膜结构抑制剂[J]. 南京大学学报(自然科学版), 2018, 54(4): 810-.
[15] 陶玉枝1,2,赵仕梅1,2,谭安辉1,2*. 一种基于决策表约简的集覆盖问题的近似解法[J]. 南京大学学报(自然科学版), 2018, 54(4): 821-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!