南京大学学报(自然科学版) ›› 2022, Vol. 58 ›› Issue (3): 506–518.doi: 10.13232/j.cnki.jnju.2022.03.014

• • 上一篇    

基于层次类别邻域粗糙集的在线流特征选择算法

曾艺祥1,2, 林耀进1,2(), 范凯钧1,2, 曾伯儒1,2   

  1. 1.闽南师范大学计算机学院,漳州,363000
    2.数据科学与智能应用福建省高校重点实验室,漳州,363000
  • 收稿日期:2022-03-02 出版日期:2022-05-30 发布日期:2022-06-07
  • 通讯作者: 林耀进 E-mail:zzlinyaojin@163.com
  • 基金资助:
    国家自然科学基金(62076116);福建省自然科学基金(2021J02049)

Online streaming feature selection method based on hierarchical class neighborhood rough set

Yixiang Zeng1,2, Yaojin Lin1,2(), Kaijun Fan1,2, Boru Zeng1,2   

  1. 1.School of Computer Science and Engineering, Minnan Normal University, Zhangzhou, 363000, China
    2.Key Laboratory of Data Science and Intelligence Application, Minnan Normal University, Zhangzhou, 363000, China
  • Received:2022-03-02 Online:2022-05-30 Published:2022-06-07
  • Contact: Yaojin Lin E-mail:zzlinyaojin@163.com

摘要:

在开放动态环境中,在线流特征选择是降低特征空间维度的有效方法.现有的在线流特征选择算法能够有效地选择一个较优的特征子集,然而,这些算法忽略了类别中可能存在的层次结构.基于此,提出基于层次类别邻域粗糙集的在线流特征选择算法:首先,在邻域粗糙集中引入层次最近异类的邻域关系,避免邻域粒度的选择,借助层次结构计算特征对标记的层次依赖度,推广邻域粗糙集模型以适应层次类别数据;其次,基于层次依赖度提出三个在线特征评价函数,设计了在线相关选择、在线重要度计算和在线冗余更新的层次特征选择框架;最后,在六个层次类别数据集和八个扁平单标记数据集上的实验表明,提出的算法优于现有最先进的在线流特征选择算法.

关键词: 在线流特征选择, 邻域粗糙集, 层次分类, 层次依赖度

Abstract:

Online streaming feature selection is an effective approach to reduce feature space dimension in open and dynamic environment. Existing streaming feature selection methods can effectively get an optimal feature subset. However,these methods generally ignore a real?world scenario,i.e.,the hierarchical structure of the class. To address this problem,we propose an online streaming feature selection method based on hierarchical class neighborhood rough set. Firstly,we introduce a new neighborhood relation to avoid granularity selection. Then,the hierarchical structure is leveraged to calculate hierarchical dependence,which generalize neighborhood rough set to fit hierarchical classification learning. Three online feature subset evaluation functions based on hierarchical dependence are defined. We design an online hierarchical label streaming feature selection framework including online relevance selection,online importance calculation,and online redundant updation. Finally,empirical experiments are carried out on six hierarchical class and eight single label datasets,showing that the proposed algorithm outperforms other state?of?the?art online streaming feature selection methods.

Key words: online streaming feature selection, neighborhood rough set, hierarchical classification, hierarchical dependence

中图分类号: 

  • TP181

图1

VOC数据集层次分类结构"

表1

层次结构数据符号说明"

符号意义
H表示一个树形层次结构
D表示类别集合
U表示样本集合
Hi表示第i层的所有类别
Ui表示第i层上样本的集合
Uij表示第i层上利用层次信息划分Ui的第j类样本集合
H表示层次结构中的层数
C表示所有特征的数目

表2

层次结构样本数据示例"

UCD
x10.1Class 1
x20.2Class 1
x30.4Class 2
x40.6Class 3
x50.8Class 4
x60.75Class 3
x70.81Class 4
x80.6Class 2
x90.82Class 4

图2

类别层次结构示意图"

表3

层次数据集描述"

名称样本数特征数内部节点叶子节点层数
AWA640525217103
Bridges10812862
Cifar500005121211002
DD302047332272
F19470154732021942
VOC7178100030884

表4

扁平单标记数据集描述"

名称样本数特征数类别数
AUTOVAL_B414725
COLON6220002
CAR174918211
GENE21741253311
GLIOMA5044344
LEUKEMIA7271292
SRBCT8323084
WARPAR10P103240010

图3

λ的不同取值对OFS?HCRS算法的平均预测精度的影响"

图4

λ的不同取值对OFS?HCRS算法的TIE的影响"

图5

λ的不同取值对OFS?HCRS算法的LCA?F1的影响"

图6

λ的不同取值对OFS?HCRS算法的H?F1的影响"

图7

流特征顺序对OFS?HCRS算法的平均预测精度的影响"

图8

流特征顺序对OFS?HCRS算法的TIE的影响"

图9

流特征顺序对OFS?HCRS算法的LCA?F1的影响"

图10

流特征顺序对OFS?HCRS算法的H?F1的影响"

表5

OFS?HCRS算法和六种对比算法在层次数据集上的平均预测精度(↑)"

数据集OSFSFOSFSSAOLAK⁃OFSDOFSDOFS⁃A3MOFS⁃HCRS
Average0.29240.30260.26720.29120.26310.37630.4873
AWA0.2080.21950.17810.22340.18670.25120.3038
Bridges0.630.630.630.55330.56440.62560.63
Cifar0.06740.07470.02010.09250.12890.27100.2735
DD0.37040.37070.29290.42260.30790.70540.7925
F1940.21970.25370.22520.17480.1010.08790.572
VOC0.25940.26710.25680.28070.28980.31650.3521

表6

OFS?HCRS算法和六种对比算法在层次数据集上的TIE (↓)"

数据集OSFSFOSFSSAOLAK⁃OFSDOFSDOFS⁃A3MOFS⁃HCRS
Average2.53052.48292.62892.47542.70652.23741.8454
AWA3.68683.64343.76273.65933.81513.46263.2272
Bridges1.00931.00931.00931.16671.138911.0093
Cifar3.58033.5393.80943.45573.29362.69072.6806
DD1.87591.87592.22841.5532.33990.77790.5682
F1942.21332.04182.1372.30742.97132.93001.16
VOC2.81752.78772.82682.71012.682.56322.4269

表7

OFS?HCRS算法和六种对比算法在层次数据集上的LCA?F1 (↑)"

数据集OSFSFOSFSSAOLAK⁃OFSDOFSDOFS⁃A3MOFS⁃HCRS
Average0.5580.56520.54140.56190.5340.61040.6795
AWA0.46430.47120.44840.4720.44820.49480.5297
Bridges0.78520.78520.78520.74440.75280.78610.7852
Cifar0.39080.39670.35590.40950.43520.53280.5345
DD0.63380.63390.57860.67810.57430.83700.8834
F1940.55530.58090.56350.53250.45260.45170.7605
VOC0.51840.52360.51650.5350.54110.56050.5841

表8

OFS?HCRS算法和六种对比算法在层次数据集上的H?F1 (↑)"

数据集OSFSFOSFSSAOLAK⁃OFSDOFSDOFS⁃A3MOFS⁃HCRS
Average0.59850.60650.58240.60870.57180.64850.7135
AWA0.53910.54460.52970.54260.52310.56720.5966
Bridges0.80680.80680.80680.78150.78520.80310.8068
Cifar0.40330.41020.36510.42410.45110.55150.5532
DD0.68740.68740.62860.74120.610.87030.9053
F1940.63110.65970.64380.61540.50480.51170.8067
VOC0.5230.530.52040.54710.55710.58700.6125

表9

OFS?HCRS算法和六种对比算法在扁平单标记数据集上的平均预测精度(↑)"

数据集OSFSFOSFSSAOLAK⁃OFSDOFSDOFS⁃A3MOFS⁃HCRS
Average0.61340.68440.72890.75940.74000.86010.9041
AUTOVAL_B0.76730.80750.77420.48390.84010.87640.8723
COLON0.74170.87080.88750.80830.85420.85830.9083
CAR0.57940.66860.86140.91420.68780.95020.9374
GENE20.22250.19730.44660.67880.53730.71370.7125
GLIOMA0.48790.78940.64620.47790.76210.70030.8768
LEUKEMIA0.96750.98330.97080.98330.96670.92980.9833
SRBCT0.85340.87320.95420.84660.8750.96251
WARPAR10P0.28750.2850.290.88250.39750.890.9425

表10

不同层次评价指标下的FF和相应的临界值"

层次评价指标FF临界值
AP3.88891.98
LCAF15.1103
HF14.4590
TIE4.2562

表11

扁平单标记评价指标下的FF和相应的临界值"

扁平单标记评价指标FF临界值
AP6.261.87

图11

不同评价指标的Boferroni?Dunn检验的CD图"

1 胡清华,王煜,周玉灿,等.大规模分类任务的分层学习方法综述.中国科学:信息科学,201848(5):487-500.
Hu Q H, Wang Y, Zhou Y C,et al. Review on hierarchical learning methods for large?scale classification task. Scientia Sinica:Informationis,201848(5):487-500.
2 An L, Adeli E, Liu M X,et al. A hierarchical feature and sample selection framework and its application for Alzheimer's disease diagnosis. Scientific Reports2017(7):45269.
3 Friedman N. Inferring cellular networks using probabilistic graphical models. Science2004303(5659):799-805.
4 Zhai Y T, Ong Y S, Tsang I W. The emerging "Big Dimensionality". IEEE Computational Intelligence Magazine20149(3):14-26.
5 Grimaudo L, Mellia M, Baralis E. Hierarchical learning for fine grained internet traffic classification∥2012 8th International Wireless Communications and Mobile Computing Conference. Limassol,Cyprus:IEEE,2012:463-468.
6 Zhao H, Zhu P F, Wang P,et al. Hierarchical feature selection with recursive regularization∥Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne,Australia:AAAI Press,2017:3483-3489.
7 Liu X X, Zhao H. Hierarchical feature extraction based on discriminant analysis. Applied Intelligence201949(7):2780-2792.
8 郝世杰,郭艳蓉,陈涛,等.基于自适应稀疏结构学习的神经精神疾病特征选择方法.模式识别与人工智能202134(4):311-321.
Hao S J, Guo Y R, Chen T,et al. Feature selection method for neuropsychiatric disorder based on adaptive sparse structure learning. Pattern Recognition and Artificial Intelligence202134(4):311-321.
9 Javidi M M, Eskandari S. Online streaming feature selection:A minimum redundancy,maximum significance approach. Pattern Analysis and Applications201922(3):949-963.
10 Perkins S, Lacker K, Theiler J. Grafting:Fast,incremental feature selection by gradient descent in function space. The Journal of Machine Learning Research2003(3):1333-1356.
11 Wu X D, Yu K, Ding W,et al. Online feature selection with streaming features. IEEE Transactions on Pattern Analysis and Machine Intelligence201335(5):1178-1192.
12 Zhou P, Hu X G, Li P P,et al. Online feature selection for high?dimensional class?imbalanced data. Knowledge?Based Systems2017(136):187-199.
13 Lin Y J, Hu Q H, Liu J H,et al. Streaming feature selection for multilabel learning based on fuzzy mutual information. IEEE Transactions on Fuzzy Systems201725(6):1491-1507.
14 Liu J H, Lin Y J, Li Y W,et al. Online multi?label streaming feature selection based on neighborhood rough set. Pattern Recognition2018(84):273-287.
15 Everingham M, Van Gool L, Williams C K I,et al. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision201088(2):303-338.
16 ?lezak D, Synak P, Wróblewski J,et al. Infobright analytic database engine using rough sets and granular computing∥2010 IEEE International Conference on Granular Computing. San Jose,CA,USA:IEEE,2010:432-437.
17 Silla C N, Freitas A A. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery201122(1-2):31-72.
18 胡清华,于达仁,谢宗霞. 基于邻域粒化和粗糙逼近的数值属性约简. 软件学报200819(3):640-649.
Hu Q H, Yu D R, Xie Z X. Numerical attribute reduction based on neighborhood granulation and rough approximation. Journal of Software200819(3):640-649.
19 毛振宇,窦慧莉,宋晶晶,等.共现邻域关系下的属性约简研究. 南京大学学报(自然科学)202157(1):150-159.
Mao Z Y, Dou H L, Song J J,et al. Research on attribute reduction via co?occurrence neighborhood relation. Journal of Nanjing University (Natural Science)202157(1):150-159.
20 Yu K, Wu X D, Ding W,et al. Scalable and accurate online feature selection for big data. ACM Transactions on Knowledge Discovery from Data201711(2):16.
21 Zhou P, Hu X G, Li P P,et al. OFS?Density:A novel online streaming feature selection method. Pattern Recognition2019(86):48-61.
22 Zhou P, Hu X G, Li P P. A new online feature selection method using neighborhood rough set∥2017 IEEE International Conference on Big Knowledge. Hefei,China:IEEE,2017:135-142.
23 Friedman M. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics194011(1):86-92.
24 Dunn O J. Multiple comparisons among means. Journal of the American Statistical Association196156(293):52-64.
[1] 刘琼, 代建华, 陈姣龙. 区间值数据的代价敏感特征选择[J]. 南京大学学报(自然科学版), 2021, 57(1): 121-129.
[2] 程永林, 李德玉, 王素格. 基于极大相容块的邻域粗糙集模型[J]. 南京大学学报(自然科学版), 2019, 55(4): 529-536.
[3]  徐智康1,李 旸1,李德玉1,2*.  基于可变最小贝叶斯风险的层次多标签分类方法[J]. 南京大学学报(自然科学版), 2017, 53(6): 1023-.
[4] 贾洪杰1,2丁世飞1,2. 基于邻域粗糙集约减的谱聚类算法[J]. 南京大学学报(自然科学版), 2013, 49(5): 619-627.
[5]  谢娟英**,李楠1,2,乔子茵1
.  基于邻域粗糙集的不完整决策系统特征选择算法*[J]. 南京大学学报(自然科学版), 2011, 47(4): 383-390.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!