南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (1): 30–40.doi: 10.13232/j.cnki.jnju.2020.01.004

• • 上一篇    下一篇

基于邻域交互增益信息的多标记流特征选择算法

陈超逸1,林耀进1,2(),唐莉1,2,王晨曦1,2   

  1. 1. 闽南师范大学计算机学院,漳州,363000
    2. 数据科学与智能应用福建省教育厅重点实验室,漳州,363000
  • 收稿日期:2019-08-14 出版日期:2020-01-30 发布日期:2020-01-10
  • 通讯作者: 林耀进 E-mail:zzlinyaojin@163.com
  • 基金资助:
    国家自然科学基金(61672272);福建省自然科学基金(2018J01548);福建省教育厅科技项目(JT180318)

Streaming multi⁃label feature selection based on neighborhood interaction gain information

Chaoyi Chen1,Yaojin Lin1,2(),Li Tang1,2,Chenxi Wang1,2   

  1. 1. School of Computer Science,Minnan Normal University,Zhangzhou,363000,China
    2. Key Laboratory of;Data Science and Intelligence Application,Department of Education of Fujian Province,Zhangzhou,363000,China
  • Received:2019-08-14 Online:2020-01-30 Published:2020-01-10
  • Contact: Yaojin Lin E-mail:zzlinyaojin@163.com

摘要:

现有的多标记特征选择一般假设特征空间是固定已知的,然而实际应用中很多特征是需要在提取过程中实时地进行筛选.为此,提出基于邻域交互增益信息的多标记在线流特征选择算法.首先,基于多标记邻域互信息和邻域交互增益信息提出在线相关性分析与在线冗余性分析两种策略来评价特征;其次,基于邻域交互增益信息构建了在线流多标记特征选择的目标优化函数;最后,在六个多标记数据集和四个评价指标上,实验结果证明了该算法的有效性和稳定性.

关键词: 在线流特征, 多标记学习, 邻域熵, 邻域交互增益信息

Abstract:

The existing multi?label feature selection methods generally assume that the feature space is fixed and known. However,a lot of features need to be filtered in real?time during the extraction in practical application. Therefore,a streaming multi?label feature selection based on neighborhood interaction gain information is proposed. Firstly,we propose online correlation analysis and online redundancy analysis to evaluate features based on multi?label neighborhood mutual information and neighborhood interaction gain information. Secondly,based on neighborhood interaction gain information,we construct an objective optimization function for streaming multi?label feature selection. Finally,experimental results on six multi?label datasets and four criteria demonstrate the effectiveness and stability of the algorithm.

Key words: online stream features, multi?label learning, neighborhood entropy, neighborhood interaction gain

中图分类号: 

  • TP391

表1

多标记数据集的描述"

数据集样本数特征数类别数

训练

样本数

测试

样本数

Arts(A)50004622620003000
Birds(B)64526019322323
Business(C)50004383020003000
Education(D)50005503320003000
Emotions(E)593726391202
Yeast(F)2417103141499918

表2

SMFS和其他算法的AP(↑)指标的比较"

数据集MLNBMDDM?spcMDDM?projPMURF?MLSMFS
平均值0.65200.63160.63110.63970.64990.6707
A0.49910.47350.46690.49170.48340.5319
B0.50520.48180.45640.50820.52630.5199
C0.87130.86390.86330.86980.87420.8762
D0.54780.44410.48240.47980.51140.5557
E0.75290.77720.76830.73990.75660.7871
F0.73550.74880.74900.74880.74760.7533

表3

SMFS和其他算法的RL(↓)指标的比较"

数据集MLNBMDDM?spcMDDM?projPMURF?MLSMFS
平均值0.15080.15480.15830.15400.14890.1405
A0.15420.16310.16620.15840.15760.1418
B0.22370.24410.26080.20420.20930.2160
C0.04190.04560.04650.04390.04270.0410
D0.09220.11380.10890.10990.10160.0923
E0.20550.18250.19040.23010.20400.1749
F0.18710.17970.17680.17740.17810.1768

表4

SMFS和其他算法的HL(↓)指标的比较"

数据集MLNBMDDM?spcMDDM?projPMURF?MLSMFS
平均值0.10540.10060.10550.10530.10280.0983
A0.06120.06210.06220.06070.06140.0593
B0.04940.05210.05540.04860.04860.0500
C0.02830.02860.02860.02800.02780.0274
D0.04050.04460.04430.04420.04270.0409
E0.24500.21530.24170.24750.23180.2137
F0.20800.20100.20100.20250.20470.1983

表5

SMFS和其他算法的Mi?F1 (↑)指标的比较"

数据集MLNBMDDM?spcMDDM?projPMURF?MLSMFS
平均值0.39110.35500.33630.37200.36400.4364
A0.10930.05650.04710.12790.09250.2345
B0.16530.14890.07610.23590.12350.1769
C0.67920.66790.67040.67980.68130.6927
D0.20700.00410.00410.00230.08560.2314
E0.58110.63190.58670.56900.58490.6597
F0.60450.62080.63340.61710.61630.6232

图1

多标记图片示例"

图2

SMFS和四个对比算法的AP(↑)在六个数据集上的变化"

图3

SMFS和四个对比算法的RL(↓)在六个数据集上的变化"

图4

SMFS和四个对比算法的HL(↓)在六个数据集上的变化"

图5

SMFS和四个对比算法的Mi-F1 (↑)在六个数据集上的变化"

表6

不同指标下的Friedman统计FF (k=6,N=6)"

评价指标FF临界值(α=0.1)
AP3.98212.0922
RL2.2914
HL2.5911
Mi?F12.4687

图6

通过Bonferroni?Dunn测试比较SMFS与其他算法的性能差异"

图7

蜘蛛网图展示的SMFS算法在六个多标记数据集下不同指标的稳定性"

表7

SMFS与OM?NRS的AP,RL和HL的比较"

APRLHL
SMFSOM?NRSSMFSOM?NRSSMFSOM?NRS
平均值0.67070.65620.14050.14080.09830.0989
A0.53190.52170.14190.14400.05930.0606
B0.51990.48420.21600.21900.05000.0518
C0.87620.87600.04100.04110.02740.0274
D0.55570.53980.09230.09120.04090.0408
E0.78710.76080.17490.17650.21370.2104
F0.75330.75450.17680.17320.19830.2021
1 Boutell M R,Luo J B,Shen X P,et al. Learning multi?label scene classification. Pattern Recognition,2004,37(9):1757-1771.
2 Lewis D D,Yang Y M,Rose T G,et al. RCV1:a new benchmark collection for text categorization research. The Journal of Machine Learning Research,2004,5(2):361-397.
3 Elisseeff A,Weston J. A kernel method for multi?labelled classification∥Proceedings of the 14th International Conference on Neural Information Processing Systems:Natural and Synthetic. Cambridge,MA,USA:MIT Press,2001.
4 Trohidis K,Tsoumakas G,Kalliris G,et al. Multi?label classification of music by emotion. EURASIP Journal on Audio,Speech,and Music Processing,2011,2011(1):4.
5 段洁,胡清华,张灵均等. 基于邻域粗糙集的多标记分类特征选择算法. 计算机研究与发展,2015,52(1):56-65.
Duan J,Hu Q H,Zhang L J,et al. Feature selection for multi?label classification based on neighborhood rough set. Journal of Computer Research and Development,2015,52(1):56-65.
6 Hotelling H. Relations between two sets of variates. Biometrika,1936,28(3-4):321-377.
7 Zhang Y,Zhou Z H. Multilabel dimensionality reduction via dependence maximization. ACM Transactions on Knowledge Discovery from Data,2010,4(3):14.
8 Yu K,Yu S P,Tresp V. Multi?label informed latent semantic indexing∥Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Salvador,Brazil:ACM,2005:258-265.
9 许行,张凯,王文剑. 一种小样本数据的特征选择方法. 计算机研究与发展,2018,55(10):2321-2330.
Xu X,Zhang K,Wang W J.A feature selection method for small samples. Journal of Computer Research and Development,2018,55(10):2321-2330.
10 Zhang L J,Hu Q H,Duan J,et al. Multi?label feature selection with fuzzy rough sets∥International Conference on Rough Sets and Knowledge Technology. Springer Berlin Heidelberg,2014:121-128.
11 Lin Y J,Hu Q H,Liu J H,et al. Multi?label feature selection based on neighborhood mutual information. Applied Soft Computing,2016,38:244-256.
12 Hu L,Gao W F,Zhao K,et al. Feature selection considering two types of feature relevancy and feature interdependency. Expert Systems with Applications,2018,93:423-434.
13 程玉胜,李雨,王一宾等. 动态滑动窗口加权互信息流特征选择. 南京大学学报(自然科学),2018,54(5):974-985.
Cheng Y S,Li Y,Wang Y B,et al. Streaming feature selection with weighted fuzzy mutual information based on dynamic sliding window. Journal of Nanjing University (Natural Science),2018,54(5):974-985.
14 Lin Y J,Hu Q H,Liu J H,et al. Streaming feature selection for multilabel learning based on fuzzy mutual information. IEEE Transactions on Fuzzy Systems,2017,25(6):1491-1507.
15 Liu J H,Lin Y J,Li Y W,et al. Online multi?label streaming feature selection based on neighborhood rough set. Pattern Recognition,2018,84:273-287.
16 Lin Y J,Hu Q H,Liu J H,et al. Multi?label fea?ture selection based on max?dependency and min?redundancy. Neurocomputing,2015,168:92-103.
17 Kwak N,Choi C H. Input feature selection for classification problems. IEEE Transactions on Neural Networks,2002,13(1):143-159.
18 Zhang M L,Pe?a J M,Robles V. Feature selection for multi?label naive Bayes classification. Information Sciences,2009,179(19):3218-3229.
19 Lee J,Kim D W. Feature selection for multi?label classification using multivariate mutual information. Pattern Recognition Letters,2013,34(3):349-357.
20 Spola?r N,Cherman E A,Monard M C,et al. ReliefF for multi?label feature selection∥2013 Brazilian Conference on Intelligent Systems (BRACIS). Fortaleza,Brazil:IEEE,2013:6-11.
21 Friedman M. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics,1940,11(1):86-92.
22 Dunn O J. Multiple comparisons among means. Journal of the American Statistical Association,1961,56(293):52-64.
[1] 程玉胜,陈飞,庞淑芳. 标记倾向性的粗糙互信息k特征核选择[J]. 南京大学学报(自然科学版), 2020, 56(1): 19-29.
[2] 陈琳琳1*,陈德刚2. 一种基于核对齐的分类器链的多标记学习算法[J]. 南京大学学报(自然科学版), 2018, 54(4): 725-.
[3]  王一宾1,2,程玉胜1,2*,裴根生1.  结合均值漂移的多示例多标记学习改进算法[J]. 南京大学学报(自然科学版), 2018, 54(2): 422-.
[4] 蔡亚萍,杨 明* . 一种利用局部标记相关性的多标记特征选择算法[J]. 南京大学学报(自然科学版), 2016, 52(4): 693-.
[5] 梁新彦1,2,钱宇华1,2*,郭 倩2,成红红1,2. 面向多标记学习的局部粗糙集[J]. 南京大学学报(自然科学版), 2016, 52(2): 270-.
[6] 吕静,何志芬. 一种基于正则化最小二乘的多标记分类算法[J]. 南京大学学报(自然科学版), 2015, 51(1): 139-147.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1]  于文1,倪培1**,王国光1,商力1,江来利2,王波华3,张怀东3
.  安徽金寨县沙坪沟
斑岩钥矿床成矿流体演化特征*
[J]. 南京大学学报(自然科学版), 2012, 48(3): 240 -255 .
[2] 李丽*,张宇昂,傅玉祥,潘红兵,韩峰. 三维众核片上处理器存储架构研究[J]. 南京大学学报(自然科学版), 2014, 50(3): 330 .
[3]  王 彪1,2*,蒋亚立1,戴跃伟1.  基于l0范数的匹配场源定位方法[J]. 南京大学学报(自然科学版), 2017, 53(4): 675 .
[4] 梅世嘉,施 斌,曹鼎峰,魏广庆,张 岩,郝 瑞. 基于AHFO方法的Green-Ampt模型K0取值试验研究[J]. 南京大学学报(自然科学版), 2018, 54(6): 1085 -1094 .
[5] 杨 薇, 王洪元, 张 继, 张中宝. 一种基于Faster-RCNN的车辆实时检测改进算法[J]. 南京大学学报(自然科学版), 2019, 55(2): 231 -237 .
[6] 董少春,种亚辉,胡 欢,黄璐璐. 基于时序InSAR的常州市2015-2018年地面沉降监测[J]. 南京大学学报(自然科学版), 2019, 55(3): 370 -380 .
[7] 丁晓冬,张 陆,耿洪斌,陈庆民. 基于芳香族端氨基聚四亚甲基醚软段的聚脲弹性体结构与性能[J]. 南京大学学报(自然科学版), 2019, 55(3): 498 -503 .
[8] 张弘,申俊峰,董国臣,刘圣强,王冬丽,王伟清. 云南来利山锡矿锡石标型特征及其找矿意义[J]. 南京大学学报(自然科学版), 2019, 55(6): 888 -897 .
[9] 信统昌,刘兆伟. 基于贝叶斯⁃遗传算法的多值无环CP⁃nets学习[J]. 南京大学学报(自然科学版), 2020, 56(1): 74 -84 .
[10] 汪洋,陈泰格,陆晓凡,辛小燕,王坤,李茗,青钊,张英为,严晓敏,吴超,言方荣,张冰. COVID⁃19的临床和影像特征与试行指南的映证分析[J]. 南京大学学报(自然科学版), 2020, 56(3): 430 -436 .