南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (1): 19–29.doi: 10.13232/j.cnki.jnju.2020.01.003

• • 上一篇    下一篇

标记倾向性的粗糙互信息k特征核选择

程玉胜1,2(),陈飞1,庞淑芳1   

  1. 1. 安庆师范大学计算机与信息学院,安庆,246011
    2. 安徽省高校智能感知与计算重点实验室,安庆,246011
  • 收稿日期:2019-07-08 出版日期:2020-01-30 发布日期:2020-01-10
  • 通讯作者: 程玉胜 E-mail:chengyshaq@163.com
  • 基金资助:
    安徽省高校重点科研项目(KJ2017A352);安徽省高校重点实验室基金,安庆师范大学科研创新团队项目

k⁃Kernel feature selection of tendentious labels based on rough mutual information

Yusheng Cheng1,2(),Fei Chen1,Shufang Pang1   

  1. 1. School of Computer and Information,Anqing Normal University,Anqing,246011,China
    2. The University Key Laboratory of Intelligent Perception and Computing of Anhui Province,Anqing,246011,China
  • Received:2019-07-08 Online:2020-01-30 Published:2020-01-10
  • Contact: Yusheng Cheng E-mail:chengyshaq@163.com

摘要:

针对多标记学习算法中特征描述粒度导致的标记倾向性问题,大多数研究者从特征与所有标记之间的关联性入手,通过求解得出若干重要特征,并由此构造相应的特征子空间.这种做法会导致有些特征与某个标记有很强的相关性,但与整个标记空间的相关性却并不大,这样的特征丢失易造成分类器精度下降.如果将整个标记空间换成部分标记空间甚至单个标记空间来计算与特征之间的关联性,并把关联性很强的标记分开进行特征选择,就会降低算法的时间开销,提高算法的效率.同时,基于互信息的多标记学习算法多数采用传统熵的方法进行特征选择,由于传统熵不具有补的性质,计算方法较为复杂.引入粗糙熵的度量方法,提出基于粗糙互信息的多标记倾向性k特征核选择算法,实验和统计假设检验都证明该算法是有效的.

关键词: 多标记学习, 相关性矩阵, 特征选择, 粗糙互信息

Abstract:

Aiming at the problem about the tendentious label which causes by the granularity of the feature description in mul?ti?label learning algorithm,many researchers often use the relationship between the features with all of the label to obtain some important features,and then construct the corresponding feature subspace. However,this approach will make some features for a label with a strong relevance but less correlation for the whole label space. Therefore,the accuracy of the classifier will inevitably descend because some features will be unselected by those traditional approaches. If the whole label space is replaced with a part of label space or even a single label space calculates the relevance between the feature space,the highly correlated label is separated for feature selection. The time overhead of the algorithm is reduced,and the efficiency of the algorithm is improved. Meanwhile,many mutual information approaches of the feature selection often use the traditional entropy method in the multi?label learning algorithms at present. The traditional entropy has a high complexity of computation because it has no nature of complement. Therefore,a new definition about the rough entropy will be introduced in this paper and then the algorithm of the k?Kernel Feature Selection based on Rough Mutual Information (kKFSRMI) is proposed. Experimental results show that kKFSRMI is effective.

Key words: multi?label learning, correlation matrix, feature selection, rough mutual information

中图分类号: 

  • TP18

图1

kKFSRMI算法流程"

表1

数据集相关信息"

数据集

训练

样本数

测试

样本数

特征

数目

标记

数目

Health2000300061232
Computers2000300068133
Education2000300055033
Enron1123579100153
Reference2000300079333
Society2000300063627
Science2000300074340
Social20003000104739

表2

不同特征算法选择的特征数目"

数据集kKFSRMIMFSLS?532MFSLS?631
k=10k=15k=20
Health6899127205206
Computers5681106229229
Education93134164185186
Enron144188226335336
Reference84122173266266
Society415875213214
Science69109134249249
Social91131170350350

表3

Average Precision实验结果"

数据集kKFSRMIMFSLS?532MFSLS?631PMUMDDMprojMDDMspc
k=10k=15k=20
Health0.67750.68090.68130.71880.72370.64390.67350.6744
Computers0.62520.63340.63440.63660.64050.61710.61980.6236
Education0.55390.55020.54780.52350.54050.49030.54300.5430
Enron0.65090.64480.64990.61040.60840.63380.63760.6247
Reference0.59410.61560.63540.61910.63040.60500.59330.6132
Society0.58640.58780.58720.57160.56920.55750.55470.5523
Science0.44370.46420.46680.45420.46020.43800.41980.4299
Social0.70460.69920.70240.68280.69260.65650.66050.6866

表4

Hamming Loss实验结果"

数据集kKFSRMIMFSLS?532MFSLS?631PMUMDDMprojMDDMspc
k=10k=15k=20
Health0.04420.04320.04260.04100.04100.04850.04430.0447
Computers0.04080.04020.03940.04020.04010.04110.04110.0408
Education0.04190.04180.04200.04380.04210.04420.04290.0429
Enron0.04990.05140.05070.05470.05590.05110.05250.0527
Reference0.03320.03090.03010.03180.03150.03170.03380.0324
Society0.05780.0570.05720.05730.05770.05750.05980.0592
Science0.03530.03470.03450.03420.03430.03480.03550.0355
Social0.02640.02670.02630.02670.02570.02750.02860.0265

表5

One Error实验结果"

数据集kKFSRMIMFSLS?532MFSLS?631PMUMDDMprojMDDMspc
k=10k=15k=20
Health0.42100.42130.41900.35230.34770.46070.41400.4200
Computers0.45570.44430.44700.43530.43230.45800.46130.4517
Education0.60570.58470.59270.62100.59970.67070.60030.6003
Enron0.26420.26080.26600.32300.32120.28840.27980.3126
Reference0.51370.48770.46400.47400.46170.49070.50930.4873
Society0.46200.46370.46570.47270.48070.48630.49300.4973
Science0.69830.66600.66330.68430.66900.70170.72000.7090
Social0.39300.39930.39630.42770.40830.45900.45100.4140

表6

Ranking Loss实验结果"

数据集kKFSRMIMFSLS?532MFSLS?631PMUMDDMprojMDDMspc
k=10k=15k=20
Health0.06250.06110.06100.05680.05630.07370.06540.0642
Computers0.09120.08980.08980.09190.08960.09920.09360.0935
Education0.09510.09200.09140.09750.09380.10350.09320.0932
Enron0.09070.09350.09240.10480.10460.09510.09640.0975
Reference0.09100.08510.08120.09080.08830.09370.09450.0883
Society0.14450.14280.14240.15510.15370.15920.15830.1608
Science0.14140.13630.13470.13660.13410.14600.14890.1503
Social0.06450.06610.06460.06870.06650.07640.07530.0707

表7

运行时间(单位:s)"

数据集kKFSRMIMFSLS?532MFSLS?631
k=10k=15k=20
Health8.1998.2028.2159.90962.197
Computers9.52710.069.44572.51775.374
Education7.6267.7417.60550.70550.339
Enron14.59515.74114.548114.534112.715
Reference10.89110.92311.108101.481100.309
Society7.3127.7167.27668.50565.982
Science12.06112.29511.77889.78386.636
Social14.69916.25316.333169.584167.433

表8

不同算法的平均排序"

评价指标kKFSRMIMFSLSPMUMDDMprojMDDMspc
Average Precision1.252.3754.253.6253.375
Hamming Loss1.252.6253.53.8753.375
One Error1.3752.62543.6253.375
Ranking Loss1.1252.6254.1253.6253.375

图2

Enron数据集在不同算法下的实验结果"

图3

不同算法的Nemenyi检验"

1 Zhang M L,Zhou Z H. A review on multi?label learning algorithms. IEEE Transactions on Knowledge & Data Engineering,2014,26(8):1819-1837.
2 王一宾,裴根生,程玉胜. 弹性网络核极限学习机的多标记学习算法. 智能系统学报,2019,14(4):831-842.
Wang Y B,Pei G S,Cheng Y S.Multi?label learning algorithm of an elastic net kernel extreme learning machine. CAAI Transactions on Intelligent Systems,2019,14(4):831-842.
3 李峰,苗夺谦,张志飞等. 一种标记粒化集成的多标记学习算法. 小型微型计算机系统,2018,39(6):1121-1125.
Li F,Miao D Q,Zhang Z F,et al. Label?granulated ensemble method for multi?label learning. Journal of Chinese Computer Systems,2018,39(6):1121-1125.
4 余鹰,王乐为,吴新念等. 基于改进卷积神经网络的多标记分类算法. 智能系统学报,2019,14(3):566-574.
Yu Y,Wang L W,Wu X M,et al. A multi?label classification algorithm based on an improved convolutional neural network. CAAI Transactions on Intelligent Systems,2019,14(3):566-574.
5 Geng X. Label distribution learning. IEEE Transactions on Knowledge & Data Engineering,2016,28(7):1734-1748.
6 刘海峰,姚泽清,苏展. 基于词频的优化互信息文本特征选择方法. 计算机工程,2014,40(7):179-182.
Liu H F,Yao Z Q,Su Z.Optimization mutual information text feature selection method based on word frequency. Computer Engineering,2014,40(7):179-182.
7 张辉宜,谢业名,袁志祥等. 一种基于概率的卡方特征选择方法. 计算机工程,2016,42(8):194-198,205.
Zhang H Y,Xie Y M,Yuan Z X,et al. A method of CHI-square feature selection based on probability. Computer Engineering,2016,42(8):194-198,205.
8 孙广路,宋智超,刘金来等. 基于最大信息系数和近似马尔科夫毯的特征选择方法. 自动化学报,2017,43(5):795-805.
Sun G L,Song Z C,Liu J L,et al. Feature selection method based on maximum information coefficient and approximate Markov blanket. Acta Automatica Sinica,2017,43(5):795-805.
9 赖学方,贺兴时. 最小冗余最大分离准则特征选择方法. 计算机工程与应用,2017,53(12):70-75.
Lai X F,He X S.Method based on minimum redundancy and maximum separability for feature selection. Computer Engineering and Applications,2017,53(12):70-75.
10 张振海,李士宁,李志刚等. 一类基于信息熵的多标签特征选择算法. 计算机研究与发展,2013,50(6):1177-1184.
Zhang Z H,Li S N,Li Z G,et al. Multi?label feature selection algorithm based on information entropy. Journal of Computer Research and Development,2013,50(6):1177-1184.
11 刘景华,林梦雷,王晨曦等. 基于局部子空间的多标记特征选择算法. 模式识别与人工智能,2016,29(3):240-251.
Liu J H,Lin M L,Wang C X,et al. Multi?label feature selection algorithm based on local subspace. Pattern Recognition and Artificial Intelligence,2016,29(3):240-251.
12 Lee J,Kim D W. Feature selection for multi?label classification using multivariate mutual information. Pattern Recognition Letters,2013,34(3):349-357.
13 王晨曦,林耀进,唐莉等. 基于信息粒化的多标记特征选择算法. 模式识别与人工智能,2018,31(2):123-131.
Wang C X,Lin Y J,Tang L,et al. Multi?label feature selection based on information granulation. Pattern Recognition and Artificial Intelligence,2018,31(2):123-131.
14 钱文彬,黄琴,王映龙等. 多标记不完备数据的特征选择算法. 计算机科学与探索,2019,doi:10.3778/j.issn.1673?9418.1807067.
doi: 10.3778/j.issn.1673?9418.1807067
Qian W B,Huang Q,Wang Y L,et al. Feature selection algorithm in multi?label incomplete data. Journal of Frontiers of Computer Science and Technology,2019,doi:10.3778/j.issn.1673?9418.1807067.
doi: 10.3778/j.issn.1673?9418.1807067
15 李志欣,卓亚琦,张灿龙等. 多标记学习研究综述. 计算机应用研究,2014,31(6):1601-1605.
Li Z X,Zhuo Y Q,Zhang C L,et al. Survey on multi?label learning. Application Research of Computers,2014,31(6):1601-1605.
16 Liang J Y,Chin K S,Dang C Y,et al. A new method for measuring uncertainty and fuzziness in rough set theory. International Journal of General Systems,2002,31(4):331-342.
17 程玉胜,张佑生,胡学钢. 基于边界域的知识粗糙熵与粗集粗糙熵. 系统仿真学报,2007,19(9):2008-2011.
Cheng Y H,Zhang Y S,Hu X G.Entropy of knowledge and rough set based on boundary region. Journal of System Simulation,2007,19(9):2008-2011.
18 Mi J S,Leung Y,Wu W Z. An uncertainty measure in partition?based fuzzy rough sets. International Journal of General Systems,2005,34(1):77-90.
19 Zhang M L,Zhou Z H. ML?KNN:a lazy learning approach to multi?label learning. Pattern Recognition,2007,40(7):2038-2048.
20 Zhang Y,Zhou Z H. Multilabel dimensionality reduction via dependence maximization. ACM Transactions on Knowledge Discovery from Data,2010,4(3):Article No. 14.
21 Dem?ar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research,2006,7:1-30.
[1] 陈超逸,林耀进,唐莉,王晨曦. 基于邻域交互增益信息的多标记流特征选择算法[J]. 南京大学学报(自然科学版), 2020, 56(1): 30-40.
[2] 刘亮,何庆. 基于改进蝗虫优化算法的特征选择方法[J]. 南京大学学报(自然科学版), 2020, 56(1): 41-50.
[3] 刘 素, 刘惊雷. 基于特征选择的CP-nets结构学习[J]. 南京大学学报(自然科学版), 2019, 55(1): 14-28.
[4] 陈海娟,冯 翔,虞慧群. 基于预测算子的GSO特征选择算法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1206-1215.
[5] 陈琳琳1*,陈德刚2. 一种基于核对齐的分类器链的多标记学习算法[J]. 南京大学学报(自然科学版), 2018, 54(4): 725-.
[6] 温 欣1,李德玉1,2*,王素格1,2. 一种基于邻域关系和模糊决策的特征选择方法[J]. 南京大学学报(自然科学版), 2018, 54(4): 733-.
[7] 靳义林1,2*,胡 峰1,2. 基于三支决策的中文文本分类算法研究[J]. 南京大学学报(自然科学版), 2018, 54(4): 794-.
[8]  王一宾1,2,程玉胜1,2*,裴根生1.  结合均值漂移的多示例多标记学习改进算法[J]. 南京大学学报(自然科学版), 2018, 54(2): 422-.
[9]  董利梅,赵 红*,杨文元.  基于稀疏聚类的无监督特征选择[J]. 南京大学学报(自然科学版), 2018, 54(1): 107-.
[10]  崔 晨,邓赵红*,王士同.  面向单调分类的简洁单调TSK模糊系统[J]. 南京大学学报(自然科学版), 2018, 54(1): 124-.
[11]  李 婵,杨文元*,赵 红.  联合依赖最大化与稀疏表示的无监督特征选择方法[J]. 南京大学学报(自然科学版), 2017, 53(4): 775-.
[12]  姚 晟1,2*,徐 风1,2,赵 鹏1,2,刘政怡1,2,陈 菊1,2.  基于改进邻域粒的模糊熵特征选择算法[J]. 南京大学学报(自然科学版), 2017, 53(4): 802-.
[13] 蔡亚萍,杨 明* . 一种利用局部标记相关性的多标记特征选择算法[J]. 南京大学学报(自然科学版), 2016, 52(4): 693-.
[14] 谢娟英*,屈亚楠,王明钊 . 基于密度峰值的无监督特征选择算法[J]. 南京大学学报(自然科学版), 2016, 52(4): 735-.
[15] 梁新彦1,2,钱宇华1,2*,郭 倩2,成红红1,2. 面向多标记学习的局部粗糙集[J]. 南京大学学报(自然科学版), 2016, 52(2): 270-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 秦 娅, 申国伟, 赵文波, 陈艳平. 基于深度神经网络的网络安全实体识别方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 29 -40 .
[2] 钟琪,冯亚琴,王蔚. 跨语言语料库的语音情感识别对比研究[J]. 南京大学学报(自然科学版), 2019, 55(5): 765 -773 .