标记倾向性的粗糙互信息k特征核选择

doi:10.13232/j.cnki.jnju.2020.01.003

南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (1): 19–29.doi: 10.13232/j.cnki.jnju.2020.01.003

标记倾向性的粗糙互信息k特征核选择

程玉胜^1,²(),陈飞¹,庞淑芳¹

1. 安庆师范大学计算机与信息学院，安庆，246011
2. 安徽省高校智能感知与计算重点实验室，安庆，246011

收稿日期:2019-07-08 出版日期:2020-01-30 发布日期:2020-01-10
通讯作者: 程玉胜 E-mail:chengyshaq@163.com
基金资助:
安徽省高校重点科研项目(KJ2017A352);安徽省高校重点实验室基金,安庆师范大学科研创新团队项目

k⁃Kernel feature selection of tendentious labels based on rough mutual information

Yusheng Cheng^1,²(),Fei Chen¹,Shufang Pang¹

1. School of Computer and Information，Anqing Normal University，Anqing，246011，China
2. The University Key Laboratory of Intelligent Perception and Computing of Anhui Province，Anqing，246011，China

Received:2019-07-08 Online:2020-01-30 Published:2020-01-10
Contact: Yusheng Cheng E-mail:chengyshaq@163.com

摘要/Abstract

摘要：

针对多标记学习算法中特征描述粒度导致的标记倾向性问题，大多数研究者从特征与所有标记之间的关联性入手，通过求解得出若干重要特征，并由此构造相应的特征子空间.这种做法会导致有些特征与某个标记有很强的相关性，但与整个标记空间的相关性却并不大，这样的特征丢失易造成分类器精度下降.如果将整个标记空间换成部分标记空间甚至单个标记空间来计算与特征之间的关联性，并把关联性很强的标记分开进行特征选择，就会降低算法的时间开销，提高算法的效率.同时，基于互信息的多标记学习算法多数采用传统熵的方法进行特征选择，由于传统熵不具有补的性质，计算方法较为复杂.引入粗糙熵的度量方法，提出基于粗糙互信息的多标记倾向性k特征核选择算法，实验和统计假设检验都证明该算法是有效的.

关键词: 多标记学习, 相关性矩阵, 特征选择, 粗糙互信息

Abstract:

Aiming at the problem about the tendentious label which causes by the granularity of the feature description in mul?ti?label learning algorithm，many researchers often use the relationship between the features with all of the label to obtain some important features，and then construct the corresponding feature subspace. However，this approach will make some features for a label with a strong relevance but less correlation for the whole label space. Therefore，the accuracy of the classifier will inevitably descend because some features will be unselected by those traditional approaches. If the whole label space is replaced with a part of label space or even a single label space calculates the relevance between the feature space，the highly correlated label is separated for feature selection. The time overhead of the algorithm is reduced，and the efficiency of the algorithm is improved. Meanwhile，many mutual information approaches of the feature selection often use the traditional entropy method in the multi?label learning algorithms at present. The traditional entropy has a high complexity of computation because it has no nature of complement. Therefore，a new definition about the rough entropy will be introduced in this paper and then the algorithm of the k?Kernel Feature Selection based on Rough Mutual Information (kKFSRMI) is proposed. Experimental results show that kKFSRMI is effective.

Key words: multi?label learning, correlation matrix, feature selection, rough mutual information

中图分类号:

TP18

程玉胜,陈飞,庞淑芳. 标记倾向性的粗糙互信息k特征核选择[J]. 南京大学学报(自然科学版), 2020, 56(1): 19–29.

Yusheng Cheng,Fei Chen,Shufang Pang. k⁃Kernel feature selection of tendentious labels based on rough mutual information[J]. Journal of Nanjing University(Natural Sciences), 2020, 56(1): 19–29.

图/表 11

图1

表1

表2

表3

表4

表5

表6

表7

表8

图2

图3

参考文献 21

1	Zhang M L，Zhou Z H. A review on multi?label learning algorithms. IEEE Transactions on Knowledge & Data Engineering，2014，26(8)：1819-1837.
2	王一宾，裴根生，程玉胜. 弹性网络核极限学习机的多标记学习算法. 智能系统学报，2019，14(4)：831-842.
	Wang Y B，Pei G S，Cheng Y S.Multi?label learning algorithm of an elastic net kernel extreme learning machine. CAAI Transactions on Intelligent Systems，2019，14(4)：831-842.
3	李峰，苗夺谦，张志飞等. 一种标记粒化集成的多标记学习算法. 小型微型计算机系统，2018，39(6)：1121-1125.
	Li F，Miao D Q，Zhang Z F，et al. Label?granulated ensemble method for multi?label learning. Journal of Chinese Computer Systems，2018，39(6)：1121-1125.
4	余鹰，王乐为，吴新念等. 基于改进卷积神经网络的多标记分类算法. 智能系统学报，2019，14(3)：566-574.
	Yu Y，Wang L W，Wu X M，et al. A multi?label classification algorithm based on an improved convolutional neural network. CAAI Transactions on Intelligent Systems，2019，14(3)：566-574.
5	Geng X. Label distribution learning. IEEE Transactions on Knowledge & Data Engineering，2016，28(7)：1734-1748.
6	刘海峰，姚泽清，苏展. 基于词频的优化互信息文本特征选择方法. 计算机工程，2014，40(7)：179-182.
	Liu H F，Yao Z Q，Su Z.Optimization mutual information text feature selection method based on word frequency. Computer Engineering，2014，40(7)：179-182.
7	张辉宜，谢业名，袁志祥等. 一种基于概率的卡方特征选择方法. 计算机工程，2016，42(8)：194-198，205.
	Zhang H Y，Xie Y M，Yuan Z X，et al. A method of CHI-square feature selection based on probability. Computer Engineering，2016，42(8)：194-198，205.
8	孙广路，宋智超，刘金来等. 基于最大信息系数和近似马尔科夫毯的特征选择方法. 自动化学报，2017，43(5)：795-805.
	Sun G L，Song Z C，Liu J L，et al. Feature selection method based on maximum information coefficient and approximate Markov blanket. Acta Automatica Sinica，2017，43(5)：795-805.
9	赖学方，贺兴时. 最小冗余最大分离准则特征选择方法. 计算机工程与应用，2017，53(12)：70-75.
	Lai X F，He X S.Method based on minimum redundancy and maximum separability for feature selection. Computer Engineering and Applications，2017，53(12)：70-75.
10	张振海，李士宁，李志刚等. 一类基于信息熵的多标签特征选择算法. 计算机研究与发展，2013，50(6)：1177-1184.
	Zhang Z H，Li S N，Li Z G，et al. Multi?label feature selection algorithm based on information entropy. Journal of Computer Research and Development，2013，50(6)：1177-1184.
11	刘景华，林梦雷，王晨曦等. 基于局部子空间的多标记特征选择算法. 模式识别与人工智能，2016，29(3)：240-251.
	Liu J H，Lin M L，Wang C X，et al. Multi?label feature selection algorithm based on local subspace. Pattern Recognition and Artificial Intelligence，2016，29(3)：240-251.
12	Lee J，Kim D W. Feature selection for multi?label classification using multivariate mutual information. Pattern Recognition Letters，2013，34(3)：349-357.
13	王晨曦，林耀进，唐莉等. 基于信息粒化的多标记特征选择算法. 模式识别与人工智能，2018，31(2)：123-131.
	Wang C X，Lin Y J，Tang L，et al. Multi?label feature selection based on information granulation. Pattern Recognition and Artificial Intelligence，2018，31(2)：123-131.
14	钱文彬，黄琴，王映龙等. 多标记不完备数据的特征选择算法. 计算机科学与探索，2019，doi：10.3778/j.issn.1673?9418.1807067. doi: 10.3778/j.issn.1673?9418.1807067
	Qian W B，Huang Q，Wang Y L，et al. Feature selection algorithm in multi?label incomplete data. Journal of Frontiers of Computer Science and Technology，2019，doi：10.3778/j.issn.1673?9418.1807067. doi: 10.3778/j.issn.1673?9418.1807067
15	李志欣，卓亚琦，张灿龙等. 多标记学习研究综述. 计算机应用研究，2014，31(6)：1601-1605.
	Li Z X，Zhuo Y Q，Zhang C L，et al. Survey on multi?label learning. Application Research of Computers，2014，31(6)：1601-1605.
16	Liang J Y,Chin K S,Dang C Y,et al. A new method for measuring uncertainty and fuzziness in rough set theory. International Journal of General Systems，2002,31(4)：331-342.
17	程玉胜，张佑生，胡学钢. 基于边界域的知识粗糙熵与粗集粗糙熵. 系统仿真学报，2007，19(9)：2008-2011.
	Cheng Y H，Zhang Y S，Hu X G.Entropy of knowledge and rough set based on boundary region. Journal of System Simulation，2007，19(9)：2008-2011.
18	Mi J S，Leung Y，Wu W Z. An uncertainty measure in partition?based fuzzy rough sets. International Journal of General Systems，2005，34(1)：77-90.
19	Zhang M L，Zhou Z H. ML?KNN：a lazy learning approach to multi?label learning. Pattern Recognition，2007，40(7)：2038-2048.
20	Zhang Y，Zhou Z H. Multilabel dimensionality reduction via dependence maximization. ACM Transactions on Knowledge Discovery from Data，2010，4(3)：Article No. 14.
21	Dem?ar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research，2006，7：1-30.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

数据集	训练样本数	测试样本数	特征数目	标记数目
Health	2000	3000	612	32
Computers	2000	3000	681	33
Education	2000	3000	550	33
Enron	1123	579	1001	53
Reference	2000	3000	793	33
Society	2000	3000	636	27
Science	2000	3000	743	40
Social	2000	3000	1047	39

数据集	kKFSRMI			MFSLS?532	MFSLS?631
数据集	k=10	k=15	k=20	MFSLS?532	MFSLS?631
Health	68	99	127	205	206
Computers	56	81	106	229	229
Education	93	134	164	185	186
Enron	144	188	226	335	336
Reference	84	122	173	266	266
Society	41	58	75	213	214
Science	69	109	134	249	249
Social	91	131	170	350	350

数据集	kKFSRMI			MFSLS?532	MFSLS?631	PMU	MDDMproj	MDDMspc
数据集	k=10	k=15	k=20	MFSLS?532	MFSLS?631	PMU	MDDMproj	MDDMspc
Health	0.6775	0.6809	0.6813	0.7188	0.7237	0.6439	0.6735	0.6744
Computers	0.6252	0.6334	0.6344	0.6366	0.6405	0.6171	0.6198	0.6236
Education	0.5539	0.5502	0.5478	0.5235	0.5405	0.4903	0.5430	0.5430
Enron	0.6509	0.6448	0.6499	0.6104	0.6084	0.6338	0.6376	0.6247
Reference	0.5941	0.6156	0.6354	0.6191	0.6304	0.6050	0.5933	0.6132
Society	0.5864	0.5878	0.5872	0.5716	0.5692	0.5575	0.5547	0.5523
Science	0.4437	0.4642	0.4668	0.4542	0.4602	0.4380	0.4198	0.4299
Social	0.7046	0.6992	0.7024	0.6828	0.6926	0.6565	0.6605	0.6866

数据集	kKFSRMI			MFSLS?532	MFSLS?631	PMU	MDDMproj	MDDMspc
数据集	k=10	k=15	k=20	MFSLS?532	MFSLS?631	PMU	MDDMproj	MDDMspc
Health	0.0442	0.0432	0.0426	0.0410	0.0410	0.0485	0.0443	0.0447
Computers	0.0408	0.0402	0.0394	0.0402	0.0401	0.0411	0.0411	0.0408
Education	0.0419	0.0418	0.0420	0.0438	0.0421	0.0442	0.0429	0.0429
Enron	0.0499	0.0514	0.0507	0.0547	0.0559	0.0511	0.0525	0.0527
Reference	0.0332	0.0309	0.0301	0.0318	0.0315	0.0317	0.0338	0.0324
Society	0.0578	0.057	0.0572	0.0573	0.0577	0.0575	0.0598	0.0592
Science	0.0353	0.0347	0.0345	0.0342	0.0343	0.0348	0.0355	0.0355
Social	0.0264	0.0267	0.0263	0.0267	0.0257	0.0275	0.0286	0.0265

数据集	kKFSRMI			MFSLS?532	MFSLS?631	PMU	MDDMproj	MDDMspc
数据集	k=10	k=15	k=20	MFSLS?532	MFSLS?631	PMU	MDDMproj	MDDMspc
Health	0.4210	0.4213	0.4190	0.3523	0.3477	0.4607	0.4140	0.4200
Computers	0.4557	0.4443	0.4470	0.4353	0.4323	0.4580	0.4613	0.4517
Education	0.6057	0.5847	0.5927	0.6210	0.5997	0.6707	0.6003	0.6003
Enron	0.2642	0.2608	0.2660	0.3230	0.3212	0.2884	0.2798	0.3126
Reference	0.5137	0.4877	0.4640	0.4740	0.4617	0.4907	0.5093	0.4873
Society	0.4620	0.4637	0.4657	0.4727	0.4807	0.4863	0.4930	0.4973
Science	0.6983	0.6660	0.6633	0.6843	0.6690	0.7017	0.7200	0.7090
Social	0.3930	0.3993	0.3963	0.4277	0.4083	0.4590	0.4510	0.4140

标记倾向性的粗糙互信息k特征核选择

k⁃Kernel feature selection of tendentious labels based on rough mutual information

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 21

相关文章 15

Metrics

本文评价

推荐阅读 2

评价指标	kKFSRMI	MFSLS	PMU	MDDMproj	MDDMspc
Average Precision	1.25	2.375	4.25	3.625	3.375
Hamming Loss	1.25	2.625	3.5	3.875	3.375
One Error	1.375	2.625	4	3.625	3.375
Ranking Loss	1.125	2.625	4.125	3.625	3.375

[1]	陈超逸,林耀进,唐莉,王晨曦. 基于邻域交互增益信息的多标记流特征选择算法[J]. 南京大学学报(自然科学版), 2020, 56(1): 30-40.
[2]	刘亮,何庆. 基于改进蝗虫优化算法的特征选择方法[J]. 南京大学学报(自然科学版), 2020, 56(1): 41-50.
[3]	刘　素, 刘惊雷. 基于特征选择的CP－nets结构学习[J]. 南京大学学报(自然科学版), 2019, 55(1): 14-28.
[4]	陈海娟，冯　翔，虞慧群. 基于预测算子的GSO特征选择算法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1206-1215.
[5]	陈琳琳1*，陈德刚2. 一种基于核对齐的分类器链的多标记学习算法[J]. 南京大学学报(自然科学版), 2018, 54(4): 725-.
[6]	温　欣1，李德玉1，2*，王素格1，2. 一种基于邻域关系和模糊决策的特征选择方法[J]. 南京大学学报(自然科学版), 2018, 54(4): 733-.
[7]	靳义林1，2*，胡　峰1，2. 基于三支决策的中文文本分类算法研究[J]. 南京大学学报(自然科学版), 2018, 54(4): 794-.
[8]	王一宾1，2，程玉胜1，2*，裴根生1. 结合均值漂移的多示例多标记学习改进算法[J]. 南京大学学报(自然科学版), 2018, 54(2): 422-.
[9]	董利梅，赵　红*，杨文元. 基于稀疏聚类的无监督特征选择[J]. 南京大学学报(自然科学版), 2018, 54(1): 107-.
[10]	崔　晨，邓赵红*，王士同. 面向单调分类的简洁单调TSK模糊系统[J]. 南京大学学报(自然科学版), 2018, 54(1): 124-.
[11]	李　婵，杨文元*，赵　红. 联合依赖最大化与稀疏表示的无监督特征选择方法[J]. 南京大学学报(自然科学版), 2017, 53(4): 775-.
[12]	姚　晟1，2*，徐　风1，2，赵　鹏1，2，刘政怡1，2，陈　菊1，2. 基于改进邻域粒的模糊熵特征选择算法[J]. 南京大学学报(自然科学版), 2017, 53(4): 802-.
[13]	蔡亚萍，杨　明* . 一种利用局部标记相关性的多标记特征选择算法[J]. 南京大学学报(自然科学版), 2016, 52(4): 693-.
[14]	谢娟英*，屈亚楠，王明钊 . 基于密度峰值的无监督特征选择算法[J]. 南京大学学报(自然科学版), 2016, 52(4): 735-.
[15]	梁新彦^1，2，钱宇华^1，2*，郭　倩²，成红红^1，2. 面向多标记学习的局部粗糙集[J]. 南京大学学报(自然科学版), 2016, 52(2): 270-.

数据集	kKFSRMI			MFSLS?532	MFSLS?631	PMU	MDDMproj	MDDMspc
数据集	k=10	k=15	k=20	MFSLS?532	MFSLS?631	PMU	MDDMproj	MDDMspc
Health	0.0625	0.0611	0.0610	0.0568	0.0563	0.0737	0.0654	0.0642
Computers	0.0912	0.0898	0.0898	0.0919	0.0896	0.0992	0.0936	0.0935
Education	0.0951	0.0920	0.0914	0.0975	0.0938	0.1035	0.0932	0.0932
Enron	0.0907	0.0935	0.0924	0.1048	0.1046	0.0951	0.0964	0.0975
Reference	0.0910	0.0851	0.0812	0.0908	0.0883	0.0937	0.0945	0.0883
Society	0.1445	0.1428	0.1424	0.1551	0.1537	0.1592	0.1583	0.1608
Science	0.1414	0.1363	0.1347	0.1366	0.1341	0.1460	0.1489	0.1503
Social	0.0645	0.0661	0.0646	0.0687	0.0665	0.0764	0.0753	0.0707

数据集	kKFSRMI			MFSLS?532	MFSLS?631
数据集	k=10	k=15	k=20	MFSLS?532	MFSLS?631
Health	8.199	8.202	8.21	59.909	62.197
Computers	9.527	10.06	9.445	72.517	75.374
Education	7.626	7.741	7.605	50.705	50.339
Enron	14.595	15.741	14.548	114.534	112.715
Reference	10.891	10.923	11.108	101.481	100.309
Society	7.312	7.716	7.276	68.505	65.982
Science	12.061	12.295	11.778	89.783	86.636
Social	14.699	16.253	16.333	169.584	167.433