基于粗糙集和改进二进制布谷鸟搜索算法的高维数据特征选择

doi:10.13232/j.cnki.jnju.2022.04.003

南京大学学报(自然科学版) ›› 2022, Vol. 58 ›› Issue (4): 584–593.doi: 10.13232/j.cnki.jnju.2022.04.003

• • 上一篇

基于粗糙集和改进二进制布谷鸟搜索算法的高维数据特征选择

章成旭, 叶绍强, 周恺卿(), 欧云

吉首大学信息科学与工程学院，吉首，416000

收稿日期:2022-04-24 出版日期:2022-07-30 发布日期:2022-08-01
通讯作者: 周恺卿 E-mail:kqzhou@jsu.edu.cn
基金资助:
国家自然科学基金(62066016);湖南省自然科学基金(2020JJ5458);湖南省教育厅科学研究项目(21C0383)

Feature selection of high dimensional data utilizing improved binary cuckoo search algorithm and rough set

Chengxu Zhang, Shaoqiang Ye, Kaiqing Zhou(), Yun Ou

College of Information Science and Engineering，Jishou University，Jishou，416000，China

Received:2022-04-24 Online:2022-07-30 Published:2022-08-01
Contact: Kaiqing Zhou E-mail:kqzhou@jsu.edu.cn

摘要/Abstract

摘要：

在大数据时代，数据多具有规模大、类别多、维度高和样本小等特点，使其特征空间中存在大量冗余和不相关的信息.这些冗余及不相关信息会影响模型的性能，增加计算负担，故特征子集的筛选是数据处理中不可或缺的一环.针对特征选择的数据量大、分类准确率低的问题，提出一种基于粗糙集和改进二进制布谷鸟搜索算法的高维数据特征选择模型.首先，为了加强布谷鸟算法的寻优能力，融合差分进化中变异交叉选择的思想；其次，利用新的鸟巢更新机制寻找优质特征，提升特征选择效果；最后，结合粗糙集构建合适的适应度函数进行评判.为了验证算法的性能，在UCI数据集上选取三种不同分类器进行实验，并利用Friedman检验与Nemenyi后续检验对实验数据进行评估.实验结果表明，提出算法的平均分类准确率达到88.7％，和其他算法相比，在特征选择方面更有优势.

关键词: 特征选择, 粗糙集, 二进制布谷鸟搜索算法, 差分进化, UCI数据集

Abstract:

In the era of big data，data has the characteristics of large scale，multiple categories，high dimensionality and small samples，which makes a large amount of redundant and irrelevant information in the feature space. These redundant and irrelevant information affects the performance of the model and increase the computational burden，so the selection of feature subsets is an indispensable part of data processing. Aiming at the problem of large amount of data for feature selection and low classification accuracy，this paper proposes a rough set and improved Binary Cuckoo Search algorithm. First of all，to strengthen the optimization ability of the Cuckoo algorithm，the idea of mutation cross selection in differential evolution is integrated. Secondly，use the new bird's nest update mechanism to find high?quality features and improve the effect of feature selection. Finally，a suitable fitness function is constructed in combination with rough set for judgment. In order to verify the performance of the algorithm，three different classifiers are selected on the UCI dataset to experiment with the proposed algorithm，and experimental data are evaluated by Friedman test and Nemenyi follow?up test. The results show that the average classification accuracy of the proposed algorithm reaches 88.7%，which is more advantageous than other algorithms in feature selection.

Key words: feature selection, rough set, Binary Cuckoo Search, differential evolution, UCI dataset

中图分类号:

TP18

章成旭, 叶绍强, 周恺卿, 欧云. 基于粗糙集和改进二进制布谷鸟搜索算法的高维数据特征选择[J]. 南京大学学报(自然科学版), 2022, 58(4): 584–593.

Chengxu Zhang, Shaoqiang Ye, Kaiqing Zhou, Yun Ou. Feature selection of high dimensional data utilizing improved binary cuckoo search algorithm and rough set[J]. Journal of Nanjing University(Natural Sciences), 2022, 58(4): 584–593.

图/表 9

图1

图2

图3

表1

表2

特征子集的分类精度"

数据集	分类器	50⁃50训练验证					10折交叉验证
数据集	分类器	IBCSRS	BCS	BWOA	BBA	BPSOGSA	IBCSRS	BCS	BWOA	BBA	BPSOGSA
Breast⁃c	LR	0.97	0.97	0.97	0.97	0.97	0.97	0.96	0.96	0.96	0.96
	C4.5	0.97	0.95	0.98	0.98	0.98	0.94	0.94	0.98	0.97	0.97
	NB	0.98	0.97	0.97	0.97	0.97	0.97	0.96	0.90	0.90	0.88
Breast	LR	0.97	0.97	0.97	0.97	0.94	0.95	0.95	0.94	0.94	0.94
	C4.5	0.96	0.95	0.96	0.96	0.94	0.94	0.93	0.97	0.97	0.96
	NB	0.97	0.97	0.97	0.97	0.93	0.97	0.96	0.97	0.97	0.94
Congress	LR	0.97	0.97	0.95	0.95	0.95	0.98	0.96	0.97	0.97	0.97
	C4.5	0.97	0.96	0.97	0.97	0.96	0.98	0.97	0.98	0.98	0.96
	NB	0.96	0.95	0.95	0.95	0.93	0.96	0.96	0.93	0.93	0.91
Exactly	LR	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69
	C4.5	0.98	0.87	0.86	0.86	0.70	0.95	0.94	0.78	0.74	0.70
	NB	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69	0.69
Exactly2	LR	0.76	0.76	0.76	0.76	0.76	0.76	0.76	0.76	0.76	0.76
	C4.5	0.76	0.76	0.76	0.76	0.76	0.79	0.76	0.79	0.79	0.78
	NB	0.76	0.76	0.76	0.76	0.76	0.76	0.76	0.76	0.76	0.76
Heart	LR	0.87	0.86	0.85	0.85	0.81	0.91	0.86	0.90	0.90	0.86
	C4.5	0.82	0.81	0.84	0.84	0.79	0.84	0.82	0.84	0.83	0.73
	NB	0.87	0.87	0.84	0.84	0.78	0.85	0.85	0.83	0.82	0.72
Ionosphere	LR	0.92	0.90	0.92	0.91	0.88	0.89	0.88	0.91	0.91	0.90
	C4.5	0.93	0.90	0.94	0.93	0.93	0.94	0.93	0.93	0.93	0.90
	NB	0.93	0.93	0.95	0.93	0.92	0.91	0.90	0.96	0.96	0.90
Krvskp	LR	0.96	0.95	0.96	0.95	0.63	0.97	0.96	0.97	0.97	0.66
	C4.5	0.99	0.98	0.98	0.98	0.64	1	0.99	0.97	0.97	0.27
	NB	0.91	0.90	0.82	0.82	0.63	0.93	0.90	0.95	0.95	0.63
Lymph.	LR	0.89	0.88	0.87	0.86	0.80	0.93	0.87	0.90	0.90	0.87
	C4.5	0.84	0.81	0.86	0.85	0.79	0.83	0.78	0.87	0.85	0.57
	NB	0.88	0.86	0.87	0.86	0.77	0.82	0.78	0.82	0.80	0.59
M⁃of⁃n	LR	1	1	1	0.94	0.82	1	1	1	0.97	0.88
	C4.5	1	0.99	1	0.99	0.82	1	0.98	0.89	0.89	0.84
	NB	0.95	0.94	0.85	0.84	0.78	0.96	0.95	0.76	0.76	0.72
Penglung	LR	0.95	0.91	0.95	0.94	0.58	0.90	0.89	0.99	0.98	0.66
	C4.5	0.66	0.65	0.70	0.69	0.55	0.70	0.69	0.64	0.61	0.55
	NB	0.71	0.67	0.92	0.92	0.59	0.78	0.74	0.97	0.96	0.49
Sonar	LR	0.83	0.82	0.83	0.82	0.68	0.90	0.87	0.87	0.87	0.74
	C4.5	0.85	0.77	0.84	0.83	0.66	0.72	0.72	0.86	0.85	0.72
	NB	0.78	0.78	0.85	0.80	0.67	0.74	0.74	0.93	0.93	0.71
Spect	LR	0.87	0.87	0.87	0.87	0.81	0.92	0.89	0.91	0.91	0.88
	C4.5	0.83	0.83	0.86	0.86	0.82	0.83	0.81	0.81	0.81	0.66
	NB	0.85	0.84	0.85	0.80	0.79	0.82	0.81	0.79	0.79	0.79
Tic⁃tac⁃toe	LR	0.68	0.68	0.73	0.72	0.71	0.70	0.70	0.76	0.76	0.75
	C4.5	0.87	0.86	0.89	0.89	0.82	0.88	0.83	0.83	0.82	0.79
	NB	0.73	0.71	0.72	0.72	0.71	0.72	0.71	0.68	0.67	0.66
Vote	LR	0.96	0.95	0.70	0.68	0.65	0.95	0.94	0.90	0.75	0.89
	C4.5	0.96	0.95	0.96	0.96	0.95	0.96	0.94	0.96	0.96	0.94
	NB	0.96	0.95	0.95	0.95	0.93	0.95	0.95	0.92	0.92	0.92
Waveform	LR	0.88	0.87	0.87	0.87	0.70	0.89	0.86	0.88	0.88	0.73
	C4.5	0.77	0.75	0.77	0.77	0.61	0.79	0.75	0.78	0.77	0.67
	NB	0.83	0.82	0.83	0.82	0.66	0.84	0.82	0.83	0.83	0.66
Wine	LR	0.98	0.97	0.97	0.97	0.95	0.97	0.96	0.95	0.95	0.95
	C4.5	0.95	0.93	0.92	0.91	0.95	0.93	0.93	0.93	0.93	0.93
	NB	1	0.99	1	1	0.97	0.99	0.99	0.98	0.98	0.98
Zoo	LR	0.98	0.96	0.98	0.98	0.86	0.97	0.96	0.96	0.96	0.96
	C4.5	0.98	0.96	0.97	0.96	0.91	0.97	0.97	0.97	0.97	0.85
	NB	0.99	0.99	0.95	0.94	0.79	0.99	0.96	0.96	0.93	0.83

表2

图4

表3

qa的常用值"

$q a$	2	3	4	5	6
$q 0.05$	1.960	2.343	2.569	2.728	2.850
$q 0.10$	1.645	2.052	2.291	2.459	2.589

表3

表4

表5

参考文献 17

1	Zeng Z L， Zhang H J， Zhang R，et al. A novel feature selection method considering feature interaction. Pattern Recognition，2015，48(8)：2656-2666.
2	Li J D， Liu H. Challenges of feature selection for big data analytics. IEEE Intelligent Systems，2017，32(2)：9-15.
3	Pawlak Z. Rough sets. International Journal of Computer and Information Sciences，1982，11(5)：341-356.
4	Wang C Z， Shao M W， He Q，et al. Feature subset selection based on fuzzy neighborhood rough sets. Knowledge?Based Systems，2016(111)：173-179.
5	Yu Y， Pedrycz W， Miao D Q. Neighborhood rough sets based multi?label classification for automatic image annotation. International Journal of Approximate Reasoning，2013，54(9)：1373-1387.
6	Banerjee A， Maji P. Rough sets and stomped normal distribution for simultaneous segmentation and bias field correction in brain MR images. IEEE Transactions on Image Processing，2015，24(12)：5764-5776.
7	Zhou J， Pedrycz W， Miao D Q. Shadowed sets in the characterization of rough?fuzzy clustering. Pattern Recognition，2011，44(8)：1738-1749.
8	周涛，陆惠玲，任海玲，等. 基于粗糙集的属性约简算法综述. 电子学报，2021，49(7)：1439-1449.
	Zhou T， Lu H L， Ren H L，et al. Survey on attribute reduction algorithm of rough set. Acta Electronica Sinica，2021，49(7)：1439-1449.
9	Bae C， Yeh W C， Chung Y Y，et al. Feature selection with intelligent dynamic swarm and rough set. Expert Systems with Applications，2010，37(10)：7026-7032.
10	Gupta A， Purohit A. RGAP：A rough set，genetic algorithm and particle swarm optimization based feature selection approach. International Journal of Computer Applications，2017，161(6)：1-5.
11	王生武，陈红梅. 基于粗糙集和改进鲸鱼优化算法的特征选择方法. 计算机科学，2020，47(2)：44-50.
	Wang S W， Chen H M. Feature selection method based on rough sets and improved whale optimization algorithm. Computer Science，2020，47(2)：44-50.
12	李红梅，周桂红，王克俭. 基于粗糙集和遗传算法的知识发现方法. 现代电子技术，2007，30(8)：76-78.
	Li H M， Zhou G H， Wang K J. A knowledge disco?very method based on rough set theory and genetic algorithm. Modern Electronics Technique，2007，30(8)：76-78.
13	方波，陈红梅，王生武. 基于粗糙集和果蝇优化算法的特征选择方法. 计算机科学，2019，46(7)：157-164.
	Fang B， Chen H M， Wang S W. Feature selection algorithm based on rough sets and fruit fly optimization. Computer Science，2019，46(7)：157-164.
14	Rodrigues D， Pereira L A M， Almeida T N S，et al. BCS：A binary cuckoo search algorithm for feature selection∥2013 IEEE International Symposium on Circuits and Systems. Beijing，China：IEEE，2013：465-468.
15	Yang X S， Deb S. Cuckoo search via Lévy flights∥2009 World Congress on Nature & Biologically Inspired Computing. Coimbatore，India：IEEE，2009：210-214.
16	Kennedy J， Eberhart R C. A discrete binary version of the particle swarm algorithm∥1997 IEEE International Conference on Systems，Man，and Cybernetics. Computational Cybernetics and Simulation. Orlando，FL，USA：IEEE，1997：4104-4108.
17	Mirjalili S， Wang G G， dos S. Coelho L. Binary optimization using hybrid particle swarm optimization and gravitational search algorithm. Neural Computing and Applications，2014，25(6)：1423-1435.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

[1]	王文珏, 黄兵. 多尺度单值中智系统中基于优势粗糙集模型的最优尺度选择与约简[J]. 南京大学学报(自然科学版), 2022, 58(3): 495-505.
[2]	曾艺祥, 林耀进, 范凯钧, 曾伯儒. 基于层次类别邻域粗糙集的在线流特征选择算法[J]. 南京大学学报(自然科学版), 2022, 58(3): 506-518.
[3]	周悦丽, 林国平, 谢淋淋. 基于矩阵的动态局部相容粗糙集的增量方法[J]. 南京大学学报(自然科学版), 2022, 58(3): 519-531.
[4]	胡玉文, 徐久成, 张倩倩. 决策演化集的卷积预测[J]. 南京大学学报(自然科学版), 2022, 58(1): 1-8.
[5]	卢舜, 林耀进, 吴镒潾, 包丰浩, 王晨曦. 基于多粒度一致性邻域的多标记特征选择[J]. 南京大学学报(自然科学版), 2022, 58(1): 60-70.
[6]	于子淳, 吴伟志. 用证据理论刻画协调的具有多尺度决策的信息系统的最优尺度选择[J]. 南京大学学报(自然科学版), 2022, 58(1): 71-81.
[7]	王敬前, 张小红. 基于极大相容块的不完备信息处理新方法及其应用[J]. 南京大学学报(自然科学版), 2022, 58(1): 82-93.
[8]	刘小伟, 景运革. 一种有效更新多源数据约简的增量算法[J]. 南京大学学报(自然科学版), 2021, 57(6): 1083-1091.
[9]	李苓玉, 刘治平. 基于机器学习的自发性早产生物标记物发现[J]. 南京大学学报(自然科学版), 2021, 57(5): 767-774.
[10]	孙颖, 蔡天使, 张毅, 鞠恒荣, 丁卫平. 基于合理粒度的局部邻域决策粗糙计算方法[J]. 南京大学学报(自然科学版), 2021, 57(2): 262-271.
[11]	刘琼, 代建华, 陈姣龙. 区间值数据的代价敏感特征选择[J]. 南京大学学报(自然科学版), 2021, 57(1): 121-129.
[12]	郑嘉文, 吴伟志, 包菡, 谭安辉. 基于熵的多尺度决策系统的最优尺度选择[J]. 南京大学学报(自然科学版), 2021, 57(1): 130-140.
[13]	郑文彬, 李进金, 张燕兰, 廖淑娇. 基于矩阵的多粒度粗糙集粒度约简方法[J]. 南京大学学报(自然科学版), 2021, 57(1): 141-149.
[14]	毛振宇, 窦慧莉, 宋晶晶, 姜泽华, 王平心. 共现邻域关系下的属性约简研究[J]. 南京大学学报(自然科学版), 2021, 57(1): 150-159.
[15]	李佳佳, 丁伟, 王伯伟, 聂秀山, 崔超然. 基于随机森林的民俗体育对身体指标影响评估方法[J]. 南京大学学报(自然科学版), 2021, 57(1): 59-67.

基于粗糙集和改进二进制布谷鸟搜索算法的高维数据特征选择

Feature selection of high dimensional data utilizing improved binary cuckoo search algorithm and rough set

RichHTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 17

相关文章 15

Metrics

本文评价

推荐阅读 0

数据集名称	特征数 (维度)	实例数（个）	类别个数（个）
Breast⁃cancer	9	699	2
Breast	30	569	2
Congress	16	435	2
Exactly	13	1000	2
Exactly2	13	1000	2
Heart	13	270	2
Ionosphere	34	351	2
Krvskp	36	3196	2
Lymphography	18	148	4
M⁃of⁃n	13	1000	2
Penglung	325	73	7
Sonar	60	208	2
Spect	22	267	2
Tic⁃tac⁃toe	9	958	2
Vote	16	300	2
Waveform	40	5000	3
Wine	13	178	3
Zoo	16	101	7

数据集	IBCSRS	BCS	BWOA	BBA	BPSOGSA
平均序值	1.64	3.22	2.25	3.17	4.72
Breast⁃c	2.5	5	2.5	2.5	2.5
Breast	2	4	2	2	5
Congress	1	2	3.5	3.5	5
Exactly	1	2	3.5	3.5	5
Exactly2	3	3	3	3	3
Heart	1	2	3.5	3.5	5
Ionosphere	2	4.5	1	3	4.5
Krvskp	1	4	2	3	5
Lymph.	1	4	2	3	5
M⁃of⁃n	1	2	3	4	5
Penglung	3	4	1	2	5
Sonar	2	4	1	3	5
Spect	2	3	1	4	5
Tic⁃tac⁃toe	3	4	1	2	5
Vote	1	2	3	4	5
Waveform	1	4	2	3	5
Wine	1	2.5	2.5	4	5
Zoo	1	2	3	4	5

数据集	IBCSRS	BCS	BWOA	BBA	BPSOGSA
平均序值	1.55	3.28	2.31	3.00	4.86
Breast⁃c	1	2	3	4	5
Breast	3	4.5	1.5	1.5	4.5
Congress	1	2	3.5	3.5	5
Exactly	1	2	3	4	5
Exactly2	2	5	2	2	4
Heart	1	4	2	3	5
Ionosphere	3	4	1.5	1.5	5
Krvskp	1	4	2.5	2.5	5
Lymph.	2	4	1	3	5
M⁃of⁃n	1	2	3	4	5
Penglung	3	4	1	2	5
Sonar	3	4	1	2	5
Spect	1	3	3	3	5
Tic⁃tac⁃toe	1	4	2	3	5
Vote	1	2	3	4	5
Waveform	1	4	2	3	5
Wine	1	2	4	4	4
Zoo	1	2.5	2.5	4	5