南京大学学报(自然科学版) ›› 2022, Vol. 58 ›› Issue (4): 584–593.doi: 10.13232/j.cnki.jnju.2022.04.003

• • 上一篇    

基于粗糙集和改进二进制布谷鸟搜索算法的高维数据特征选择

章成旭, 叶绍强, 周恺卿(), 欧云   

  1. 吉首大学信息科学与工程学院,吉首,416000
  • 收稿日期:2022-04-24 出版日期:2022-07-30 发布日期:2022-08-01
  • 通讯作者: 周恺卿 E-mail:kqzhou@jsu.edu.cn
  • 基金资助:
    国家自然科学基金(62066016);湖南省自然科学基金(2020JJ5458);湖南省教育厅科学研究项目(21C0383)

Feature selection of high dimensional data utilizing improved binary cuckoo search algorithm and rough set

Chengxu Zhang, Shaoqiang Ye, Kaiqing Zhou(), Yun Ou   

  1. College of Information Science and Engineering,Jishou University,Jishou,416000,China
  • Received:2022-04-24 Online:2022-07-30 Published:2022-08-01
  • Contact: Kaiqing Zhou E-mail:kqzhou@jsu.edu.cn

摘要:

在大数据时代,数据多具有规模大、类别多、维度高和样本小等特点,使其特征空间中存在大量冗余和不相关的信息.这些冗余及不相关信息会影响模型的性能,增加计算负担,故特征子集的筛选是数据处理中不可或缺的一环.针对特征选择的数据量大、分类准确率低的问题,提出一种基于粗糙集和改进二进制布谷鸟搜索算法的高维数据特征选择模型.首先,为了加强布谷鸟算法的寻优能力,融合差分进化中变异交叉选择的思想;其次,利用新的鸟巢更新机制寻找优质特征,提升特征选择效果;最后,结合粗糙集构建合适的适应度函数进行评判.为了验证算法的性能,在UCI数据集上选取三种不同分类器进行实验,并利用Friedman检验与Nemenyi后续检验对实验数据进行评估.实验结果表明,提出算法的平均分类准确率达到88.7%,和其他算法相比,在特征选择方面更有优势.

关键词: 特征选择, 粗糙集, 二进制布谷鸟搜索算法, 差分进化, UCI数据集

Abstract:

In the era of big data,data has the characteristics of large scale,multiple categories,high dimensionality and small samples,which makes a large amount of redundant and irrelevant information in the feature space. These redundant and irrelevant information affects the performance of the model and increase the computational burden,so the selection of feature subsets is an indispensable part of data processing. Aiming at the problem of large amount of data for feature selection and low classification accuracy,this paper proposes a rough set and improved Binary Cuckoo Search algorithm. First of all,to strengthen the optimization ability of the Cuckoo algorithm,the idea of mutation cross selection in differential evolution is integrated. Secondly,use the new bird's nest update mechanism to find high?quality features and improve the effect of feature selection. Finally,a suitable fitness function is constructed in combination with rough set for judgment. In order to verify the performance of the algorithm,three different classifiers are selected on the UCI dataset to experiment with the proposed algorithm,and experimental data are evaluated by Friedman test and Nemenyi follow?up test. The results show that the average classification accuracy of the proposed algorithm reaches 88.7%,which is more advantageous than other algorithms in feature selection.

Key words: feature selection, rough set, Binary Cuckoo Search, differential evolution, UCI dataset

中图分类号: 

  • TP18

图1

特征选择流程图"

图2

特征更新过程"

图3

IBCSRS算法流程图"

表1

实验使用的数据集"

数据集名称特征数 (维度)实例数(个)类别个数(个)
Breast⁃cancer96992
Breast305692
Congress164352
Exactly1310002
Exactly21310002
Heart132702
Ionosphere343512
Krvskp3631962
Lymphography181484
M⁃of⁃n1310002
Penglung325737
Sonar602082
Spect222672
Tic⁃tac⁃toe99582
Vote163002
Waveform4050003
Wine131783
Zoo161017

表2

特征子集的分类精度"

数据集分类器50⁃50训练验证10折交叉验证
IBCSRSBCSBWOABBABPSOGSAIBCSRSBCSBWOABBABPSOGSA
Breast⁃cLR0.970.970.970.970.970.970.960.960.960.96
C4.50.970.950.980.980.980.940.940.980.970.97
NB0.980.970.970.970.970.970.960.900.900.88
BreastLR0.970.970.970.970.940.950.950.940.940.94
C4.50.960.950.960.960.940.940.930.970.970.96
NB0.970.970.970.970.930.970.960.970.970.94
CongressLR0.970.970.950.950.950.980.960.970.970.97
C4.50.970.960.970.970.960.980.970.980.980.96
NB0.960.950.950.950.930.960.960.930.930.91
ExactlyLR0.690.690.690.690.690.690.690.690.690.69
C4.50.980.870.860.860.700.950.940.780.740.70
NB0.690.690.690.690.690.690.690.690.690.69
Exactly2LR0.760.760.760.760.760.760.760.760.760.76
C4.50.760.760.760.760.760.790.760.790.790.78
NB0.760.760.760.760.760.760.760.760.760.76
HeartLR0.870.860.850.850.810.910.860.900.900.86
C4.50.820.810.840.840.790.840.820.840.830.73
NB0.870.870.840.840.780.850.850.830.820.72
IonosphereLR0.920.900.920.910.880.890.880.910.910.90
C4.50.930.900.940.930.930.940.930.930.930.90
NB0.930.930.950.930.920.910.900.960.960.90
KrvskpLR0.960.950.960.950.630.970.960.970.970.66
C4.50.990.980.980.980.6410.990.970.970.27
NB0.910.900.820.820.630.930.900.950.950.63
Lymph.LR0.890.880.870.860.800.930.870.900.900.87
C4.50.840.810.860.850.790.830.780.870.850.57
NB0.880.860.870.860.770.820.780.820.800.59
M⁃of⁃nLR1110.940.821110.970.88
C4.510.9910.990.8210.980.890.890.84
NB0.950.940.850.840.780.960.950.760.760.72

Penglung

LR0.950.910.950.940.580.900.890.990.980.66
C4.50.660.650.700.690.550.700.690.640.610.55
NB0.710.670.920.920.590.780.740.970.960.49
SonarLR0.830.820.830.820.680.900.870.870.870.74
C4.50.850.770.840.830.660.720.720.860.850.72
NB0.780.780.850.800.670.740.740.930.930.71
SpectLR0.870.870.870.870.810.920.890.910.910.88
C4.50.830.830.860.860.820.830.810.810.810.66
NB0.850.840.850.800.790.820.810.790.790.79
Tic⁃tac⁃toeLR0.680.680.730.720.710.700.700.760.760.75
C4.50.870.860.890.890.820.880.830.830.820.79
NB0.730.710.720.720.710.720.710.680.670.66
VoteLR0.960.950.700.680.650.950.940.900.750.89
C4.50.960.950.960.960.950.960.940.960.960.94
NB0.960.950.950.950.930.950.950.920.920.92
WaveformLR0.880.870.870.870.700.890.860.880.880.73
C4.50.770.750.770.770.610.790.750.780.770.67
NB0.830.820.830.820.660.840.820.830.830.66
WineLR0.980.970.970.970.950.970.960.950.950.95
C4.50.950.930.920.910.950.930.930.930.930.93
NB10.99110.970.990.990.980.980.98
ZooLR0.980.960.980.980.860.970.960.960.960.96
C4.50.980.960.970.960.910.970.970.970.970.85
NB0.990.990.950.940.790.990.960.960.930.83

图4

两种验证法中IBCSRS算法和四种对比算法各分类器的总平均准确率"

表3

qa的常用值"

qa23456
q0.051.9602.3432.5692.7282.850
q0.101.6452.0522.2912.4592.589

表4

50?50训练验证中四种算法的Friedman检验结果"

数据集IBCSRSBCSBWOABBABPSOGSA
平均序值1.643.222.253.174.72
Breast⁃c2.552.52.52.5
Breast24225
Congress123.53.55
Exactly123.53.55
Exactly233333
Heart123.53.55
Ionosphere24.5134.5
Krvskp14235
Lymph.14235
M⁃of⁃n12345
Penglung34125
Sonar24135
Spect23145
Tic⁃tac⁃toe34125
Vote12345
Waveform14235
Wine12.52.545
Zoo12345

表5

10折交叉验证中四种算法的Friedman检验结果"

数据集IBCSRSBCSBWOABBABPSOGSA
平均序值1.553.282.313.004.86
Breast⁃c12345
Breast34.51.51.54.5
Congress123.53.55
Exactly12345
Exactly225224
Heart14235
Ionosphere341.51.55
Krvskp142.52.55
Lymph.24135
M⁃of⁃n12345
Penglung34125
Sonar34125
Spect13335
Tic⁃tac⁃toe14235
Vote12345
Waveform14235
Wine12444
Zoo12.52.545
1 Zeng Z L, Zhang H J, Zhang R,et al. A novel feature selection method considering feature interaction. Pattern Recognition201548(8):2656-2666.
2 Li J D, Liu H. Challenges of feature selection for big data analytics. IEEE Intelligent Systems201732(2):9-15.
3 Pawlak Z. Rough sets. International Journal of Computer and Information Sciences198211(5):341-356.
4 Wang C Z, Shao M W, He Q,et al. Feature subset selection based on fuzzy neighborhood rough sets. Knowledge?Based Systems2016(111):173-179.
5 Yu Y, Pedrycz W, Miao D Q. Neighborhood rough sets based multi?label classification for automatic image annotation. International Journal of Approximate Reasoning201354(9):1373-1387.
6 Banerjee A, Maji P. Rough sets and stomped normal distribution for simultaneous segmentation and bias field correction in brain MR images. IEEE Transactions on Image Processing201524(12):5764-5776.
7 Zhou J, Pedrycz W, Miao D Q. Shadowed sets in the characterization of rough?fuzzy clustering. Pattern Recognition201144(8):1738-1749.
8 周涛,陆惠玲,任海玲,等. 基于粗糙集的属性约简算法综述. 电子学报202149(7):1439-1449.
Zhou T, Lu H L, Ren H L,et al. Survey on attribute reduction algorithm of rough set. Acta Electronica Sinica202149(7):1439-1449.
9 Bae C, Yeh W C, Chung Y Y,et al. Feature selection with intelligent dynamic swarm and rough set. Expert Systems with Applications201037(10):7026-7032.
10 Gupta A, Purohit A. RGAP:A rough set,genetic algorithm and particle swarm optimization based feature selection approach. International Journal of Computer Applications2017161(6):1-5.
11 王生武,陈红梅. 基于粗糙集和改进鲸鱼优化算法的特征选择方法. 计算机科学202047(2):44-50.
Wang S W, Chen H M. Feature selection method based on rough sets and improved whale optimization algorithm. Computer Science202047(2):44-50.
12 李红梅,周桂红,王克俭. 基于粗糙集和遗传算法的知识发现方法. 现代电子技术200730(8):76-78.
Li H M, Zhou G H, Wang K J. A knowledge disco?very method based on rough set theory and genetic algorithm. Modern Electronics Technique200730(8):76-78.
13 方波,陈红梅,王生武. 基于粗糙集和果蝇优化算法的特征选择方法. 计算机科学201946(7):157-164.
Fang B, Chen H M, Wang S W. Feature selection algorithm based on rough sets and fruit fly optimization. Computer Science201946(7):157-164.
14 Rodrigues D, Pereira L A M, Almeida T N S,et al. BCS:A binary cuckoo search algorithm for feature selection∥2013 IEEE International Symposium on Circuits and Systems. Beijing,China:IEEE,2013:465-468.
15 Yang X S, Deb S. Cuckoo search via Lévy flights∥2009 World Congress on Nature & Biologically Inspired Computing. Coimbatore,India:IEEE,2009:210-214.
16 Kennedy J, Eberhart R C. A discrete binary version of the particle swarm algorithm∥1997 IEEE International Conference on Systems,Man,and Cybernetics. Computational Cybernetics and Simulation. Orlando,FL,USA:IEEE,1997:4104-4108.
17 Mirjalili S, Wang G G, dos S. Coelho L. Binary optimization using hybrid particle swarm optimization and gravitational search algorithm. Neural Computing and Applications201425(6):1423-1435.
[1] 王文珏, 黄兵. 多尺度单值中智系统中基于优势粗糙集模型的最优尺度选择与约简[J]. 南京大学学报(自然科学版), 2022, 58(3): 495-505.
[2] 曾艺祥, 林耀进, 范凯钧, 曾伯儒. 基于层次类别邻域粗糙集的在线流特征选择算法[J]. 南京大学学报(自然科学版), 2022, 58(3): 506-518.
[3] 周悦丽, 林国平, 谢淋淋. 基于矩阵的动态局部相容粗糙集的增量方法[J]. 南京大学学报(自然科学版), 2022, 58(3): 519-531.
[4] 胡玉文, 徐久成, 张倩倩. 决策演化集的卷积预测[J]. 南京大学学报(自然科学版), 2022, 58(1): 1-8.
[5] 卢舜, 林耀进, 吴镒潾, 包丰浩, 王晨曦. 基于多粒度一致性邻域的多标记特征选择[J]. 南京大学学报(自然科学版), 2022, 58(1): 60-70.
[6] 于子淳, 吴伟志. 用证据理论刻画协调的具有多尺度决策的信息系统的最优尺度选择[J]. 南京大学学报(自然科学版), 2022, 58(1): 71-81.
[7] 王敬前, 张小红. 基于极大相容块的不完备信息处理新方法及其应用[J]. 南京大学学报(自然科学版), 2022, 58(1): 82-93.
[8] 刘小伟, 景运革. 一种有效更新多源数据约简的增量算法[J]. 南京大学学报(自然科学版), 2021, 57(6): 1083-1091.
[9] 李苓玉, 刘治平. 基于机器学习的自发性早产生物标记物发现[J]. 南京大学学报(自然科学版), 2021, 57(5): 767-774.
[10] 孙颖, 蔡天使, 张毅, 鞠恒荣, 丁卫平. 基于合理粒度的局部邻域决策粗糙计算方法[J]. 南京大学学报(自然科学版), 2021, 57(2): 262-271.
[11] 刘琼, 代建华, 陈姣龙. 区间值数据的代价敏感特征选择[J]. 南京大学学报(自然科学版), 2021, 57(1): 121-129.
[12] 郑嘉文, 吴伟志, 包菡, 谭安辉. 基于熵的多尺度决策系统的最优尺度选择[J]. 南京大学学报(自然科学版), 2021, 57(1): 130-140.
[13] 郑文彬, 李进金, 张燕兰, 廖淑娇. 基于矩阵的多粒度粗糙集粒度约简方法[J]. 南京大学学报(自然科学版), 2021, 57(1): 141-149.
[14] 毛振宇, 窦慧莉, 宋晶晶, 姜泽华, 王平心. 共现邻域关系下的属性约简研究[J]. 南京大学学报(自然科学版), 2021, 57(1): 150-159.
[15] 李佳佳, 丁伟, 王伯伟, 聂秀山, 崔超然. 基于随机森林的民俗体育对身体指标影响评估方法[J]. 南京大学学报(自然科学版), 2021, 57(1): 59-67.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!