南京大学学报(自然科学版) ›› 2019, Vol. 55 ›› Issue (4): 633–643.doi: 10.13232/j.cnki.jnju.2019.04.013

• • 上一篇    下一篇

基于模糊区分矩阵的结直肠癌基因选择

李藤1,杨田2,4,代建华2,陈鸰3()   

  1. 1. 中南林业科技大学物流与交通学院,长沙,410004
    2. 湖南师范大学智能计算与语言信息处理湖南省重点实验室,长沙,410081
    3. 中南大学湘雅医院,长沙,410008
    4. 国防科技大学系统工程学院,长沙,410073
  • 收稿日期:2019-05-28 出版日期:2019-07-30 发布日期:2019-07-23
  • 通讯作者: 陈鸰 E-mail:50766131@qq.com
  • 基金资助:
    中国博士后科学基金(2017T100795);湖南省自然科学基金(2017JJ2408);湖南省重点研发计划(2018SK2129)

Colon characteristic gene selection based on fuzzy discernibility matrix

Teng Li1,Tian Yang2,4,Jianhua Dai2,Ling Chen3()   

  1. 1. College of Logistics and Transportation, Central South University of Forestry and Technology, Changsha, 410004, China
    2. Hunan Provincial Science and Technology Project Foundation, Hunan Normal University, Changsha, 410081, China
    3. Xiangya Hospital of Central South University, Changsha, 410008, China
    4. College of Systems Engineering, University of Defense Science and Technology, Changsha, 410073, China
  • Received:2019-05-28 Online:2019-07-30 Published:2019-07-23
  • Contact: Ling Chen E-mail:50766131@qq.com

摘要:

由于低分化肿瘤很难通过常规组织病理学诊断发现,而结合基因检测的手段可以准确筛选出针对特定肿瘤的致病基因,因此基因选择是进行肿瘤分类和临床治疗的关键问题.肿瘤基因表达数据具有样本小、维度高的特征,现有的基因选择算法在分类精度和计算效率上还有待提高.在模糊粗糙集理论的基础上进行区分矩阵模糊化,并依此设计了模糊区分矩阵属性约简算法.相比于经典的区分矩阵,模糊化的区分矩阵能够体现不同属性对于两个对象区分程度的差异,从而选择区分程度更高的属性而获得更好的分类效果.数值实验表明该方法提高了肿瘤基因数据的分类精度,且降低了计算耗时.实验采用kNN分类器进行结直肠癌(Colon Microarray)分类特征基因选择实验,从2000个特征基因中筛选出了五个结直肠癌发病相关的关键基因,且分类精度高达88.06%.

关键词: 模糊粗糙集, 粗糙集, 模糊区分矩阵, 基因选择

Abstract:

Since poorly differentiated tumors are difficult to be diagnosed by conventional histopathology,through gene selection can accurate screen disease?causing genes for specific tumors,therefore gene selection has become a key issue in tumor classification and clinical treatment. Tumor gene expression data usually contains thousands of genes but a small number of samples. On the basis of fuzzy rough set theory,the concept of discernibility matrix fuzzification is proposed in this paper. Compared with the classical discernibility matrix,the fuzzy discernibility matrix can reflect the difference in the degree of the two objects distinguished by different attributes,so that the attributes with higher degree of distinction can be selected for better classification effect. Numerical experiments show that this method improves the classification accuracy of tumor gene data and reduces the computation time. In this study,kNN classifier was used for the gene selection of Colon cancer (Colon Microarray),five key genes related to Colon cancer were screened from 2000 feature genes and the classification accuracy was as high as 88.06%.

Key words: fuzzy rough sets, rough sets, fuzzy discernibility matrix, gene selection

中图分类号: 

  • TP311

表1

决策表"

Ua1a2a3a4a5a6a7D
x10.320.770.50.480.10.90.361
x20.560.980.720.90.40.380.822
x30.880.420.770.620.540.790.361
x40.60.670.510.70.30.220.863

表2

模糊区分矩阵"

Ux1x2x3x4
x1(0,0,0,0,0,0,0)(0,0,0,0.0938,0,0.25,0.1562)(0,0,0,0,0,0,0)(0,0,0,0,0,0.5,0.2188)
x2(0,0,0,0,0,0,0)

(0,0.3125,0,0,0,

0.0781,0.1562)

(0,0,0,0,0,0,0,0)
x3(0,0,0,0,0,0,0)(0,0,0,0,0,0.3281,0.2188)
x4(0,0,0,0,0,0,0)

表3

经典区分矩阵"

Ux1x2x3x4
x1?{a4, a6, a7}?{a6, a7}
x2?{a2, a6, a7}?
x3?{a6, a7}
x4?

表4

数据集的描述"

DatasetsSamplesFeaturesClasses
Heart270132
Primary?tumor339172
Mammographic96152
Lung cancer32562
Gastrointestional766983
Leukemia7270702

表5

不同算法的分类精度(%)"

DatasetsRawCDGNFRSHANDIFDM
Heart80.15±7.580.86±4.3281.05±5.7481.11±9.0181.30±10.93
Primary?tumor66.45±4.8467.77±8.3769.21±6.8368.56±4.2168.12±4.75
Mammographic75.50±4.4276.79±3.0076.69±4.5677.26±3.0577.09±2.87
Lung cancer83.65±21.1583.75±33.7580.63±19.3783.75±33.7585.63±35.62
Gastrointestional54.55±0.0071.82±17.2771.36±15.0071.76±13.2474.55±15.45
Leukemia86.63±13.3797.86±7.3897.62±7.1497.86±2.6298.10±2.86
Average74.4979.8179.4380.0580.80

表6

不同算法求约简的时间(单位:s)"

DatasetsCDGNFRSHANDIFDM
Heart0.81114.8516.5681.108
Primary?tumor1.81028.53314.8832.012
Mammographic4.82049.25031.2473.838
Lung cancer0.1561.7470.6860.094
Gastrointestional1.794281.08376.4853.151
Leukemia27.4871520.600168.13822.776
Average6.146316.01149.6685.497

表7

Colon数据集的描述"

Tumor datasetGeneSamplePositiveNormal
Colon2000624022

表8

两种基因选择方法的比较"

MethodGeneδFeature SelectedkNN
EGGS20000.25H55933,T58861,H61410,J05032,H0652486.25
FDM20000.24T63591,D13315,X83412,J02854,R6243888.06
1 叶明全,高凌云,伍长荣等.基于对称不确定性和邻域粗糙集的肿瘤分类信息基因选择.数据采集与处理,2018,33(3): 426-435.
Ye M Q,Gao L Y,Wu C R,et al.Informative gene selection for tumor classification based on symmetric uncer?tainty and neighborhood rough set.Journal of Data Acquisition and Processing,2018,33(3):426-435.
2 DaiJ H,XuQ.Attribute selection based on information gain ratio in fuzzy rough set theory with application to tumor classification.Applied Soft Computing,2013,13(1):211-221.
3 WangS L,LiX L,ZhangS W,et al.Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction.Computers in Biology and Medicine,2010,40(2):179-189.
4 ChenY M,ZhangZ J,ZhengJ Z,et al.Gene selection for tumor classification using neighbor?hood rough sets and entropy measures.Journal of Biomedical Informatics,2017,67:59-68.
5 Al?ThanoonN A,QasimO S,AlgamalZ Y.Tuning parameter estimation in SCAD?support vector machine using firefly algorithm with appli?cation in gene selection and cancer classification.Computers in Biology and Medicine,2018,103:262-268.
6 徐菲菲,苗夺谦,魏莱.基于模糊粗糙集的肿瘤分类特征基因选取.计算机科学,2009,36(3):196-200.
Xu F F,Miao D Q,Wei L.Feature Selection for Cancer Classification Based on Fuzzy Rough Sets.Computer Science,2009,36(3):196-200.
7 CaoJ,ZhangL,WangB J,et al.A fast gene selection method for multi?cancer classification using multiple support vector data description.Journal of Biomedical Informatics,2015,53:381-389.
8 ModelF,AdorjánP,OlekA,et al.Feature selection for DNA methylation based cancer classification.Bioinformatics,2001,17(S1):S157-S164.
9 AlgamalZ Y,LeeM H.Penalized logistic regression with the adaptive LASSO for gene selection in high?dimensional cancer classification.Expert Systems with Applications,2015,42(23):9326-9332.
10 GolubT R,SlonimD K,TamayoP,et al.Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.Science,1999,286(5439):531-537.
11 ZhangY,DingC,LiT.Gene selection algorithm by combining ReliefF and MRMR.BMC Genomics,2008,9(S1):S27.
12 Robnik??ikonjaM,KononenkoI.Theoretical and empirical analysis of ReliefF and RReliefF.Machine Learning,2003,53(1-2):23-69.
13 PengH C,LongF H,DingC.Feature selection based on mutual information criteria of max?depen?dency,max?relevance,and min?redundancy.IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(8): 1226-1238.
14 GuyonI,WestonJ,BarnhillS,et al.Gene selection for cancer classification using support vector machines.Machine Learning,2002,46(1-3):389-422.
15 WangL,ZhuJ,ZouH.Hybrid huberized support vector machines for microarray classification and gene selection.Bioinformatics,2008,24(3):412-419.
16 GhoshD,ChinnaiyanA M.Classification and selection of biomarkers in genomic data using LASSO.Journal of Biomedicine and Biotechno?logy,2005,2005(2):147-154.
17 MylonaK,KoukouvinosC,TheodorakiE M,et al.Variable selection via nonconcave penalized likelihood in high dimensional medical problems.International Journal of Applied Mathematics and Statistics,2009,14:1-11.
18 HerawanT,DerisM M,AbawajyJ H.A rough set approach for selecting clustering attribute.Knowledge?Based Systems,2010,23(3):220-231.
19 ParthalainN M,ShenQ.Exploring the boundary region of tolerance rough sets for feature selection.Pattern Recognition,2009,42(5):655-667.
20 MiJ S,WuW Z,ZhangW X.Approaches to knowledge reduction based on variable precision rough set model.Information Sciences,2004,159(3-4):255-272.
21 QianY H,LiangJ Y,PedryczW,et al.Positive approximation: an accelerator for attribute reduc?tion in rough set theory.Artificial Intelligence,2010,174(9-10):597-618.
22 DuboisD,PradeH.Rough fuzzy sets and fuzzy rough sets.International Journal of General Systems,1990,17(2-3):191-209.
23 JensenR,ShenQ.Fuzzy?rough attribute reduction with application to web categorization.Fuzzy Sets and systems,2004,141(3):469-485.
24 HuQ H,YuD,XieZ X,et al.Fuzzy probabilistic approximation spaces and their information measures.IEEE Transactions on Fuzzy Systems,2006,14(2):191-201.
25 HuQ H,YuD R,XieZ X.Information?preserving hybrid data reduction based on fuzzy?rough techni?ques.Pattern Recognition Letters,2006,27(5):414-423.
26 ChenD G,ZhangL,ZhaoS Y,et al.A novel algorithm for finding reducts with fuzzy rough sets.IEEE Transactions on Fuzzy Systems,2012,20(2):385-389.
27 TsangE C C,ChenD G,DanielS Y,et al.Attributes reduction using fuzzy rough sets.IEEE Transactions on Fuzzy Systems,2008,16(5):1130-1141.
28 DaiJ H,HuH,WuW Z,et al.Maximal?discernibility?pair?based approach to attribute reduction in fuzzy rough sets.IEEE Transactions on Fuzzy Systems,2017,26(4):2174-2187.
29 WangC Z,QiY L,ShaoM W,et al.A fitting model for feature selection with fuzzy rough sets.IEEE Transactions on Fuzzy Systems,2017,25(4):741-753.
30 QianY H,WangQ,ChengH H,et al.Fuzzy?rough feature selection accelerator.Fuzzy Sets and Systems,2015,258:61-78.
31 胡宝清.模糊理论基础.第2版.武汉:武汉大学出版社,2010,648.
32 WangC Z,WuC X,ChenD G.A systematic study on attribute reduction with rough sets based on general binary relations.Information Sciences,2008,178(9):2237-2261.
33 HuQ H,YuD,XieZ X.Neighborhood classifiers.Expert Systems with Applications,2008,34(2):866-876.
34 ChenJ K,LinY J,LinG P,et al.Attribute reduction of covering decision systems by hypergraph model.Knowledge?Based Systems,2016,118: 93-104.
35 WangC Z,HuQ H,WangX Z,et al.Feature selection based on neighborhood discrimination index.IEEE Transactions on Neural Networks and Learning Systems,2018,29(7):2986-2999.
[1] 李同军,于洋,吴伟志,顾沈明. 经典粗糙近似的一个公理化刻画[J]. 南京大学学报(自然科学版), 2020, 56(4): 445-451.
[2] 任睿,张超,庞继芳. 有限理性下多粒度q⁃RO模糊粗糙集的最优粒度选择及其在并购对象选择中的应用[J]. 南京大学学报(自然科学版), 2020, 56(4): 452-460.
[3] 王宝丽,姚一豫. 信息表中约简补集对及其一般定义[J]. 南京大学学报(自然科学版), 2020, 56(4): 461-468.
[4] 张龙波, 李智远, 杨习贝, 王怡博. 决策代价约简求解中的交叉验证策略[J]. 南京大学学报(自然科学版), 2019, 55(4): 601-608.
[5] 姚宁, 苗夺谦, 张远健, 康向平. 属性的变化对于流图的影响[J]. 南京大学学报(自然科学版), 2019, 55(4): 519-528.
[6] 程永林, 李德玉, 王素格. 基于极大相容块的邻域粗糙集模型[J]. 南京大学学报(自然科学版), 2019, 55(4): 529-536.
[7] 张 婷1,2,张红云1,2*,王 真3. 基于三支决策粗糙集的迭代量化的图像检索算法[J]. 南京大学学报(自然科学版), 2018, 54(4): 714-.
[8] 敬思惠,秦克云*. 决策系统基于特定决策类的上近似约简[J]. 南京大学学报(自然科学版), 2018, 54(4): 804-.
[9] 胡玉文1,2,3*,徐久成1,2,张倩倩1,2. 决策演化集的膜结构抑制剂[J]. 南京大学学报(自然科学版), 2018, 54(4): 810-.
[10] 陶玉枝1,2,赵仕梅1,2,谭安辉1,2*. 一种基于决策表约简的集覆盖问题的近似解法[J]. 南京大学学报(自然科学版), 2018, 54(4): 821-.
[11]  严丽宇1,魏 巍1,2*,郭鑫垚1,崔军彪1.  一种基于带核随机子空间的聚类集成算法[J]. 南京大学学报(自然科学版), 2017, 53(6): 1033-.
[12] 赵天娜1,米据生1*,解 滨2,梁美社1,3. 基于多伴随直觉模糊粗糙集的三支决策[J]. 南京大学学报(自然科学版), 2017, 53(6): 1081-.
[13]  卢 媛,王 栋*,刘登峰,王远坤.  基于改进的粗糙集-云模型的水质评价方法[J]. 南京大学学报(自然科学版), 2017, 53(5): 879-.
[14]  贺晓丽1,2,魏 玲1*,折延宏2.  多粒度粗糙集模型的一致模语义分析[J]. 南京大学学报(自然科学版), 2017, 53(5): 954-.
[15]  张春英1,2,乔 鹏1,2,王立亚1,2*,刘 璐1,2,张建松1,3.  基于概率PS-粗糙集的动态三支决策及应用[J]. 南京大学学报(自然科学版), 2017, 53(5): 937-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘 素, 刘惊雷. 基于特征选择的CP-nets结构学习[J]. 南京大学学报(自然科学版), 2019, 55(1): 14 -28 .
[2] 秦 娅, 申国伟, 赵文波, 陈艳平. 基于深度神经网络的网络安全实体识别方法[J]. 南京大学学报(自然科学版), 2019, 55(1): 29 -40 .
[3] 洪佳明,黄云,刘少鹏,印鉴. 具有结果多样性的近似子图查询算法[J]. 南京大学学报(自然科学版), 2019, 55(6): 960 -972 .
[4] 徐媛媛,张恒汝,闵帆,黄雨婷. 三支交互推荐[J]. 南京大学学报(自然科学版), 2019, 55(6): 973 -983 .
[5] 秦洋,姚素平,萧汉敏. 致密砂岩储层孔⁃喉连通性研究[J]. 南京大学学报(自然科学版), 2020, 56(3): 338 -353 .
[6] 刘鑫,胡军,张清华. 属性组序下基于代价敏感的约简方法[J]. 南京大学学报(自然科学版), 2020, 56(4): 469 -479 .
[7] 顾萍萍,周献中. 基于概率语言术语集评价的三支决策方法研究[J]. 南京大学学报(自然科学版), 2020, 56(4): 505 -514 .
[8] 黄雨婷,徐媛媛,张恒汝,闵帆. 融合标签结构依赖性的标签分布学习[J]. 南京大学学报(自然科学版), 2020, 56(4): 524 -532 .