Divide and conquer algorithm for minimal cost feature selection with measurement error
Huang Weiting1*,Zhao Hong2
Author information+
{{custom_zuoZheDiZhi}}
About authors:
1.School of Computing,Minnan Normal University,Zhangzhou,363000,China;2.Laboratory of Granular Computing,Minnan Normal University,Zhangzhou,363000,China
{{custom_authorNodes}}
{{custom_bio.content}}
{{custom_bio.content}}
{{custom_authorNodes}}
Collapse
History+
Published
2016-09-25
Issue Date
2016-09-25
Abstract
Minimal cost feature selection is one of the most important problems in the current stage of data mining research.The optimization goal of this problem is to obtain an attribute subset that has the minimal total cost.In the process of access to the actual data,measurement error is inevitable,and the actual data will exist error.At present,there are some minimal cost feature selection algorithms with measurement error.But the efficiency of these algorithms is not very satisfactory,especially on large scale datasets.In order to solve this critical problem,this paper proposes a divide and conquer algorithm for minimal cost feature selection with measurement error.Combining divide and conquer thought,the dataset is broken down into some disjoint subdatasets according to the number of the column.The proposed algorithm is used for each subdataset,and it can obtain the optimal attribute subset of each subdataset.These optimal attribute subsets are merged,thus an attribute subset with the minimal total cost is obtained.For the different scale of the datasets,the size of the subdatasets and the number of the subdatasets are adaptive to the scale of the datasets rather than fixed.The computational efficiency of this algorithm is enhanced by reducing the scale of the problem.The experimental results on six UCI datasets with different scales show that the proposed method is effective.Compared with the classical backtracking algorithm,the efficiency of this algorithm is increased at least by thirty percent.At the same time,the effect of this algorithm is guaranteed.Therefore the proposed method can well meet the need of the practical problems.
Huang Weiting1*,Zhao Hong2.
Divide and conquer algorithm for minimal cost feature selection with measurement error[J]. Journal of Nanjing University(Natural Sciences), 2016, 52(5): 890
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
References
[1] Turney P D.Types of cost in inductive concept learning.In:Proceedings of the Workshop on CostSensitive Learning at the 17th International Conference on Machine Learning(ICML).Canada:National Research Council,2000:15-21. [2] Liu D,Li T R,Ruan D.Probabilistic model criteria with decisiontheoretic rough sets.Information Sciences,2011,181(17):3709-3722. [3] Min F,Zhu W.Minimal cost attribute reduction through backtracking.Database Theory and Application,BioScience and BioTechnology.Heidelberg,Berlin:Springer,2011:100-107. [4] Qian W B,Shu W H,Yang J,et al.Costsensitive feature selection on heterogeneous data.Advances in Knowledge Discovery and Data Mining.Springer International Publishing,2015:397-408. [5] Yang X B,Qi Y S,Song X N,et al.Test cost sensitive multigranulation rough set:Model and minimal cost selection.Information Sciences,2013,250:184-199. [6] BolónCanedo V,PortoDíaz I,SánchezMaroño N,et al.A framework for costbased feature selection.Pattern Recognition,2014,47(7):2481-2489. [7] Li X J,Zhao H,Zhu W.A cost sensitive decision tree algorithm with two adaptive mechanisms.KnowledgeBased Systems,2015,88:24-33. [8] Shu W H,Shen H.Multicriteria feature selection on costsensitive data with missing values.Pattern Recognition,2016,51:268-280. [9] Yuan D R,Huang X M,Duan Q L,et al.A strategy of constructing heterogeneous costsensitive decision tree.Advanced Materials Research,2013,756:3414-3418. [10] 张燕平,邹慧锦,赵 姝.基于CCA的代价敏感三支决策模型.南京大学学报(自然科学),2015,51(2):447-452.(Zhang Y P,Zou H J,Zhao S.Costsensitive threeway decisions model based on CCA.Journal of Nanjing University(Natural Sciences),2015,51(2):447-452.) [11] Zhou Z H,Chawla N V,Jin Y C,et al.Big data opportunities and challenges:Discussions from data analytics perspectives.Computational Intelligence Magazine,IEEE,2014,9(4):62-74. [12] 蒋盛益,谢照青,余 雯.基于代价敏感的朴素贝叶斯不平衡数据分类研究.计算机研究与发展,2011,48:387-390.(Jiang S Y,Xie Z Q,Yu W.Naive bayes classification algorithm based on cost sensitive for imbalanced data distribution.Journal of Computer Research and Development,2011,48:387-390.) [13] Wu W Z,Zhang W X.Neighborhood operator systems and approximations.Information Sciences,2002,144(1):201-217. [14] Zhao H,Min F,Zhu W.Testcostsensitive attribute reduction based on neighborhood rough set.In:2011 IEEE International Conference on Granular Computing(GrC).Kaohsiung:IEEE,2011:802-806. [15] Min F,Zhu W.Attribute reduction of data with error ranges and test costs.Information Sciences,2012,211:48-67. [16] Zhao H,Zhu W.Optimal costsensitive granularization based on rough sets for variable costs.KnowledgeBased Systems,2014,65:72-82. [17] Yao Y Y.A partition model of granular computing.Transactions on Rough Sets I.Heidelberg,Berlin:Springer,2004:232-253.