南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (5): 890–.

• • 上一篇    下一篇

基于误差数据的最小代价属性选择分治算法

黄伟婷1*,赵 红2   

  • 出版日期:2016-09-25 发布日期:2016-09-25
  • 作者简介:1.闽南师范大学计算机学院,漳州,363000;2.闽南师范大学粒计算及其应用重点实验室,漳州,363000
  • 基金资助:
    基金项目:国家自然科学基金(61379049,61379089),福建省教育厅项目(JAT160287),漳州市自然科学基金(ZZ2016J35) 收稿日期:2016-07-22 *通讯联系人,E­mail:weitinghuang92@163.com

Divide and conquer algorithm for minimal cost feature selection with measurement error

Huang Weiting1*,Zhao Hong2   

  • Online:2016-09-25 Published:2016-09-25
  • About author:1.School of Computing,Minnan Normal University,Zhangzhou,363000,China;2.Laboratory of Granular Computing,Minnan Normal University,Zhangzhou,363000,China

摘要: 最小代价属性选择是数据挖掘的重要问题之一,问题的优化目标是得到总代价最小的属性子集.在实际数据的获取过程中,测量误差是不可避免的.基于测量误差,目前已有一些相关的最小代价属性选择方法.但这些方法存在效率上的问题,特别是对大规模数据集.为解决这一关键问题,提出一种基于误差数据的最小代价属性选择分治算法.该算法将数据集按列拆分为若干个互不相交的子数据集,实现对各子数据集的求解,分而治之.对于不同规模的数据集,其子数据集的大小及总个数并非固定不变,而是根据各数据集的规模自适应设定的.该算法通过拆分数据集来降低问题规模,有效地提高了计算效率.对6个不同规模UCI数据集的实验分析表明该算法的有效性,与经典回溯算法相比,该算法的效果相当但效率至少提高了30%,更能适应实际问题的需要.

Abstract: Minimal cost feature selection is one of the most important problems in the current stage of data mining research.The optimization goal of this problem is to obtain an attribute subset that has the minimal total cost.In the process of access to the actual data,measurement error is inevitable,and the actual data will exist error.At present,there are some minimal cost feature selection algorithms with measurement error.But the efficiency of these algorithms is not very satisfactory,especially on large scale datasets.In order to solve this critical problem,this paper proposes a divide and conquer algorithm for minimal cost feature selection with measurement error.Combining divide and conquer thought,the dataset is broken down into some disjoint sub­datasets according to the number of the column.The proposed algorithm is used for each sub­dataset,and it can obtain the optimal attribute subset of each sub­dataset.These optimal attribute subsets are merged,thus an attribute subset with the minimal total cost is obtained.For the different scale of the datasets,the size of the sub­datasets and the number of the sub­datasets are adaptive to the scale of the datasets rather than fixed.The computational efficiency of this algorithm is enhanced by reducing the scale of the problem.The experimental results on six UCI datasets with different scales show that the proposed method is effective.Compared with the classical backtracking algorithm,the efficiency of this algorithm is increased at least by thirty percent.At the same time,the effect of this algorithm is guaranteed.Therefore the proposed method can well meet the need of the practical problems.

[1] Turney P D.Types of cost in inductive concept learning.In:Proceedings of the Workshop on Cost­Sensitive Learning at the 17th International Conference on Machine Learning(ICML).Canada:National Research Council,2000:15-21.
[2]  Liu D,Li T R,Ruan D.Probabilistic model criteria with decision­theoretic rough sets.Information Sciences,2011,181(17):3709-3722.
[3]  Min F,Zhu W.Minimal cost attribute reduction through backtracking.Database Theory and Application,Bio­Science and Bio­Technology.Heidelberg,Berlin:Springer,2011:100-107.
[4]  Qian W B,Shu W H,Yang J,et al.Cost­sensitive feature selection on heterogeneous data.Advances in Knowledge Discovery and Data Mining.Springer International Publishing,2015:397-408.
[5]  Yang X B,Qi Y S,Song X N,et al.Test cost sensitive multigranulation rough set:Model and minimal cost selection.Information Sciences,2013,250:184-199.
[6]  Bolón­Canedo V,Porto­Díaz I,Sánchez­Maroño N,et al.A framework for cost­based feature selection.Pattern Recognition,2014,47(7):2481-2489.
[7]  Li X J,Zhao H,Zhu W.A cost sensitive decision tree algorithm with two adaptive mechanisms.Knowledge­Based Systems,2015,88:24-33.
[8]  Shu W H,Shen H.Multi­criteria feature selection on cost­sensitive data with missing values.Pattern Recognition,2016,51:268-280.
[9]  Yuan D R,Huang X M,Duan Q L,et al.A strategy of constructing heterogeneous cost­sensitive decision tree.Advanced Materials Research,2013,756:3414-3418.
[10]  张燕平,邹慧锦,赵 姝.基于CCA的代价敏感三支决策模型.南京大学学报(自然科学),2015,51(2):447-452.(Zhang Y P,Zou H J,Zhao S.Cost­sensitive three­way decisions model based on CCA.Journal of Nanjing University(Natural Sciences),2015,51(2):447-452.)
[11]  Zhou Z H,Chawla N V,Jin Y C,et al.Big data opportunities and challenges:Discussions from data analytics perspectives.Computational Intelligence Magazine,IEEE,2014,9(4):62-74.
[12]  蒋盛益,谢照青,余 雯.基于代价敏感的朴素贝叶斯不平衡数据分类研究.计算机研究与发展,2011,48:387-390.(Jiang S Y,Xie Z Q,Yu W.Naive bayes classification algorithm based on cost sensitive for imbalanced data distribution.Journal of Computer Research and Development,2011,48:387-390.)
[13]  Wu W Z,Zhang W X.Neighborhood operator systems and approximations.Information Sciences,2002,144(1):201-217.
[14]  Zhao H,Min F,Zhu W.Test­cost­sensitive attribute reduction based on neighborhood rough set.In:2011 IEEE International Conference on Granular Computing(GrC).Kaohsiung:IEEE,2011:802-806.
[15]  Min F,Zhu W.Attribute reduction of data with error ranges and test costs.Information Sciences,2012,211:48-67.
[16]  Zhao H,Zhu W.Optimal cost­sensitive granularization based on rough sets for variable costs.Knowledge­Based Systems,2014,65:72-82.
[17]  Yao Y Y.A partition model of granular computing.Transactions on Rough Sets I.Heidelberg,Berlin:Springer,2004:232-253.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!