南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (4): 815–.

• • 上一篇    

 基于关联关系分析的符号数据分类方法

 付康安1,郭虎升1,王文剑1,2*   

  • 出版日期:2017-08-03 发布日期:2017-08-03
  • 作者简介: 1.山西大学计算机与信息技术学院,太原,030006;
    2.山西大学计算智能与中文信息处理教育部重点实验室,太原,030006
  • 基金资助:
     基金项目:国家自然科学基金(61673249,61503229,61273291),山西省回国留学人员科研项目(2016-004),山西省自然科学青年基金(2015021096),山西省高等学校科技创新项目(2015110)
    收稿日期:2017-06-08
    *通讯联系人,E-mail:wjwang@sxu.edu.cn

 Categorical data classification approach based on correlation analysis

 Fu Kang’an1,Guo Husheng1,Wang Wenjian1,2*   

  • Online:2017-08-03 Published:2017-08-03
  • About author: 1.School of Computer and Information Technology,Shanxi University,Taiyuan,030006,China;
    2.Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Shanxi University,Taiyuan,030006,China

摘要:  由于符号属性数据缺乏固有的几何特性,不能简单地将现有的数值属性数据分类算法应用于符号属性数据.为了提高符号属性数据的性能,提出一种基于关联关系分析的支持向量机分类方法(Support Vector Machine Classification Approach Based on Correlation Analysis,CA_SVM).通过分析属性值与标签之间的相关性,得到属性值对标签的影响因子;然后结合属性值在类内出现的频率,使得所有原始符号数据下的属性值在不失信息的情况下转换成数值型数据;转换后的数据既可以体现属性值与标签之间的关联关系,也可以有效地表示相同属性下属性值之间的距离;最后用支持向量机(Support Vector Machine,SVM)进行分类.在标准UCI数据集上的实验结果表明,CA_SVM模型能够提高分类精度.

Abstract:  Due to lack of geometric property between categorical data,the current classification algorithms for numerical data fail to deal with categorical data.To effectively improve the classifying performance in a set of categorical objects,we proposed a support vector machine classification approach based on correlation analysis,namely CA_SVM.By analyzing the correlation between attribute values and labels and the frequency of attributes in the class,we get the influence factors of attribute values on label.The approach,which can not only reflect the correlation between attribute values and labels,but also effectively expresses the distance between attribute values,may transform a set of categorical data into numerical data without losing information.The classifying performance of new proposed method was tested on data sets downloaded from the UCI.Results illustrate that the new proposed CA_SVM model increases the classifying accuracy.

 [1] Han J W,Kamber M,Pei J.数据挖掘:概念与技术.范 明,孟小峰译.第3版.北京:机械工业出版社,2012,496.
[2] 周志华.机器学习.北京:清华大学出版社,2016,425.
[3] Huang Z X.A fast clustering algorithm to cluster very large categorical data sets in data mining.In:Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery.Vancouver,Canada:The University of British Columbia,1998:1-8.
[4] Ng M K,Li M J,Huang J Z,et al.On the impact of dissimilarity measure in K-modes clustering algorithm.IEEE Transactions on Pattern Analysis and Machine Intelligence,2007,29(3):503-507.
[5] Li C,Biswas G.Unsupervised learning with mixed numeric and nominal data.IEEE Transactions on Knowledge and Data Engineering,2002,14(4):673-690.
[6] 梁吉业,白 亮,曹付元.基于新的距离度量的K-modes聚类算法.计算机研究与发展,2010,47(10):1749-1755.(Liang J Y,Bai L,Cao F Y.K-modes clustering algorithm based on a new distance measure.Journal of Computer Research and Development,2010,47(10):1749-1755.)
[7] Bai L,Liang J Y,Dang C Y,et al.The impact of cluster representatives on the convergence of the K-modes type clustering.IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(6):1509-1522.
[8] Qian Y H,Li F J,Liang J Y,et al.Space structure and clustering of categorical data.IEEE Transactions on Neural Networks and Learning Systems,2016,27(10):2047-2059.
[9] Guo T,Ding X W,Li Y F.Parallel K-modes algorithm based on MapReduce.In:Proceedings of the 2015 3rd International Conference on Digital Information,Networking,and Wireless Communications.Moscow,Russia:IEEE,2015:176-179.
[10] Quinlan J R.Induction of decision trees.Machine Learning,1986,1(1):81-106.
[11] Chen L F,Guo G D.Nearest neighbor classification of categorical data by attributes weighting.Expert Systems with Applications,2015,42(6):3142-3149.
[12] Cortes C,Vapnik V.Support vector networks.Machine Learning,1995,20(3):273-297.
[13] UCI Machine Learning Repository.http://archive.ics.uci.edu/ml,2016-03.
[14] Kapp M N,Sabourin R,Maupin P.A dynamic model selection strategy for support vector machine classifiers.Applied Soft Computing,2012,12(8):2550-2565.
[15] 刘向东,骆 斌,陈兆乾.支持向量机最优模型选择的研究.计算机研究与发展,2005,42(4):576-581.(Liu X D,Luo B,Chen Z Q.Optimal model selection for support vector machines.Journal of Computer Research and Development,2005,42(4):576-581.)
[16] Tian M,Wang W J.An efficient Gaussian kernel optimization based on centered kernel polarization criterion.Information Sciences,2015,322:133-149.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!