南京大学学报(自然科学版) ›› 2018, Vol. 54 ›› Issue (4): 794–.

• • 上一篇    下一篇

基于三支决策的中文文本分类算法研究

靳义林1,2*,胡 峰1,2   

  • 出版日期:2018-04-30
  • 作者简介:1.重庆邮电大学计算机科学与技术学院,重庆,400065; 2.计算智能重庆市重点实验室,重庆,400065
  • 基金资助:
    基金项目:国家自然科学基金(61309014,61379114,61472056),教育部人文社科规划基金(15XJA630003) 收稿日期:2018-05-07 *通讯联系人,E-mail:1440554297@qq.com

Chinese text classification algorithm based on Three-way Decisions

Jin Yilin1,2*,Hu Feng1,2   

  • Online:2018-04-30
  • About author:1.College of Computer Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing,400065,China; 2.Chongqing Key Laboratory of Computational Intelligence,Chongqing,400065,China

摘要: 随着信息化的不断发展,越来越多的信息不断涌现出来,如何在海量的信息中快速有效地获取到最有价值的信息成为人们不断关注的热点. 中文文本分类作为自然语言处理的一个分支,通过将信息归纳成已知的主题类别,可以有效地帮助用户快速获取海量信息中所需的信息. 但由于传统特征选择算法存在着很大的局限性,目前在中文文本分类领域上还存在着很大的不足,集中体现在提取出的特征过于高维和稀疏,不能高效地表示类别的文本. 基于此,结合三支决策的思想,提出一种新颖的特征选择算法,将无监督与有监督的特征提取算法相结合,有效减少特征词的数量,使得提取出来的特征词更具有类别代表性. 通过在复旦大学语料库上进行的实验,结果表明,所提出的三支决策特征选择算法与传统的特征选择算法相比,具有一定的优势,能够有效地提高文本分类的准确率.

Abstract: With the continuous development of information technology,more and more information has emerged. How to obtain the most valuable information quickly and effectively has become a hot spot of attention. As a branch of natural language processing,Chinese text classification can effectively help users to quickly get the information they need in massive information by summarizing information into a known topic category. With the hot of machine learning and artificial intelligence,people are more inclined to let the machine complete this tedious process to improve the accuracy of classification and the speed of retrieval. At present,there is a great shortage of Chinese text categorization,which is mainly reflected in the sparse and high dimension of the feature extraction,and cannot effectively and comprehensively represent the text of category,resulting in unsatisfactory classification results. With the introduction of the Three-way Decisions frameworks,more and more scholars have tried to apply it to various applications. This proves that the framework reduces the loss of error decision by converting the traditional two decision-making methods into three decision-making methods. Based on this theoretical framework,this paper improves the traditional feature selection algorithm for the purpose of efficiently extracting text category features,and proposes a Three-way Decision feature selection algorithm. In order to reduce the number of feature words and improve the representativeness of feature words,we combines unsupervised and supervised feature extraction algorithms with a double evaluation function to evaluate a more comprehensive feature. Finally,through the experiments on Fudan University corpus,the results show that the proposed Three-way Decision feature selection algorithm has certain advantages compared with the traditional feature selection algorithms,and in comparison with the feature selection algorithms proposed by many scholars in the last three years,our algorithm has obvious advantage in low dimension feature space. In summary,the proposed algorithm can not only effectively reduce the number of feature words and compress feature space,but also effectively improve the accuracy of Chinese text classification.

[1] 苏金树,张博锋,徐 昕. 基于机器学习的文本分类技术研究进展. 软件学报,2006,17(9):1848-1859.(Su J S,Zhang B F,Xu X. Advances in machine learning based text categorization. Journal of Software,2006,17(9):1848-1859.) [2] Císcar A J,Ney H. Reversing and Smoothing the Multinomial Naive Bayes Text Classifier. José Manuel Iesta Quereda,2002:págs. 200-212. [3] 李 琼,陈 利. 一种改进的支持向量机文本分类方法. 计算机技术与发展,2015,25(5):78-82.(Li Q,Chen L. An improved text classification method for support vector machine. Computer Technology and Development,2015,25(5):78-82.) [4] 李新福,赵蕾蕾,何海斌等. 使用Logistic回归模型进行中文文本分类. 计算机工程与应用,2009,45(14):152-154.(Li X F,Zhao L L,He H B,et al. Using Logistic regression model for Chinese text categorization. Computer Engineer-ing and Applications,2009,45(14):152-154.) [5] 马成龙,颜永红. 基于概率语义分布的短文本分类. 自动化学报,2016,42(11):1711-1717.(Ma C L,Yan Y H. Short text classification based on probabilistic semantic distribution. Acta Automatica Sinica,2016,42(11):1711-1717.) [6] 宋胜利,王少龙,陈 平. 面向文本分类的中文文本语义表示方法. 西安电子科技大学学报(自然科学版),2013,40(2):89-97,129.(Song S L,Wang S L,Chen P. Chinese text semantic representation for text classification. Journal of Xidian University,2013,40(2):89-97,129.) [7] 朱颢东,钟 勇. 基于类别相关性和交叉熵的特征选择方法. 郑州大学学报(理学版),2010,42(2):61-65.(Zhu H D,Zhong Y. Feature Selection Method Based on Category Correlation and Cross Entropy. Journal of Zhengzhou Univer-sity(Natural Science Edition),2010,42(2):61-65.) [8] 贺科达,朱铮涛,程 昱. 基于改进TF-IDF算法的文本分类方法研究. 广东工业大学学报,2016,33(5):49-53.(He K D,Zhu Z T,Cheng Y. A Research on text classification method based on improved TF-IDF algorithm. Journal of Guangdong University of Technology,2016,33(5):49-53.) [9] 郭正斌,张仰森,蒋玉茹. 一种面向文本分类的特征向量优化方法. 计算机应用研究,2017,34(8):2299-2302,2348.(Guo Z B,Zhang Y S,Jiang Y R. Feature vector optimization method for text classification. Application Research of Computers,2017,34(8):2299-2302,2348.) [10] 叶 敏,汤世平,牛振东. 一种基于多特征因子改进的中文文本分类算法. 中文信息学报,2017,31(4):132-137,144.(Ye M,Tang S P,Niu Z D. An improved Chinese text classification algorithm based on multiple feature factors. Journal of Chinese Information Processing,2017,31(4):132-137,144.) [11] Pavlinek M,Podgorelec V. Text classification method based on self-training and LDA topic models. Expert Systems with Applications,2017,80:83-93. [12] 胡永丽,龚沛曾. 基于模糊C均值和改进的LSA的文档聚类研究. 计算机技术与发展,2010,20(12):126-129,136.(Hu Y L,Gong P Z. Document clustering research based on fuzzy C-means and improved Latent Semantic Analysis. Computer Technology and Development,2010,20(12):126-129,136.) [13] Wu F Z,Huang Y F. Collaborative multi-domain sentiment classification ∥ Proceedings of 2015 IEEE International Conference on Data Mining. Atlantic City,NJ,USA:IEEE,2015,459-468. [14] 姚海英. 中文文本分类中卡方统计特征选择方法和TF-IDF权重计算方法的研究. 硕士学位论文. 长春:吉林大学,2016.(Yao H Y. Research on chi-square statistic feature selection method and TF-IDF feature weighting method for Chinese text classification. Master Dissertation. Changchun:Jilin University,2016.) [15] 张培颖,王雷全. 基于语义距离的文本分类方法. 计算机技术与发展,2013,23(1):128-130,134.(Zhang P Y,Wang L Q. Text classification method based on semantic distance. Computer Technology and Development,2013,23(1):128-130,134.) [16] 吴致辉,刘洪伟,陈 丽. 高效朴素贝叶斯Web新闻文本分类模型的简易实现. 统计学与应用,2014,3(1):30-35.(Wu Z H,Liu H W,Chen L. The simply implement of effective naive bayes Web news text classification model. Statistics and Applications,2014,3(1):30-35.) [17] Rehman A,Javed K,Babri H A. Feature selection based on a normalized difference measure for text classification. Information Processing & Management,2017,53(2):473-489. [18] 张元虹,郭剑毅,龚华明等. 基于DF与LSA相结合的降维法的文本分类系统的研究. 山西电子技术,2008(4):3-4,13.(Zhang Y H,Guo J Y,Gong H M,et al. Study on text classification based on DF and LSA. Shanxi Electronic Technology,2008(4):3-4,13.) [19] 于瑞萍. 中文文本分类相关算法的研究与实现. 硕士学位论文. 西安:西北大学,2007.(Yu R P,Research and implement on the related algorithms of Chinese text classification. Master Dissertation. Xi’an:Northwest University,2007.) [20] 黄承慧,印 鉴,侯 昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法. 计算机学报,2011,34(5):856-864.(Huang C H,Yin J,Hou F. A text similarity measurement combining word semantic information with TF-IDF method. Chinese Journal of Computers,2011,34(5):856-864.) [21] 裴英博,刘晓霞. 文本分类中改进型CHI特征选择方法的研究. 计算机工程与应用,2011,47(4):128-130,194.(Pei Y B,Liu X X. Study on improved CHI for feature selection in Chinese text categorization. Computer Engineering and Applications,2011,47(4):128-130,194.) [22] 郭 颂,马 飞. 文本分类中信息增益特征选择算法的改进. 计算机应用与软件,2013,30(8):139-142.(Guo S,Ma F. Improving the algorithm of information gain feature selection in text classification. Computer Applications and Software,2013,30(8):139-142.) [23] 辛 竹,周亚建. 文本分类中互信息特征选择方法的研究与算法改进. 计算机应用,2013,33(S2):116-118,152.(Xin Z,Zhou Y J. Study and improvement of mutual information for feature selection in text categorization. Journal of Computer Applications,2013,33(S2):116-118,152.) [24] 杜同森,周亚建. 特征选择中期望交叉熵算法的研究与改进. 2013.(Du T S,Zhou Y J. Research and improvement of Expected Cross Entropy in feature selection. 2013.) [25] 韩冬煦,常宝宝. 中文分词模型的领域适应性方法. 计算机学报,2015,38(2):272-281.(Han D X,Chang B B. Approaches to domain adaptive Chinese segmentation model. Chinese Journal of Computers,2015,38(2):272-281.) [26] 顾 鑫,王士同. 大样本多源域与小目标域的跨领域快速分类学习. 计算机研究与发展,2014,51(3):519-535.(Gu X,Wang S T. Fast cross-domain classification method for large multisources/small target domains. Journal of Computer Research and Development,2014,51(3):519-535.) [27] Yao Y Y. The superiority of Three-way Decisions in probabilistic rough set models. Information Sciences,2011,181(16):1080-1096. [28] Yao Y Y. Three-way Decisions with probabilistic rough sets. Information Sciences,2010,180(3):341-353. [29] Yao Y Y. Three-way Decision:An interpretation of rules in rough set theory ∥ Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology. Springer Berlin Heidelberg,2009:642-649. [30] Yao Y Y. An outline of a theory of Three-way Decisions ∥ Proceedings of the 7th International Conference on Rough Sets and Current Trends in Computing. Springer Berlin Heidelberg,2012:1-17. [31] 王 磊,黄河笑,吴 兵等. 基于主题与三支决策的文本情感分析. 计算机科学,2015,42(6):93-96.(Wang L,Huang H X,Wu B,et al. Emotion analysis of text based on topics and Three-way Decisions. Computer Science,2015,42(6):93-96.) [32] 于 洪,王国胤,李天瑞等. 三支决策:复杂问题求解方法与实践. 北京:科学出版社,2015:23-25.(Yu H,Wang G Y,Li T R,et al. Three-way Decisions:methods and practices for complex problem solving. Beijing:Science Press,2015.) [33] 刘 盾. 三支决策与粒计算. 北京:科学出版社,2013,334.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!