Chinese text classification algorithm based on Three-way Decisions

Jin Yilin1,2*,Hu Feng1,2

Journal of Nanjing University(Natural Sciences) ›› 2018, Vol. 54 ›› Issue (4) : 794.

PDF(1434260 KB)
PDF(1434260 KB)
Journal of Nanjing University(Natural Sciences) ›› 2018, Vol. 54 ›› Issue (4) : 794.

Chinese text classification algorithm based on Three-way Decisions

  • Jin Yilin1,2*,Hu Feng1,2
Author information +
History +

Abstract

With the continuous development of information technology,more and more information has emerged. How to obtain the most valuable information quickly and effectively has become a hot spot of attention. As a branch of natural language processing,Chinese text classification can effectively help users to quickly get the information they need in massive information by summarizing information into a known topic category. With the hot of machine learning and artificial intelligence,people are more inclined to let the machine complete this tedious process to improve the accuracy of classification and the speed of retrieval. At present,there is a great shortage of Chinese text categorization,which is mainly reflected in the sparse and high dimension of the feature extraction,and cannot effectively and comprehensively represent the text of category,resulting in unsatisfactory classification results. With the introduction of the Three-way Decisions frameworks,more and more scholars have tried to apply it to various applications. This proves that the framework reduces the loss of error decision by converting the traditional two decision-making methods into three decision-making methods. Based on this theoretical framework,this paper improves the traditional feature selection algorithm for the purpose of efficiently extracting text category features,and proposes a Three-way Decision feature selection algorithm. In order to reduce the number of feature words and improve the representativeness of feature words,we combines unsupervised and supervised feature extraction algorithms with a double evaluation function to evaluate a more comprehensive feature. Finally,through the experiments on Fudan University corpus,the results show that the proposed Three-way Decision feature selection algorithm has certain advantages compared with the traditional feature selection algorithms,and in comparison with the feature selection algorithms proposed by many scholars in the last three years,our algorithm has obvious advantage in low dimension feature space. In summary,the proposed algorithm can not only effectively reduce the number of feature words and compress feature space,but also effectively improve the accuracy of Chinese text classification.

Cite this article

Download Citations
Jin Yilin1,2*,Hu Feng1,2. Chinese text classification algorithm based on Three-way Decisions[J]. Journal of Nanjing University(Natural Sciences), 2018, 54(4): 794

References

[1] 苏金树,张博锋,徐 昕. 基于机器学习的文本分类技术研究进展. 软件学报,2006,17(9):1848-1859.(Su J S,Zhang B F,Xu X. Advances in machine learning based text categorization. Journal of Software,2006,17(9):1848-1859.) [2] Císcar A J,Ney H. Reversing and Smoothing the Multinomial Naive Bayes Text Classifier. José Manuel Iesta Quereda,2002:págs. 200-212. [3] 李 琼,陈 利. 一种改进的支持向量机文本分类方法. 计算机技术与发展,2015,25(5):78-82.(Li Q,Chen L. An improved text classification method for support vector machine. Computer Technology and Development,2015,25(5):78-82.) [4] 李新福,赵蕾蕾,何海斌等. 使用Logistic回归模型进行中文文本分类. 计算机工程与应用,2009,45(14):152-154.(Li X F,Zhao L L,He H B,et al. Using Logistic regression model for Chinese text categorization. Computer Engineer-ing and Applications,2009,45(14):152-154.) [5] 马成龙,颜永红. 基于概率语义分布的短文本分类. 自动化学报,2016,42(11):1711-1717.(Ma C L,Yan Y H. Short text classification based on probabilistic semantic distribution. Acta Automatica Sinica,2016,42(11):1711-1717.) [6] 宋胜利,王少龙,陈 平. 面向文本分类的中文文本语义表示方法. 西安电子科技大学学报(自然科学版),2013,40(2):89-97,129.(Song S L,Wang S L,Chen P. Chinese text semantic representation for text classification. Journal of Xidian University,2013,40(2):89-97,129.) [7] 朱颢东,钟 勇. 基于类别相关性和交叉熵的特征选择方法. 郑州大学学报(理学版),2010,42(2):61-65.(Zhu H D,Zhong Y. Feature Selection Method Based on Category Correlation and Cross Entropy. Journal of Zhengzhou Univer-sity(Natural Science Edition),2010,42(2):61-65.) [8] 贺科达,朱铮涛,程 昱. 基于改进TF-IDF算法的文本分类方法研究. 广东工业大学学报,2016,33(5):49-53.(He K D,Zhu Z T,Cheng Y. A Research on text classification method based on improved TF-IDF algorithm. Journal of Guangdong University of Technology,2016,33(5):49-53.) [9] 郭正斌,张仰森,蒋玉茹. 一种面向文本分类的特征向量优化方法. 计算机应用研究,2017,34(8):2299-2302,2348.(Guo Z B,Zhang Y S,Jiang Y R. Feature vector optimization method for text classification. Application Research of Computers,2017,34(8):2299-2302,2348.) [10] 叶 敏,汤世平,牛振东. 一种基于多特征因子改进的中文文本分类算法. 中文信息学报,2017,31(4):132-137,144.(Ye M,Tang S P,Niu Z D. An improved Chinese text classification algorithm based on multiple feature factors. Journal of Chinese Information Processing,2017,31(4):132-137,144.) [11] Pavlinek M,Podgorelec V. Text classification method based on self-training and LDA topic models. Expert Systems with Applications,2017,80:83-93. [12] 胡永丽,龚沛曾. 基于模糊C均值和改进的LSA的文档聚类研究. 计算机技术与发展,2010,20(12):126-129,136.(Hu Y L,Gong P Z. Document clustering research based on fuzzy C-means and improved Latent Semantic Analysis. Computer Technology and Development,2010,20(12):126-129,136.) [13] Wu F Z,Huang Y F. Collaborative multi-domain sentiment classification ∥ Proceedings of 2015 IEEE International Conference on Data Mining. Atlantic City,NJ,USA:IEEE,2015,459-468. [14] 姚海英. 中文文本分类中卡方统计特征选择方法和TF-IDF权重计算方法的研究. 硕士学位论文. 长春:吉林大学,2016.(Yao H Y. Research on chi-square statistic feature selection method and TF-IDF feature weighting method for Chinese text classification. Master Dissertation. Changchun:Jilin University,2016.) [15] 张培颖,王雷全. 基于语义距离的文本分类方法. 计算机技术与发展,2013,23(1):128-130,134.(Zhang P Y,Wang L Q. Text classification method based on semantic distance. Computer Technology and Development,2013,23(1):128-130,134.) [16] 吴致辉,刘洪伟,陈 丽. 高效朴素贝叶斯Web新闻文本分类模型的简易实现. 统计学与应用,2014,3(1):30-35.(Wu Z H,Liu H W,Chen L. The simply implement of effective naive bayes Web news text classification model. Statistics and Applications,2014,3(1):30-35.) [17] Rehman A,Javed K,Babri H A. Feature selection based on a normalized difference measure for text classification. Information Processing & Management,2017,53(2):473-489. [18] 张元虹,郭剑毅,龚华明等. 基于DF与LSA相结合的降维法的文本分类系统的研究. 山西电子技术,2008(4):3-4,13.(Zhang Y H,Guo J Y,Gong H M,et al. Study on text classification based on DF and LSA. Shanxi Electronic Technology,2008(4):3-4,13.) [19] 于瑞萍. 中文文本分类相关算法的研究与实现. 硕士学位论文. 西安:西北大学,2007.(Yu R P,Research and implement on the related algorithms of Chinese text classification. Master Dissertation. Xi’an:Northwest University,2007.) [20] 黄承慧,印 鉴,侯 昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法. 计算机学报,2011,34(5):856-864.(Huang C H,Yin J,Hou F. A text similarity measurement combining word semantic information with TF-IDF method. Chinese Journal of Computers,2011,34(5):856-864.) [21] 裴英博,刘晓霞. 文本分类中改进型CHI特征选择方法的研究. 计算机工程与应用,2011,47(4):128-130,194.(Pei Y B,Liu X X. Study on improved CHI for feature selection in Chinese text categorization. Computer Engineering and Applications,2011,47(4):128-130,194.) [22] 郭 颂,马 飞. 文本分类中信息增益特征选择算法的改进. 计算机应用与软件,2013,30(8):139-142.(Guo S,Ma F. Improving the algorithm of information gain feature selection in text classification. Computer Applications and Software,2013,30(8):139-142.) [23] 辛 竹,周亚建. 文本分类中互信息特征选择方法的研究与算法改进. 计算机应用,2013,33(S2):116-118,152.(Xin Z,Zhou Y J. Study and improvement of mutual information for feature selection in text categorization. Journal of Computer Applications,2013,33(S2):116-118,152.) [24] 杜同森,周亚建. 特征选择中期望交叉熵算法的研究与改进. 2013.(Du T S,Zhou Y J. Research and improvement of Expected Cross Entropy in feature selection. 2013.) [25] 韩冬煦,常宝宝. 中文分词模型的领域适应性方法. 计算机学报,2015,38(2):272-281.(Han D X,Chang B B. Approaches to domain adaptive Chinese segmentation model. Chinese Journal of Computers,2015,38(2):272-281.) [26] 顾 鑫,王士同. 大样本多源域与小目标域的跨领域快速分类学习. 计算机研究与发展,2014,51(3):519-535.(Gu X,Wang S T. Fast cross-domain classification method for large multisources/small target domains. Journal of Computer Research and Development,2014,51(3):519-535.) [27] Yao Y Y. The superiority of Three-way Decisions in probabilistic rough set models. Information Sciences,2011,181(16):1080-1096. [28] Yao Y Y. Three-way Decisions with probabilistic rough sets. Information Sciences,2010,180(3):341-353. [29] Yao Y Y. Three-way Decision:An interpretation of rules in rough set theory ∥ Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology. Springer Berlin Heidelberg,2009:642-649. [30] Yao Y Y. An outline of a theory of Three-way Decisions ∥ Proceedings of the 7th International Conference on Rough Sets and Current Trends in Computing. Springer Berlin Heidelberg,2012:1-17. [31] 王 磊,黄河笑,吴 兵等. 基于主题与三支决策的文本情感分析. 计算机科学,2015,42(6):93-96.(Wang L,Huang H X,Wu B,et al. Emotion analysis of text based on topics and Three-way Decisions. Computer Science,2015,42(6):93-96.) [32] 于 洪,王国胤,李天瑞等. 三支决策:复杂问题求解方法与实践. 北京:科学出版社,2015:23-25.(Yu H,Wang G Y,Li T R,et al. Three-way Decisions:methods and practices for complex problem solving. Beijing:Science Press,2015.) [33] 刘 盾. 三支决策与粒计算. 北京:科学出版社,2013,334.
PDF(1434260 KB)

5

Accesses

0

Citation

Detail

Sections
Recommended

/