南京大学学报(自然科学版) ›› 2011, Vol. 47 ›› Issue (5): 544–550.

• • 上一篇    下一篇

 挖掘重要项集的关联文本分类*

 蔡金凤,白清源**
  

  • 出版日期:2015-04-30 发布日期:2015-04-30
  • 作者简介: (福州大学数学与计算机科学学院,福州,350108)
  • 基金资助:
     国家自然科学基金(61070062)

 Association text classification of minim ItemSet significance

 Cai Jin一Feng,Bai Qing-Yuan
  

  • Online:2015-04-30 Published:2015-04-30
  • About author: (College of Mathematics and Computer Science,Fuzhou University, Fuzhou,350108,China)

摘要: 针对在关联规则分类算法的构造分类器阶段中只考虑特征词是否存在,忽略了文木特征权重的问题,基于关联规则的文木分类方法(ARC-BC)的基础上提出一种可以提高关联文木分类准确率的
ISARCItcmSet Significance-based ARC)算法.该算法利用特征项权重定义了k领集重要度,通过挖掘重要项集来产生关联规则,并考虑提升度对待分类文木的影响.实验结果表明,挖掘重要项集的ISARC
算法可以提高关联文木分类的准确率.

Abstract: Text classification technology is an important basis of information retrieval and text mining, and its main task is to mark category according to a given category set.Text classification has a wide range of applications in
natural language processing and understanding, information organization and management, information filtering and other areas. At present,text classification can be mainly divided into three groups; based on statistical methods,
based on connection method and the method based on rules. The basic idea of the traditional association text classification algorithm associative rule-based classifier by
category(ARC-BC) is to use the association rule mining algorithm Apriori which generates frequent items that appear frequently feature items or itemsets, and then use these frequent items as rule antecedent and category is used
as rule consequent to form the rule set and then make these rules constitute a classifier. During classifying the test samples, if the test sample matches the rule antecedent, put the rule that belongs to the class counterm to the
cumulative confidence. if the confidence of the category counter is the maximum, then determine the test sample belongs to that catcgory.However,ARC-BC algorithm has two main drawbacks;(1)During the structure classifier, it only considers the
existence of feature words and ignores the weight of text features for mining frequent itemsets and generated association rules may affect the classification results; (2) In the class prediction stage, it gives too much emphasis
on rule confidence. In the mining process, there will be ruck that have the same antecedent but different consequent, and if only considering the rules’confidence in predicting the impact of text classification, without
considering the correlation between rules antecedent and consequdent, it will also affect the classification accuracy. In order to solve the two problems,in this paper, a new algorithm itemset significance-based a ssociation rule-based
categorizer(ISARC) is proposed on the basis of the ARC-BC. The new algorithm ISARC makes use of features weight to define the significance of k-itemset, mining the
significant itemset to generate association rules.The main idea of SARC is given below: Firstly, preprocess the text,each document corresponds to a transaction, each unique keyword is assigned an item. Secondly, weight the
features according to the to frequency-inverse document frequcncy(TF-lDF),and calculate each transaction weight as WT according to the defined formula, and then to be 1-important itemsets as Ll.The function of Ucne-CSigitem
is to generate candidate itemsets and wcightsig function is to calculate for the importance of each candidate set,that meeting the minimum set of items is as an important item set. Then, in the class stage,it considers the lift towards
the impact of the text classification. Finally, the experimental results show that mining the significant itemset of ISARC algorithm can improve the accuracy of the association classification.

[1]Liu B, Hsu W, Ma Y, integrating classification and association rule mining. Proceedings of ACM international Conference on Knowledge Discovery and Data Mining. New York:ACM,1998,80一86.
[2]Li W, Han.l,Pei .I. Accurate and efficient clan sification based on multiple class-association rules. Proceedings of the 2001 IEEE lnterna- tional Conference on Data Mining. California, 2001,369一376.
[3]Aaiane OR,Antonic M. Classifying text docu menu by associating terms with text categories. Proceedings of 13th Australasian Database Con ference. Mclbourncl Australian Computer Soci ety, 2002,21(2);215一222.
[4]Gourab K,Md. Monirul I,Sirajum M. ACN:An associative classifier with negative rules. IEE International Conference on Computation al Science and Engineering, 2008,369~375.
[5]Cheng H,Yan X F, Han J W, et al. Direct discriminative pattern mining for effective clas- sification. 2008 IEEE 24th International Confer- ence on Data Engineering, 2008,169一178
[6]Elena B, Silvia C,Paolo G. A lazy approach to associative classification, IEEE Transactions on Knowledge and Data Engineering, 2008,20 (2):156一171.
[7]Chen X Y,Hu Y F.Text association categori zation hased on self-adaptive weighting. Journal of Chinese Computer Systems,2007,28(1) 116-121.陈晓云,胡运发.基于自适应加权的文木关联分类.小型微型计算机系统,2007,28 (1):116一121).
[8]Shang B Z, Bai Q Y, Improved association text classification based on feature weight. Journal of Computer Research and Development,2008.45(Supplement); 252-256.商炳章,白清源.
基于特征项权重改进的关联文木分类.计算机研究与发展,2008,5(增刊)):252-256).
[9]Chen D L, Bai Q Y. Association text classifica- tion based on term frequency. Journal of Com- puter Research and Development, 2009,46 (Supplement) ; 464-469.陈东亮,白清源.基于词频向量的关联文木分类.计算机研究与发展,2009,6(增刊):464一469).
[10]Zhao G H,Luo B, Lin H. A classification algo- rithm and its applicatio. Journal of Nanjing Uni- versity(Natural Sciences),2001,37(2):142一147.(赵志宏,骆斌,林海.一种分类挖掘算法及其应用.南京大学学报(自然科学),2001, 3 7 (2}:142一147).
[11]Zhou L, Zhu Q M, Li P F. A method to recog- nize unkonwn Chinese words based on statistic and regulation. Journal of Nanjing University (Natural Sciences),2005,4l:819一825.
(周蕾,朱巧明,李培峰.一种基于统计和规则的未登录词识别方法.南京大学学报(自然科学). 2005,41:819一825).
[12]Li I. B, Li N, Yang Y B. Maximal frequent itemset feneration based on graph. Journal of Nanjing University(Natural Sciences),2008,44(5): 486-494.(李立斌,李宁,杨育彬.一种基于分类互补性的特征选择算法.南京大学学报(自然科学),2008,}}(5):486}-494).
[13]Agrawal R,Srikant R. Fast algorithms for min- ing association rules. Proceedings of the 20th VLDB Conference. Santiago,1994,187一199.
[14]Ouyang W M, Zheng C, Cai Q S. The discov- cry of weighting association rules in DataBase. Journal of Software, 2001,12(4):612一619. (欧阳为民,郑诚,蔡庆生.数据库中加权关联规则的发现.软件学报,2001, 120);612- 619).
[15]李荣陆.中文自然语言处理开放平台.http;// www. nlp, org, cn/does/download. php? doc一id =281,2003一O5一28.
[16]Sebastiani F. Machine learning in automated text categorization. Association for Computing Machinery(ACM) Computing Surveys,2002,34(1):1一17.




No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!