南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (2): 353.
韩彦昭1,乔亚男1*,范亚平1,李孟超2,万迪昉3
Han Yanzhao1,Qiao Yanan1*,Fan Yaping2,Li Mengchao2,Wan Difang3
摘要: 针对微博数据特点,采用降噪算法和条件随机场模型对微博数据进行词性标注,并对其中比重较大的谐音词使用贝叶斯方法进行词性二次纠正.首先利用新浪平台API和爬虫获取原始微博数据,再根据噪音特点人工制定规则进行降噪.由于条件随机场在中文词性标注中特征提取的优势,使用条件随机场模型对降噪后的微博语料词性标注.在此基础上,利用微博语料中谐音词比重较大的特点,将微博词语转化为拼音,根据贝叶斯方法计算得到谐音词的原生词候选,再根据词语的上下文建立谐音词和原生词映射,并利用原生词的词性已知的性质,对谐音词进行词性纠错.实验结果表明,该方法可以较好地标注微博未登录词,词性标注准确率达到95.23%.
[1] 丁兆云,贾 焰,周 斌.微博数据挖掘综述.计算机研究与发展,2014,51(4):1-16.(Ding Z Y,Jia Y,Zhou B.Survey of data mining for microblogs.Journal of Computer Research and Development,2014,51(4):1-16.) [2] 赵 斌,吉根林,曲维光等.基于重用检查的微博垃圾用户过滤算法.南京大学学报(自然科学),2013,49(4):456-464.(Zhao B,Ji G L,Qu W G,et al.Detecting microblog spammers based on reuse detection.Journal of Nanjing University(Natural Sciences),2013,49(4):456-464.) [3] 于 清,阿里甫·库尔班.微博语料分词及标注方法初探.新疆大学学报(自然科学版),2013,30(1):81-86.(Yu Q,Alifu Kuerban.Preliminary study of Chinese word segmentation and partofspeech tagging being used for microblog data.Journal of Xinjiang University(Natural Science Edition),2013,30(1):81-86.) [4] 蒋才智,王 浩,姚宏亮.基于知网的贝叶斯中文人名识别.南京大学学报(自然科学),2012,48(2):147-153.(Jiang C Z,Wang H,Yao H L.Chinese name recognition based on HowNet and Bayesian classifier.Journal of Nanjing University(Natural Sciences),2012,48(2):147-153.) [5] Weischedel R,Schwartz R,Palmucci J,et al.Copingwith ambiguity and unknown words through probabilistic models.Computational Linguistics,1993,19(2):361-382. [6] Ratnaparkhi A.A maximum entropy model for partofspeech tagging.In:Proceedings of the Conference on Empirical Methods in Natural Language Processing.Philadelphia,P A,USA:Association for Computational Linguistics,1996,133-142. [7] Lafferty J,Mccallum A,Pereira F C.Conditional random fields:Probabilistic models for segmenting and labeling sequence data.In:Proceeding of the 18th International Conference on Machine Learning.San Francisco,CA,USA:Morgan Kaufmann Publishers Inc,2001,85-120. [8] Lu X F.Hybrid methods for POS Guessing of Chinese unknown word.In:Proceedings of the ACL Student Research.Stroudsburg,PA,USA:Association for Computational Linguistics,2005,1-6. [9] Wu A,Jiang Z X.Statisticallyenhanced new word identification in a rulebased Chinese system.In:Proceedings of the 2nd Workshop on Chinese Language Processing.Stroudsburg,PA,USA:Association for Computational Linguistics,2000,46-51. [10] Zhang K X,Zhou C L.Regularized structured perceptron for Chinese word segmentation POS tagging and parsing.In:Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics.Gothenburg,Sweden:Association for Computational Linguistics,2014,164-173. [11] Huang F,Ahuja A,Downey D,et al.Learning representations for weakly supervised natural language processing tasks.Computational Linguistics,2013,40(1):85-120. [12] Bengio Y,Courville A,Vincent P.Representation learning:A review and new perspectives.IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(8):1798-1828. [13] Sutton C.An introduction to conditional random fields.Foundations and Trends in Machine Learning,2012,4(4):267-373. [14] Wallach H.Conditional random fields:An introduction.Technical Reports(CIS).University of Pennsylvania,2004,22. [15] 姜 维,王晓龙,关 毅等.基于多知识源的中文词法分析系统.计算机学报,2007,30(1):137-145.(Jiang W,Wang X L,Guan Y,et al.Research on Chinese lexical analysis system by fusing multiple knowledge sources.Chinese Journal of Computers,2007,30(1):137-145.) [16] 吴 秦,白玉昭,梁久祯.一种基于语义词典的局部查询扩展方法.南京大学学报(自然科学),2014,50(4),526-533.(Wu Q,Bai Y Z,Liang J Z.A local query expansion method based on semantic dictionary.Journal of Nanjing University(Natural Sciences),2014,50(4):526-533.) |
No related articles found! |
|