基于条件随机场模型和文本纠错的微博新词词性识别研究

南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (2): 353–.

基于条件随机场模型和文本纠错的微博新词词性识别研究

韩彦昭¹，乔亚男¹*，范亚平¹，李孟超²，万迪昉³

出版日期:2016-03-27 发布日期:2016-03-27
作者简介: 1.西安交通大学电子与信息工程学院，西安，710049；2.西安交通大学软件学院，西安，710049；3．西安交通大学管理学院，西安，710049
基金资助:
基金项目：国家自然科学基金(61202181)，博士后科学基金(2012M512006)，中央高校基本科研业务费专项资金(XJJ2013097)
收稿日期：2015－10－10
*通讯联系人，Email：qiaoyanan@mail.xjtu.edu.cn

Partofspeech tagging of microblog unknown words based on conditional random fields and error correction

Han Yanzhao¹，Qiao Yanan¹*，Fan Yaping²，Li Mengchao²，Wan Difang³

Online:2016-03-27 Published:2016-03-27
About author: 1.School of Electronic and Information Engineering，Xi’an Jiaotong University，Xi’an，710049，China；2．School of Software，Xi’an Jiaotong University，Xi’an，710049，China；3．School of Management，Xi’an Jiaotong University，Xi’an，710049，China

摘要/Abstract

摘要： 针对微博数据特点，采用降噪算法和条件随机场模型对微博数据进行词性标注，并对其中比重较大的谐音词使用贝叶斯方法进行词性二次纠正．首先利用新浪平台API和爬虫获取原始微博数据，再根据噪音特点人工制定规则进行降噪．由于条件随机场在中文词性标注中特征提取的优势，使用条件随机场模型对降噪后的微博语料词性标注．在此基础上，利用微博语料中谐音词比重较大的特点，将微博词语转化为拼音，根据贝叶斯方法计算得到谐音词的原生词候选，再根据词语的上下文建立谐音词和原生词映射，并利用原生词的词性已知的性质，对谐音词进行词性纠错．实验结果表明，该方法可以较好地标注微博未登录词，词性标注准确率达到95.23%.

Abstract: The purpose of this work is to solve the problem of microblog partofspeech(POS)tagging.POS tagging of Chinese new word is a difficult，important and widelystudied sequence modeling problem.This paper describes a hybrid model that combines a rulebased model with linearchain conditional random fields(CRFs)and Bayes algorithm for the task of POS tagging of Microblog unknown words.Firstly，microblog data are obtained by using Sina API and web spider.According to the features of microblog，a rulebased method is presented to reduce the impact of noisy data in POS tagging.Then，since CRFs has an advantage in feature extraction of POS tagging，it is used to obtain initial POS tags of microblog new words.We also present a probabilistic POS tagging method，which further improves performances.Homophonic words account for a large proportion of microblog new words.Because the pronunciation between homophonic words and its original words are similar or identical，Chinese Phonetic Alphabet is used to build the bridge between them.And the bridge is built by using Naive Bayes algorithm.Evaluation on microblog test set shows that this system could tag the new words of microblog in a better way，the best precision it achieves is 95.23%.

韩彦昭¹，乔亚男¹*，范亚平¹，李孟超²，万迪昉³. 基于条件随机场模型和文本纠错的微博新词词性识别研究[J]. 南京大学学报(自然科学版), 2016, 52(2): 353–.

Han Yanzhao¹，Qiao Yanan¹*，Fan Yaping²，Li Mengchao²，Wan Difang³. Partofspeech tagging of microblog unknown words based on conditional random fields and error correction[J]. Journal of Nanjing University(Natural Sciences), 2016, 52(2): 353–.

参考文献

[1]　丁兆云，贾　焰，周　斌．微博数据挖掘综述．计算机研究与发展，2014，51(4)：1－16.(Ding Z Y，Jia Y，Zhou B．Survey of data mining for microblogs.Journal of Computer Research and Development，2014，51(4)：1－16.)
[2] 赵　斌，吉根林，曲维光等．基于重用检查的微博垃圾用户过滤算法．南京大学学报(自然科学)，2013，49(4)：456－464.(Zhao B，Ji G L，Qu W G，et al.Detecting microblog spammers based on reuse detection.Journal of Nanjing University(Natural Sciences)，2013，49(4)：456－464.)
[3] 于　清，阿里甫·库尔班．微博语料分词及标注方法初探．新疆大学学报(自然科学版)，2013，30(1)：81－86.(Yu Q，Alifu Kuerban.Preliminary study of Chinese word segmentation and partofspeech tagging being used for microblog data.Journal of Xinjiang University(Natural Science Edition)，2013，30(1)：81－86.)
[4] 蒋才智，王　浩，姚宏亮．基于知网的贝叶斯中文人名识别．南京大学学报(自然科学)，2012，48(2)：147－153.(Jiang C Z，Wang H，Yao H L．Chinese name recognition based on HowNet and Bayesian classifier.Journal of Nanjing University(Natural Sciences)，2012，48(2)：147－153.)
[5] Weischedel R，Schwartz R，Palmucci J，et al.Copingwith ambiguity and unknown words through probabilistic models.Computational Linguistics，1993，19(2)：361－382.
[6] Ratnaparkhi A．A maximum entropy model for partofspeech tagging.In：Proceedings of the Conference on Empirical Methods in Natural Language Processing.Philadelphia，P A，USA：Association for Computational Linguistics，1996，133－142.
[7] Lafferty J，Mccallum A，Pereira F C．Conditional random fields：Probabilistic models for segmenting and labeling sequence data.In：Proceeding of the 18th International Conference on Machine Learning.San Francisco，CA，USA：Morgan Kaufmann Publishers Inc，2001，85－120.
[8] Lu X F．Hybrid methods for POS Guessing of Chinese unknown word.In：Proceedings of the ACL Student Research.Stroudsburg，PA，USA：Association for Computational Linguistics，2005，1－6.
[9] Wu A，Jiang Z X．Statisticallyenhanced new word identification in a rulebased Chinese system.In：Proceedings of the 2nd Workshop on Chinese Language Processing.Stroudsburg，PA，USA：Association for Computational Linguistics，2000，46－51.
[10] Zhang K X，Zhou C L．Regularized structured perceptron for Chinese word segmentation POS tagging and parsing.In：Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics.Gothenburg，Sweden：Association for Computational Linguistics，2014，164－173.
[11] Huang F，Ahuja A，Downey D，et al.Learning representations for weakly supervised natural language processing tasks.Computational Linguistics，2013，40(1)：85－120.
[12] Bengio Y，Courville A，Vincent P．Representation learning：A review and new perspectives.IEEE Transactions on Pattern Analysis and Machine Intelligence，2013，35(8)：1798－1828.
[13] Sutton C．An introduction to conditional random fields.Foundations and Trends in Machine Learning，2012，4(4)：267－373.
[14] Wallach H．Conditional random fields：An introduction.Technical Reports(CIS)．University of Pennsylvania，2004，22.
[15] 姜　维，王晓龙，关　毅等．基于多知识源的中文词法分析系统．计算机学报，2007，30(1)：137－145.(Jiang W，Wang X L，Guan Y，et al.Research on Chinese lexical analysis system by fusing multiple knowledge sources.Chinese Journal of Computers，2007，30(1)：137－145.)
[16] 吴　秦，白玉昭，梁久祯．一种基于语义词典的局部查询扩展方法．南京大学学报(自然科学)，2014，50(4)，526－533.(Wu Q，Bai Y Z，Liang J Z．A local query expansion method based on semantic dictionary.Journal of Nanjing University(Natural Sciences)，2014，50(4)：526－533.)

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed