南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (2): 289–.

• • 上一篇    下一篇

基于条件随机场的藏文人名识别技术研究

珠 杰1,2*,李天瑞1,刘胜久1   

  • 出版日期:2016-03-26 发布日期:2016-03-26
  • 作者简介: 1.西南交通大学信息科学与技术学院,成都,610031;2.西藏大学计算机科学系,拉萨,850000
  • 基金资助:
    基金项目:国家自然科学基金(61262058)
    收稿日期:2015-09-10
    *通讯联系人,E­mail:790139756@qq.com

Research on Tibetan name recognition technology under CRF

Zhu Jie1,2*,Li Tianrui1,Liu Shengjiu1   

  • Online:2016-03-26 Published:2016-03-26
  • About author: 1.School of Information Science and Technology,Southwest Jiaotong University,Chengdu,610031,China;2.Department of Computer Science,Tibet University,Lhasa,850000,China

摘要: 文本挖掘中命名实体识别是一项重要的研究内容,利用统计学原理进行命名实体识别具有较高的识别率.利用条件随机场(conditional random fields,CRF)方法,研究藏文人名识别技术,重点探讨藏文人名的内部结构特征、上下文特征、特征选择和数据预处理等内容,并通过实验分析了不同特征的有效性.首先给出了基于字(音节)和字位信息的人名识别方法;其次研究了触发词、虚词、人名词典和指人名词后缀为特征的不同特征组合与优化,并细化了不同虚词对人名识别的作用;最后,通过不同组合的实验测试,结果表明:1)触发词和作格助词特征在藏文人名识别上能够起到积极的作用;2)不同特征窗口大小对人名识别有一定影响;3)利用CRF识别藏文人名F1值能够达到80%左右,但由于藏文两字人名的高歧义性,目前还达不到与其他语言相近的识别效果.

Abstract: Named entity recognition is an important research content in text mining.It has a high recognition rate by use of statistical principle.This paper studies Tibetan name recognition technology using conditional random fields(CRF)principle,focuses on analysis of the internal structure of the Tibetan names,contextual features,feature selection and data preprocessing,etc.and evaluates the effectiveness of different features through experiments.The contributions of this paper are that the method of name recognition based on the information of word(syllable)and word position is firstly presented;trigger words,function words,dictionary of names and personal noun suffix as features,together with their different combinations and optimization are studied,and the role of the different function words to the name recognition is refined.Experimental evaluation on different combinations showed that:1)the features of trigger words and ergative particle can play a positive role on the Tibetan name recognition;2)different feature window sizes have an impact on the name recognition;3)the recognition rate of Tibetan names can reach 80% of F1 value by use of CRF.However,it can’t reach similar recognition results in other languages due to the high ambiguity of words consisting of two Tibetan syllables.

[1] 孙茂松,黄昌宁,高海燕等.中文姓名的自动辨识.中文信息学报,1995,9(2):16-27.(Sun M S,Huang C N,Gao H Y,et al.Identifying Chinese names in unrestricted texts.Journal of Chinese Information Processing,1995,9(2):16-27.)
[2]  Levow G A.The 3rd international Chinese language processing backoff:Word segmentation and name entity recognition.In:Proceedings of the 5th SigHAN Workshop on Chinese Language Processing,Sydney:Association for Computa­tional Linguistics,2006,108-117.
[3]  吴 秦,胡丽娟,梁久祯.基于分块重要度和二维条件随机场的Web信息抽取.南京大学学报(自然科学),2014,50(1):79-85.(Wu Q,Hu L J,Liang J Z.Web information extraction based on block importance model and 2D conditional random fields.Journal of Nanjing University(Natural Sciences),2014,50(1):79-85.)
[4]  程显毅,朱 倩.未定义类型的关系抽取的半监督学习框架研究.南京大学学报(自然科学),2012,48(4):465-473.(Cheng X Y,Zhu Q.A study of relation extraction of undefined relation type based on semrsupervised learning framework.Journal of Nanjing University(Natural Sciences),2012,48(4):465-473.)
[5]  李广一,王厚峰.基于多步聚类的汉语命名实体识别和歧义消解.中文信息学报,2013,27(5):31-34.(Li G Y,Wang H F.Chinese named entity recognition an disambiguation based on multi­stage clustering.Journal of Chinese Information Processing,2013,27(5):31-34.)
[6]  Collober M,Weston,J.A unified architecture for natural language processing:Deep neural networks with multitask learning.In:Proceedings of the 25th International Conference on Machine Learning.New York:ICML,2008,160-167.
[7]  Turian J,Ratinov L,Bengio Y.Word representations:A simple and general method for semi­supervised learning.In:Proceeding of the 48th Annual Meeting of the Association for Computational Linguistics.Sweden:ACL,2010,384-394.
[8]  窦 嵘,加羊吉,黄 伟.统计与规则相结合的藏文人名自动识别研究.长春工程学院学报(自然科学版),2010,11(2):113-115.(Dou R,Jia Y J,Huang W.Automatic recognition of tibetan name with the combination of statistics and regular.Journal of Changchun Institute of Technology(Natural Science Edition),2010,11(2):113-115.)
[9]  Liu H,Zhao W,Nuo M,et al.Tibetan number identification based on classification of number components in Tibetan word segmentation.In:Proceedings of the 23rd International Conference on Computational Linguistics.Beijing:Coling,2010,719-724.
[10]  孙 萌,华却才让,刘 凯等.藏文数词识别与翻译.北京大学学报(自然科学版),2013,1(1):75-80.(Sun M,Hua Quecairang,Liu K,et al.Tibetan number identification and translation.Acta Scientiarum Naturalium Universitatis Pekinensis,2013,1(1):75-80.)
[11]  华却才让,姜文斌,赵海兴等.基于感知机模型藏文命名实体识别.计算机工程与应用,2014,50(15):172-176.(Hua Quecairang,Jiang W B,Zhao H X,et al.Tibetan name entity recognition with perceptron model.Computer Engineering and Applications,2014,50(15):172-176.)
[12]  加羊吉,李亚超,宗成庆等.最大熵和条件随机场模型相融合的藏文人名识别.中文信息学报,2014,1(1):107-112.(Jia Y J,Li Y C,Zong C Q,et al.A hybrid approach to Tibetan person name identification by maximum entropy model and conditional random fields.Journal of Chinese Infromation Processing,2014,1(1):107-112.)
[13]  康才畯,龙从军,江 荻.基于条件随机场的藏文人名识别研究.计算机工程与应用,2015,51(3):109-111.(Kang C J,Long C J,Jiang D.Tibetan names recognition research based on CRF.Computer Engineering and Applications,2015,51(3):109-111.)
[14]  嘎·达哇次仁.藏族人名文化.西藏大学学报,1996,11(2):21-25.
[15]  格桑居冕,格桑央京.实用藏文文法教程.成都:四川民族出版社,2004,11.
[16]  Lafferty J,McCallum A,Pereira F.Conditional random fields:Probabilistic models for segmenting and labeling sequence data.In:Proceedings of the 18th International Conference on Machine Learning.San Francisco:ICML,2001,282-289.
[17]  CRF++.http://sourceforge.net/projects/crfpp/files/crfpp/,2012-02-14.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!