南京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (2): 289.
珠 杰1,2*,李天瑞1,刘胜久1
Zhu Jie1,2*,Li Tianrui1,Liu Shengjiu1
摘要: 文本挖掘中命名实体识别是一项重要的研究内容,利用统计学原理进行命名实体识别具有较高的识别率.利用条件随机场(conditional random fields,CRF)方法,研究藏文人名识别技术,重点探讨藏文人名的内部结构特征、上下文特征、特征选择和数据预处理等内容,并通过实验分析了不同特征的有效性.首先给出了基于字(音节)和字位信息的人名识别方法;其次研究了触发词、虚词、人名词典和指人名词后缀为特征的不同特征组合与优化,并细化了不同虚词对人名识别的作用;最后,通过不同组合的实验测试,结果表明:1)触发词和作格助词特征在藏文人名识别上能够起到积极的作用;2)不同特征窗口大小对人名识别有一定影响;3)利用CRF识别藏文人名F1值能够达到80%左右,但由于藏文两字人名的高歧义性,目前还达不到与其他语言相近的识别效果.
[1] 孙茂松,黄昌宁,高海燕等.中文姓名的自动辨识.中文信息学报,1995,9(2):16-27.(Sun M S,Huang C N,Gao H Y,et al.Identifying Chinese names in unrestricted texts.Journal of Chinese Information Processing,1995,9(2):16-27.) [2] Levow G A.The 3rd international Chinese language processing backoff:Word segmentation and name entity recognition.In:Proceedings of the 5th SigHAN Workshop on Chinese Language Processing,Sydney:Association for Computational Linguistics,2006,108-117. [3] 吴 秦,胡丽娟,梁久祯.基于分块重要度和二维条件随机场的Web信息抽取.南京大学学报(自然科学),2014,50(1):79-85.(Wu Q,Hu L J,Liang J Z.Web information extraction based on block importance model and 2D conditional random fields.Journal of Nanjing University(Natural Sciences),2014,50(1):79-85.) [4] 程显毅,朱 倩.未定义类型的关系抽取的半监督学习框架研究.南京大学学报(自然科学),2012,48(4):465-473.(Cheng X Y,Zhu Q.A study of relation extraction of undefined relation type based on semrsupervised learning framework.Journal of Nanjing University(Natural Sciences),2012,48(4):465-473.) [5] 李广一,王厚峰.基于多步聚类的汉语命名实体识别和歧义消解.中文信息学报,2013,27(5):31-34.(Li G Y,Wang H F.Chinese named entity recognition an disambiguation based on multistage clustering.Journal of Chinese Information Processing,2013,27(5):31-34.) [6] Collober M,Weston,J.A unified architecture for natural language processing:Deep neural networks with multitask learning.In:Proceedings of the 25th International Conference on Machine Learning.New York:ICML,2008,160-167. [7] Turian J,Ratinov L,Bengio Y.Word representations:A simple and general method for semisupervised learning.In:Proceeding of the 48th Annual Meeting of the Association for Computational Linguistics.Sweden:ACL,2010,384-394. [8] 窦 嵘,加羊吉,黄 伟.统计与规则相结合的藏文人名自动识别研究.长春工程学院学报(自然科学版),2010,11(2):113-115.(Dou R,Jia Y J,Huang W.Automatic recognition of tibetan name with the combination of statistics and regular.Journal of Changchun Institute of Technology(Natural Science Edition),2010,11(2):113-115.) [9] Liu H,Zhao W,Nuo M,et al.Tibetan number identification based on classification of number components in Tibetan word segmentation.In:Proceedings of the 23rd International Conference on Computational Linguistics.Beijing:Coling,2010,719-724. [10] 孙 萌,华却才让,刘 凯等.藏文数词识别与翻译.北京大学学报(自然科学版),2013,1(1):75-80.(Sun M,Hua Quecairang,Liu K,et al.Tibetan number identification and translation.Acta Scientiarum Naturalium Universitatis Pekinensis,2013,1(1):75-80.) [11] 华却才让,姜文斌,赵海兴等.基于感知机模型藏文命名实体识别.计算机工程与应用,2014,50(15):172-176.(Hua Quecairang,Jiang W B,Zhao H X,et al.Tibetan name entity recognition with perceptron model.Computer Engineering and Applications,2014,50(15):172-176.) [12] 加羊吉,李亚超,宗成庆等.最大熵和条件随机场模型相融合的藏文人名识别.中文信息学报,2014,1(1):107-112.(Jia Y J,Li Y C,Zong C Q,et al.A hybrid approach to Tibetan person name identification by maximum entropy model and conditional random fields.Journal of Chinese Infromation Processing,2014,1(1):107-112.) [13] 康才畯,龙从军,江 荻.基于条件随机场的藏文人名识别研究.计算机工程与应用,2015,51(3):109-111.(Kang C J,Long C J,Jiang D.Tibetan names recognition research based on CRF.Computer Engineering and Applications,2015,51(3):109-111.) [14] 嘎·达哇次仁.藏族人名文化.西藏大学学报,1996,11(2):21-25. [15] 格桑居冕,格桑央京.实用藏文文法教程.成都:四川民族出版社,2004,11. [16] Lafferty J,McCallum A,Pereira F.Conditional random fields:Probabilistic models for segmenting and labeling sequence data.In:Proceedings of the 18th International Conference on Machine Learning.San Francisco:ICML,2001,282-289. [17] CRF++.http://sourceforge.net/projects/crfpp/files/crfpp/,2012-02-14. |
No related articles found! |
|