南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (2): 357–.

• • 上一篇    下一篇

融合越南语语言特征与改进PCFG的越南语短语树库构建

 李 英1,2,郭剑毅1,2*,余正涛1,2,线岩团1,2,陈 玮1,2   

  • 出版日期:2017-03-27 发布日期:2017-03-27
  • 作者简介:?1.昆明理工大学信息工程与自动化学院,昆明,650500;2.昆明理工大学智能信息处理重点实验室,昆明,650500
  • 基金资助:
    ?基金项目:国家自然科学基金(61262041,61363044,61472168)
    收稿日期:2016-10-27
    *通讯联系人,E-mail:gjade86@hotmail.com

Construct the Vietnamese phrase Treebank by fusion of Vietnamese grammatical features and improved PCFG

 Li Ying1,2,Guo Jianyi1,2*,Yu Zhengtao1,2*,Xian Yantuan1,2,Chen Wei1,2   

  • Online:2017-03-27 Published:2017-03-27
  • About author:?1.The School of Information Engineering and Automation,Kunming University of Science and Technology,
    Kunming,650500,China;2.The Key Laboratory of Intelligent Information Processing,Kunming University
    of Science and Technology,Kunming,650500,China

摘要:  短语树库是自然语言处理的研究和实际应用的重要资源,就越南语而言目前也缺乏这类树库资源,不利于汉越双语信息处理工作.提出一种融合越南语语法特征与改进PCFG(概率上下文无关文法)的越南语短语树库构建方法,能够自动分析出越南语的短语结构树,从而可解决了越南语短语树库的自动构建问题.首先通过分析越南语的语言特征,制定出越南语的语言特征集;然后利用Inside-Outside算法从人工标注的少量越南语短语树获取PCFG模型中的语法规则集;最后将语法特征集作为语法规则集的补充融入PCFG模型,用得到的新模型最终完成越南语短语树库的构建.实验结果表明,新的PCFG模型针对越南语短语树库构建的准确率达到了81.14%,相比传统PCFG模型以及基于最大熵的树库构建方法准确率明显提高了2%~3%.

Abstract:  Phrase Treebank is an important resource for Natural Language Processing research and practical application.For Vietnamese,we still lack this kind of Treebank resources,which has made Chinese and Vietnamese bilingual information processing be difficult to carry on.This paper presents a method to construct the Vietnamese phrase Treebank by fusion of Vietnamese grammatical features and improved PCFG(probabilistic context-free grammar)model.We think that it is a necessary resource for the linguistic research in general and for the development of real applications in the area of NLP(Natural Language Processing).This method can automatically analyze Vietnamese phrase structure tree,and it can solve the problem of constructing the Vietnamese phrase Treebank.Firstly,Vietnamese grammatical feature set is established by analysis of Vietnamese grammatical features.Then,grammar rule set of PCFG(probabilistic context-free grammar)model is obtained from manual annotation Vietnamese phrase trees.At the same time,The traditional PCFG(probabilistic context-free grammar)model is improved by adding more contextual semantic information,which are Pre co-occurrence probability and Post co-occurrence probability.Finally,Vietnamese grammatical feature set is fused into improved PCFG(probabilistic context-free grammar)model,which is regarded as a supplement.The new method completes the construction of Vietnamese phrase Treebank.The final improved PCFG(probabilistic context-free grammar)model has obtained good results for Vietnamese syntactic analysis.It not only improves the accuracy,but also reduces syntactic parsing time.The process of Vietnamese automatic syntactic analysis also promotes the construction of Vietnamese phrase Treebank.The experimental results show that the accuracy of proposed PCFG(probabilistic context-free grammar)model for the Vietnamese phrase Treebank construction reaches 81.14%.Compared with conventional PCFG(probabilistic context-free grammar)model and the maximum entropy method,the accuracy is obviously improved.

 [1] 刘 挺,马金山.汉语自动句法分析的理论与方法.当代语言学,2009(2):100-112.(Liu T,Ma J S.Theories and methods of Chinese automatic syntactic parsing:A critical survey.Contem-porary Linguistics,2009(2):100-112.)
[2] Johnson M.PCFG models of linguistic tree representations.Computational Linguistics,1998,24(4):613-632.
[3] Meyer N J,Allen J P.Commitment in the workplace.Sage Publications,1997,175.
[4] Zhang K,Zan H,Han Y,et al.Preliminary study on the construction of bilingual phrase structure Treebank.Lecture Notes in Computer Science,2014,8922:403-413.
[5] Hong P L,Nguyen T M H,Roussanaly A.Vietnamese parsing with an automatically extracted tree-adjoining grammar.In:IEEE Rivf International Conference on Computing and Communication Technologies,Research,Innovation,and Vision for the Future.Vietnam,2012:1-6.
[6] Dinh D,Thuy N,Xuan Q,et al.The parallel corpus approach to building the syntactic tree transfer set in the English-to-Vietnamese machine Translation.Química Nova,2009,32(6):1477-1481.
[7] Arda8 1e3l0eb8i7,3A5rzucanArda.N-gram parsing for jointly training a discriminative constituency parser.Polibits,2013,47(47):5-12.
[8] Dukes K,Habash N.One-step statistical parsing of hybrid dependency-constituency syntactic representations.In:International Conference on Parsing Technologies,Iwpt 2011.Dubin,Ireland:Dublin City University,2011,92-103.
[9] Ule T.Directed treebank refinement for PCFG parsing.In:The Workshop on Treebanks & Linguistic Theories.2013:523-530.
[10] Antony P J,Warrier N J,Soman K P.Penn Treebank.International Journal of Computer Applications,2010,7(8):14-21.
[11] Li J,Mu L,Zan H,et al.Research on Chinese parsing based on the improved compositional vector grammar.In:Chinese Lexical Semantics.Springer International Publishing,2015.
[12] Li X,Zong C.An effective framework for chinese syntactic parsing.International Journal of Signal Processing,2005:201.
[13] Carroll B G,Rooth M.Valence induction with a head-lexicalised PCFG.In:Conference on Empirical Methods in Natural Language Processing,2013.
[14] Nguyen C T,Nguyen T K,Phan X H,et al.Vietnamese word segmentation with CRFs and SVMs:An investigation.In:Asio Pacific International Conference on Language,Information and Computing.Wuhan,China 2006.
[15] Le H P,Nguyen T M H,Romary L,et al.A lexicalized tree-adjoining grammar for Vietna-mese.In:International Conference on Language Rescources and Evaluation (Lrec).France,2006.
[16] Dinh H T,Lee C,Niyato D,et al.A survey of mobile cloud computing:architecture,applications,and approaches.Wireless Communications & Mobile Computing,2015,13(18):1587-1611.
[17] Carpenter B.The generative power of categorial grammars and head-driven phrase structure grammars with lexical rules.Computational Linguistics,2013,17:301-314.
[18] Andy W.Robust sub-sentential alignment of phrase-structure trees.Journal of Neurology Neurosurgery and Psychiatry,2010,54(9):848-849.
[19] Langlais P,Gotti F.Phrase-based SMT with shallow Tree-Phrases.In:The Workshop on Statistical Machine Translation.Association for Computational Linguistics,2006:39-46.
[20] Volk B M,Gustafsoncapková S,Lundborg J,et al.Phrase alignment in parallel Treebanks.In Proc.TLT-2006,2014:91-102.
[21] Johan H,Joakim N.Parsing discontinuous phrase structure with grammatical functions.In:Interna-tional Conference on Advances in Natural Language Processing.Springer-Verlag,2013:169-180.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!