南京大学学报(自然科学版) ›› 2015, Vol. 51 ›› Issue (4): 707–713.

• • 上一篇    下一篇

基于语序位置特征的汉英术语对自动抽取研究

张莉1*刘昱显2   

  • 出版日期:2015-07-08 发布日期:2015-07-08
  • 作者简介:(1. 南京大学计算机科学与技术系,南京,210093;2. 中国人民大学国际关系学院,北京,100872)
  • 基金资助:
    基金项目:国家社会科学基金(11CYY031)

Research on automatic Chinese-English term extraction based on order and position feature of words

Zhang Li 1*, Liu Yuxian2   

  • Online:2015-07-08 Published:2015-07-08
  • About author:(1.Department of Computer Science and Technology, Nanjing University, Nanjing, 210093,China; 2.Institute of International Relations, Renmin University of China, Beijing, 100872,China)

摘要: 双语的术语抽取和对齐在跨语言检索、构建双语词典和机器翻译等研究领域有着重要的作用。提出一种基于语序位置特征信息的汉英术语对自动对齐算法。该算法对双语术语抽取两步走策略中的术语对齐部分进行了改进,将基于短语的机器翻译中的语序位置特征融合进术语对齐算法中,通过对基准方法的对比,新方法显著提高了术语对齐的精确率,特别在术语翻译概率较低时提高更为明显,同时又避免了基于短语的机器翻译中计算效率低的缺陷。

Abstract: With the explosion of information and in current society, knowledge is spreading among information in various areas and also in different languages. The characteristic of knowledge spreading brings people tremendous obstacles in understanding, retrieving and exchanging their thinking. Bilingual terminology is an important language resource for natural language processing tasks such as machine translation, Data Mining and bilingual information retrieval. The collecting of bilingual terminology is often challenging and time-consuming because texts to be aligned are usually in different languages such as Chinese and English and there are significant differences in many cases.Thus bilingual terminology extraction and alignment becomes more important and brings more and more attention in the information processing and it plays an important role in cross-language retrieval, building bilingual dictionaries and machine translation research. The development of bilingualterminology extraction and alignment will benefit the building of translation memory in the field of machine-assisted translation and it can improve the quality of the machine translations while adding the bilingual terminology information.We propose an automatic Chinese-English terminology alignment algorithm based on the order and position feature information of words. The algorithm improves the terminology alignment of two-step strategy about extracting bilingual terms by integrating the order and position feature information of words in phrase-basedmachine translation. The experimental corpus we used is the journals in CSSCI from the year of 1998 to 2012, mainly including the titles and abstracts in Chinese and English. In our experiment, 37206 complete English titles and abstracts of many papers are launchedincluding a total of 1.63 million words in Chinese and 1910000 words in English. The algorithm improves accuracy rate of term alignment especially in the case of lower probability of terms translation while eliminating computing inefficiency of the phrase-based machine translation.

[1] 李秀英. 术语与机器翻译实验结果分析与术语数据库的构建.实验室研究与探索,2008, 27(11):51-56.
[2] Och F J, Ney H. Discriminative training and maximum entropy models for statistical machine translation. In:Proceedings of the 40thAnnual Meeting on Association for Computational Linguistics. Philadelphia, PA: Association for Computational Linguistics, 2002: 295-302.
[3] Koehn P, Och F J, Marcu D. Statistical phase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Volume 1. Stroudsburg, PA: Association for Computational Linguistics, 2003:48-54.
[4]Zhao B. Statistical alignment models for translational equivalence.Carnegie Mellon University,2007.
[5]Ma Y J, Stroppa N, Way A. Bootstrapping word alignment via word packing. In: Proceedingsof the 45thAnnual Meeting of the Association forComputationalLinguistics (ACL’07), 2007,45(1):304-311.
[6] 俞敬松, 王惠临, 吴胜兰. 高正确率的双语语块对齐算法研究.中文信息学报, 2015,29(1):67-74,110.
[7]刘豹, 张桂平, 蔡东风.基于统计和规则相结合的科技术语自动抽取研究.计算机工程与应用,2008,44(32):147-150.
[8]何燕, 穗志方, 段慧明.一种结合术语部件库的术语提取方法.计算机工程与应用,2006,33:4-7.
[9] 孙茂松, 李莉, 刘知远.面向中英平行专利的双语术语自动抽取.清华大学学报(自然科学版),2014,54(10):1139-1343.
[10] ZhangK, SunM S. Unified framework of performing Chinese word segmentation and part-of-speech tagging. China Communications, 2012,9(3):1-9.
[11] 陈相.面向生物医学领域的双语对齐技术研究.硕士学位论文. 大连理工大学,2009.
[12] 刘桃, 刘秉权, 徐志明等.领域术语自动抽取及其在文本分类中的应用.电子学报,2007, 35(2):328-332.
[13] 周绍钧, 吕学强, 李卓等. 基于多策略融合的专利术语自动抽取.计算机应用与软件,2015,32(2):28-32.
[14] 李丽双, 王意文, 黄德根. 基于信息熵和词频分布变化的术语抽取研究.中文信息学报, 2015,29(1):82-87.
[15]Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.In: ProceedingsofInternational Conference on Machine Learning, 2001:282-289.
[16] 吴秦, 胡丽娟, 梁久祯.基于分块重要度和二维条件随机场的Web信息抽取.南京大学学报(自然科学),2014,50(1):79-85.
[17] 曾文, 徐硕, 张运良等.科技文献术语的自动抽取技术研究与分析.现代图书情报技术,2014,242(1):51-55.
[18] 孙乐, 金友兵.平行语料库中双语术语词典的自动抽取.中文信息学报,2000,14(6):33-39.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!