南京大学学报(自然科学版) ›› 2018, Vol. 54 ›› Issue (3): 604–611.

• • 上一篇    下一篇

基于Scopus检索和TFIDF的论文关键词自动提取方法

陈列蕾,方 晖*   

  • 出版日期:2018-05-23 发布日期:2018-05-23
  • 作者简介:南京大学电子科学与工程学院,南京,210023

Keyphrases automatic extraction from the abstracts of English scientific papers based on Scopus retrieval

Chen Lielei, Fang Hui*   

  • Online:2018-05-23 Published:2018-05-23
  • About author:School of Electronics Science and Engineering,Nanjing University,Nanjing,210023

摘要: 客观准确的关键词能够帮助电子数据库对科研文献进行分类,也能帮助研究人员缩小文献检索的范围。提出基于TFIDF与Scopus数据库检索的方法自动提取英文科研文献的关键词,将Scopus数据库包含的所有文档作为语料库,并利用Scopus API实现库内自动检索。相对于传统的人工建立并标记语料库,该方法更方便,可用数据更丰富。该方法利用摘要冗余信息量少的特点,结合全文信息的统计特征从摘要中提取关键词;考虑并建立了摘要的结构特征词,通过统计引入了短语的位置特征并进行加权,还扩展了两类停用词库用于过滤干扰词。实验结果表明该方法具有较好的性能。

Abstract: Keyphrases automatic extraction technology has been gradually used in scientific publications. Objective and accurate keyphrases are utilized to clustering documents in databases. In addition, suitable keyphrases assist researchers in finding relevant papers. This paper proposed a method based on TFIDF and Scopus database retrieval to extract keyphrases automatically from abstracts of English scientific papers. Our method considers all the documents indexed in the Scopus database as corpus, and uses Scopus API to retrieve candidates in the database automatically. Compared with the traditional approaches that rely on manually established and annotated corpus, our method is more convenient with richer available data. Taking the advantages that abstracts have less redundant information, the key phrases were extracted from the abstracts based on the statistical features of the full text. We constructed the structural characteristics of abstracts and introduced the position feature of candidates. Moreover, two type stop-words lists for excluding noise candidates were developed for a better performance. The experimental results show that our method performed well.

[1] Kumar N, Srinathan K. Automatic keyphrase extraction from scientific documents using N-gram filtration technique // Proceedings of the 8th ACM Symposium on Document Engineering. Sao Paulo, Brazil: ACM, 2008: 199-208. [2] Turney P D. Learning algorithms for keyphrase extraction. Information Retrieval, 2000, 2(4): 303-336. [3] Frank E, Paynter G W, Witten I H, et al. Domain-specific keyphrase extraction // Proceedings of the 16th International Joint Conference on Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999: 668-673. [4] Jiang X, Hu Y H, Li H. A ranking approach to keyphrase extraction // Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Boston, MA, USA: ACM, 2009: 756-757. [5] Nguyen T D, Kan M Y. Keyphrase extraction in scientific publications // Proceedings of the 10th International Conference on Asian Digital Libraries: Looking Back 10 Years and Forging New Frontiers. Hanoi, Vietnam: Springer-Verlag, 2007: 317-326. [6] Wan X J, Xiao J G. Single document keyphrase extraction using neighborhood knowledge // Proceedings of the 23rd AAAI Conference on Artificial Intelligence. Chicago, IL, USA: AAAI Press, 2008: 855-860. [7] Ercan G, Cicekli I. Using lexical chains for keyword extraction. Information Processing & Management, 2007, 43(6): 1705-1714. [8] Mihalcea R, Tarau P. TextRank: Bringing order into texts // Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain: DBLP, 2004: 404-411. [9] Liu Z Y, Huang W Y, Zheng Y B, et al. Automatic keyphrase extraction via topic decomposition // Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Cambridge, MA, USA: Association for Computational Linguistics, 2010: 366-376. [10] Liu Z Y, Li P, Zheng Y B, et al. Clustering to find exemplar terms for keyphrase extraction // Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore, Republic of Singapore: Association for Computational Linguistics, 2009, 1: 257-266. [11] Haddoud M, Abdeddaïm S. Accurate keyphrase extraction by discriminating overlapping phrases. Journal of Information Science, 2014, 40(4): 488-500. [12] Rose S, Engel D, Cramer N, et al. Automatic keyword extraction from individual documents // Berry M W, Kogan J. Text Mining: Application and Theory. Hoboken, NJ, USA: John Wiley & Sons, 2010: 1-20. [13] Kim S N, Kan M Y. Re-examining automatic keyphrase extraction approaches in scientific articles // Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications. Suntec, Singapore, Republic of Singapore: Association for Computational Linguistics, 2009: 9-16. [14] Santorini B. Part-of-speech tagging guidelines for the penn treebank project. Annual Meeting of ACl, 1990, 22(10): 88-96. [15] Toutanova K, Klein D, Manning C D, et al. Feature-rich part-of-speech tagging with a cyclic dependency network // Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Edmonton, Canada: Association for Computational Linguistics, 2003: 173-180. [16] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 1988, 24(5): 513-523. [17] 施聪莺, 徐朝军, 杨晓江. TFIDF算法研究综述. 计算机应用, 2009, 29(S1): 167-170, 180. (Shi C Y, Xu C J, Yang X J. Study of TFIDF algorithm. Journal of Computer Applications, 2009, 29(S1): 167-170.) [18] Milas-Bracovi? M, Zajec J. Author abstracts of research articles published in scholarly journals in Croatia (Yugoslavia): An evaluation. Libri, 1989, 39(4): 303-318. [19] Endres-Niggemeyer B, Maier E, Sigel A. How to implement a naturalistic model of abstracting: four core working steps of an expert abstractor. Information Processing & Management, 1995, 31(5): 631-674. [20] Salager-Meyer F. Discoursal flaws in Medical English abstracts: A genre analysis per research-and text-type. Text & Talk, 1990, 10(4): 365-384. [21] Hartley J, Betts L. Common weaknesses in traditional abstracts in the social sciences. Journal of the American Society for Information Science and Technology, 2009, 60(10): 2010-2018. [22] Jamar N, ?auperl A, Bawden D. The components of abstracts: The logical structure of abstracts in the areas of materials science and technology and of library and information science. New Library World, 2014, 115(1-2): 15-33. [23] Kanoksilapatham B. Generic characterisation of civil engineering research article abstracts. 3L: The Southeast Asian Journal of English Language Studies, 2013, 19(3): 1-10. [24] Elsevier Developers. Elsevier Scopus APIs. https://dev.elsevier.com/sc_apis.html. [25] Elsevier Developers. API key settings. https://dev.elsevier.com/api_key_settings.html. [26] Liu H X, Goulding J, Brailsford T. Towards computation of novel ideas from corpora of scientific text // Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2015: 541-556. [27] Elsevier. Scopus: Access and use Support Center. https://service.elsevier.com/app/home/supporthub/scopus/.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!