南京大学学报(自然科学版) ›› 2023, Vol. 59 ›› Issue (4): 620–628.doi: 10.13232/j.cnki.jnju.2023.04.009

• • 上一篇    下一篇

融入领域知识的跨境民族文化生成式摘要方法

赵冠博1,2,3, 张勇丙1,2,3(), 毛存礼1,2,3, 高盛祥1,2,3, 王奉孝1,2,3   

  1. 1.南亚东南亚语言语音信息处理教育部工程研究中心,昆明理工大学,昆明,650500
    2.昆明理工大学信息工程与自动化学院,昆明,650500
    3.云南省人工智能重点实验室,昆明理工大学,昆明,650500
  • 收稿日期:2023-05-29 出版日期:2023-07-31 发布日期:2023-08-18
  • 通讯作者: 张勇丙 E-mail:zhangyongbing419@163.com
  • 基金资助:
    国家自然科学基金(62166023);云南省重大科技专项计划(202103AA080015);云南省自然科学基金重点项目(2019FA023)

A generative summary method of cross⁃border ethnic culture incorporating domain knowledge

Guanbo Zhao1,2,3, Yongbing Zhang1,2,3(), Cunli Mao1,2,3, Shengxiang Gao1,2,3, Fengxiao Wang1,2,3   

  1. 1.South Asia and Southeast Asia Languages Voice Information Processing Engineering Research Center under the Ministry of Education,Kunming University of Science and Technology,Kunming,650500,China
    2.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,650500,China
    3.Yunnan Key Laboratoryof ArtificialIntelligence,Kunming University of Science and Technology,Kunming,650500,China
  • Received:2023-05-29 Online:2023-07-31 Published:2023-08-18
  • Contact: Yongbing Zhang E-mail:zhangyongbing419@163.com

摘要:

从跨境民族文化文本中生成具有领域知识的摘要对进一步开展跨境民族文化文本检索、问答等任务具有重要的支撑作用,当前基于深度学习的生成式文本摘要取得了较好的效果,但直接用于跨境民族文化文本摘要任务会导致生成的摘要出现领域词汇丢失的问题.为此,提出一种融入领域知识的跨境民族文化生成式摘要方法(Domain Knowledge?Culture?Generative Summary,DKCGS),在编码端将跨境民族文化领域词典编码与原文本编码融合,以此增强模型对领域词汇的表征能力;在解码端,基于指针生成网络将具有同义或跨境关系的领域词汇分布与原文本分布结合,提高模型生成文化领域词汇的准确率.同时,在通用领域文本上进行预训练并进一步初始化参数,以缓解数据稀缺导致模型训练效果不佳的问题.实验结果表明,提出的方法在跨境民族文本摘要数据集上比基线模型的Rouge?1提升了0.95,有效提升了跨境民族文化文本摘要生成的质量.

关键词: 跨境民族文化, 领域知识, 指针生成网络, 预训练, 文本摘要

Abstract:

Generating summaries with domain knowledge from cross?border ethnic culture texts plays an important supporting role in further carrying out tasks such as cross?border ethnic culture text retrieval and question answering. Currently,generative text summarization based on deep learning has shown promising results. However,directly applying it to cross?border national cultural text summarization tasks may result in the omission of domain?specific words in the generated summary. Therefore,a Domain Knowledge?Culture?Generative Summary (DKCGS) method for cross?border ethnic culture summarization with Domain Knowledge is proposed. In the encoding end,the cross?border ethnic culture domain dictionary coding is integrated with the original text coding in order to enhance the representation ability of the model for domain words. At the decoder,the distribution of domain words with synonymous or cross?border relationships is combined with the distribution of the original text based on the pointer generation network to improve the accuracy of the model in generating cultural domain words. At the same time,it pre?trains and further initializes the parameters on the general domain text to alleviate the problem of poor model training effect caused by data scarcity. Experimental results show that the Rouge?1 of the proposed method is 0.95 higher than that of the baseline model on the cross?border ethnic text summarization dataset,which effectively improves the quality of cross?border ethnic culture text summarization generation.

Key words: cross?border ethnic culture, domain knowledge, pointer?generator network, pre?trained, text summary

中图分类号: 

  • TP391

图1

跨境民族文化文本摘要的样例"

图2

DKCGS的模型图"

图3

领域词典与输入文档融合"

表1

跨境民族文化领域词典示例"

领域词汇别称
泼水节摆爽南、佛诞节、宋干节
开门节出洼节、出夏节
叫谷魂招谷魂
婚礼金欠
关门节进洼
花街节赶花街
…………

表2

本文模型在实验中的参数设置"

项目名称数值设置
模型隐藏大小512
前馈隐藏大小1024
编解码器层数6层
Transformer头数4
Batch_size12
epochs20
dropout0.1
Beam_search3

表3

本文方法和七种对比算法的实验结果对比"

模型名称Rouge⁃1Rouge⁃2Rouge⁃L

抽取式

方法

Lead⁃1⁃First23.5311.6822.83
TextRank23.8611.5422.46

生成式

方法

Pointer24.6412.7123.36
RNN Context24.0112.5623.12
Unlim25.8312.6124.37
FAME25.6712.9624.64
Tri⁃PCN24.8812.6424.17
DKCGS26.6213.6325.41

表4

不同词典编码方式的实验结果"

词典编码方式Rouge⁃1Rouge⁃2Rouge⁃L
随机初始化22.5711.4321.11
GloVe26.3113.4724.86
Word2vec26.6213.6225.41

表5

消融实验的结果"

方法Rouge⁃1Rouge⁃2Rouge⁃L
DKCGS26.6213.6325.41
Transformer13.567.0211.22
Transformer+预训练24.2311.0722.48
Transformer+预训练+领域词典26.1413.2124.56

表6

词典规模对本文模型性能的影响"

词典规模Rouge⁃1Rouge⁃2Rouge⁃L
100024.3511.3722.54
150025.2412.3722.96
200025.6712.7223.98
250026.4313.5325.11
300026.6213.6225.40
312326.6213.6325.41

图4

不同对比模型生成摘要样例分析"

1 Hu B T, Chen Q C, Zhu F Z. LCSTS:A large scale Chinese short text summarization dataset∥Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon,Portugal:ACL,2015:1967-1972.
2 Vaswani A, Shazeer N, Parmar N,et al. Attention is all you need∥Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach,CA,USA:Curran Associates Inc.,2017:6000-6010.
3 Dong L, Yang N, Wang W H,et al. Unified language model pre?training for natural language understanding and generation∥Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver,Canada:Curran Associates Inc.,2019:13063-13075.
4 Aralikatte R, Narayan S, Maynez J,et al. Focus attention:Promoting faithfulness and diversity in summarization∥Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics,the 11th International Joint Conference on Natural Language Processing. Bangkok,Thailand:ACL,2021:6078-6095.
5 Zhu J N, Zhou Y, Zhang J J,et al. Attend,translate and summarize:An efficient method for neural cross?lingual summarization∥Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Washington DC,WA,USA:ACL,2020:1309-1321.
6 Nallapati R, Zhou B W, Dos Santos C,et al. Abstractive text summarization using sequence?to?sequence RNNs and beyond∥Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Berlin,Germany:Association for Computational Linguistics,2016:280-290.
7 Dou Z Y, Liu P F, Hayashi H,et al. GSum:a general framework for guided neural abstractive summarization∥Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Online:Association for Computational Linguistics,2021:4830-4842.
8 Mihalcea R, Tarau P. TextRank:bringing order into text∥Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing. Barcelona,Spain:ACL,2004:404-411.
9 Liu Y, Lapata M. Text summarization with pretrained encoders∥Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,the 9th International Joint Conference on Natural Language Processing. Hong Kong,China:Association for Computational Linguistics,2019:3730-3740.
10 Zhou Q Y, Yang N, Wei F R,et al. Selective encoding for abstractive sentence summarization∥Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Volume 1:Long Papers. Vancouver,Canada:Association for Computational Linguistics,2017:1095-1104.
11 Rush A M, Chopra S, Weston J. A neural attention model for abstractive sentence summarization∥Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon,Portugal:Association for Computational Linguistics,2015:379-389.
12 Chopra S, Auli M, Rush A M. Abstractive sentence summarization with attentive recurrent neural networks∥Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. San Diego,CA,USA:ACL,2016:93-98.
13 See A, Liu P J, Manning C D. Get to the point:Summarization with pointer?generator networks∥Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Volume 1. LongPapers. Vancouver,Canada:ACL,2017:1073-1083.
14 Manakul P, Gales M. Long?span summarization via local attention and content selection∥Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics,the 11th International Joint Conference on Natural Language Processing. Bangkok,Thailand:ACL,2021:6026-6041.
15 Li C L, Xu W R, Li S,et al. Guiding generation for abstractive text summarization based on key information guide network∥Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Volume 2.PapersShort. New Orleans,LA,USA:Association for Computational Linguistics,2018:55-60.
16 Afzal M, Alam F, Malik K M,et al. Clinical context?aware biomedical text summarization using deep neural network:Model development and validation. Journal of Medical Internet Research202022(10):e19810.
17 蔡中祥,孙建伟. 融合指针网络的新闻文本摘要模型. 小型微型计算机系统202142(3):462-466.
Cai Z X, Sun J W. News text summarization model integrating pointer network. Journal of Chinese Computer Systems202142(3):462-466.
18 Kingma D P, Ba J. Adam:A method for stochastic optimization∥Proceedings of the 3rd International Conference on Learning Representations. San Diego,CA,USA:ICLR,2015,arXiv:.
19 Lin C Y. ROUGE:a package for automatic evaluation of summaries∥Proceedings of the Text Summarization Branches Out. Barcelona,Spain:ACL,2004:74-81.
20 Koehn P. Statistical significance tests for machine translation evaluation∥Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing. Barcelona,Spain:ACL,2004:388-395.
[1] 刘思源, 毛存礼, 张勇丙. 基于领域知识图谱和对比学习的汉越跨境民族文本检索方法[J]. 南京大学学报(自然科学版), 2023, 59(4): 610-619.
[2] 杨京虎, 段亮, 岳昆, 李忠斌. 基于子事件的对话长文本情感分析[J]. 南京大学学报(自然科学版), 2023, 59(3): 483-493.
[3] 唐伟佳, 张华, 侯志荣. 基于空间卷积融合的中文文本匹配方法[J]. 南京大学学报(自然科学版), 2022, 58(5): 868-875.
[4] 张俊, 陈秀宏. 基于BERT模型的无监督候选词生成及排序算法[J]. 南京大学学报(自然科学版), 2022, 58(2): 286-297.
[5] 陈炳鑫, 陈黎飞. 符号序列的预训练HMM分类方法[J]. 南京大学学报(自然科学版), 2021, 57(1): 52-58.
[6] 李 钝,薛昊原,李 伦,郑志蕴*. 面向教学资源的实体链接算法[J]. 南京大学学报(自然科学版), 2015, 51(4): 901-908.
[7] 陈鹏1郭剑毅1,2余正涛1,2严馨1,2张志坤1,2高盛祥1,2. 融合领域知识短语树核函数的中文领域实体关系抽取[J]. 南京大学学报(自然科学版), 2015, 51(1): 181-186.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!