南京大学学报(自然科学版) ›› 2023, Vol. 59 ›› Issue (4): 610–619.doi: 10.13232/j.cnki.jnju.2023.04.008

• • 上一篇    下一篇

基于领域知识图谱和对比学习的汉越跨境民族文本检索方法

刘思源1,2,3, 毛存礼1,2,3, 张勇丙1,2,3()   

  1. 1.南亚东南亚语言语音信息处理教育部工程研究中心, 昆明, 650000
    2.昆明理工大学信息与自动化学院, 昆明, 650000
    3.云南省人工智能重点实验室, 昆明理工大学, 昆明, 650000
  • 收稿日期:2023-05-24 出版日期:2023-07-31 发布日期:2023-08-18
  • 通讯作者: 张勇丙 E-mail:zhangyongbing419@163.com
  • 基金资助:
    国家自然科学基金(62166023);云南省自然科学基金重点项目(2019FA023)

A Chinese⁃Vietnamese cross⁃border ethnic text retrieval method based on domain knowledge graph

Siyuan Liu1,2,3, Cunli Mao1,2,3, Yongbing Zhang1,2,3()   

  1. 1.South Asia and Southeast Asia Languages Voice Information Processing Engineering Research Center under the Ministry of Education,Kunming,650000,China
    2.School of Information and Automation, Kunming University of Science and Technology,Kunming,650000,China
    3.Key Laboratory of Artificial Intelligence in Yunnan Province,Kunming University of Science and Technology,Kunming,650000,China
  • Received:2023-05-24 Online:2023-07-31 Published:2023-08-18
  • Contact: Yongbing Zhang E-mail:zhangyongbing419@163.com

摘要:

汉越跨境民族文本检索是一类面向领域的跨语言检索任务,旨在以一种语言作为问题查询,检索出另一种语言对应的民族、宗教、文化习俗等跨境民族文档.但在汉越跨境民族文本检索任务中存在大量不常见的领域实体,实体表达形式多样,且中文和越南语两种语言领域实体没有直接对应关系,导致跨语言领域词对齐和语义对齐困难,进而影响汉越跨境民族文本检索模型性能.基于此,提出一种基于领域知识图谱和对比学习的汉越跨境民族文本检索方法.首先,利用多头注意力机制将汉越跨境民族领域知识图谱融入查询和文档,丰富查询和文档中不常见的跨境民族领域实体信息;然后,引入对比学习来解决跨语言查询和文档的语义表征对齐困难问题;最后,将融入知识图谱的查询和文档表征之间的相似度计算作为相关性分数.实验表明,提出的方法和基线模型相比,性能提高了4.1%.

关键词: 跨境民族文化, 跨境民族知识图谱, 跨语言检索, 对比学习, 信息检索

Abstract:

Chinese?Vietnamese cross?border ethnic text retrieval is a type of domain?oriented cross?language retrieval task,which aims to use one language as a query to retrieve cross?border ethnic documents such as ethnicity,religion,and cultural customs corresponding to another language. However,in the Chinese?Vietnamese cross?border ethnic text retrieval task,there are a large number of uncommon domain entities with various expressions,and there is no direct correspondence between Chinese and Vietnamese language domain entities,which leads to difficulties in word alignment and semantic alignment in cross?language domains,and in turn affects the performance of the Chinese?Vietnamese cross?border ethnic text retrieval model. Based on this,this paper proposes a Chinese?Vietnamese cross?border ethnic text retrieval method that integrates domain knowledge graphs. First,the multi?head attention mechanism is used to integrate the Han?Vietnamese cross?border ethnic domain knowledge graph into queries and documents,enriching the uncommon cross?border ethnic domain entity information in queries and documents. Then,contrastive learning is introduced to address the difficult problem of aligning semantic representations of cross?lingual queries and documents. Finally,the similarity between the query and document representation incorporated into the knowledge graph is calculated as a relevance score. Experiments show that the proposed method outperforms the baseline model by 4.1%.

Key words: cross?border national culture, Cross?border ethnic knowledge map, cross?language search, Contrastive learning, information retrieval

中图分类号: 

  • TP301

表1

汉越跨境民族文本检索数据样例"

编号检索:傣族的楞贺桑勘
1Lễ hội té nước là lễ hội quốc gia trang trọng và có tầm ảnh hưởng lớn nhất của người dân người Dai ...
2Lễ hội té nước Người Dai phổ biến ở Yunnan Dehong,Xishuangbanna và những nơi khác...
3Lễ hội té nước thể hiện nét văn hóa truyền thống của Người Dai như văn hóa sông nước...

表2

扩充后的汉越跨境民族知识三元组的数量"

类别中文知识三元组越南语知识三元组
宗教文化718568
建筑文化491402
服饰文化623538
饮食文化558444
艺术文化488376
习俗文化646350
共计35242678

表3

汉越跨境民族文化正负样本示例"

中文文本越南语文本类型
傣族人在泼水节期间看龙舟赛.Người Dai xem đua thuyền rồng trong Lễ hội té nước.原数据
傣族人看龙舟比赛.Người Dai có thuyền rồng trong Lễ hội Songkran.正样本
傣族人在火把节期间看龙舟赛.Người Dai xứ Đài xem đua thuyền rồng trong Lễ hội đuốc.负样本

图1

基于领域知识图谱和对比学习的汉越跨境民族文本检索方法"

图2

汉越跨境民族知识图谱嵌入的模型"

图3

汉越跨民族文化数据集的分布样例"

表4

实验中本文模型的参数设置"

ObjectNumber
queue of size23768
temperature0.05
momentum0.999
learning rate0.00005
ratio_max0.5
ratio_min0.1

表5

本文模型和其他模型的对比实验结果"

模型名称Recall @ 100MRR @ 100
本文方法0.9090.658
UnsupCLIR0.7520.392
Wasserstein0.8130.457
EncoderCLIRmBERT0.8590.524
mDPR0.8840.579

表6

消融实验的结果"

模型名称Recall @ 100MRR @100
本文模型0.9090.658
mContrievermBERT0.8780.594
mContrieverXLM⁃R0.8870.617

mContrieverXLM⁃独立剪裁

(对比学习)

0.8890.645

mContrieverXLM⁃span

(对比学习)

0.8940.651

mContrieverXLM⁃two_view

(对比学习)

0.8980.656

表7

不同多语言模型的实验结果"

模型名称Recall @ 100MRR @ 100
本文模型(XLM⁃R)0.9090.658
本文模型(mBERT)0.8840.625
本文模型(XLM)0.8950.631

表8

本文模型使用不同动量值的性能"

模型名称Recall @ 100MRR @ 100
Momentum=0.9950.8870.574
Momentum=0.9960.8920.60.3
Momentum=0.9970.8990.629
Momentum=0.9980.9010.641
Momentum=0.9990.9090.658

表9

实例分析"

Query查询文本:傣族泼水节Query查询文本:Lễ hội té nước Dai
1.Đây là hình thức biểu diễn không thể thiếu trong Lễ hội Songkran của người Tay.1.傣族泼水节,傣语称桑勘比迈或楞贺桑勘,时间在傣历6月下旬或7月初(公历4月中旬).
2.Người Shan gọi "Lễ hội tắm Phật" là "Bimai", có nghĩa là năm mới.2.每逢泰族宋干节,人们开始互相泼,你泼我,我泼你,一朵花在空中绽放,象征吉祥、幸福、健康.
3.Lễ hội té nước là chữ viết dân tộc nhất của người Dai.3.宋干节是泰国泰族、缅甸掸族、老挝佬族以及中国傣族的传统节日.
4.Lễ hội té nước chỉ được tổ chức ở những làng người Dai theo đạo Phật Nam tông.4.掸族最隆重的节日是浴佛节,也称“宋干节”,掸族都会在浴佛节期间举办一定规模的庆祝活动.
1 Izacard G, Caron M, Hosseini L,et al. Unsupervised dense information retrieval with contrastive learning. 2022,arXiv:.
2 Gao J F, Nie J Y, Xun E D,et al. Improving query translation for cross?language information retrieval using statistical models∥Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans,LO,USA:ACM,2001:96-104.
3 黄国斌,王明文,叶浩. 一种新的基于中间语义的跨语言信息检索模型. 中文信息学报200923(2):77-82.
Huang G B, Wang M W, Ye H. A novel cross language information retrievai model based on interlingua semantics. Journal of Chinese Information Processing200923(2):77-82.
4 Xu J X, Weischedel R. Cross?lingual information retrieval using hidden Markov models∥Proceedings of 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora,the 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong,China:ACM,2000:95-103.
5 Huang P S, He X D, Gao J F,et al. Learning deep structured semantic models for web search using clickthrough data∥Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. San Francisco,CA,USA:ACM,2013:2333-2338.
6 Shen Y L, He X D, Gao J F,et al. Learning semantic representations using convolutional neural networks for web search∥Proceedings of the 23rd International Conference on World Wide Web. Seoul,Korea (South):ACM,2014:373-374.
7 Palangi H, Deng L, Shen Y L,et al. Deep sentence embedding using long short?term memory networks:Analysis and application to information retrieval. IEEE/ACM Transactions on Audio,Speech,and Language Processing201624(4):694-707.
8 Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT?∥Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence,Italy:ACL,2019:4996-5001.
9 Conneau A, Lample G. Cross?lingual language model pretraining∥Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver,Canada,2019:Article No.634.
10 毛存礼,王斌,雷雄丽,等. 融合领域知识图谱的跨境民族文化分类. 小型微型计算机系统202243(5):943-949.
Mao C L, Wang B, Lei X L,et al. Cross?border ethnic cultural classification integrating domain knowledge map. Journal of Chinese Compu?ter Systems202243(5):943-949.
11 Wu Z R, Xiong Y J, Yu S X,et al. Unsupervised feature learning via non?parametric instance discrimination∥Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City,UT,USA:IEEE,2018:3733-3742.
12 Chen T, Kornblith S, Norouzi M,et al. A simple framework for contrastive learning of visual representations∥Proceedings of the 37th International Conference on Machine Learning. JMLR.org2020:Article No.149.
13 Lee K, Chang M W, Toutanova K. Latent retrieval for weakly supervised open domain question answering∥Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence,Italy:ACL,2019:6086-6096.
14 Chen M H, Tian Y T, Yang M H,et al. Multilingual knowledge graph embeddings for cross?lingual knowledge alignment∥Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne,Australia:AAAI Press,2017:1511-1517.
15 杨振平,毛存礼,雷雄丽,等. 融入词集合信息的跨境民族文化实体识别方法. 中文信息学报202236(10):88-96.
Yang Z P, Mao C L, Lei X L,et al. Cross?border national cultural entity recognition method with word set information. Journal of Chinese Information Processing202236(10):88-96.
16 Vaswani A, Shazeer N, Parmar N,et al. Attention is all you need∥Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach,CA,USA:Curran Associates Inc.,2017:6000-6010.
17 Litschko R, Glava? G, Ponzetto S P,et al. Unsupervised cross?lingual information retrieval using monolingual data only∥Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. Ann Arbor,MI,USA:ACM,2018:1253-1256.
18 Balikas G, Laclau C, Redko I,et al. Cross?lingual document retrieval using regularized Wasserstein distance∥Proceedings of the 40th European Conference on Information Retrieval. Springer Berlin Heidelberg,2018:398-410.
19 Litschko R, Vuli? I, Ponzetto S P,et al. On cross?lingual retrieval with multilingual text encoders. Information Retrieval Journal202225(2):149-183.
20 Asai A, Yu X, Kasai J,et al. One question answering model for many languages with cross?lingual dense passage retrieval∥Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021:7547-7560.
[1] 赵冠博, 张勇丙, 毛存礼, 高盛祥, 王奉孝. 融入领域知识的跨境民族文化生成式摘要方法[J]. 南京大学学报(自然科学版), 2023, 59(4): 620-628.
[2] 周业瀚, 沈子钰, 周清, 李云. 基于生成式对抗网络的自监督多元时间序列异常检测方法[J]. 南京大学学报(自然科学版), 2023, 59(2): 256-262.
[3] 王津, 谭安辉, 顾沈明. 基于弱监督对比学习的弱多标记特征选择[J]. 南京大学学报(自然科学版), 2023, 59(1): 85-97.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!