南京大学学报(自然科学版) ›› 2023, Vol. 59 ›› Issue (4): 610619.doi: 10.13232/j.cnki.jnju.2023.04.008
刘思源1,2,3, 毛存礼1,2,3, 张勇丙1,2,3()
Siyuan Liu1,2,3, Cunli Mao1,2,3, Yongbing Zhang1,2,3()
摘要:
汉越跨境民族文本检索是一类面向领域的跨语言检索任务,旨在以一种语言作为问题查询,检索出另一种语言对应的民族、宗教、文化习俗等跨境民族文档.但在汉越跨境民族文本检索任务中存在大量不常见的领域实体,实体表达形式多样,且中文和越南语两种语言领域实体没有直接对应关系,导致跨语言领域词对齐和语义对齐困难,进而影响汉越跨境民族文本检索模型性能.基于此,提出一种基于领域知识图谱和对比学习的汉越跨境民族文本检索方法.首先,利用多头注意力机制将汉越跨境民族领域知识图谱融入查询和文档,丰富查询和文档中不常见的跨境民族领域实体信息;然后,引入对比学习来解决跨语言查询和文档的语义表征对齐困难问题;最后,将融入知识图谱的查询和文档表征之间的相似度计算作为相关性分数.实验表明,提出的方法和基线模型相比,性能提高了4.1%.
中图分类号:
1 | Izacard G, Caron M, Hosseini L,et al. Unsupervised dense information retrieval with contrastive learning. 2022,arXiv:. |
2 | Gao J F, Nie J Y, Xun E D,et al. Improving query translation for cross?language information retrieval using statistical models∥Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans,LO,USA:ACM,2001:96-104. |
3 | 黄国斌,王明文,叶浩. 一种新的基于中间语义的跨语言信息检索模型. 中文信息学报,2009,23(2):77-82. |
Huang G B, Wang M W, Ye H. A novel cross language information retrievai model based on interlingua semantics. Journal of Chinese Information Processing,2009,23(2):77-82. | |
4 | Xu J X, Weischedel R. Cross?lingual information retrieval using hidden Markov models∥Proceedings of 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora,the 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong,China:ACM,2000:95-103. |
5 | Huang P S, He X D, Gao J F,et al. Learning deep structured semantic models for web search using clickthrough data∥Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. San Francisco,CA,USA:ACM,2013:2333-2338. |
6 | Shen Y L, He X D, Gao J F,et al. Learning semantic representations using convolutional neural networks for web search∥Proceedings of the 23rd International Conference on World Wide Web. Seoul,Korea (South):ACM,2014:373-374. |
7 | Palangi H, Deng L, Shen Y L,et al. Deep sentence embedding using long short?term memory networks:Analysis and application to information retrieval. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2016,24(4):694-707. |
8 | Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT?∥Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence,Italy:ACL,2019:4996-5001. |
9 | Conneau A, Lample G. Cross?lingual language model pretraining∥Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver,Canada,2019:Article No.634. |
10 | 毛存礼,王斌,雷雄丽,等. 融合领域知识图谱的跨境民族文化分类. 小型微型计算机系统,2022,43(5):943-949. |
Mao C L, Wang B, Lei X L,et al. Cross?border ethnic cultural classification integrating domain knowledge map. Journal of Chinese Compu?ter Systems,2022,43(5):943-949. | |
11 | Wu Z R, Xiong Y J, Yu S X,et al. Unsupervised feature learning via non?parametric instance discrimination∥Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City,UT,USA:IEEE,2018:3733-3742. |
12 | Chen T, Kornblith S, Norouzi M,et al. A simple framework for contrastive learning of visual representations∥Proceedings of the 37th International Conference on Machine Learning. JMLR.org,2020:Article No.149. |
13 | Lee K, Chang M W, Toutanova K. Latent retrieval for weakly supervised open domain question answering∥Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence,Italy:ACL,2019:6086-6096. |
14 | Chen M H, Tian Y T, Yang M H,et al. Multilingual knowledge graph embeddings for cross?lingual knowledge alignment∥Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne,Australia:AAAI Press,2017:1511-1517. |
15 | 杨振平,毛存礼,雷雄丽,等. 融入词集合信息的跨境民族文化实体识别方法. 中文信息学报,2022,36(10):88-96. |
Yang Z P, Mao C L, Lei X L,et al. Cross?border national cultural entity recognition method with word set information. Journal of Chinese Information Processing,2022,36(10):88-96. | |
16 | Vaswani A, Shazeer N, Parmar N,et al. Attention is all you need∥Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach,CA,USA:Curran Associates Inc.,2017:6000-6010. |
17 | Litschko R, Glava? G, Ponzetto S P,et al. Unsupervised cross?lingual information retrieval using monolingual data only∥Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. Ann Arbor,MI,USA:ACM,2018:1253-1256. |
18 | Balikas G, Laclau C, Redko I,et al. Cross?lingual document retrieval using regularized Wasserstein distance∥Proceedings of the 40th European Conference on Information Retrieval. Springer Berlin Heidelberg,2018:398-410. |
19 | Litschko R, Vuli? I, Ponzetto S P,et al. On cross?lingual retrieval with multilingual text encoders. Information Retrieval Journal,2022,25(2):149-183. |
20 | Asai A, Yu X, Kasai J,et al. One question answering model for many languages with cross?lingual dense passage retrieval∥Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021:7547-7560. |
[1] | 赵冠博, 张勇丙, 毛存礼, 高盛祥, 王奉孝. 融入领域知识的跨境民族文化生成式摘要方法[J]. 南京大学学报(自然科学版), 2023, 59(4): 620-628. |
[2] | 周业瀚, 沈子钰, 周清, 李云. 基于生成式对抗网络的自监督多元时间序列异常检测方法[J]. 南京大学学报(自然科学版), 2023, 59(2): 256-262. |
[3] | 王津, 谭安辉, 顾沈明. 基于弱监督对比学习的弱多标记特征选择[J]. 南京大学学报(自然科学版), 2023, 59(1): 85-97. |
|