南京大学学报(自然科学版) ›› 2023, Vol. 59 ›› Issue (2): 273–281.doi: 10.13232/j.cnki.jnju.2023.02.010

• • 上一篇    下一篇

融合标签嵌入和知识感知的多标签文本分类方法

冯海1, 马甲林1(), 许林杰1, 杨宇1, 谢乾1,2   

  1. 1.淮阴工学院计算机与软件工程学院,淮安,223001
    2.江苏卓易信息科技股份有限公司,无锡,214200
  • 收稿日期:2022-12-03 出版日期:2023-03-31 发布日期:2023-04-07
  • 通讯作者: 马甲林 E-mail:majl@hyit.edu.cn
  • 基金资助:
    国家自然科学基金(61602202)

Multi⁃label text classification method combining label embedding and knowledge⁃aware

Hai Feng1, Jialin Ma1(), Linjie Xu1, Yu Yang1, Qian Xie1,2   

  1. 1.Faculty of Computer and Software, Huaiyin institute of Technology, Huaian, 223001, China
    2.Jiangsu Eazytec Company Limited, Wuxi, 214200, China
  • Received:2022-12-03 Online:2023-03-31 Published:2023-04-07
  • Contact: Jialin Ma E-mail:majl@hyit.edu.cn

摘要:

多标签文本分类是自然语言处理领域的重要任务之一.文本的标签语义信息与文本的文档内容有紧密的联系,而传统的多标签文本分类方法存在忽略标签的语义信息以及标签的语义信息不足等问题.针对以上问题,提出一种融合标签嵌入和知识感知的多标签文本分类方法LEKA (Label Embedding and Knowledge?Aware).该方法依赖于文档文本以及相应的多个标签,通过标签嵌入来获取与标签相关的注意力.考虑标签的语义信息,建立标签与文档内容的联系,将标签应用到文本分类中.另外,为了增强标签的语义信息,通过知识图谱嵌入引入外部感知知识,对标签文本进行语义扩展.在AAPD和RCV1?V2公开数据集上与其他分类模型进行了对比,实验结果表明,与LCFA (Label Combination and Fusion of Attentions)模型相比,LEKA的F1分别提高了3.5%和2.1%.

关键词: 多标签文本分类, 标签嵌入, 知识图谱, 注意力机制

Abstract:

Multi?label text classification is one of the most important tasks in natural language processing. The label semantic information of the text is closely related to the document content of the text. However,traditional multi?label text classification methods have some problems,such as ignore the semantic information of the labels itself and insufficient semantic information of the labels. In response to the above problems,we propose a multi?label text classification method LEKA (Label Embedding and Knowledge?Aware). LEKA relies on the document text and multiple labels,obtains attention related to labels through label embedding,considers the semantic information of labels,the relationship between the labels and the content of the established document,and applies labels to text classification. In addition,to enhance the semantic information of the labels,the embedding of knowledge graph is used to introduced external aware knowledge,expanding the semantic information of label text. Compared with other classification models on AAPD and RCV1?V2 open data sets,excessive experimental results show that compared with the LCFA (Label Combination and Fusion of Attentions) model,the proposed method improves the F1 value by 3.5% and 2.1% respectively.

Key words: multi?label text classification, label embedding, knowledge graph, attention mechanism

中图分类号: 

  • TP391

图1

LEKA的模型框架"

图2

词嵌入模块"

图3

标签嵌入模块"

表1

实验使用的数据集简介"

数据集样本总数标签总数

文本平均

标签数

文本平均

字数

AAPD55840542.41163.42
RCV1⁃V28044141033.24123.94

表2

在AAPD数据集上本文模型LEKA和对比模型的实验结果"

模型方法PRF1
LEKA0.7960.7120.752
BR0.6440.6480.646
LP0.6620.6080.634
LEAM0.7650.5960.670
LSAN0.7770.6460.706
LCFA0.7830.6950.726

图4

LEKA算法和其他对比算法在AAPD数据集上的实验过程"

表3

在RCV1?V2数据集上本文模型LEKA和对比模型的实验结果"

模型方法PRF1
LEKA0.9120.8730.892
BR0.9040.8160.858
LP0.8960.8240.858
LEAM0.8710.8410.856
LSAN0.9130.8410.875
LCFA0.9060.8490.877

图5

LEKA算法和其他对比算法在RCV1?V2数据集上的实验过程"

表4

在AAPD数据集上的消融实验结果"

模型方法PRF1
LE⁃noKA0.8850.8310.857
LEKA0.9120.8730.892

图6

在AAPD数据集上标签F1得分"

1 肖琳,陈博理,黄鑫,等. 基于标签语义注意力的多标签文本分类. 软件学报202031(4):1079-1089.
Xiao L, Chen B L, Huang X,et al. Multi?label text classification method based on label semantic information. Journal of Software202031(4):1079-1089.
2 Kim Y. Convolutional neural networks for sentence classification∥Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha,Qatar:Association for Computational Linguistics,2014:1746-1751.
3 Gopal S, Yang Y M. Multilabel classification with meta?level features∥Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva,Switzerland:ACM,2010:315-322.
4 Myagmar B, Li J, Kimura S. Cross?domain sentiment classification with bidirectional contextualized trans?former language models. IEEE Access2019(7):163219-163230.
5 Tang D Y, Qin B, Liu T. Document modeling with gated recurrent neural network for sentiment classification∥Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon,Portugal:Association for Computational Linguistics,2015:1422-1432.
6 Guo L, Jin B, Yu R Y,et al. Multi?label classification methods for green computing and application for mobile medical recommendations. IEEE Access2016(4):3201-3209.
7 徐月梅,樊祖薇,曹晗. 基于标签嵌入注意力机制的多任务文本分类模型. 数据分析与知识发现20226(2-3):105-116.
Xu Y M, Fan Z W, Cao H. A multi-task text classification model based on label embedding of attention mechanism. Data Analysis and Knowledge Discovery20226(2-3):105-116.
8 王鑫,邹磊,王朝坤,等. 知识图谱数据管理研究综述. 软件学报201930(7):2139-2174.
Wang X, Zou L, Wang C K,et al. Research on knowledge graph data management:A survey. Journal of Software201930(7):2139-2174.
9 Boutell M R, Luo J B, Shen X P,et al. Learning multi?label scene classification. Pattern recognition200437(9):1757-1771.
10 Tsoumakas G, Katakis I. Multi?label classification:An overview. International Journal of Data Warehousing and Mining20073(3):1-13.
11 Read J, Pfahringer B, Holmes G,et al. Classifier chains for multi?label classification. Machine Learning201185(3):333-359.
12 Wang J R, Feng J, Sun X,et al. Simplified constraints rank?SVM for multi?label classification∥The 6th Chinese Conference on Pattern Recognition. Changsha,China:Springer,2014:229-236.
13 Clare A, King R D. Knowledge discovery in multi?label phenotype data∥The 5th European Conference on Principles of Data Mining and Knowledge Discovery. Freiburg,Germany:Springer,2001:42-53.
14 Zhang M L, Zhou Z H. ML?KNN:A lazy learning approach to multi?label learning. Pattern Recognition200740(7):2038-2048.
15 Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences∥Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore,MD,USA:ACL,2014:655-665.
16 Wang J, Yang Y, Mao J H,et al. CNN?RNN:A unified framework for multi?label image classifi?cation∥Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas,NV,USA:IEEE,2016:2285-2294.
17 Socher R, Lin C C Y, Ng A Y,et al. parsing natural scenes and natural language with recursive neural networks∥Proceedings of the 28th International Conference on International Conference on Machine Learning. Bellevue,WA,USA:Omnipress,2011:129-136.
18 Yang P C, Sun X, Li W,et al. SGM:Sequence generation model for multi?label classification∥Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe,NM,USA:Association for Computational Linguistics,2018:3915-3926.
19 邬鑫珂,孙俊,李志华. 采用标签组合与融合注意力的多标签文本分类. 计算机工程与应用http://kns.cnki.net/kcms/detail/11.2127.TP.20220117.1920.015.html,2022-01-18.
Wu X K, Sun J, Li Z H. Multi?label text classification basedon label combination and fusion of attentions. Computer Engineering and Applicationshttp://kns.cnki.net/kcms/detail/11.2127.TP.20220117.1920.015.html,2022-01-18.
20 Wang G Y, Li C Y, Wang W L,et al. Joint embedding of words and labels for text classification∥Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne,Australia:ACL,2018:2321-2331.
21 Mahdisoltani F, Biega J, Suchanek F. Yago3:A knowledge base from multilingual wikipedias∥The 7th Biennial Conference on Innovative Data Systems Research. Asilomar,CA,USA:www.cidrdb.org,https:∥www.cidrdb.org/cidr2015/Papers/CIDR15_ Paper1.pdf,2015.
22 Bollacker K, Evans C, Paritosh P,et al. Freebase:A collaboratively created graph database for structuring human knowledge∥Proceedings of 2008 ACM SIGMOD International Conference on Management of Data. Vancouver,Canada:ACM,2008:1247-1250.
23 Wu W T, Li H S, Wang H X,et al. Probase:A probabilistic taxonomy for text understanding∥Proceedings of 2012 ACM SIGMOD International Conference on Management of Data. Scottsdale,AZ,USA:ACM,2012:481-492.
24 Mikolov T, Sutskever I, Chen K,et al. Distributed representations of words and phrases and their compositionality∥Proceedings of the 26th Inter?national Conference on Neural Information Processing Systems. Lake Tahoe,NV,USA:Curran Associates Inc.,2013:3111-3119.
25 Wang Z, Zhang J W, Feng J L,et al. Knowledge graph embedding by translating on hyperplanes∥Proceedings of the 28th AAAI Conference on Artificial Intelligence. Québec City,Canada:AAAI Press,2014:1112-1119.
26 Lin Y K, Liu Z Y, Sun M S,et al. Learning entity and relation embeddings for knowledge graph completion∥Proceedings of the 29th AAAI Conference on Artificial Intelligence. Austin,TX,USA:AAAI Press,2015:2181-2187.
27 Xiao H, Huang M L, Zhu X Y. TransG:A generative model for knowledge graph embedding∥Proceedings of the 54th Annual Meetings of the Association for Computational Linguistics. Berlin,Germany:ACL,2016:2316-2325.
28 Lewis D D, Yang Y M, Rose T G,et al. Rcv1:A new benchmark collection for text categorization research. The Journal of Machine Learning Research2004(5):361-397.
[1] 卞苏阳, 严云洋, 龚成张, 冷志超, 祝巧巧. 基于CXANet⁃YOLO的火焰检测方法[J]. 南京大学学报(自然科学版), 2023, 59(2): 295-301.
[2] 方巍, 李佳欣, 陆文赫. 基于3D卷积和自注意力机制的卫星云图预测研究[J]. 南京大学学报(自然科学版), 2023, 59(1): 155-164.
[3] 马学森, 马吉, 蒋功辉, 许雪梅, 周天保. 基于注意力机制和多尺度特征融合的绝缘子缺陷检测方法[J]. 南京大学学报(自然科学版), 2022, 58(6): 1020-1029.
[4] 蔡国永, 兰天. 基于多头注意力和词共现关系的方面级情感分析[J]. 南京大学学报(自然科学版), 2022, 58(5): 884-893.
[5] 苏雅茜, 崔超然, 曲浩. 基于自注意力移动平均线的时间序列预测[J]. 南京大学学报(自然科学版), 2022, 58(4): 649-657.
[6] 张玮, 赵永虹, 邱桃荣. 基于注意力机制和深度学习的运动想象脑电信号分类方法[J]. 南京大学学报(自然科学版), 2022, 58(1): 29-37.
[7] 黄治纲, 谢新强, 邢铁军, 葛东, 蔡晨秋, 窦丽莉, 王天翊. 基于司法案例知识图谱的类案推荐[J]. 南京大学学报(自然科学版), 2021, 57(6): 1053-1063.
[8] 房笑宇, 曹陈涵, 夏彬. 基于注意力机制的大规模系统日志异常检测方法[J]. 南京大学学报(自然科学版), 2021, 57(5): 785-792.
[9] 徐樱笑, 资文杰, 宋洁琼, 邵瑞喆, 陈浩. 基于多站点、多时间注意力机制的电磁强度时空关联分析与可视化[J]. 南京大学学报(自然科学版), 2021, 57(5): 838-846.
[10] 普志方, 陈秀宏. 基于卷积神经网络的细胞核图像分割算法[J]. 南京大学学报(自然科学版), 2021, 57(4): 566-574.
[11] 段建设, 崔超然, 宋广乐, 马乐乐, 马玉玲, 尹义龙. 基于多尺度注意力融合的知识追踪方法[J]. 南京大学学报(自然科学版), 2021, 57(4): 591-598.
[12] 温玉莲, 林培光. 基于行业背景差异下的金融时间序列预测方法[J]. 南京大学学报(自然科学版), 2021, 57(1): 90-100.
[13] 李亚重,杨有龙,仇海全. 一种基于嵌入式的弱标记分类算法[J]. 南京大学学报(自然科学版), 2020, 56(4): 549-560.
[14] 朱伟,张帅,辛晓燕,李文飞,王骏,张建,王炜. 结合区域检测和注意力机制的胸片自动定位与识别[J]. 南京大学学报(自然科学版), 2020, 56(4): 591-600.
[15] 徐扬,周文瑄,阮慧彬,孙雨,洪宇. 基于层次化表示的隐式篇章关系识别[J]. 南京大学学报(自然科学版), 2019, 55(6): 1000-1009.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!