南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (4): 549–560.doi: 10.13232/j.cnki.jnju.2020.04.013

• • 上一篇    下一篇

一种基于嵌入式的弱标记分类算法

李亚重1,杨有龙1(),仇海全1,2   

  1. 1.西安电子科技大学数学与统计学院,西安,710126
    2.安徽科技学院信息与网络工程学院,蚌埠,233030
  • 收稿日期:2020-03-03 出版日期:2020-07-30 发布日期:2020-08-06
  • 通讯作者: 杨有龙 E-mail:ylyang@mail.xidian.edu.cn
  • 基金资助:
    国家自然科学基金(61573266);安徽省高校自然科学研究重点项目(KJ2019A0816)

Label embedding for weak label classification

Yachong Li1,Youlong Yang1(),Haiquan Qiu1,2   

  1. 1.School of Mathematics and Statistics,Xidian University,Xi'an, 710126,China
    2.College of Information & Network Engineering,Anhui Science and Technology University,Bengbu,233030,China
  • Received:2020-03-03 Online:2020-07-30 Published:2020-08-06
  • Contact: Youlong Yang E-mail:ylyang@mail.xidian.edu.cn

摘要:

对于高维标签的分类问题,标签嵌入法已经受到广泛关注.现有的嵌入方法大都需要完整的标签信息,也没有将特征空间考虑在内;同时,由于数据进行人工标注的成本高以及噪声干扰等原因,仅能获得数据的部分标签信息,使得含有缺失标签的高维标签分类问题变得更加复杂.为解决这一问题,提出一种弱标记嵌入算法(Label Embedding for Weak Label Classification,LEWL).该算法利用矩阵的低秩分解模型,结合样本的流形结构恢复缺失标签;同时采用希尔伯特?施密特独立标准技术(Hilbert?Schmidt Independence Criterion,HSIC)使特征和标签相互作用,联合学习获得一个低维的嵌入空间,可以有效地减少模型的训练时间.通过在七个多标签数据集上与其他算法的对比实验,结果表明了所提算法的有效性.

关键词: 弱标记学习, 标签嵌入, 低秩分解, 希尔伯特?施密特独立标准, 缺失标签

Abstract:

For the classification of high?dimensional labels,label embedding has attracted extensive attention of researchers in recent years. Current embedding methods require complete label information and do not take feature information into consideration. Meanwhile,due to the high cost of manual labeling and interference of noise,only part of the label information can be obtained. This makes the classification problem of high?dimensional labels with missing labels more complicated. To end this,a Label Embedding method for Weak Label Classification (LEWL) is proposed in this paper. The algorithm uses the low?rank factorization model on the label matrix and the flow pattern structure of the samples to recover the missing labels. In the meantime,the HSIC (Hilbert?Schmidt Independence Criterion) technique is adopted to obtain the low dimensional embedding space by making feature and labels interact with each other for joint learning,which can effectively reduce the training time of the model. Compared with other methods on seven data sets,comprehensive experimental results validate the effectiveness of proposed approach.

Key words: weak label classification, label embedding, the low?rank factorization on the matrix, HSIC(Hilbert?Schmidt Independence Criterion), missing labels

中图分类号: 

  • TP301.6

表1

数据集基本信息"

数据集样本特征标签领域标签基数
Emotions593726music1.869
Yeast241710314biology4.237
CAL50050268174music26.044
Medical978144945text1.245
Langlog1460100475images1.18
Enron1702100153text3.378
Corel5k5000499374images3.522

表2

缺失标签在平均精度上的恢复结果"

Data setρLEWLMLMLLRMLCPLSTBR
Emotions0.30.918±0.0080.920±0.00440.843±0.0110.886±0.0050.886±0.005
0.70.839±0.0110.846±0.0120.772±0.0130.781±0.0100.781±0.010
Yeast0.30.894±0.0030.891±0.0010.792±0.0100.869±0.0020.869±0.002
0.70.811±0.0040.805±0.0030.716±0.0110.738±0.0050.738±0.005
CAL5000.30.832±0.0060.791±0.0050.784±0.0140.816±0.0040.816±0.004
0.70.663±0.0060.633±0.0070.652±0.0100.623±0.0040.623±0.004
Medical0.30.790±0.0190.776±0.0290.772±0.0420.766±0.0210.766±0.021
0.70.602±0.0240.588±0.0290.531±0.0440.542±0.0090.542±0.009
Langlog0.30.860±0.0050.811±0.0020.852±0.0310.827±0.0020.827±0.002
0.70.709±0.0060.667±0.0030.704±0.0220.643±0.0060.643±0.006
Enron0.30.819±0.0070.823±0.0060.764±0.0510.795±0.0060.795±0.006
0.70.632±0.0200.667±0.0280.543±0.0580.585±0.0130.585±0.013
Corel5k0.30.749±0.0050.735±0.0060.742±0.0040.733±0.0020.733±0.002
0.70.515±0.0080.510±0.0110.513±0.0090.511±0.0040.511±0.004

表3

缺失标签在Macro F1上的恢复结果"

Data setρLEWLMLMLLRMLCPLSTBR
Emotions0.30.882±0.0090.905±0.0060.857±0.0480.847±0.0130.847±0.013
0.70.775±0.0100.801±0.0090.814±0.0350.666±0.0110.666±0.011
Yeast0.30.883±0.0070.874±0.0060.805±0.0150.850±0.0030.85±0.003
0.70.749±0.0080.738±0.0070.717±0.0230.660±0.0020.66±0.002
CAL5000.30.845±0.0040.776±0.0050.755±0.0210.849±0.0080.849±0.008
0.70.667±0.0070.633±0.0070.643±0.0200.657±0.0120.657±0.012
Medical0.30.804±0.0530.783±0.0290.755±0.0050.796±0.0490.796±0.049
0.70.566±0.0420.593±0.0290.565±0.0060.557±0.0380.557±0.038
Langlog0.30.859±0.0020.856±0.0020.855±0.0410.85±0.0010.85±0.001
0.70.685±0.0030.683±0.0030.682±0.0660.663±0.0050.663±0.005
Enron0.30.845±0.0130.822±0.0060.754±0.0160.822±0.0150.822±0.015
0.70.633±0.0330.648±0.0280.63±0.0130.651±0.0140.651±0.014
Corel5k0.30.817±0.0120.785±0.0070.801±0.0250.813±0.0170.813±0.017
0.70.619±0.0110.605±0.0130.614±0.0210.612±0.0160.612±0.016

表4

缺失标签在排序损失上的恢复结果"

Data setρLEWLMLMLLRMLCPLSTBR
Emotions0.30.132±0.0100.133±0.0130.229±0.0170.262±0.01210.262±0.012
0.70.274±0.0050.271±0.0110.324±0.0160.498±0.0180.498±0.018
Yeast0.30.116±0.0030.138±0.0020.209±0.0080.261±0.0030.261±0.003
0.70.227±0.0040.258±0.0070.288±0.0070.502±0.0070.502±0.007
CAL5000.30.130±0.0010.239±0.0060.249±0.0110.259±0.00550.259±0.005
0.70.248±0.0010.394±0.0110.339±0.0090.503±0.0060.503±0.006
Medical0.30.107±0.0070.254±0.0110.155±0.0080.257±0.0220.257±0.022
0.70.148±0.0220.444±0.0120.205±0.0120.505±0.0110.505±0.011
Langlog0.30.170±0.0040.233±0.0040.204±0.0510.251±0.0020.251±0.002
0.70.332±0.0120.339±0.0070.342±0.0180.493±0.0070.493±0.007
Enron0.30.129±0.0080.191±0.0100.213±0.0430.257±0.0080.257±0.008
0.70.272±0.0290.363±0.0210.291±0.0430.513±0.0110.513±0.011
Corel5k0.30.108±0.0020.271±0.0080.178±0.0080.259±0.0030.259±0.003
0.70.403±0.0040.496±0.0140.356±0.0070.503±0.0040.503±0.004

表5

不同算法关于恢复结果的t检验"

MLMLLRMLCPLSTBR
Average Precision0.04550.00040.00010.0001
Macro F10.22730.01790.02300.0230
Rank Loss0.00150.00080.00030.0003

表6

测试数据在平均精度上的预测结果"

Data setρLEWLMLMLLRMLCPLSTBR
Emotions0.30.732±0.0260.716±0.019990.729±0.02220.662±0.0380.645±0.020
0.70.708±0.0120.687±0.0390.682±0.0160.558±0.0230.568±0.021
Yeast0.30.708±0.0060.654±0.0050.441±0.0240.587±0.0040.573±0.005
0.70.698±0.0050.648±0.0080.44±0.0260.448±0.0080.450±0.003
CAL5000.30.365±0.0170.31±0.0130.202±0.0110.248±0.0120.248±0.004
0.70.369±0.0130.307±0.0130.197±0.0090.170±0.0080.175±0.009
Medical0.30.805±0.0250.734±0.0090.775±0.0080.675±0.0490.289±0.029
0.70.718±0.0390.685±0.0180.721±0.0380.332±0.0300.112±0.011
Langlog0.30.543±0.0190.419±0.0160.410±0.0170.461±0.0480.469±0.019
0.70.530±0.0160.531±0.0190.399±0.0170.308±0.0290.309±0.011
Enron0.30.536±0.0410.395±0.0510.348±0.0070.331±0.0450.296±0.043
0.70.493±0.0540.346±0.0550.252±0.0110.213±0.0200.197±0.025
Corel5k0.30.038±0.0030.027±0.0080.027±0.0050.030±0.0070.027±0.007
0.70.022±0.0020.025±0.0070.026±0.0040.027±0.0070.023±0.007

表7

测试数据在Macro F1上的预测结果"

Data setρLEWLMLMLLRMLCPLSTBR
Emotions0.30.606±0.0570.576±0.0510.616±0.0360.429±0.0690.389±0.021
0.70.596±0.0380.523±0.0610.610±0.0360.152±0.0330.160±0.041
Yeast0.30.461±0.0040.334±0.0110.367±0.0080.251±0.0090.234±0.009
0.70.452±0.0060.313±0.0080.366±0.0090.058±0.0090.062±0.004
CAL5000.30.234±0.0120.073±0.0070.212±0.0110.035±0.0050.046±0.003
0.70.232±0.0080.066±0.0080.205±0.0100.008±0.0020.015±0.002
Medical0.30.358±0.0320.311±0.0120.325±0.0020.310±0.0230.098±0.018
0.70.306±0.0200.258±0.0080.287±0.0190.171±0.0290.01±0.003
Langlog0.30.448±0.0180.164±0.0110.389±0.0280.171±0.0250.185±0.014
0.70.434±0.0170.232±0.0090.374±0.0290.048±0.0090.059±0.0120.
Enron0.30.166±0.0110.065±0.0130.120±0.0150.063±0.0090.042±0.008
0.70.149±0.0100.051±0.0140.101±0.0160.022±0.0040.013±0.011
Corel5k0.30.025±0.0010.018±0.0050.020±0.0030.010±0.0030.009±0.003
0.70.019±0.0010.015±0.0060.016±0.0030.007±0.0040.006±0.002

表8

测试数据在排序损失上的预测结果"

Data setρLEWLMLMLLRMLCPLSTBR
Emotions0.30.446±0.0240.509±0.0420.441±0.0190.692±0.0520.731±0.018
0.70.457±0.0300.561±0.0710.453±0.0160.924±0.0230.915±0.013
Yeast0.30.385±0.0080.484±0.0090.709±0.0070.652±0.0080.679±0.011
0.70.401±0.0110.496±0.0130.722±0.0050.940±0.0110.933±0.011
CAL5000.30.461±0.0110.577±0.0060.518±0.0120.890±0.0140.885±0.003
0.70.457±0.0110.584±0.0180.528±0.0080.972±0.0080.969±0.005
Medical0.30.139±0.0370.476±0.0140.302±0.0190.345±0.0500.785±0.027
0.70.168±0.0050.529±0.0160.347±0.0170.462±0.0400.977±0.006
Langlog0.30.468±0.0140.716±0.0170.629±0.0570.716±0.0530.689±0.025
0.70.487±0.0150.761±0.0210.625±0.0500.811±0.0310.887±0.019
Enron0.30.315±0.0380.500±0.0560.548±0.0080.792±0.0410.835±0.041
0.70.349±0.0380.569±0.0550.771±0.0190.948±0.0150.967±0.023
Corel5k0.30.538±0.0010.745±0.0020.697±0.0020.825±0.0010.889±0.001
0.70.549±0.0010.783±0.0040.721±0.0020.889±0.0040.903±0.002

表9

不同算法关于预测结果的t检验"

MLMLLRMLCPLSTBR
Average Precision0.00460.00540.00000.0007
Macro F10.00080.00070.00010.0000
Rank Loss0.00070.00160.00010.0000

图1

LEWL在Emotions(上图)和Yeast(下图)数据集上的收敛性"

图2

在Emotions(上图)和Medical(下图)数据集上的平均精度"

图3

λ和β对Emotions数据集的影响"

1 Katakis I,Tsoumakas G,Vlahavas I. Multilabel text classification for automated tag suggestion∥Proceedings of the ECML?PKDD/2008 Workshop on Discovery Challenge. Antwerp,Belgium:Springer, 2008,18:5.
2 Jia X,Sun F M,Li H J,et al. Image multi?label annotation based on supervised nonnegative matrix factorization with new matching measurement. Neurocomputing,2017,219:518-525.
3 Elisseeff A,Weston J. A kernel method for multi?labelled classification∥Proceedings of the 14th International Conference on Neural Information Processing Systems:Natural and Synthetic. Vancouver,Canada:MIT Press,2001:681-687.
4 Boutell M R,Luo J B,Shen X P,et al. Learning multi?label scene classification. Pattern Recognition,2004,37(9):1757-1771.
5 Tsoumakas G,Vlahavas I. Random k?labelsets:An ensemble method for multilabel classification∥European Conference on Machine Learning. Springer Berlin Heidelberg,2007:406-417.
6 Read J,Pfahringer B,Holmes G,et al. Classifier chains for multi?label classification. Machine Learning,2011,85(3):333-359.
7 Zhang M L,Zhou Z H. ML?KNN:a lazy learning approach to multi?label learning. Pattern Recognition,2007,40(7):2038-2048.
8 Freund Y,Schapire R. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence,1999,14(5):771-780.
9 马宏亮,万建武,王洪元. 一种嵌入样本流形结构与标记相关性的多标记降维算法. 南京大学学报(自然科学),2019,55(1):92-101.
Ma M L,Wan J W,Wang H Y. A multi?label dimensionality reduction algorithm embedded sample manifold structure and label correlation. Journal of Nanjing University (Natural Science),2019,55(1):92-101.
10 彭成伦. 多义性机器学习中的标记嵌入方法研究. 硕士学位论文. 南京:东南大学,2018.
Peng C L. Research on label embedding in ambiguous machine learning. Master Dissertation. Nanjing:Southeast University,2018.
11 Hsu D J,Kakade S M,Langford J,et al. Multi?label prediction via compressed sensing. 2009,arXiv:0902.1284.
12 Tai F,Lin H T. Multilabel classification with principal label space transformation. Neural Computation,2012,24(9):2508-2542.
13 Chen Y N,Lin H T. Feature?aware label space dimension reduction for multi?label classification∥Advances in Neural Information Processing Systems. Lake Tahoe,NV,USA:Neural Information Processing Systems Foundation,Inc.,2012,2:1529-1537.
14 Lin Z J,Ding G G,Han J G,et al. End?to?end fea?ture?aware label space encoding for multilabel classification with many classes. IEEE Transactions on Neural Networks and Learning Systems,2018,29(6):2472-2487.
15 刘阳. 多标签数据分类技术研究. 博士学位论文. 西安:西安电子科技大学,2018.
Liu Y. Research on Multi?label data classification technology. Ph.D. Dissertation. Xi'an:Xidian University,2018.
16 Sun Y Y,Zhang Y,Zhou Z H. Multi?label learning with weak label∥Proceedings of the 24th AAAI Conference on Artificial Intelligence. Atlanta,GE,USA:AAAI Press,2010:593-598.
17 Wu B Y,Liu Z L,Wang S F,et al. Multi?label learning with missing labels∥2014 22nd International Conference on Pattern Recognition. Stockholm,Sweden:IEEE,2014:1964-1968.
18 Guo B,Hou C,Shan J,et al. Low rank multi?label classification with missing labels∥2018 24th International Conference on Pattern Recognition (ICPR2018). Beijing,China:IEEE,2018:417-422.
19 Han Y F,Sun G L,Shen Y,et al. Multi?label Learning with Highly Incomplete Data via Collaborative Embedding∥Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. London,United Kingdom:ACM Press,2018:1494-1503.
20 Xu M,Jin R,Zhou Z H. Speedup matrix completion with side information:application to multi?label learning∥Proceedings of the 27th Annual Conference on Neural Information Processing Systems. Montreal,Canada:MIT Press,2013:2301-2309.
21 Xu L L,Wang Z,Shen Z F,et al. Learning low?rank label correlations for multi?label classification with missing labels∥2014 IEEE International Conference on Data Mining. Shenzhen,China:IEEE,2014:1067-1072.
22 Candès E J,Tao T. The power of convex relaxation:near?optimal matrix completion. IEEE Transactions on Information Theory,2010,56(5):2053-2080.
23 Wen Z W,Yin W T,Zhang Y. Solving a low?rank factorization model for matrix completion by a nonlinear successive over?relaxation algorithm. Mathematical Programming Computation,2012,4(4):333-361.
24 Zhang Y,Schneider J. Multi?label output codes using canonical correlation analysis∥Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale,FL,USA:JMLR,2011:873-882.
25 Gretton A,Bousquet O,Smola A,et al. Measuring statistical dependence with Hilbert?Schmidt norms∥International Conference on Algorithmic Learning Theory. Springer Berlin Heidelberg,2005:63-77.
26 Han S J,Qubo C,Meng H. Parameter selection in SVM with RBF kernel function∥World Automation Congress 2012. Puerto Vallarta,Mexico:IEEE,2012:1-4.
27 Lin Z J,Ding G G,Hu M Q,et al. Multi?label classification via feature?aware implicit label space encoding∥Proceedings of the 31st International Conference on International Conference on Machine Learning. Beijing, China: JMLR.org,2014:325-333.
28 Han Y H,Wu F,Jia J Z,et al. Multi?task sparse discriminant analysis (MtSDA) with overlapping categories∥Proceedings of the 24th AAAI Conference on Artificial Intelligence. Atlana,GA,USA:AAAI Press, 2010:469-474.
29 Pacharawongsakda E,Theeramunkong T. Towards more efficient multi?label classification using dependent and independent dual space reduction∥Pacific?Asia Conference on Knowledge Discovery and Data Mining. Springer Berlin Heidelberg,2012:383-394.
30 Zhang M L,Zhou Z H. A review on multi?label learning algorithms. IEEE Transactions on Knowledge and Data Engineering,2013,26(8):1819-1837.
[1] 刘亮,何庆. 基于改进蝗虫优化算法的特征选择方法[J]. 南京大学学报(自然科学版), 2020, 56(1): 41-50.
[2] 袁燕,陈伯伦,朱国畅,花勇,于永涛. 基于社区划分的空气质量指数(AQI)预测算法[J]. 南京大学学报(自然科学版), 2020, 56(1): 142-150.
[3] 洪思思,曹辰捷,王 喆*,李冬冬. 基于矩阵的AdaBoost多视角学习[J]. 南京大学学报(自然科学版), 2018, 54(6): 1152-1160.
[4] 周星星,张海平,吉根林. 具有时空特性的区域移动模式挖掘算法[J]. 南京大学学报(自然科学版), 2018, 54(6): 1171-1182.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 贾海宁, 王士同. 面向重尾噪声的模糊规则模型[J]. 南京大学学报(自然科学版), 2019, 55(1): 61 -72 .
[2] 顾健伟, 曾 诚, 邹恩岑, 陈 扬, 沈 艺, 陆 悠, 奚雪峰. 基于双向注意力流和自注意力结合的机器阅读理解[J]. 南京大学学报(自然科学版), 2019, 55(1): 125 -132 .
[3] 章康宁, 卢 晶. 指向性小尺度线性扬声器阵列鲁棒性研究[J]. 南京大学学报(自然科学版), 2019, 55(2): 180 -190 .
[4] 肖梦君, 吴俊仙, 张 娜, 陈 昕, 饶泽兵, 陈允梓. VDR基因肠上皮特异性敲除小鼠的构建及其对炎症性肠病的影响[J]. 南京大学学报(自然科学版), 2019, 55(2): 332 -338 .
[5] 董少春,种亚辉,胡 欢,黄璐璐. 基于时序InSAR的常州市2015-2018年地面沉降监测[J]. 南京大学学报(自然科学版), 2019, 55(3): 370 -380 .
[6] 何轶凡, 邹海涛, 于化龙. 基于动态加权Bagging矩阵分解的推荐系统模型[J]. 南京大学学报(自然科学版), 2019, 55(4): 644 -650 .
[7] 汪贵庆, 袁杰, 沈庆宏. 基于精英蚁群算法的交通最优路径研究[J]. 南京大学学报(自然科学版), 2019, 55(5): 709 -717 .
[8] 查羿,潘红兵. 一种负载均衡的LSTM硬件加速器设计[J]. 南京大学学报(自然科学版), 2019, 55(5): 733 -739 .
[9] 黄华娟,韦修喜. 基于自适应调节极大熵的孪生支持向量回归机[J]. 南京大学学报(自然科学版), 2019, 55(6): 1030 -1039 .
[10] 郭小松,赵红丽,贾俊芳,杨静,孟祥军. 密度泛函理论方法研究第一系列过渡金属对甘氨酸的配位能力[J]. 南京大学学报(自然科学版), 2019, 55(6): 1040 -1046 .