南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (4): 515–523.doi: 10.13232/j.cnki.jnju.2020.04.009

• • 上一篇    下一篇

基于迁移学习的软子空间聚类算法

王丽娟1,2,丁世飞1(),丁玲1   

  1. 1.中国矿业大学计算机科学与技术学院,徐州,221116
    2.徐州工业职业技术学院信息与电气工程学院,徐州,221400
  • 收稿日期:2020-06-20 出版日期:2020-07-30 发布日期:2020-08-06
  • 通讯作者: 丁世飞 E-mail:dingsf@cumt.edu.cn
  • 基金资助:
    国家自然科学基金(61672522);2020年江苏省高校“青蓝工程”

Soft subspace clustering algorithm based on transfer learning

Lijuan Wang1,2,Shifei Ding1(),Ling Ding1   

  1. 1.School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,221116,China
    2.School of Information and Electrical Engineering,Xuzhou College of Industrial Technology,Xuzhou,221400,China
  • Received:2020-06-20 Online:2020-07-30 Published:2020-08-06
  • Contact: Shifei Ding E-mail:dingsf@cumt.edu.cn

摘要:

随着大数据时代的到来,大量的高维数据在生活中无处不在.聚类是分析描述数据并按照某种相似性将数据归类的一项技术.传统聚类算法在面对高维数据时,往往无法进行有效的聚类处理.软子空间聚类是通过分配权重,描述样本隶属于不同簇的不确定性来进行聚类,然而,当数据残缺或信息不准时,现有的软子空间聚类的准确度和效率会受到很大的影响.从软子空间聚类面临的问题出发,提出一种改进的软子空间聚类算法;同时针对数据残缺不足的问题,引入迁移学习来削弱数据量不足对聚类分析的影响;通过引入信息熵的概念,用信息熵确定高维数据权重.实验证明,通过结合迁移学习和信息熵,有效地提高了软子空间聚类算法精确度和准确度.

关键词: 子空间聚类, 迁移学习, 信息熵, 高维数据

Abstract:

With the advent of the era of big data,a large number of high?dimensional data have become very common. Clustering is a technique of analyzing,describing and classifying data according to some similarity. When faced with high?dimensional data,the traditional clustering algorithms are often unable to carry out effective clustering processing. Soft subspace clustering is based on the distribution of weights to describe the uncertainty of samples belonging to different clusters. However,the accuracy and efficiency of existing soft subspace clustering will be significantly affected when the data is incomplete or the information is not timely. Starting from the problems faced by soft subspace clustering,this paper proposes an improved soft subspace clustering algorithm. At the same time,aiming at the problem of insufficient data,we introduce migration learning to reduce the impact of insufficient data on clustering analysis. By introducing the concept of information entropy,we use information entropy to determine the weight of high?dimensional data. By combining migration learning and information entropy,the accuracy and accuracy of soft subspace clustering algorithm are effectively improved.

Key words: subspace clustering, transfer learning, information entropy, high dimensional data

中图分类号: 

  • TP301

图1

WFCM的算法流程图"

图2

知识迁移"

表1

UCI数据集详细信息"

序号名称样本数N维度D聚类数目C
1Iris15043
2Wine178133
3Vehicle208184
4Australian690142

表2

Xhistory聚类结果(RI指数)"

数据集EWKMESSCFSC
Iris0.85230.87200.8423
Wine0.84150.89750.8358
Vehicle0.37470.52610.3854
Australian0.75520.71230.7348

表3

Xhistory聚类结果(NMI指数)"

数据集EWKMESSCFSC
Iris0.75230.74410.7105
Wine0.70150.70250.7158
Vehicle0.10420.12250.1156
Australian0.48350.34540.3855

表4

Xcurrent聚类结果(RI指数)"

数据集Xcurrent?allXcurrent?lost
EWKMESSCFSCTSCEWKMESSCFSCTSC
Iris0.62350.63580.61350.78520.54420.53170.54230.7561
Wine0.65520.68840.64510.72450.60750.60230.59970.8024
Vehicle0.65780.60210.60770.82450.38710.35610.35340.6122
Australian0.60210.59750.59880.78460.42230.41250.41080.7241

表5

Xcurrent聚类结果(NMI指数)"

数据集Xcurrent?allXcurrent?lost
EWKMESSCFSCTSCEWKMESSCFSCTSC
Iris0.52470.53650.52860.68070.42750.45620.39250.5803
Wine0.62180.64520.61020.75340.58420.62310.57140.6744
Vehicle0.12040.10780.11070.16080.02140.04510.01680.1256
Australian0.25360.22370.24960.45320.24530.14310.10870.3998
1 Chan E Y,Ching W K,Ng M K,et al. An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognition,2004,37(5):943-952.
2 Gan G J,Wu J H,Yang Z J. A fuzzy subspace algorithm for clustering high dimensional data∥International Conference on Advanced Data Mining and Applications. Springer Berlin Heidelberg,2006:271-278.
3 Jing L P,Ng M K,Huang J Z. An entropy weighting k?means algorithm for subspace clustering of high?dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering,2007,19(8):1026-1041.
4 Domeniconi C,Gunopulos D,Ma S,et al. Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery,2007,14(1):63-97.
5 Deng Z H,Choi K S,Chung F L,et al. Enhanced soft subspace clustering integrating within?cluster and between?cluster information. Pattern Recognition,2010,43(3):767-781.
6 Lu Y P,Wang S R,Li S Z,et al. Particle swarm optimizer for variable weighting in clustering high?dimensional data. Machine Learning,2011,82(1):43-70.
7 Wang X B,Lei Z,Shi H L,et al. Co?referenced subspace clustering∥2018 IEEE International Conference on Multimedia and Expo (ICME). San Diego,CA,USA:IEEE,2018:1-6.
8 Elhamifar E,Vidal R. Sparse subspace clustering:algorithm,theoryand applications. IEEE Transac?tions on Pattern Analysis & Machine Intelligence, 2012, 35(11):2765-2781.
9 Dai W,Xue G R,Yang Q,et al. Transferring naive Bayes classifiers for text classification∥Proceedings of the 22nd AAAI Conference on Artificial Intelligence. Vancouver,Canada:AAAI Press,2007:540-545.
10 Wei F M,Zhang J P,Chu Y,et al. FSFP:transfer learning from long texts to the short. Applied Mathematics & Information Sciences,2014,8(4):2033-2040.
11 Dai W Y,Yang Q,Xue G R,et al. Boosting for transfer learning∥Proceedings of the 24th international conference on Machine learning. Helsinki Finland:ACM,2007:193-200.
12 Agrawal R,Gehrke J,Gunopulos D,et al. Automatic subspace clustering of high dimensional data for data mining applications∥ACM SIGMOD Record. Seattle,WA,USA:ACM,1998:94-105.
13 钱鹏江,孙寿伟,蒋亦棒等. 知识迁移极大熵聚类算法. 控制与决策,2015,30(6):1001-1006.
Qian P J,Sun S W,Jiang Y B,et al. Knowledge Transfer based maximum entropy clustering. Control and Decision,2015,30(6):1001-1006.
14 Yu J,Shi H B,Huang H K,et al. Counterexamples to convergence theorem of maximum?entropy clustering algorithm. Science in China Series F:Information Sciences,2003,46(5):321-326.
15 王熙照,安素芳. 基于极大模糊熵原理的模糊产生式规则中的权重获取方法研究. 计算机研究与发展,2006,43(4):673-678.
Wang X Z,An S F. Research on learning weights of fuzzy production rules based on maximum fuzzy entropy. Journal of Computer Research and Development,2006,43(4):673-678.
16 邓赵红,王士同,吴锡生等. 鲁棒的极大熵聚类算法RMEC及其例外点标识. 中国工程科学,2004,6(9):38-45.
Deng Z H,Wang S T,Wu X S,et al. Robust maximum entropy clustering algorithm RMEC and its outlier labeling. Engineering Science,2004,6(9):38-45.
17 Jiang W H,Chung F L. Transfer spectral clustering∥Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg,2012:790-803.
18 Jain A K,Murty M N,Flynn P J. Data Clustering:a review. ACM Computing Surveys (CSUR),1999,31(3):265-320.
19 Guo G D,Chen S,Chen L F. Soft subspace clustering with an improved feature weight self?adjustment mechanism. International Journal of Machine Learning & Cybernetics,2012,3(1):39-49.
20 Xu Y M,Wang C D,Lai J H. Weighted multi?view clustering with feature selection. Pattern Recognition,2016,53:25-35.
21 Zhao X R,Evans N,Dugelay J L. A subspace co?training framework for multi?view clustering. Pattern Recognition Letters,2014,41:73-82.
22 Ji J C,Bai T,Zhou C G,et al. An improved k?prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing,2013,120:590-596.
23 黄王非,黎飞,青山. 基于子空间维度加权的密度聚类算法. 计算机工程2010,36(9):65-67. (Huang W F,Li F,Qing S. Density clustering algorithm based on subspace dimensional weighting. Computer Engineering,2010,36(9):65-67.)
24 Donoho D L. High?dimensional data analysis:The curses and blessings of dimensionality. American Mathematical Society Math Challenges Lecture,2000,1:32.
25 许亚骏. 子空间聚类算法研究及应用. 硕士学位论文. 无锡:江南大学,2016.
Xu Y J. Research on subspace clustering algorithms and its applications. Master Dissertation. Wuxi:Jiangnan University,2016.
26 Weiss K,Khoshgoftaar T M,Wang D D. A survey of transfer learning. Journal of Big Data,2016,3:9.
27 Günnemann S,Boden B,Seidl T. DB?CSC: a density?based approach for subspace clustering in graphs with feature vectors∥Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg,2011:565-580.
28 Wan S J,Wong S K M,Prusinkiewicz P. An algorithm for multidimensional data clustering. ACM Transactions on Mathematical Software,14(2):153-162.
[1] 钟琪,冯亚琴,王蔚. 跨语言语料库的语音情感识别对比研究[J]. 南京大学学报(自然科学版), 2019, 55(5): 765-773.
[2] 帅 惠, 袁晓彤, 刘青山. 基于L0约束的稀疏子空间聚类[J]. 南京大学学报(自然科学版), 2018, 54(1): 23-.
[3]  严丽宇1,魏 巍1,2*,郭鑫垚1,崔军彪1.  一种基于带核随机子空间的聚类集成算法[J]. 南京大学学报(自然科学版), 2017, 53(6): 1033-.
[4]  李新玉1,徐桂云1,任世锦2*,杨茂云1,2.  基于可靠性的正则化加权软k-均值的子空间聚类[J]. 南京大学学报(自然科学版), 2017, 53(3): 525-.
[5] 孟佳娜*, 赵丹丹, 于玉海, 孙世昶. 归纳式迁移学习在跨领域情感倾向性分析中的应用[J]. 南京大学学报(自然科学版), 2016, 52(1): 175-183.
[6]  廖 娟 1* , 王 江 1 , 徐 亮 2 , 李 勃 1 , 陈启美 1
.  相机抖动场景下的运动前景检测算法 

[J]. 南京大学学报(自然科学版), 2015, 51(2): 219-226.
[7] 刘 波1, 王红军1*,成 聪2,杨 燕1. 基于属性最大间隔的子空间聚类[J]. 南京大学学报(自然科学版), 2014, 50(4): 482-.
[8] 贾洪杰1,2丁世飞1,2. 基于邻域粗糙集约减的谱聚类算法[J]. 南京大学学报(自然科学版), 2013, 49(5): 619-627.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 魏 桐,童向荣. 基于加权启发式搜索的鲁棒性信任路径生成[J]. 南京大学学报(自然科学版), 2018, 54(6): 1161 -1170 .
[2] 阚 威, 李 云. 基于LSTM的脑电情绪识别模型[J]. 南京大学学报(自然科学版), 2019, 55(1): 110 -116 .
[3] 徐秀芳, 徐 森, 花小朋, 徐 静, 皋 军, 安 晶. 一种基于t-分布随机近邻嵌入的文本聚类方法[J]. 南京大学学报(自然科学版), 2019, 55(2): 264 -271 .
[4] 谢探春, 王国兵, 尹 颖, 郭红岩. 柳树对镉-芘复合污染土壤的修复潜力与耐受性研究[J]. 南京大学学报(自然科学版), 2019, 55(2): 282 -290 .
[5] 董越男,吴兵党. 铁离子对紫外/乙酰丙酮法降解甲基橙的影响[J]. 南京大学学报(自然科学版), 2019, 55(3): 504 -510 .
[6] 李家辉, 周忠眉. 基于多次学习和关联度的关联分类改进算法[J]. 南京大学学报(自然科学版), 2019, 55(4): 564 -572 .
[7] 李黎, 张瑞芳, 杜娜娜, 柳寰宇. 基于有限临时删边的病毒传播控制策略[J]. 南京大学学报(自然科学版), 2019, 55(4): 651 -659 .
[8] 王文琪, 王栋, 王远坤. 长江三角洲太湖流域湖西浙西区降水极值特性分析[J]. 南京大学学报(自然科学版), 2019, 55(4): 688 -698 .
[9] 徐扬,周文瑄,阮慧彬,孙雨,洪宇. 基于层次化表示的隐式篇章关系识别[J]. 南京大学学报(自然科学版), 2019, 55(6): 1000 -1009 .
[10] 刘作国,陈笑蓉. 汉语句法分析中的论元关系模型研究[J]. 南京大学学报(自然科学版), 2019, 55(6): 1010 -1019 .