南京大学学报(自然科学版) ›› 2020, Vol. 56 ›› Issue (1): 9–18.doi: 10.13232/j.cnki.jnju.2020.01.002

• • 上一篇    下一篇

半监督平面聚类算法设计

杨红鑫,杨绪兵(),张福全,业巧林   

  1. 南京林业大学信息科学技术学院,南京,210037
  • 收稿日期:2019-07-02 出版日期:2020-01-30 发布日期:2020-01-10
  • 通讯作者: 杨绪兵 E-mail:xbyang@njfu.edu.cn
  • 基金资助:
    国家自然科学基金(31670554);江苏省自然科学基金(BK20171453);2019江苏省研究生科研创新项目(SJKY19_0907)

Semi⁃supervised plane clustering algorithm

Hongxin Yang,Xubing Yang(),Fuquan Zhang,Qiaolin Ye   

  1. School of Computer Science and Technology,Nanjing Forestry University,Nanjing,210037,China
  • Received:2019-07-02 Online:2020-01-30 Published:2020-01-10
  • Contact: Xubing Yang E-mail:xbyang@njfu.edu.cn

摘要:

采用以平面为原型来拟合样本的思想设计学习机,已在机器学习和数据挖掘等领域引起广泛关注,然而,如何利用少量标记样本,兼顾平面原型特点实现聚类,鲜见报道.以kPC(k?Plane Clustering)为切入点,在有标样本极端少的情况下,设计了半监督型平面聚类算法semi?kPC.考虑到L1范数较L2范数更为鲁棒的事实,在已有工作L1kPC (L1 norm kPC)的基础上,提出基于L1范数的半监督聚类方法semi?L1kPC.从每类仅有一个已标样本出发,在人工数据集和UCI数据集上的实验表明:(1)在XOR(Exclusive OR)问题上,平面型的聚类方法的聚类准确率均显著高于k?means算法,因为k?means无法利用平面特性;(2)在引入少量监督信息后,半监督型聚类方法semi?kPC和semi?L1kPC比其他聚类方法的聚类准确率更高;(3)采用L1范数的semi?L1kPC比semi?kPC的鲁棒性更好.

关键词: 半监督聚类, 平面分布, 鲁棒性, L1范数度量

Abstract:

Unsupervised learning machine,typically generated from plane?shaped data clustering,has been widely applied in the fields of machine learning,data mining. Instead of point?prototype in classic k?means or FCM,the so?called plane clustering methods aim to seek multiple plane?prototype as cluster centers,and then group the data into clusters by minimizing the distance between fitting planes and data points in their corresponding clusters. There has been increasing interest in plane?based methods in the family of data clustering in last decades,including kPC (k Plane Clustering),PPC (Proximal Plane Clustering) and TWSVC (Twin Support Vector Clustering). However,it is rarely reported how to utilize plane characteristics to design semi?supervised plane clustering algorithms,even in the era of the rising of semi?supervised learning. Moreover, due to the fact that L1 norm is more robust than L2 norm,in this paper,we propose a robust semi?supervised plane clustering method based on L1 norm,termed as semi?L1kPC for shortly. Compared with the state?to?the?art methods,the advantages of our proposed lie in three folds: (1) similar to kPC, it has clear geometrical intuition. That is,the data is grouped into clusters by minimizing the point?to?plane distance measured by infinite norm,the dual of L1 norm. (2) The algorithm is designed on the small number of labeled samples, even only need ONE labeled sample per class. (3) The leading problem can be solved by linear programming,rather than eigenvalue problems in kPC,quadratic programming problems in TWSVC. The experimental comparisons on artificial and benchmark datasets show that: (1) on exclusive XOR problem,the clustering accuracies of plane?prototype methods are higher than that of point?prototype ones. (2) When introducing fewer supervised information,i.e.,labeled samples,our proposed semi?kPC and semi?L1kPC outperform the foresaid unsupervised methods in cluster accuracy. (3) As for kPC itself,semi?L1kPC receives more robustness than L2 norm based semi?kPC.

Key words: semi?supervised clustering, plane distribution, robustness, L1 norm metric

中图分类号: 

  • TP391

图1

数据集的原始分布(a)和k?means (b),semi?kPC (c),semi?L1kPC (d)三种方法在XOR上的聚类效果"

图2

六种算法的鲁棒性测试"

表1

UCI数据集的基本信息"

数据集 样本数 维数 类别信息
Ecoil 332 6 8
Glass 214 9 6
Haberman 306 3 2
Vowel 528 10 11
Liver 345 6 2
Monk1 432 6 2
PID 768 8 2
User 403 5 4
Mushroom 8124 22 2

表2

七种聚类方法的准确率比较"

数据集 k?means kPC L1kPC TWSVC FRTWSVC semi?kPC semi?L1kPC 无监督平均值 半监督平均值
Ecoil 71.17 50.80 75.00 84.97 87.32 89.25 94.87 74.52 92.06
Glass 25.00 62.65 56.67 65.56 66.92 72.45 64.00 62.95 68.23
Haberman 75.00 54.57 70.81 61.26 62.54 73.50 63.33 62.30 68.42
Vowel 22.73 83.13 87.50 83.09 83.25 53.13 56.67 84.24 54.90
Liver 51.28 50.31 62.32 51.32 51.54 50.31 63.64 53.87 56.98
Monk1 50.00 49.88 54.55 50.50 52.36 57.68 64.29 51.82 60.99
PID 59.52 58.80 67.74 54.43 55.94 68.80 72.62 59.23 70.71
User 44.23 62.09 45.98 61.17 68.36 55.65 51.16 59.40 53.41
Mushroom 85.00 89.40 90.94 89.49 90.24 90.40 92.45 90.02 91.43

图3

有标样本的个数对聚类准确率的影响"

1 业巧林,许等平,张冬 . 基于深度学习特征和支持向量机的遥感图像分类. 林业工程学报,2019,4(02):119-125.
Ye Q L,Xu D P,Zhang D. Remote sensing image classification based on deep learning features and support vector machine. Journal of Forestry Engineering,2019,4(02):119-125.
2 许博鸣,刘晓峰,业巧林 等 . 面向移动平台的深度学习复杂场景目标识别应用. 陕西师范大学学报(自然科学版),2019,47(05):10-15.
Xu B M , Liu X F , Ye Q L ,et al . A deep learning based object detection application for mobile platform in complex scenes. Journal of Shaanxi Normal University (Nature Science Edition),2019,47(5):10-15.
3 刘建伟,刘媛,罗雄麟 . 半监督学习方法. 计算机学报,2015,38(8):1592-1617.
Liu J W,Liu Y,Luo X L. Semi?supervised learning methods. Chinese Journal of Computers,2015,38(8):1592-1617.
4 Li Y F , Kwok J T , Zhou Z H . Semi?supervised learning using label mean∥Proceedings of the 26th Annual International Conference on Machine Learning.Montreal,Canada:ACM Press,2009:633-640.
5 张云斌,张春梅,周千琪 等 . 基于L1范数和k近邻叠加图的半监督分类算法. 模式识别与人工智能,2016,29(9):850-855.
Zhang Y B , Zhang C M , Zhou Q Q ,et al . Semi?supervised classification algorithm based on L1?norm and KNN superposition graph. Pattern Recognition and Artificial Intelligence,2016,29(9):850-855.
6 马蕾,汪西莉 . 基于支持向量机协同训练的半监督回归. 计算机工程与应用,2011,47(3):177-180.
Ma L,Wang X L. (Semi?supervised regression based on support vector machine co?training. Computer Engineering and Application,2011,47(3):177-180.
7 吕峰,柴变芳,李文斌 等 . 一种主动半监督K?means聚类算法的改进策略. 南京师范大学学报(工程技术版),2018,18(2):56-62.
Lü F , Chai B F , Li W B ,et al . An Improved strategy of active semi?supervision k?means clustering algorithm. Journal of Nanjing Normal University (Engineering and Technology Edition),2018,18(2):56-62.
8 方玲,陈松灿 . 结合特征偏好的半监督聚类学习. 计算机科学与探索,2015,9(1):105-111.
Fang L,Chen S C. Semi?supervised clustering learning combined with feature preferences. Journal of Frontiers of Computer Science & Technology,2015,9(1):105-111.
9 张春涛,郭皎,徐家良 . 基于稀疏表示的半监督降维方法. 计算机工程与应用,2011,47(20):181-183,187.
Zhang C T,Guo J,Xu J L. Semi?supervised dimensionality reduction based on sparsity representation. Computer Engineering and Applications,2011,47(20):181-183,187.
10 Basu S , Banerjee A , Mooney R J . Semi?supervised clustering by seeding∥Machine Learning,Proceedings of the Nineteenth International Conference. Sydney,Australia:University of New South Wales,2002.
11 高滢,刘大有,齐红 等 . 一种半监督K均值多关系数据聚类算法. 软件学报,2008,19(11):2814-2821.
Gao Y , Liu D Y , Qi H ,et al . Semi?supervised k?means clustering algorithm for multi?type relational data. Journal of Software,2008,19(11):2814-2821.
12 Bradley P S , Mangasarian O L . K?plane clustering. Journal of Global Optimization,2000,16(1):23-32.
13 Fung G M , Mangasarian O L . Multicategory proximal support vector machine classifiers. Machine Learning,2005,59(1-2):77-97.
14 Mangasarian O L , Wild E W . Multisurface proximal support vector machine classification via generalized eigenvalues. IEEE Transactions on Pattern Analysis and Machine Intelligence,2006,28(1):69-74.
15 Jayadeva, Khemchandani R , Chandra S . Twin support vector machines for pattern classification. IEEE Transactions on Pattern Analysis and Machine Intelligence,2007,29(5):905-910.
16 Wang Z , Shao Y H , Bai L ,et al . Twin support vector machine for clustering. IEEE Transactions on Neural Networks and Learning Systems,2015,26(10):2583-2588.
17 Ye Q L , Zhao H H , Li Z C ,et al . L1?norm distance minimization?based fast robust twin support vector k?plane clustering. IEEE Transactions on Neural Networks and Learning Systems,2018,29(9):4494-4503.
18 徐庆伶,汪西莉 . 一种基于支持向量机的半监督分类方法. 计算机技术与发展,2010,20(10):115-117,121.
Xu Q L,Wang X L. A Novel semi?supervised classification method based on SVM. Computer Technology and Development,2010,20(10):115-117,121.
19 杨绪兵,潘志松,陈松灿 . 半监督型广义特征值最接近支持向量机. 模式识别与人工智能,2009,22(3):349-353.
Yang X B,Pan Z S,Chen S C. Semi?supervised proximal support vector machine via generalized eigenvalues. Pattern Recognition and Artificial Intelligence,2009,22(3):349-353.
20 Rastogi R , Pal A . Fuzzy semi?supervised weighted linear loss twin support vector clustering. Knowledge?Based Systems,2019,165:132-148.
21 寇振宇,杨绪兵,张福全 等 . L1范数最大间隔分类器设计. 南京师范大学学报(自然科学版),2018,41(4):59-64.
Kou Z Y , Yang X B , Zhang F Q ,et al . Design of L1 norm Maximum margin classifier. Journal of Nanjing Normal University (Natural science Edition),2018,41(4):59-64.
22 Yang H X , Yang X B , Zhang F Q ,et al . Infinite norm large margin classifier. International Journal of Machine Learning and Cybernetics,2019,doi:10.1007/s13042?018?0881?y .
doi: 10.1007/s13042?018?0881?y
23 Halkidi M , Batistakis Y , Vazirgiannis M . On clustering validation techniques. Journal of Intelligent Information Systems,2001,17(2-3):107-145.
[1] 柴变芳,魏春丽,曹欣雨,王建岭. 面向网络结构发现的批量主动学习算法[J]. 南京大学学报(自然科学版), 2019, 55(6): 1020-1029.
[2] 卢文凯*,景丽萍*,杨 柳 . 截断式鲁棒非负矩阵分解算法[J]. 南京大学学报(自然科学版), 2016, 52(4): 714-.
[3] 涂 臻*,卢 晶 . 散射条件下小尺度扬声器阵列声聚焦算法鲁棒性研究[J]. 南京大学学报(自然科学版), 2016, 52(2): 382-.
[4]  常瑜1.2** ,梁吉业1, 2,高嘉伟1,2,杨静1·2
.  一种基于Seeds集和成对约束的半监督聚类算法*[J]. 南京大学学报(自然科学版), 2012, 48(4): 405-411.
[5]  申 彦**,宋顺林,朱玉全
.  一种基于半监督的大规模数据集聚类算法*
[J]. 南京大学学报(自然科学版), 2011, 47(4): 372-382.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 刘瑞红,王晖,陶农建. 电位调制下单纳米颗粒的等离激元成像研究[J]. 南京大学学报(自然科学版), 2019, 55(5): 813 -818 .
[2] 李勤,陆现彩,张立虎,程永贤,刘鑫. 蒙脱石层间阳离子交换的分子模拟[J]. 南京大学学报(自然科学版), 2019, 55(6): 879 -887 .
[3] 朱荀,刘国强,丁华平,沈庆宏. 一种通过支持向量机对交通拥堵情况进行分类的方法[J]. 南京大学学报(自然科学版), 2020, 56(2): 278 -283 .