一种基于半监督的大规模数据集聚类算法*

南京大学学报(自然科学版) ›› 2011, Vol. 47 ›› Issue (4): 372–382.

一种基于半监督的大规模数据集聚类算法*

申彦**，宋顺林，朱玉全

出版日期:2015-04-21 发布日期:2015-04-21
作者简介: (江苏大学计算机科学与通信工程学院，镇江，212013)
基金资助:
国家科技支撑计划项日(20108A工88800)，江苏省自然科学基金(BK20 10331)，博卜研究生创新计划(C:X10B016 X)，江苏大学高级人才基金(08JDG057)

A clustering algorithm for scalable datasets based on semrsupervision technology

Shen Yun¹，Song Shun一Lin¹，Zhu Yu一Quan¹

Online:2015-04-21 Published:2015-04-21
About author: (1 .School of Computer Science and Communication Engineering, Jiangsu University,Zhcnjiang, 212013，China)

摘要/Abstract

摘要： 待挖掘数据集规模的不断增长，以往的聚类算法由于需要多次扫描原始数据集而不再适用，
现阶段，一遍扫描原始数据集即完成聚类的算法成为了首要的研究目标.但是，现有针对大规模数据集
的算法容易受到初始化参数以及原始数据集分布的影响，聚类结果质量不高，并且也不稳定.对此，吸收
半监督聚类的思想，提出了基于标记集的半监督一遍扫描K均值算法，该算法利用驻留主存的标记集
指导聚类过程，使得聚类效率以及聚类结果的质量得到了进一步的提高.在人工生成数据集以及1998KDD数据集上验证了该算法的有效性.

Abstract: As the sizc of datasets to be mined is constantly increasing, traditional clustering algorithms are not suitable anymore for their repeated scanning on the original datasets. Nowadays，clustering algorithms that scan the original scalable datasets just once have become a main target of studies. However, such algorithms for scalable datasets arc always affected easily by initial parameters and distribution of original datasets;hence, the quality of
results is not only low but also unstable.Therefore, integrating the main thoughts of semi-supervised clustering, a novel algorithm called semi-supervised labels onescan kmcans is presented in this paper.This algorithm makes use of
labels which residents in memory to guide the process of clustering, which improves the efficiency and quality of clustering very much.The experiments of synthetic datasets and real world datasets 1998KDD also support effectiveness of this algorithm.

申彦**，宋顺林，朱玉全
. 一种基于半监督的大规模数据集聚类算法*
[J]. 南京大学学报(自然科学版), 2011, 47(4): 372–382.

Shen Yun¹，Song Shun一Lin¹，Zhu Yu一Quan¹
. A clustering algorithm for scalable datasets based on semrsupervision technology
[J]. Journal of Nanjing University(Natural Sciences), 2011, 47(4): 372–382.

参考文献

[1]Basu S, Banerjec A. Semi-supervised clustering by seeds. ln; Morgan Kaufmann, Proceeding of the 19^th international Conference on Machine Learning, Svdney, Australia, 2002，7:19一26.
[2]zhang T, Ramakrishnan R，Livny M Y. An ef- ficicnt data clustering method for very large da- tabases. SIUMOD Record(ACM Special inter- est Group on Management of Data)，1996，25 (2):103一114.
[3]Ng K，Jia W H. Efficient and effective cluste ring method for spatial data mining. Proceed ings of International Conference on Very Large Database, Santiago，Chile, 1994,144~155
[4]Ester M, Kricgcl H P，Sander J，et al. A den- sity-based algorithm for discovering clusters in large spatial database. lntternational Conference on Knowledge Discovering and Data Mining, Portland，USA，1996,226一231.
[5]Bradley P, Fayyad U, Reina C. Scaling cluster ring algorithms to large databases. Proceedings of the 4^th international Conference on Knowl- edge Discovery and Data Mining, Menlo Park, CA，1998，9一15.
[6]Zhou A W, Yu Y F.The Research about clus- tering algorithm of K-Means. Computer Tech- nology and Development，2011(2):62一66.(周爱武，于亚飞.K-Means聚类算法的研究.计算机技术与发展.201 1(2) :62-66).
[7]Krista R G. An efficient k-means clustering al gorithm. Pattern Recognition Letters, 2008，29 (9):138一1391.
[8]Mcila M, Heckerman D. An experimental com- parison of several clustering and initialization methods. Data Mining and Knowledge Discov- cry, Microsoft Research, USA，1998，1: l4l一182.
[9]Bradley P, Fayyad U. Refining initial points for k-means clustering, 1n; Morgan Kaufmann, Proceedings of the 15^thinternational Conference on Machine Learning, San Francisco, 1998，91一99.
[10]Ankerst M, Brcunig M. OPTICS; Ordering points to identify the clustering structure. Pro- cecdings of ACM-SIUMOD international Con- ference Management of Data Philadelphia, USA，1999,49一60.
[11]Chapelle O，Scholkopf B, Zicn A. Semi-super wised learning. 2nd ed. Cambridge，MA，The MIT Prcss,2010，2一60.
[12]He Z F, Xiong F L. Constrained partition mod el and kmeans algorithm. Journal of Software 2005，16(5):799一809.
[13]Gyorgy J，Kumar V, Zhang Z L. Semi-super- wised approach to rapid and reliable labeling of large data sets, ln; Ying L，Bing L. Proceeding of the l4^th ACM SIUKDD international Confer- ence on Knowledge Discovery and Data Mining, Las Vegas，2008，641一649.
[14]Qi Y J，Collobcrt R，Kuksa P, et al. Combi- ning labeled and unlabeled data with word-class distribution learning. Proceedings of lnterna- tional Conference on information and Knowledge Management，Hong Kong, China,2009 1737一1740.
[15]Dcmpster A P，Laird N M, Ruhin D B. Maxi- mum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistic Society H，1997，39(1):1一38.
[16]Bar H，Hertz T，Shental N, et al. Learning distance functions using equivalence relations. The 20^thinternational Conference on Machine Mcarning, Washington, 2003，21:11一18.
[17]Eric P, Andrew Y，Michael l，et al. Metric learning with application to clustering with side- information, ln: Koller D, Schuurmans D. Ad- vanees in Neural Information Processing Sys tem. 2002，15:505一512.
[18]Peng Y，Zhang D Q. Semi-Supervised canonical correlation analysis algorithm. Journal of Soft- ware,2008,19(11):2822一2832.(彭岩，张道强.半监督典型相关分析算法.软件学报，2008, 19(11):2822一2832).
[19]Huang J C，Chen W W. The model of neural network classifier in high-dimension space. Journal of Nanjing Unicrsity(Natural Sci- cnces), 2003, 39(2); 194~204.(黄金才，陈文伟.基于高维空间划分的神经网络分类学习模型.南京大学学报(自然科学)，2003, 39(2); 194一 204).

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed