南京大学学报(自然科学版) ›› 2011, Vol. 47 ›› Issue (4): 372382.
申 彦**,宋顺林,朱玉全
Shen Yun1,Song Shun一Lin1,Zhu Yu一Quan1
摘要: 待挖掘数据集规模的不断增长,以往的聚类算法由于需要多次扫描原始数据集而不再适用,
现阶段,一遍扫描原始数据集即完成聚类的算法成为了首要的研究目标.但是,现有针对大规模数据集
的算法容易受到初始化参数以及原始数据集分布的影响,聚类结果质量不高,并且也不稳定.对此,吸收
半监督聚类的思想,提出了基于标记集的半监督一遍扫描K均值算法,该算法利用驻留主存的标记集
指导聚类过程,使得聚类效率以及聚类结果的质量得到了进一步的提高.在人工生成数据集以及1998KDD数据集上验证了该算法的有效性.
[1]Basu S, Banerjec A. Semi-supervised clustering by seeds. ln; Morgan Kaufmann, Proceeding of the 19th international Conference on Machine Learning, Svdney, Australia, 2002,7:19一26. [2]zhang T, Ramakrishnan R,Livny M Y. An ef- ficicnt data clustering method for very large da- tabases. SIUMOD Record(ACM Special inter- est Group on Management of Data),1996,25 (2):103一114. [3]Ng K,Jia W H. Efficient and effective cluste ring method for spatial data mining. Proceed ings of International Conference on Very Large Database, Santiago,Chile, 1994,144~155 [4]Ester M, Kricgcl H P,Sander J,et al. A den- sity-based algorithm for discovering clusters in large spatial database. lntternational Conference on Knowledge Discovering and Data Mining, Portland,USA,1996,226一231. [5]Bradley P, Fayyad U, Reina C. Scaling cluster ring algorithms to large databases. Proceedings of the 4th international Conference on Knowl- edge Discovery and Data Mining, Menlo Park, CA,1998,9一15. [6]Zhou A W, Yu Y F.The Research about clus- tering algorithm of K-Means. Computer Tech- nology and Development,2011(2):62一66.(周爱武,于亚飞.K-Means聚类算法的研究.计算机技术与发展.201 1(2) :62-66). [7]Krista R G. An efficient k-means clustering al gorithm. Pattern Recognition Letters, 2008,29 (9):138一1391. [8]Mcila M, Heckerman D. An experimental com- parison of several clustering and initialization methods. Data Mining and Knowledge Discov- cry, Microsoft Research, USA,1998,1: l4l一182. [9]Bradley P, Fayyad U. Refining initial points for k-means clustering, 1n; Morgan Kaufmann, Proceedings of the 15thinternational Conference on Machine Learning, San Francisco, 1998,91一99. [10]Ankerst M, Brcunig M. OPTICS; Ordering points to identify the clustering structure. Pro- cecdings of ACM-SIUMOD international Con- ference Management of Data Philadelphia, USA,1999,49一60. [11]Chapelle O,Scholkopf B, Zicn A. Semi-super wised learning. 2nd ed. Cambridge,MA,The MIT Prcss,2010,2一60. [12]He Z F, Xiong F L. Constrained partition mod el and kmeans algorithm. Journal of Software 2005,16(5):799一809. [13]Gyorgy J,Kumar V, Zhang Z L. Semi-super- wised approach to rapid and reliable labeling of large data sets, ln; Ying L,Bing L. Proceeding of the l4th ACM SIUKDD international Confer- ence on Knowledge Discovery and Data Mining, Las Vegas,2008,641一649. [14]Qi Y J,Collobcrt R,Kuksa P, et al. Combi- ning labeled and unlabeled data with word-class distribution learning. Proceedings of lnterna- tional Conference on information and Knowledge Management,Hong Kong, China,2009 1737一1740. [15]Dcmpster A P,Laird N M, Ruhin D B. Maxi- mum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistic Society H,1997,39(1):1一38. [16]Bar H,Hertz T,Shental N, et al. Learning distance functions using equivalence relations. The 20th international Conference on Machine Mcarning, Washington, 2003,21:11一18. [17]Eric P, Andrew Y,Michael l,et al. Metric learning with application to clustering with side- information, ln: Koller D, Schuurmans D. Ad- vanees in Neural Information Processing Sys tem. 2002,15:505一512. [18]Peng Y,Zhang D Q. Semi-Supervised canonical correlation analysis algorithm. Journal of Soft- ware,2008,19(11):2822一2832.(彭岩,张道强.半监督典型相关分析算法.软件学报,2008, 19(11):2822一2832). [19]Huang J C,Chen W W. The model of neural network classifier in high-dimension space. Journal of Nanjing Unicrsity(Natural Sci- cnces), 2003, 39(2); 194~204.(黄金才,陈文伟.基于高维空间划分的神经网络分类学习模型.南京大学学报(自然科学),2003, 39(2); 194一 204). |
No related articles found! |
|