目前经典的聚类算法在内存空间有限的情况下,聚类受到时间、空间等各方面的限制,提出一种基于代表点的快速聚类算法FCBRP(fast clustering based representative points).首先,判定数据集中所有节点的属性,当节点的D临域内存在大于等于K个邻居节点时,将其定义为代表点,代表点T)临域
内所有邻居节点与该代表点之间的平均欧氏距离即为该代表点的相关密度RT),所有的代表点组成代表点集合;将所有在代表点的D临域内的节点定义为能被代表的节点,并将其进行存储;既不是代表点、又不能被其它节点所代表的节点,将其定义为噪音节点;其次,对代表点集合进行聚类,对于给定的密度
标准a,如果两个代表点满足密度相关,即两个代表点的相关密度分别乘以密度标准a后同时大于等于两者之间的欧氏距离,则将其划分到同一类簇中,通过对代表点的聚类,达到对数据的区域划分,得到所有类簇的基木形状;最后,对于被其它代表点所代表的节点,通过检测代表它们的代表点所属的类簇,判
定被代表的节点所属的类簇,对于少数位于不同类簇中的代表点的D临域内的节点,将其划分到相对距离较近的代表点所属的类簇中.实验证明,FCYRP算法对空间需求较小,效率快,精度高,鲁棒性更佳.
Abstract
The existing classical clustering algorithms’effect is not good enough by various restrictions in limited memory, such as time complexity and space complexity, this paper presents an algorithrrrfast clustering based representative points(FCBRP).Firstly, judge the attribute of the points of the dataset, if the point that has more
than K neighbors in its radius of D,define it representative point,the representative point’s related density RD is defined that the average Euclidean distance from the representative point to the neighbors of it,all the representative points constitute the set of representative. Define the point represented-point which is the neighbor of the representative points and store all of them. Define the other points noises. Secondly, cluster the set of representative, as the given density standard a,if the distance of the two representative points is both smaller than their related density RD multiplied the density standard a,define them belong one cluster, by cluster the representative points to separate the data region, in order to get the shape of the clusters. Finally, find the cluster number of the points which can be represented by other representative points, that is to say, find the cluster number of their representative points, as the represented-points those are the neighbors of representative points which are belong to deferent clusters,separate them to the cluster of representative point which are closer to the represented- point. The experiment result proved that the FCBRP algorithm is less space required,higlrcfficicncy, high precision, more robust.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1]Sun J G, Liu J,Zhao L Y. Clustering algo- rithms research. Journal of Software, 2008,19 (1):48-61.(孙吉贵,刘杰,赵连宇.聚类算
法研究.软件学报,2008, 19(1); 48-61).
[2]Jain A K,Flynn P J. Image segmentation using clustering. Ahuja N, Bowyer K. Advances in image Understanding; A Festchrift for Azricl
Rosenfcld. Piscataway; IEEE Press, 1996,65一83.
[3]Cadcs I,Smyth P,Mannila H. Probabilistic modeling of transactional data with applications to profiling, visualization and prediction, sig-
mod. Proceedings of the 7th ACM Special Inter-est Group on Knowledge Discovery and Data Mining. San Francisco; ACM Press,2001,37~46.
[4]Jain A K,Murty M N, Flynn P J. Data clustc ring:A review. ACM Computing Surveys,1999 ,31(3):264一323.
[5]Marques J P. Pattern recognition concepts, methods and applications.The 2nd Edition. Bei- jing:Tsinghua University Press, 2002,51一74.
[6]Huang Z. A fast clustering algorithm to cluster very large categorical data sets in data mining. Proceedings of the Workshop on Research 1s-
sues on Data Mining and Knowledge Discovery. Tucson, 1997,146一151.
[7]Zheng M M, Ji G L. An improved density based distributed clustering. Journal of Nanjing University( Natural Sciences),2008,44(5):
536-543.(郑苗苗,吉根林.一种基于密度的分布式聚类算法.南京大学学报(自然科学),2008,44(5):536一543).
[8]Niu X Z, She K. Study of fast parallel cluste- ring partition algorithm for large data sets. Computer Science, 2012,39(1):134一137,
151.(牛新征,余垄.面向大规模数据的快速并行聚类划分算法研究.计算机科学,2012, 39(1):134一137,151).
[9]Yan K,He X M. Anonymous calculation of da- to based on local clustering. Application Re search of Computer, 2012,29(1):148一151.
(焉凯,何贤L.基于局部聚类的数据匿名化算法.计算机应用研究,2012, 29(1); 148- 151).
[10]Zhu C J,Zhang Y. Research of improved fuzzy C-means clustering algorithm. Journal of Henan University(Natural Sciences),2012,42(1):
92-95.(朱长江,张缨.模糊C一均值聚类算法的改进研究.河南大学学报(自然科学),2012,42(1):92一95).
[11]Chen Q Z,Han J H,Ghang W Y,et al. Clus- tering with the adaptive genetic algorithm. Journal of Nanjing University(Natural Sci-
ences),2005,41(5):749一754.(陈庆章,韩江洪,张维一等.采用适应性遗传算法进行数据聚类研究.南京大学学报(自然科学),2005,41 (5):749一 754).
[12]Sebastian L, Mihai L, Incremental clustering of dynamic data streams using connectivity based representative points. Data and Knowledge En-
gineering, 2009,68(1):1一27.
[13]Chen E H,Wang S F, Ning Y,et al. The de sign and implementation of clustering algorithm using representative data. Pattern Recognition
and Artificial intelligence, 2001,14(04): 417-422.陈恩红,王上飞,宁岩等.一种利用代表点的有效聚类算法设计与实现.模式识
别与人J智能,2001, 14(04); 417-422).
[14]Huang T Q, Qin X L,Wang J D. Multi-repre sentation fature tree and spatial clustering algo rithm. Computer Scicnce,2006,33(12):189一
195.(黄添强,秦小麟,王金栋.多代表点特征树与空间聚类算法.计算机科学,2006, 33(12); 189一195).
[15]Jia R Y,Geng J W, Ning Z Z, et al. Fast clus tering algorithm based on representative points. Computer Engineering and Applications,2010
46(33) : 121一123.贾瑞玉,耿锦威,宁再早等. 基于代表点的快速聚类算法.计算机工程与应用,2010 , 46 (33):121一123).
[16]Qin Q W , Liang J Y,Qian Y H. Clustering feature selection method nased on neighborhood distance. Computer Science,2012,39(1):
175-177.(秦奇伟,梁吉业,钱字华.一种基于邻域距离的聚类特征选择方法.计算机科学,2012,39(1):175一177).
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
教育部高等学校博士学科点专项科研基金(20110095110010)
{{custom_fund}}