南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (3): 483–.

• • 上一篇    下一篇

 基于近期最远遍历的支撑点选择

 李兴亮1,毛 睿2*   

  • 出版日期:2017-05-30 发布日期:2017-05-30
  • 作者简介: 1.中国科学技术大学计算机科学与技术学院,合肥,230026;
    2.深圳大学计算机与软件学院,广东省普及型高性能计算机重点实验室,深圳,518060
  • 基金资助:
     基金项目:国家863项目(2015AA015305),国家自然科学基金(U1301252,61471243),广东省重点实验室项目(2012A061400024),深圳市基础研究项目(JCYJ20140418095735561,JCYJ20150731160834611,JCYJ20150625101524056),深港创新圈项目(SGLH20131010163759789),广东省教育厅项目(2015KQNCX143)
    收稿日期:2017-03-02
    *通讯联系人,E­mail:mao@szu.edu.cn

 Pivot selection on recent farthest traversal

 Li Xingliang1,Mao Rui2*   

  • Online:2017-05-30 Published:2017-05-30
  • About author: 1.School of Computer Science and Technology,University of Science and Technology of China,Hefei,230026,China;
    2.Guangdong Province Key Laboratory of Popular High Performance Computers,College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,518060,China

摘要:  度量空间数据管理分析方法把数据抽象成度量空间中的点,具有高度的通用性,是应对大数据多样性挑战的有效手段之一.由于度量空间没有坐标,很多数学工具无法直接使用,一般以数据到参考点(也称作支撑点)的距离作为坐标.支撑点的好坏对于度量空间数据管理分析的性能发挥着关键性的影响.最远优先遍历(Farthest First Traversal,FFT)可以选出数据拐角的点,具有线性的时间复杂度和空间复杂度,是使用最广泛的支撑点选取算法之一.但是,实验表明最好的支撑点往往不是最拐角的点,故FFT很难选出最好的支撑点.提出近期最远遍历(Recent Farthest Traversal,RFT)算法,只以近期的几个支撑点来选择下一个支撑点,能够更快地选出性能更优的支撑点.同时,实验表明FFT还可以在数据内部均匀抽样.提出支撑点集合选择算法(Pivot Set Selection,PSS),可以一次性选出所有支撑点.以RFT选择候选集,以FFT选择评价集,选出支撑点并构建相似性索引,PSS使得索引构建代价大大降低,索引性能得到一定提升.实验表明,RFT选出好的支撑点的速度远快于FFT,准确率高于FFT,而FFT的抽样效果良好.

Abstract:  Metric­space data management and analysis method abstracts data to the points of metric space has a high generality,and is one of effective approaches for the variety challenge of big data.Since coordinates do not exist in metric space,many mathematical tools can’t be used directly and it is common to consider the distances from a data object to pre­selected reference points(or called pivots) as its coordinates.The quality of pivots is critical to the performance of data management and analysis in metric space.Farthest First Traversal(FFT) algorithm is able to select corners of the data set with linear time and space complexity,and is one of the most widely used pivot selection methods.However,experiments show that the best pivots are usually not the most corners,so it is very hard to select best pivots by FFT.This work proposes the Recent Farthest Traversal(RFT) algorithm.It selects next pivot only based on several recent pivots and is able to select better pivots faster.At the same time,experiments also show that FFT can sample points evenly in the data set.This work also proposes the Pivot Set Selection(PSS) algorithm,which is able to get all pivots just with one selection.With exploiting RFT to select a candidate set and exploiting FFT to sample an evaluation set,PSS can select a better set of pivots to create a similarity index.As a result,the construction cost of similarity index is greatly reduced and the index performance has some improvements.Experiments show that RFT selects good pivots far more rapidly than FFT and the accuracy of RFT is higher than FFT,while FFT generates a good sample of data.

 [1] Roman S.Advanced linear algebra.The 2nd Edition.New York,USA:Springer,2005,283-286.
[2] Iqbal Q,Aggarwal J K.Image retrieval via isotropic and anisotropic mappings.Pattern Recognition Journal,2002,35(12):2673-2686.
[3] Chavez E,Navarro G,Baeza­Yates R,et al.Searching in metric spaces.ACM Computing Surveys(CSUR),2001,33:273-321.
[4] Ramakrishnan S R,Mao R,Nakorchevskiy A A,et al.A fast coarse filtering method for protein identification by mass spectrometry.Bioinformatics,2006,22(12):1524-1531.
[5] Ciaccia P,Patella M,Zezula P.M­tree:An efficient access method for similarity search in metric spaces.In:The 23rd International Conference on Very Large Data Bases(VLDB’97).Athens,Greece:Morgan Kaufmann Publishers Inc.,1997:426-435.
[6] Navarro G.Searching in metric spaces by spatial approximation.VLDB Journal,1999,11(1):141-148.
[7] Amato G,Esuli A,Falchi F.Pivot selection strategies for permutation­based similarity search.Springer,2013,8199:91-102.
[8] Mao R,Miranker W L,Miranker D P.Pivot selection:Dimension reduction for distance­based indexing.Journal of Discrete Algorithm,2012,13:32-46.
[9] Ma K W,Liu Y J,Xu H L,et al.Pivot selection methods based on covariance and correlation for metric­space indexing.2012 National Conference on Information Technology and Computer Science (CITCS2012).Lanzhou,China:Atlantis Press,2012,672-677.
[10] Bustos B,Navarro G,Chavez E.Pivot selection techniques for proximity searching in metric spaces.Pattern Recognition Letters,2003,24(14):2357-2366.
[11] Gonzalez T F.Clustering to minimize the maximum intercluster distance.Theoretical Computer Science,1985,38:293-306.
[12] Berman A,Shapiro L G.Selecting good keys for triangle­inequality­based pruning algorithms.In:Proceedings of the 1998 International Workshop on Content­based Access of Image and Video Databases(CAIVD’98).Bombay,India:IEEE,1998:12-19.
[13] Vleugels J,Veltkamp R C.Efficient image retrieval through vantage objects.Pattern Recognition,2002,35(1):69-80.
[14] Mao R,Xu W,Singh N,et al.An assessment of a metric space database index to support sequence homology.IEEE Symposium on Bioinformatics & Bioengineering,2003,14(5):375-382.
[15] Henniga C,Lateckic L J.The choice of vantage objects for image retrieval.Pattern Recognition,2003,36(9):2187-2196.
[16] Hochbaum D S,Shmoys D B.A best possible heuristic for the k­center problem.Mathematics of Operational Research,1985,10(2):180-184.
[17] Yianilos P N.Data structures and algorithms for nearest neighbor search in general metric spaces.ACM­SIAM Symposium on Discrete Algorithms,1993:311-321.
[18] Mao R,Zhang P H,Li X L,et al.Pivot selection for metric­space indexing.International Journal of Machine Learning and Cybernetics,2016,7(2):311-323.
[19] The address of UMAD project:https://github.com/ruimao/UMAD,2015-11.
[20] Mao R,Iqbal Q,Liu W,et al.Case Study:Distance­based image retrieval in the MoBIoS DBMS.In:The Proceedings of The 5th International Conference on Computer and Information Technology(CIT2005).Shanghai,China:IEEE,2005:49-55.
[21] Needleman S B,Wunsch C D.A general method applicable to the search for similarities in the amino acid sequence of two proteins.Journal of Molecular Biology,2015,48(3):443-453.
[22] Xu W,Miranker D P.A metric model of amino acid substitution.Bioinformatics,2004,20(8):1214-1221.
[23] Navarro G.Analyzing metric space indexes:What for?In:The Proceedings of the 2nd International Conference on Similarity Search & Applications(SISAP2009).Washington D C,USA:IEEE Computer Society,2009:3-10.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!