南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (6): 991–.

• •    下一篇

基于实体演化的记录链接算法

刘 宏,申德荣*,寇 月,聂铁铮,于 戈   

  • 出版日期:2017-11-26 发布日期:2017-11-26
  • 作者简介:东北大学计算机科学与工程学院,沈阳,110169
  • 基金资助:
    基金项目:国家自然科学基金(61472070,61672142) 收稿日期:2017-09-16 *通讯联系人,E-mail:shendr@mail.neu.edu.cn

A temporal record matching based on entity evolution

Liu Hong,Shen Derong*,Kou Yue,Nie Tiezheng,Yu Ge   

  • Online:2017-11-26 Published:2017-11-26
  • About author:School of Computer Science and Engineering,Northeast University,Shenyang,110169,China

摘要: 实体识别(Entity Resolution)是指判断一个或多个数据源中两个不同记录是否描述相同实体,它有时也被称作记录连接(Record Linkage),在数据集成中被用于数据清洗(Data Clean)、去重(Deduplication)和相似连接(Similarity Joins)等集成操作中.实体识别技术可被广泛应用于人口普查、引文识别、Web搜索、数据清洗以及剽窃检验等诸多领域.然而,在真实世界中,实体的属性会随着时间的变化而变化,两条记录的属性值不同不能表明这两条记录对应不同的实体,具有相同的属性值的两条记录也不能表明对应相同的实体.时间记录链接就是匹配描述同一实体的带有时间戳的记录.已有的解决时间记录链接的方法依赖于时间模型来捕捉实体的演化,但是已有的时间模型在预测实体的演化时,实体匹配准确率不高,而聚类计算复杂度较高.为此提出了更加细致的捕捉实体演化的模型和新的两阶段的快速聚类算法.通过在三个真实数据集上的实验结果表明,提出的时间模型可以更加细致地捕捉实体的演化,提出的聚类算法能更快速而准确的聚类描述同一实体的记录,提高了识别的准确率和效率.

Abstract: Entity resolution,also named as record linkage,is to judge whether two different records in one or more data sources belong to the same entity.In the area of data integration,entity resolution is widely used for data clean,deduplication and similarity joins.Entity resolution can be also widely applied in census,citation recognition,web search,data cleaning,plagiarism and inspection.However,in reality,entity attribute changes over time.That is,the two records with different attributes do not mean the two records belong to different entity.On the contrary,the two records with the same attributes also can not demonstrate the reference to the same entity.Then,the problem of linking temporal record,which aims at linking the records with time stamps,is proposed.Most state-of-the-art methods prefer to present different temporal models to capture the entity evolution.However,these temporal models have a low accuracy and a high computation cost in solving temporal record linkage.In this paper,we firstly present a more novel temporal model for capturing entity evolution.Then,a two-stage fast clustering algorithm are presented.At last,experimental results on three real-world datasets demonstrate that our temporal model has better performance in capturing the entity evolution,and our clustering algorithm is more fast and accurate in solving temporal record linkage.

[1] Koudas N,Sarawagi S,Srivastava D.Record linkage:Similarity measures and algorithms.In:Roceedings of the 2006 ACM SIGMOD International Conference on Management of Data.Chicago,IL,USA:ACM,2006:802-803. [2] Steorts R C,Ventura S L,Sadinle M,et al.A comparison of blocking methods for record linkage.In:Domingo-Ferrer J.International Conference on Privacy in Statistical Database.Ibiza,Spain:Springer International Publishing,2014:253-268. [3] Gruenheid A,Dong X L,Srivastava D.Incremental record linkage.Proceedings of the VLDB Endowmen,2014,7(9):697-708. [4] Li P,Dong X L,Guo S T,et al.Robust group linkage.In:Proceedings of the 24th International Conference on World Wide Web.Florence,Italy:ACM,2014. [5] Elmagarmid A K,Ipeirotis P G,Verykios V S.Duplicate record detection:A survey.IEEE Transactions on Knowledge and Data Engineering,2007,19(1):1-16. [6] Getoor L,Machanavajjhala A.Entity resolution:Theory,practice & open challenges.Proceedings of the VLDB Endowment,2012,5(12):2018-2019. [7] Kpcke H,Thor A,Rahm E.Evaluation of entity resolution approaches on real-world match problems.Proceedings of the VLDB Endowment,2010,3(1-2):484-493. [8] Shen W,De Rose P,Vu L,et al.Source-aware entity matching:A compositional approach.In:2007 IEEE 23rd International Conference on Data Engineering.Istanbul,Turkey:IEEE,2007:196-205. [9] Jiang Y,Li G L,Feng J H,et al.String similarity joins:An experimental evaluation.Proceedings of the VLDB Endowment,2014,7(8):625-636. [10] Li P,Dong X,Maurino A,et al.Linking temporal records.Proceedings of the VLDB Endowment,2011,4(7):956-967. [11] Chiang Y H,Doan A,Naughton J F.Modeling entity evolution for temporal record matching.In:Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.Snowbird,UT,USA:ACM,2014:1175-1186. [12] Chiang Y H,Doan A,Naughton J F.Tracking entities in the dynamic world:A fast algorithm for matching temporal records.Proceedings of the VLDB Endowment,2014,7(6):469-480. [13] Li F,Lee M L,Hsu W,et al.Linking temporal records for profiling entities.In:Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.Melbourne,Australia:ACM,2015:593-605. [14] Christen P,Gayler R W.Adaptive temporal entity resolution on dynamic databases.In:Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining.Gold Coast,Australia:Springer,2013:558-569. [15] Rekatsinas T,Dong X L,Srivastava D.Characterizing and selecting fresh data sources.In:Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.Snowbird,UT,USA:ACM,2014:919-930. [16] Li F,Lee M L,Hsu W.Entity profiling with varying source reliabilities.In:Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining.New York,NY,USA:ACM,2014:1146-1155. [17] Vernica R,Carey M J,Li C.Efficient parallel set-similarity joins using mapreduce.In:Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.New York,NY,USA:ACM,2010:495-506. [18] The DBLP computer science bibliography.http://dblp.uni-trier.de/db/,2016-03-15. [19] Academic patenting in Europe (APE-INV).http://www.esf-ape-inv.eu/,2016-02-10. [20] Hassanzadeh O,Chiang F,Lee H,et al.Framework for evaluating clustering algorithms in duplicate detection.Proceedings of the VLDB Endowment,2009,2(1):1282-1293.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!