|本期目录/Table of Contents|

[1]刘 宏,申德荣*,寇 月,等.基于实体演化的记录链接算法[J].南京大学学报(自然科学),2017,53(6):991.[doi:10.13232/j.cnki.jnju.2017.06.001]
 Liu Hong,Shen Derong*,Kou Yue,et al.A temporal record matching based on entity evolution[J].Journal of Nanjing University(Natural Sciences),2017,53(6):991.[doi:10.13232/j.cnki.jnju.2017.06.001]
点击复制

基于实体演化的记录链接算法()
     

《南京大学学报(自然科学)》[ISSN:0469-5097/CN:32-1169/N]

卷:
53
期数:
2017年第6期
页码:
991
栏目:
出版日期:
2017-12-01

文章信息/Info

Title:
A temporal record matching based on entity evolution
作者:
刘 宏申德荣*寇 月聂铁铮于 戈
东北大学计算机科学与工程学院,沈阳,110169
Author(s):
Liu HongShen Derong*Kou YueNie TiezhengYu Ge
School of Computer Science and Engineering,Northeast University,Shenyang,110169,China
关键词:
实体演化记录链接时间模型聚类算法
Keywords:
entity evolutionrecord linkagetemporal modelclustering algorithm
分类号:
TP181
DOI:
10.13232/j.cnki.jnju.2017.06.001
文献标志码:
A
摘要:
实体识别(Entity Resolution)是指判断一个或多个数据源中两个不同记录是否描述相同实体,它有时也被称作记录连接(Record Linkage),在数据集成中被用于数据清洗(Data Clean)、去重(Deduplication)和相似连接(Similarity Joins)等集成操作中.实体识别技术可被广泛应用于人口普查、引文识别、Web搜索、数据清洗以及剽窃检验等诸多领域.然而,在真实世界中,实体的属性会随着时间的变化而变化,两条记录的属性值不同不能表明这两条记录对应不同的实体,具有相同的属性值的两条记录也不能表明对应相同的实体.时间记录链接就是匹配描述同一实体的带有时间戳的记录.已有的解决时间记录链接的方法依赖于时间模型来捕捉实体的演化,但是已有的时间模型在预测实体的演化时,实体匹配准确率不高,而聚类计算复杂度较高.为此提出了更加细致的捕捉实体演化的模型和新的两阶段的快速聚类算法.通过在三个真实数据集上的实验结果表明,提出的时间模型可以更加细致地捕捉实体的演化,提出的聚类算法能更快速而准确的聚类描述同一实体的记录,提高了识别的准确率和效率.
Abstract:
Entity resolution,also named as record linkage,is to judge whether two different records in one or more data sources belong to the same entity.In the area of data integration,entity resolution is widely used for data clean,deduplication and similarity joins.Entity resolution can be also widely applied in census,citation recognition,web search,data cleaning,plagiarism and inspection.However,in reality,entity attribute changes over time.That is,the two records with different attributes do not mean the two records belong to different entity.On the contrary,the two records with the same attributes also can not demonstrate the reference to the same entity.Then,the problem of linking temporal record,which aims at linking the records with time stamps,is proposed.Most state-of-the-art methods prefer to present different temporal models to capture the entity evolution.However,these temporal models have a low accuracy and a high computation cost in solving temporal record linkage.In this paper,we firstly present a more novel temporal model for capturing entity evolution.Then,a two-stage fast clustering algorithm are presented.At last,experimental results on three real-world datasets demonstrate that our temporal model has better performance in capturing the entity evolution,and our clustering algorithm is more fast and accurate in solving temporal record linkage.

参考文献/References:

[1] Koudas N,Sarawagi S,Srivastava D.Record linkage:Similarity measures and algorithms.In:Roceedings of the 2006 ACM SIGMOD International Conference on Management of Data.Chicago,IL,USA:ACM,2006:802-803. [2] Steorts R C,Ventura S L,Sadinle M,et al.A comparison of blocking methods for record linkage.In:Domingo-Ferrer J.International Conference on Privacy in Statistical Database.Ibiza,Spain:Springer International Publishing,2014:253-268. [3] Gruenheid A,Dong X L,Srivastava D.Incremental record linkage.Proceedings of the VLDB Endowmen,2014,7(9):697-708. [4] Li P,Dong X L,Guo S T,et al.Robust group linkage.In:Proceedings of the 24th International Conference on World Wide Web.Florence,Italy:ACM,2014. [5] Elmagarmid A K,Ipeirotis P G,Verykios V S.Duplicate record detection:A survey.IEEE Transactions on Knowledge and Data Engineering,2007,19(1):1-16. [6] Getoor L,Machanavajjhala A.Entity resolution:Theory,practice & open challenges.Proceedings of the VLDB Endowment,2012,5(12):2018-2019. [7] Kpcke H,Thor A,Rahm E.Evaluation of entity resolution approaches on real-world match problems.Proceedings of the VLDB Endowment,2010,3(1-2):484-493. [8] Shen W,De Rose P,Vu L,et al.Source-aware entity matching:A compositional approach.In:2007 IEEE 23rd International Conference on Data Engineering.Istanbul,Turkey:IEEE,2007:196-205. [9] Jiang Y,Li G L,Feng J H,et al.String similarity joins:An experimental evaluation.Proceedings of the VLDB Endowment,2014,7(8):625-636. [10] Li P,Dong X,Maurino A,et al.Linking temporal records.Proceedings of the VLDB Endowment,2011,4(7):956-967. [11] Chiang Y H,Doan A,Naughton J F.Modeling entity evolution for temporal record matching.In:Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.Snowbird,UT,USA:ACM,2014:1175-1186. [12] Chiang Y H,Doan A,Naughton J F.Tracking entities in the dynamic world:A fast algorithm for matching temporal records.Proceedings of the VLDB Endowment,2014,7(6):469-480. [13] Li F,Lee M L,Hsu W,et al.Linking temporal records for profiling entities.In:Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.Melbourne,Australia:ACM,2015:593-605. [14] Christen P,Gayler R W.Adaptive temporal entity resolution on dynamic databases.In:Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining.Gold Coast,Australia:Springer,2013:558-569. [15] Rekatsinas T,Dong X L,Srivastava D.Characterizing and selecting fresh data sources.In:Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.Snowbird,UT,USA:ACM,2014:919-930. [16] Li F,Lee M L,Hsu W.Entity profiling with varying source reliabilities.In:Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining.New York,NY,USA:ACM,2014:1146-1155. [17] Vernica R,Carey M J,Li C.Efficient parallel set-similarity joins using mapreduce.In:Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.New York,NY,USA:ACM,2010:495-506. [18] The DBLP computer science bibliography.http://dblp.uni-trier.de/db/,2016-03-15. [19] Academic patenting in Europe (APE-INV).http://www.esf-ape-inv.eu/,2016-02-10. [20] Hassanzadeh O,Chiang F,Lee H,et al.Framework for evaluating clustering algorithms in duplicate detection.Proceedings of the VLDB Endowment,2009,2(1):1282-1293.

相似文献/References:

备注/Memo

备注/Memo:
基金项目:国家自然科学基金(61472070,61672142) 收稿日期:2017-09-16 *通讯联系人,E-mail:shendr@mail.neu.edu.cn
更新日期/Last Update: 2017-11-26