南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (6): 991.
• • 下一篇
刘 宏,申德荣*,寇 月,聂铁铮,于 戈
Liu Hong,Shen Derong*,Kou Yue,Nie Tiezheng,Yu Ge
摘要: 实体识别(Entity Resolution)是指判断一个或多个数据源中两个不同记录是否描述相同实体,它有时也被称作记录连接(Record Linkage),在数据集成中被用于数据清洗(Data Clean)、去重(Deduplication)和相似连接(Similarity Joins)等集成操作中.实体识别技术可被广泛应用于人口普查、引文识别、Web搜索、数据清洗以及剽窃检验等诸多领域.然而,在真实世界中,实体的属性会随着时间的变化而变化,两条记录的属性值不同不能表明这两条记录对应不同的实体,具有相同的属性值的两条记录也不能表明对应相同的实体.时间记录链接就是匹配描述同一实体的带有时间戳的记录.已有的解决时间记录链接的方法依赖于时间模型来捕捉实体的演化,但是已有的时间模型在预测实体的演化时,实体匹配准确率不高,而聚类计算复杂度较高.为此提出了更加细致的捕捉实体演化的模型和新的两阶段的快速聚类算法.通过在三个真实数据集上的实验结果表明,提出的时间模型可以更加细致地捕捉实体的演化,提出的聚类算法能更快速而准确的聚类描述同一实体的记录,提高了识别的准确率和效率.
[1] Koudas N,Sarawagi S,Srivastava D.Record linkage:Similarity measures and algorithms.In:Roceedings of the 2006 ACM SIGMOD International Conference on Management of Data.Chicago,IL,USA:ACM,2006:802-803. [2] Steorts R C,Ventura S L,Sadinle M,et al.A comparison of blocking methods for record linkage.In:Domingo-Ferrer J.International Conference on Privacy in Statistical Database.Ibiza,Spain:Springer International Publishing,2014:253-268. [3] Gruenheid A,Dong X L,Srivastava D.Incremental record linkage.Proceedings of the VLDB Endowmen,2014,7(9):697-708. [4] Li P,Dong X L,Guo S T,et al.Robust group linkage.In:Proceedings of the 24th International Conference on World Wide Web.Florence,Italy:ACM,2014. [5] Elmagarmid A K,Ipeirotis P G,Verykios V S.Duplicate record detection:A survey.IEEE Transactions on Knowledge and Data Engineering,2007,19(1):1-16. [6] Getoor L,Machanavajjhala A.Entity resolution:Theory,practice & open challenges.Proceedings of the VLDB Endowment,2012,5(12):2018-2019. [7] Kpcke H,Thor A,Rahm E.Evaluation of entity resolution approaches on real-world match problems.Proceedings of the VLDB Endowment,2010,3(1-2):484-493. [8] Shen W,De Rose P,Vu L,et al.Source-aware entity matching:A compositional approach.In:2007 IEEE 23rd International Conference on Data Engineering.Istanbul,Turkey:IEEE,2007:196-205. [9] Jiang Y,Li G L,Feng J H,et al.String similarity joins:An experimental evaluation.Proceedings of the VLDB Endowment,2014,7(8):625-636. [10] Li P,Dong X,Maurino A,et al.Linking temporal records.Proceedings of the VLDB Endowment,2011,4(7):956-967. [11] Chiang Y H,Doan A,Naughton J F.Modeling entity evolution for temporal record matching.In:Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.Snowbird,UT,USA:ACM,2014:1175-1186. [12] Chiang Y H,Doan A,Naughton J F.Tracking entities in the dynamic world:A fast algorithm for matching temporal records.Proceedings of the VLDB Endowment,2014,7(6):469-480. [13] Li F,Lee M L,Hsu W,et al.Linking temporal records for profiling entities.In:Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.Melbourne,Australia:ACM,2015:593-605. [14] Christen P,Gayler R W.Adaptive temporal entity resolution on dynamic databases.In:Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining.Gold Coast,Australia:Springer,2013:558-569. [15] Rekatsinas T,Dong X L,Srivastava D.Characterizing and selecting fresh data sources.In:Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.Snowbird,UT,USA:ACM,2014:919-930. [16] Li F,Lee M L,Hsu W.Entity profiling with varying source reliabilities.In:Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining.New York,NY,USA:ACM,2014:1146-1155. [17] Vernica R,Carey M J,Li C.Efficient parallel set-similarity joins using mapreduce.In:Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.New York,NY,USA:ACM,2010:495-506. [18] The DBLP computer science bibliography.http://dblp.uni-trier.de/db/,2016-03-15. [19] Academic patenting in Europe (APE-INV).http://www.esf-ape-inv.eu/,2016-02-10. [20] Hassanzadeh O,Chiang F,Lee H,et al.Framework for evaluating clustering algorithms in duplicate detection.Proceedings of the VLDB Endowment,2009,2(1):1282-1293. |
No related articles found! |
|