Multi-pass sorted neighborhood blocking with MapReduce

Authors: 
Kolb, L; Thor, A; Rahm, E
Author: 
Kolb, L
Rahm, E
Thor, A

Abstract Cloud infrastructures enable the efficient parallel
execution of data-intensive tasks such as entity resolution on
large datasets. We investigate challenges and possible solu-
tions of using the MapReduce programming model for par-
allel entity resolution using Sorting Neighborhood blocking
(SN). We propose and evaluate two efficient MapReduce-
based implementations for single- and multi-pass SN that
either use multiple MapReduce jobs or apply a tailored data
replication. We also propose an automatic data partitioning
approach for multi-pass SN to achieve load balancing. Our
evaluation based on real-world datasets shows the high effi-
ciency and effectiveness of the proposed approaches.

Year: 
2011
Venue: 
CSRD 2011
URL: 
http://www.springerlink.com/index/57H4677326NH4G27.pdf
Citations: 
0
Citations range: 
n/a
AttachmentSize
multi_pass_sn_with_mr.pdf739.14 KB