Sliding window

Multi-pass sorted neighborhood blocking with MapReduce

Authors: 
Kolb, L; Thor, A; Rahm, E

Abstract Cloud infrastructures enable the efficient parallel
execution of data-intensive tasks such as entity resolution on
large datasets. We investigate challenges and possible solu-
tions of using the MapReduce programming model for par-
allel entity resolution using Sorting Neighborhood blocking
(SN). We propose and evaluate two efficient MapReduce-
based implementations for single- and multi-pass SN that
either use multiple MapReduce jobs or apply a tailored data
replication. We also propose an automatic data partitioning
approach for multi-pass SN to achieve load balancing. Our

Year: 
2011

Parallel Sorted Neighborhood Blocking with MapReduce

Authors: 
Kolb, L; Thor, A; Rahm, E

Cloud infrastructures enable the efficient parallel execution of data-intensive
tasks such as entity resolution on large datasets. We investigate challenges and possi-
ble solutions of using the MapReduce programming model for parallel entity resolu-
tion. In particular, we propose and evaluate two MapReduce-based implementations
for Sorted Neighborhood blocking that either use multiple MapReduce jobs or apply
a tailored data replication.

Year: 
2011
Syndicate content