Block-based Load Balancing for Entity Resolution with MapReduce

Kolb, L; Thor, A; Rahm, E

The effectiveness and scalability of MapReduce-based im-
plementations of complex data-intensive tasks depend on an
even redistribution of data between map and reduce tasks.
In the presence of skewed data, sophisticated redistribution
approaches thus become necessary to achieve load balanc-
ing among all reduce tasks to be executed in parallel. For
the complex problem of entity resolution with blocking, we
propose BlockSplit, a load balancing approach that supports
blocking techniques to reduce the search space of entity res-


Multi-pass sorted neighborhood blocking with MapReduce

Kolb, L; Thor, A; Rahm, E

Abstract Cloud infrastructures enable the efficient parallel
execution of data-intensive tasks such as entity resolution on
large datasets. We investigate challenges and possible solu-
tions of using the MapReduce programming model for par-
allel entity resolution using Sorting Neighborhood blocking
(SN). We propose and evaluate two efficient MapReduce-
based implementations for single- and multi-pass SN that
either use multiple MapReduce jobs or apply a tailored data
replication. We also propose an automatic data partitioning
approach for multi-pass SN to achieve load balancing. Our

Syndicate content