Efficient Parallel Set-Similarity Joins Using MapReduce

Vernica, Rares; Carey, Michael J.; Li, Chen

In this paper we study how to efficiently perform set-simi-
larity joins in parallel using the popular MapReduce frame-
work. We propose a 3-stage approach for end-to-end set-
similarity joins. We take as input a set of records and output
a set of joined records based on a set-similarity condition.
We efficiently partition the data across nodes in order to
balance the workload and minimize the need for replication.
We study both self-join and R-S join cases, and show how to
carefully control the amount of data kept in main memory

Syndicate content