Efficient Parallel Set-Similarity Joins Using MapReduce

Authors: 
Vernica, Rares; Carey, Michael J.; Li, Chen
Author: 
Vernica, R
Li, C
Carey, M

In this paper we study how to efficiently perform set-simi-
larity joins in parallel using the popular MapReduce frame-
work. We propose a 3-stage approach for end-to-end set-
similarity joins. We take as input a set of records and output
a set of joined records based on a set-similarity condition.
We efficiently partition the data across nodes in order to
balance the workload and minimize the need for replication.
We study both self-join and R-S join cases, and show how to
carefully control the amount of data kept in main memory
on each node. We also propose solutions for the case where,
even if we use the most fine-grained partitioning, the data
still does not fit in the main memory of a node. We report
results from extensive experiments on real datasets, synthet-
ically increased in size, to evaluate the speedup and scaleup
properties of the proposed algorithms using Hadoop.

Year: 
2010
Venue: 
SIGMOD 2010
URL: 
http://asterix.ics.uci.edu/pub/sigmod10-vernica-long.pdf
Citations range: 
n/a
AttachmentSize
http://asterix.ics.uci.edu/pub/sigmod10-vernica.pdf691.9 KB