In this paper we study how to eﬃciently perform set-simi-
larity joins in parallel using the popular MapReduce frame-
work. We propose a 3-stage approach for end-to-end set-
similarity joins. We take as input a set of records and output
a set of joined records based on a set-similarity condition.
We eﬃciently partition the data across nodes in order to
balance the workload and minimize the need for replication.
We study both self-join and R-S join cases, and show how to
carefully control the amount of data kept in main memory
on each node. We also propose solutions for the case where,
even if we use the most ﬁne-grained partitioning, the data
still does not ﬁt in the main memory of a node. We report
results from extensive experiments on real datasets, synthet-
ically increased in size, to evaluate the speedup and scaleup
properties of the proposed algorithms using Hadoop.