Demystifying JOIN Algorithms
Joins are an important class of operations in databases and data pipelines. There
are many kinds of joins in SQL databases, but this post will
focus only on the INNER JOIN
. 1
It is possible to make the JOIN
between two large tables very efficient if the
result is going to be small. That being a common scenario in practice, makes
JOIN
s an interesting topic of research.
A naive JOIN
algorithm can be very slow because JOIN
s are essentially nested
loops.
1 2 3 4 |
|
A better alternative to the algorithm above is the HashJoin
. In the first step
of a HashJoin
, a hash table of one
of the JOIN
operands — table_a
— is built. Then a cheap lookup can
replace the expensive innermost loop doing a sequential scan on the whole
table_b
for each row of table_a
.
1 2 3 4 5 6 7 |
|
This code snippet above is very common in hand-written backend and UI code stitching data together to show something useful to users. So it should be no surprise that this is what databases commonly do to solve the same problem.
A Practical Example
Let’s look at an e-commerce database containing two tables: product
and
cart_item
.
The product
table contains all the product the store sells.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
The cart_item
contains products added to the cart of all the users.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
An useful view of the cart of a user will need more information about the product
other than just the id
(e.g. the name
of the product). That is when a JOIN
becomes necessary.
1 2 3 4 5 6 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
PostgreSQL might use a HashJoin
to execute this query. The EXPLAIN
command
shows the query plan.
1 2 3 4 5 |
|
The sequential scan on cart_item
is necessary to build the hash table mapping
cart_item.product_id
to the cart_item
record. That is the first for
loop
in the pseudo-code above. The second for
loop is the sequential scan on
product
that queries the hash table with each product.id
it finds.
If you mess with the PostgreSQL query planner really hard 2, you can
convince it to use the NestedLoop
join based on nested sequential scans of
both tables.
1 2 3 4 |
|
And this is the most naive way to JOIN
two tables as shown in the first code
snippet — two nested loops doing sequential scans.
Indexing
The hash table built in the HashJoin
is a form of indexing (temporary index)
that speeds up the second for
loop. Hash tables allow efficient random queries
(i.e. keys used in a sequence of lookups do not obbey any particular order) for
an specific key value.
An issue with the HashJoin
query plan above is that the hash table is built
at the time of the query. Indexing in the context of databases more commonly
means that an index is already built when a query comes in (persistent index).
This requires knowing which index is more suitable for the queries that will
come later. Creating all possible indexes is not a good strategy because they
would be many and indexes are not free. When the table changes, indexes have to
be updated to reflect the changes. So each index adds an extra cost to writes
and deletes.
Let’s disable HashJoin
3 and see how our JOIN
query is planned.
1 2 3 4 |
|
This query plan execution is similar to the second for loop in the HashJoin
algorithm, but instead of using a temporary hash table that needs to be built at
the time of the query, it uses the readily available primary
key index of the product
table on
the id
column.
1 2 3 4 |
|
Indexing with B-Trees
Classical relational databases were designed at a time when RAM was expensive and small. A lot of effort went into making sure that queries could be performed without bringing too much data into memory. Optimizing the disk access patterns is crucial for good performance of the algorithms operating on data stored in disks. Unlike main memory, disk are very bad at serving random access since the cost of seeking to the right position is considerable. 4
Sequentially scanning the randomly sorted product_id
s in the cart_item
table 5 and querying the product_pkey1
index stored in disk every
time is not very good because it might require the disk to seek to random
locations far too often.
The product_pkey1
index stored in disk is not a hash table, but a B-Tree.
B-Trees (and other types of ordered indexes) allow more interesting queries than
the specific key lookup provided by hash tables. In addition to key lookups,
B-trees allow range queries.
1 2 3 4 5 6 7 8 9 10 |
|
B-Trees shine when a range of keys is read sequentially. This access pattern minimizes the number of disk seeks and number of page loads into memory (storage blocks containing sorted keys).
Let’s create a B-Tree index for the product_id
column in the cart_item
table
and see what happens to the query plan. 6
1 2 3 4 |
|
Now the two indexes are being used. This is very good, because the scan for
product_id
on the cart_product_id_idx
index will not be in the random order
they appear in the cart_item
table. The lookups for product
by the product
id
will happen in the order of the product_pkey1
index. This will maximize
the use of the page cache. Entry 43
of the product_pkey1
B-Tree is very
likely to be in the same page of the entry 42
and so on. This locality is
what guarantees the Nested Loop
will not be
quadratic
in time. The cost of the innermost lookup will be amortized. Some will be
slower than others when a page has to be loaded, but on average they will take a
constant time. This is how the algorithm looks like in pseudo-code.
1 2 3 |
|
Joining Sorted Relations
MergeJoin
can take advantage of the two JOIN
operands (relations) being
sorted to efficiently produce output. The pseudo-code for the MergeJoin
is not
as straightforward as NestedLoop
and HashJoin
, but is not very complicated.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
On every loop iteration at least one record is consumed from one of the two
operands. This guarantees that MergeJoin
will take time proportional to the
sum of the size of the two sorted JOIN
operands.
Since B-Tree indexes are sorted, they lend themselves very well to MergeJoin
.
This is a possible query plan to our example query. The query planner is taking
advantage of two sorted indexes for the MergeJoin
– cart_product_id_idx
and
product_pkey1
.
1 2 3 4 |
|
If a persistent index is not available (or data is small enough to be completely
in-memory), PostgreSQL might sort one or both operands to be able to run a
MergeJoin
on two sorted relations. This is possible query plan when the
cart_product_id_idx
index is not available.
1 2 3 4 5 6 |
|
Conclusion
JOIN
s are essentially nested loops.- Multiple techniques are used to improve the performance of
JOIN
s. - When datasets are small or no appropriate index is being maintained, query
planners will likely decide to use memory to make queries run faster. The
query executor can build hash tables (or sort records in memory, or…) and
use
HashJoin
orMergeJoin
to more efficiently join the two operands according to the join-condition in the query. - Maintaining index data structures like B-Trees up to date allows
Nested Loop
algorithm to run much faster even when the dataset doesn’t conveniently fit in memory. IndexedJOIN
s take time proportional to the size of the result and are not affected by the size of the tables.
There’s much more about JOIN
s and techniques to speed them up, but these are the
basics that will allow you to start thinking about how data layout and choice of
algorithms can greatly affect query performance.
Understanding the three classical JOIN
algorithms – NestedLoop
, HashJoin
,
MergeJoin
– how they take advantage of different indexes and how they behave
when there is no index can give you a lot of insight on how databases run
queries.
Further Reading
- Query evaluation techniques for large databases by Goetz Graefe, 1993.
- Modern B-Tree Techniques by Goetz Graefe, 2011.
- New algorithms for join and grouping operations by Goetz Graefe, 2012.
- Readings in Database Systems, 5th Edition by Peter Bailis, Joseph M. Hellerstein, and Michael Stonebraker, 2015.
-
The
INNER
keyword is optional in SQL queries.JOIN
alone meansINNER JOIN
. ↩ -
Set
enable_material
,enable_mergejoin
,enable_bitmapscan
,enable_indexscan
,enable_hashjoin
query planner options tooff
. ↩ -
HashJoin
can be disabled withset enable_hashjoin = off;
in PostgreSQL. By default, the query planner is smart enough to realize that our example table is so small that it is better to build a temporary in-memory hash table than to load the, possibly not cached, disk-persistent primary key index. SoHashJoin
is disabled to illustrate the behavior when tables do not fit in RAM. ↩ -
The rows in the
cart_item
table follow the order at which users added products to cart. ↩ -
To force the query planner to not use memory for everything, I had to also
set enable_sort = off;
. This wouldn’t be necessary if the tables didn’t fit in RAM. ↩