Demystifying JOIN Algorithms
Joins are an important class of operations in databases and data pipelines. There
are many kinds of joins in SQL databases, but this post will
focus only on the INNER JOIN. 1
It is possible to make the JOIN between two large tables very efficient if the
result is going to be small. That being a common scenario in practice, makes
JOINs an interesting topic of research.
A naive JOIN algorithm can be very slow because JOINs are essentially nested
loops.
1 2 3 4 | |
A better alternative to the algorithm above is the HashJoin. In the first step
of a HashJoin, a hash table of one
of the JOIN operands — table_a — is built. Then a cheap lookup can
replace the expensive innermost loop doing a sequential scan on the whole
table_b for each row of table_a.
1 2 3 4 5 6 7 | |
This code snippet above is very common in hand-written backend and UI code stitching data together to show something useful to users. So it should be no surprise that this is what databases commonly do to solve the same problem.
A Practical Example
Let’s look at an e-commerce database containing two tables: product and
cart_item.
The product table contains all the product the store sells.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
The cart_item contains products added to the cart of all the users.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
An useful view of the cart of a user will need more information about the product
other than just the id (e.g. the name of the product). That is when a JOIN
becomes necessary.
1 2 3 4 5 6 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
PostgreSQL might use a HashJoin to execute this query. The EXPLAIN command
shows the query plan.
1 2 3 4 5 | |
The sequential scan on cart_item is necessary to build the hash table mapping
cart_item.product_id to the cart_item record. That is the first for loop
in the pseudo-code above. The second for loop is the sequential scan on
product that queries the hash table with each product.id it finds.
If you mess with the PostgreSQL query planner really hard 2, you can
convince it to use the NestedLoop join based on nested sequential scans of
both tables.
1 2 3 4 | |
And this is the most naive way to JOIN two tables as shown in the first code
snippet — two nested loops doing sequential scans.
Indexing
The hash table built in the HashJoin is a form of indexing (temporary index)
that speeds up the second for loop. Hash tables allow efficient random queries
(i.e. keys used in a sequence of lookups do not obbey any particular order) for
an specific key value.
An issue with the HashJoin query plan above is that the hash table is built
at the time of the query. Indexing in the context of databases more commonly
means that an index is already built when a query comes in (persistent index).
This requires knowing which index is more suitable for the queries that will
come later. Creating all possible indexes is not a good strategy because they
would be many and indexes are not free. When the table changes, indexes have to
be updated to reflect the changes. So each index adds an extra cost to writes
and deletes.
Let’s disable HashJoin 3 and see how our JOIN
query is planned.
1 2 3 4 | |
This query plan execution is similar to the second for loop in the HashJoin
algorithm, but instead of using a temporary hash table that needs to be built at
the time of the query, it uses the readily available primary
key index of the product table on
the id column.
1 2 3 4 | |
Indexing with B-Trees
Classical relational databases were designed at a time when RAM was expensive and small. A lot of effort went into making sure that queries could be performed without bringing too much data into memory. Optimizing the disk access patterns is crucial for good performance of the algorithms operating on data stored in disks. Unlike main memory, disk are very bad at serving random access since the cost of seeking to the right position is considerable. 4
Sequentially scanning the randomly sorted product_ids in the cart_item
table 5 and querying the product_pkey1 index stored in disk every
time is not very good because it might require the disk to seek to random
locations far too often.
The product_pkey1 index stored in disk is not a hash table, but a B-Tree.
B-Trees (and other types of ordered indexes) allow more interesting queries than
the specific key lookup provided by hash tables. In addition to key lookups,
B-trees allow range queries.
1 2 3 4 5 6 7 8 9 10 | |
B-Trees shine when a range of keys is read sequentially. This access pattern minimizes the number of disk seeks and number of page loads into memory (storage blocks containing sorted keys).
Let’s create a B-Tree index for the product_id column in the cart_item table
and see what happens to the query plan. 6
1 2 3 4 | |
Now the two indexes are being used. This is very good, because the scan for
product_id on the cart_product_id_idx index will not be in the random order
they appear in the cart_item table. The lookups for product by the product
id will happen in the order of the product_pkey1 index. This will maximize
the use of the page cache. Entry 43 of the product_pkey1 B-Tree is very
likely to be in the same page of the entry 42 and so on. This locality is
what guarantees the Nested Loop will not be
quadratic
in time. The cost of the innermost lookup will be amortized. Some will be
slower than others when a page has to be loaded, but on average they will take a
constant time. This is how the algorithm looks like in pseudo-code.
1 2 3 | |
Joining Sorted Relations
MergeJoin can take advantage of the two JOIN operands (relations) being
sorted to efficiently produce output. The pseudo-code for the MergeJoin is not
as straightforward as NestedLoop and HashJoin, but is not very complicated.
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
On every loop iteration at least one record is consumed from one of the two
operands. This guarantees that MergeJoin will take time proportional to the
sum of the size of the two sorted JOIN operands.
Since B-Tree indexes are sorted, they lend themselves very well to MergeJoin.
This is a possible query plan to our example query. The query planner is taking
advantage of two sorted indexes for the MergeJoin – cart_product_id_idx and
product_pkey1.
1 2 3 4 | |
If a persistent index is not available (or data is small enough to be completely
in-memory), PostgreSQL might sort one or both operands to be able to run a
MergeJoin on two sorted relations. This is possible query plan when the
cart_product_id_idx index is not available.
1 2 3 4 5 6 | |
Conclusion
JOINs are essentially nested loops.- Multiple techniques are used to improve the performance of
JOINs. - When datasets are small or no appropriate index is being maintained, query
planners will likely decide to use memory to make queries run faster. The
query executor can build hash tables (or sort records in memory, or…) and
use
HashJoinorMergeJointo more efficiently join the two operands according to the join-condition in the query. - Maintaining index data structures like B-Trees up to date allows
Nested Loopalgorithm to run much faster even when the dataset doesn’t conveniently fit in memory. IndexedJOINs take time proportional to the size of the result and are not affected by the size of the tables.
There’s much more about JOINs and techniques to speed them up, but these are the
basics that will allow you to start thinking about how data layout and choice of
algorithms can greatly affect query performance.
Understanding the three classical JOIN algorithms – NestedLoop, HashJoin,
MergeJoin – how they take advantage of different indexes and how they behave
when there is no index can give you a lot of insight on how databases run
queries.
Further Reading
- Query evaluation techniques for large databases by Goetz Graefe, 1993.
- Modern B-Tree Techniques by Goetz Graefe, 2011.
- New algorithms for join and grouping operations by Goetz Graefe, 2012.
- Readings in Database Systems, 5th Edition by Peter Bailis, Joseph M. Hellerstein, and Michael Stonebraker, 2015.
-
The
INNERkeyword is optional in SQL queries.JOINalone meansINNER JOIN. ↩ -
Set
enable_material,enable_mergejoin,enable_bitmapscan,enable_indexscan,enable_hashjoinquery planner options tooff. ↩ -
HashJoincan be disabled withset enable_hashjoin = off;in PostgreSQL. By default, the query planner is smart enough to realize that our example table is so small that it is better to build a temporary in-memory hash table than to load the, possibly not cached, disk-persistent primary key index. SoHashJoinis disabled to illustrate the behavior when tables do not fit in RAM. ↩ -
The rows in the
cart_itemtable follow the order at which users added products to cart. ↩ -
To force the query planner to not use memory for everything, I had to also
set enable_sort = off;. This wouldn’t be necessary if the tables didn’t fit in RAM. ↩