Demystifying JOIN Algorithms
Joins are an important class of operations in databases and data pipelines. There
are many kinds of joins in SQL databases, but this post will
focus only on the
INNER JOIN. 1
It is possible to make the
JOIN between two large tables very efficient if the
result is going to be small. That being a common scenario in practice, makes
JOINs an interesting topic of research.
JOIN algorithm can be very slow because
JOINs are essentially nested
1 2 3 4
A better alternative to the algorithm above is the
HashJoin. In the first step
HashJoin, a hash table of one
JOIN operands —
table_a — is built. Then a cheap lookup can
replace the expensive innermost loop doing a sequential scan on the whole
table_b for each row of
1 2 3 4 5 6 7
This code snippet above is very common in hand-written backend and UI code stitching data together to show something useful to users. So it should be no surprise that this is what databases commonly do to solve the same problem.
A Practical Example
Let’s look at an e-commerce database containing two tables:
product table contains all the product the store sells.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
cart_item contains products added to the cart of all the users.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
An useful view of the cart of a user will need more information about the product
other than just the
id (e.g. the
name of the product). That is when a
1 2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
PostgreSQL might use a
HashJoin to execute this query. The
shows the query plan.
1 2 3 4 5
The sequential scan on
cart_item is necessary to build the hash table mapping
cart_item.product_id to the
cart_item record. That is the first
in the pseudo-code above. The second
for loop is the sequential scan on
product that queries the hash table with each
product.id it finds.
If you mess with the PostgreSQL query planner really hard 2, you can
convince it to use the
NestedLoop join based on nested sequential scans of
1 2 3 4
And this is the most naive way to
JOIN two tables as shown in the first code
snippet — two nested loops doing sequential scans.
The hash table built in the
HashJoin is a form of indexing (temporary index)
that speeds up the second
for loop. Hash tables allow efficient random queries
(i.e. keys used in a sequence of lookups do not obbey any particular order) for
an specific key value.
An issue with the
HashJoin query plan above is that the hash table is built
at the time of the query. Indexing in the context of databases more commonly
means that an index is already built when a query comes in (persistent index).
This requires knowing which index is more suitable for the queries that will
come later. Creating all possible indexes is not a good strategy because they
would be many and indexes are not free. When the table changes, indexes have to
be updated to reflect the changes. So each index adds an extra cost to writes
HashJoin 3 and see how our
query is planned.
1 2 3 4
This query plan execution is similar to the second for loop in the
algorithm, but instead of using a temporary hash table that needs to be built at
the time of the query, it uses the readily available primary
key index of the
product table on
1 2 3 4
Indexing with B-Trees
Classical relational databases were designed at a time when RAM was expensive and small. A lot of effort went into making sure that queries could be performed without bringing too much data into memory. Optimizing the disk access patterns is crucial for good performance of the algorithms operating on data stored in disks. Unlike main memory, disk are very bad at serving random access since the cost of seeking to the right position is considerable. 4
Sequentially scanning the randomly sorted
product_ids in the
table 5 and querying the
product_pkey1 index stored in disk every
time is not very good because it might require the disk to seek to random
locations far too often.
product_pkey1 index stored in disk is not a hash table, but a B-Tree.
B-Trees (and other types of ordered indexes) allow more interesting queries than
the specific key lookup provided by hash tables. In addition to key lookups,
B-trees allow range queries.
1 2 3 4 5 6 7 8 9 10
B-Trees shine when a range of keys is read sequentially. This access pattern minimizes the number of disk seeks and number of page loads into memory (storage blocks containing sorted keys).
Let’s create a B-Tree index for the
product_id column in the
and see what happens to the query plan. 6
1 2 3 4
Now the two indexes are being used. This is very good, because the scan for
product_id on the
cart_product_id_idx index will not be in the random order
they appear in the
cart_item table. The lookups for
product by the product
id will happen in the order of the
product_pkey1 index. This will maximize
the use of the page cache. Entry
43 of the
product_pkey1 B-Tree is very
likely to be in the same page of the entry
42 and so on. This locality is
what guarantees the
Nested Loop will not be
in time. The cost of the innermost lookup will be amortized. Some will be
slower than others when a page has to be loaded, but on average they will take a
constant time. This is how the algorithm looks like in pseudo-code.
1 2 3
Joining Sorted Relations
MergeJoin can take advantage of the two
JOIN operands (relations) being
sorted to efficiently produce output. The pseudo-code for the
MergeJoin is not
as straightforward as
HashJoin, but is not very complicated.
1 2 3 4 5 6 7 8 9 10 11 12 13
On every loop iteration at least one record is consumed from one of the two
operands. This guarantees that
MergeJoin will take time proportional to the
sum of the size of the two sorted
Since B-Tree indexes are sorted, they lend themselves very well to
This is a possible query plan to our example query. The query planner is taking
advantage of two sorted indexes for the
1 2 3 4
If a persistent index is not available (or data is small enough to be completely
in-memory), PostgreSQL might sort one or both operands to be able to run a
MergeJoin on two sorted relations. This is possible query plan when the
cart_product_id_idx index is not available.
1 2 3 4 5 6
JOINs are essentially nested loops.
- Multiple techniques are used to improve the performance of
- When datasets are small or no appropriate index is being maintained, query
planners will likely decide to use memory to make queries run faster. The
query executor can build hash tables (or sort records in memory, or…) and
MergeJointo more efficiently join the two operands according to the join-condition in the query.
- Maintaining index data structures like B-Trees up to date allows
Nested Loopalgorithm to run much faster even when the dataset doesn’t conveniently fit in memory. Indexed
JOINs take time proportional to the size of the result and are not affected by the size of the tables.
There’s much more about
JOINs and techniques to speed them up, but these are the
basics that will allow you to start thinking about how data layout and choice of
algorithms can greatly affect query performance.
Understanding the three classical
JOIN algorithms –
MergeJoin – how they take advantage of different indexes and how they behave
when there is no index can give you a lot of insight on how databases run
- Query evaluation techniques for large databases by Goetz Graefe, 1993.
- Modern B-Tree Techniques by Goetz Graefe, 2011.
- New algorithms for join and grouping operations by Goetz Graefe, 2012.
- Readings in Database Systems, 5th Edition by Peter Bailis, Joseph M. Hellerstein, and Michael Stonebraker, 2015.
INNERkeyword is optional in SQL queries.
INNER JOIN. ↩
enable_hashjoinquery planner options to
HashJoincan be disabled with
set enable_hashjoin = off;in PostgreSQL. By default, the query planner is smart enough to realize that our example table is so small that it is better to build a temporary in-memory hash table than to load the, possibly not cached, disk-persistent primary key index. So
HashJoinis disabled to illustrate the behavior when tables do not fit in RAM. ↩
The rows in the
cart_itemtable follow the order at which users added products to cart. ↩
To force the query planner to not use memory for everything, I had to also
set enable_sort = off;. This wouldn’t be necessary if the tables didn’t fit in RAM. ↩