Demystifying JOIN Algorithms

Joins are an important class of operations in databases and data pipelines. There are many kinds of joins in SQL databases, but this post will focus only on the INNER JOIN. ¹

It is possible to make the JOIN between two large tables very efficient if the result is going to be small. That being a common scenario in practice, makes JOINs an interesting topic of research.

A naive JOIN algorithm can be very slow because JOINs are essentially nested loops.

NestedLoop JOIN based on sequential scans

for row_i in table_a:
  for row_j in table_b:
    if (row_i.key == row_j.key):
      emit_join_result(row_i, row_j)

A better alternative to the algorithm above is the HashJoin. In the first step of a HashJoin, a hash table of one of the JOIN operands — table_a — is built. Then a cheap lookup can replace the expensive innermost loop doing a sequential scan on the whole table_b for each row of table_a.

HashJoin

for row_i in table_a:
  hash_table[row_i.key] = row_i

for row_j in table_b:
  row_i = hash_table[row_j.key]
  if (row_i AND row_i.key == row_j.key):
    emit_join_result(row_i, row_j)

This code snippet above is very common in hand-written backend and UI code stitching data together to show something useful to users. So it should be no surprise that this is what databases commonly do to solve the same problem.

A Practical Example

Let’s look at an e-commerce database containing two tables: product and cart_item.

The product table contains all the product the store sells.

`product` table

 id |    name
----+-------------
   ...
| Iogurt
| Tea
| Sugar
| Chocolate
| Butter
| Detergent
| Panettone
| Coffee
| Shampoo
| Toothpaste
   ...

The cart_item contains products added to the cart of all the users.

`cart_item` table

  id  | user_id | product_id | quantity_added
------+---------+------------+----------------
|     776 |         34 |              4
|     494 |         35 |              2
|     494 |         34 |              2
|     494 |         36 |              1
|     494 |         37 |              3
|     494 |         38 |              1
|     494 |         39 |              2
|     494 |         40 |              4
|     494 |         41 |              2
|     494 |         42 |              2
|     574 |         34 |              2
|     574 |         36 |              1
|     574 |         37 |              4
|     574 |         43 |              1
                    ...

An useful view of the cart of a user will need more information about the product other than just the id (e.g. the name of the product). That is when a JOIN becomes necessary.

JOIN example

SELECT cart_item.user_id,
       cart_item.product_id,
       cart_item.quantity_added,
       product.name
FROM cart_item
JOIN product ON cart_item.product_id = product.id;

Results of the JOIN example query

 user_id | product_id | quantity_added |   name
---------+------------+----------------+------------
|         34 |              4 | Iogurt
|         34 |              2 | Iogurt
|         35 |              2 | Tea
|         36 |              1 | Sugar
|         36 |              1 | Sugar
|         37 |              3 | Chocolate
|         37 |              4 | Chocolate
|         38 |              1 | Butter
|         39 |              2 | Detergent
|         40 |              4 | Panettone
|         41 |              2 | Coffee
|         42 |              2 | Shampoo
|         43 |              1 | Toothpaste
                     ...

PostgreSQL might use a HashJoin to execute this query. The EXPLAIN command shows the query plan.

HashJoin query plan

Hash Join
  Hash Cond: (product.id = cart_item.product_id)
  ->  Seq Scan on product
  ->  Hash
        ->  Seq Scan on cart_item

The sequential scan on cart_item is necessary to build the hash table mapping cart_item.product_id to the cart_item record. That is the first for loop in the pseudo-code above. The second for loop is the sequential scan on product that queries the hash table with each product.id it finds.

If you mess with the PostgreSQL query planner really hard ², you can convince it to use the NestedLoop join based on nested sequential scans of both tables.

NestedLoop query plan based on sequential scans

Nested Loop
  Join Filter: (cart_item.product_id = product.id)
  ->  Seq Scan on product
  ->  Seq Scan on open_cart_item cart_item

And this is the most naive way to JOIN two tables as shown in the first code snippet — two nested loops doing sequential scans.

Indexing

The hash table built in the HashJoin is a form of indexing (temporary index) that speeds up the second for loop. Hash tables allow efficient random queries (i.e. keys used in a sequence of lookups do not obbey any particular order) for an specific key value.

An issue with the HashJoin query plan above is that the hash table is built at the time of the query. Indexing in the context of databases more commonly means that an index is already built when a query comes in (persistent index). This requires knowing which index is more suitable for the queries that will come later. Creating all possible indexes is not a good strategy because they would be many and indexes are not free. When the table changes, indexes have to be updated to reflect the changes. So each index adds an extra cost to writes and deletes.

Let’s disable HashJoin ³ and see how our JOIN query is planned.

NestedLoop query plan using a single index

Nested Loop
  ->  Seq Scan on cart_item
  ->  Index Scan using product_pkey1 on product
        Index Cond: (id = cart_item.product_id)

This query plan execution is similar to the second for loop in the HashJoin algorithm, but instead of using a temporary hash table that needs to be built at the time of the query, it uses the readily available primary key index of the product table on the id column.

Pseudo-code of the NestedLoop query plan

for cart_item_j in cart_item:
  product_i = product_pkey1.lookup(cart_item_j.product_id)
  if (product_i.id == cart_item.product_id):
    emit_join_result(product_i, cart_item_j)

Indexing with B-Trees

Classical relational databases were designed at a time when RAM was expensive and small. A lot of effort went into making sure that queries could be performed without bringing too much data into memory. Optimizing the disk access patterns is crucial for good performance of the algorithms operating on data stored in disks. Unlike main memory, disk are very bad at serving random access since the cost of seeking to the right position is considerable. ⁴

Sequentially scanning the randomly sorted product_ids in the cart_item table ⁵ and querying the product_pkey1 index stored in disk every time is not very good because it might require the disk to seek to random locations far too often.

The product_pkey1 index stored in disk is not a hash table, but a B-Tree. B-Trees (and other types of ordered indexes) allow more interesting queries than the specific key lookup provided by hash tables. In addition to key lookups, B-trees allow range queries.

Queries a B-Tree can efficiently resolve

Specific key lookup

  key = ?

Range queries

  key <  ?
  key <= ?
  key >= ?
  key >  ?

B-Trees shine when a range of keys is read sequentially. This access pattern minimizes the number of disk seeks and number of page loads into memory (storage blocks containing sorted keys).

Let’s create a B-Tree index for the product_id column in the cart_item table and see what happens to the query plan. ⁶

NestedLoop query plan using two indexes

Nested Loop
  ->  Index Scan using cart_product_id_idx on cart_item
  ->  Index Scan using product_pkey1 on product
        Index Cond: (id = cart_item.product_id)

Now the two indexes are being used. This is very good, because the scan for product_id on the cart_product_id_idx index will not be in the random order they appear in the cart_item table. The lookups for product by the product id will happen in the order of the product_pkey1 index. This will maximize the use of the page cache. Entry 43 of the product_pkey1 B-Tree is very likely to be in the same page of the entry 42 and so on. This locality is what guarantees the Nested Loop will not be quadratic in time. The cost of the innermost lookup will be amortized. Some will be slower than others when a page has to be loaded, but on average they will take a constant time. This is how the algorithm looks like in pseudo-code.

Pseudo-code of the NestedLoop query plan using two indexes

for (product_id, cart_item_row_id) in cart_product_id_idx:
  product_i = product_pkey1.lookup(product_id)
  emit_join_result(cart_item[cart_item_row_id], product_i)

Joining Sorted Relations

MergeJoin can take advantage of the two JOIN operands (relations) being sorted to efficiently produce output. The pseudo-code for the MergeJoin is not as straightforward as NestedLoop and HashJoin, but is not very complicated.

Pseudo-code for a simple MergeJoin

// Cursors for sequential scanning of table_a and table_b
cursor_a = 0
cursor_b = 0

while cursor_a.hasNext and cursor_b.hasNext:
  if cursor_a.key < cursor_b.key:
    cursor_a.next
  elif cursor_a.key == cursor_b.key:
    emit_join_result(cursor_a.row, cursor_b.row)
    cursor_a.next()
    cursor_b.next()
  else:
    cursor_b.next()

On every loop iteration at least one record is consumed from one of the two operands. This guarantees that MergeJoin will take time proportional to the sum of the size of the two sorted JOIN operands.

Since B-Tree indexes are sorted, they lend themselves very well to MergeJoin. This is a possible query plan to our example query. The query planner is taking advantage of two sorted indexes for the MergeJoin – cart_product_id_idx and product_pkey1.

Pseudo-code for a simple MergeJoin

Merge Join
  Merge Cond:
  ->  Index Scan using cart_product_id_idx on open_cart_item
  ->  Index Scan using product_pkey1 on product

If a persistent index is not available (or data is small enough to be completely in-memory), PostgreSQL might sort one or both operands to be able to run a MergeJoin on two sorted relations. This is possible query plan when the cart_product_id_idx index is not available.

Pseudo-code for a simple MergeJoin

Merge Join
  Merge Cond:
  ->  Sort
        Sort Key: open_cart_item.product_id
        ->  Seq Scan on open_cart_item
  ->  Index Scan using product_pkey1 on product

Conclusion

JOINs are essentially nested loops.
Multiple techniques are used to improve the performance of JOINs.
When datasets are small or no appropriate index is being maintained, query planners will likely decide to use memory to make queries run faster. The query executor can build hash tables (or sort records in memory, or…) and use HashJoin or MergeJoin to more efficiently join the two operands according to the join-condition in the query.
Maintaining index data structures like B-Trees up to date allows Nested Loop algorithm to run much faster even when the dataset doesn’t conveniently fit in memory. Indexed JOINs take time proportional to the size of the result and are not affected by the size of the tables.

There’s much more about JOINs and techniques to speed them up, but these are the basics that will allow you to start thinking about how data layout and choice of algorithms can greatly affect query performance.

Understanding the three classical JOIN algorithms – NestedLoop, HashJoin, MergeJoin – how they take advantage of different indexes and how they behave when there is no index can give you a lot of insight on how databases run queries.

Demystifying JOIN Algorithms

A Practical Example

Indexing

Indexing with B-Trees

Joining Sorted Relations

Conclusion

Further Reading