SQL inner and outer joins - Eli Bendersky's website

If you store data in a relational database, it's good practice to have the data normalized. This typically requires splitting data to multiple tables that are logically connected through keys. As a result, most non-trivial queries require joins on multiple tables to gather all the interesting columns. This post is a brief tour of SQL joins, focusing on the differences between inner and outer joins.

Cross join

To understand SQL joins, it's best to start with cross joins, because they are the simplest combination of tables supported by SQL. A cross join occurs when we write:

select * from t1, t2;

Throughout this post, we'll be working with two sample tables called t1 and t2:

    t1                         t2

 id |   name              code | id
----+----------          ------+----
  1 | Joanne              x    |  2
  2 | Sam                 z    |  3
  3 | Emmanuel            a    |  7
  4 | Brayden

The SQL code to create these tables and run all the examples in this post is available here. All the code was tested on PostgreSQL 9.5.

Running the cross join on these tables results in:

 id |   name   | code | id
----+----------+------+----
  1 | Joanne   | x    |  2
  2 | Sam      | x    |  2
  3 | Emmanuel | x    |  2
  4 | Brayden  | x    |  2
  1 | Joanne   | z    |  3
  2 | Sam      | z    |  3
  3 | Emmanuel | z    |  3
  4 | Brayden  | z    |  3
  1 | Joanne   | a    |  7
  2 | Sam      | a    |  7
  3 | Emmanuel | a    |  7
  4 | Brayden  | a    |  7

The cross join performs a cross product (or Cartesian product) between the two tables. For each row in t1, it adds all possible rows from t2. The resulting table has all the columns of t1 and of t2, and its number of rows is the product of numbers of rows in t1 and t2.

I find cross joins to be a good starting point because they make inner joins much easier to understand. They are also the basis of joins in relational algebra.

SQL also supports a more explicit way to invoke a cross join:

select * from t1 cross join t2;

This is equivalent to the first statement.

Inner join

An important component of SQL queries is filtering results with a where clause. For example, we can create the following (slightly nonsensical) filter on the cross join shown earlier:

select * from t1, t2 where t2.code = 'x' and t1.name like '%d%'

Resulting in:

 id |   name   | code | id
----+----------+------+----
  4 | Brayden  | x    |  2

One filter that's particularly useful when crossing two tables is checking whether there's a match on some column value. Both t1 and t2 have an id column; let's assume these IDs refer to the same thing, and that we want to find all combinations of rows from the two tables where the IDs match. We can do:

select * from t1, t2 where t1.id = t2.id;

Resulting in:

 id |   name   | code | id
----+----------+------+----
  2 | Sam      | x    |  2
  3 | Emmanuel | z    |  3

This kind of filtering is so useful that it has its own concept: the inner join [1]:

select * from t1 inner join t2 on t1.id = t2.id;

It produces the exact same result table. When the names of the columns we compare are the same in the two tables, there's an even shorter syntax that can be used:

select * from t1 inner join t2 using (id);

The result of this will only have a single id column, since we're making it explicit that ids match between the tables:

 id |   name   | code
----+----------+------
  2 | Sam      | x
  3 | Emmanuel | z

I find the filtering equivalence very useful to understand inner joins. Just remember that it's a cross product of the two tables where only rows that satisfy a certain condition are returned. You may be wondering what's the difference between using where filtering and inner join ... on. While the two are logically equivalent, some things to keep in mind:

At least theoretically, inner join ... on is more efficient because in multi-table joins (which is common) we get to apply the filtering per join and not at the end on one huge table. With modern SQL query optimizers it's not clear whether this is a real advantage, however. It's quite likely that the optimizer will generate exactly the same sequence of low-level operations for the two.
In terms of readability, it's much nicer to be able to see what the join is on close to the join itself, rather than in the end of the query in one large where filter. This can be significant for multi-table joins.

As an example, consider customers making orders, with order details in a separate table (since customers could have multiple orders). We could have a complex join done as:

select *
  from customers, orders, order_details
  where customers.id = order_details.customerid and
        orders.id = order_details.orderid

Compared with:

select *
  from customers
    inner join order_details on customer.id = order_details.customerid
    inner join orders on orders.id = order_details.orderid

In the latter it's much clearer what the criteria for each join is.

Finally, I'll mention that some databases support the natural join, which is a shortcut for "inner join tables on the columns that have the same name". The following query is equivalent to the variant with using shown above:

select * from t1 natural join t2;

Natural join is a term from relational algebra, and it's not commonly used in SQL queries.

Outer join

While the inner join is simple to understand as a special case of the cross product, outer join is a bit trickier. Luckily, it's not hard to grok outer joins once you undererstand inner joins, so we can build this knowledge step by step.

Let's get back to our tables t1 and t2. We could assign a logical meaning to the inner join using (id) as "show me all the codes (from t2) matching names (from t1)". The result is two rows where a match on id was found in the two tables. However, sometimes we want something slightly different; we want to ask "show me all the names (from t1) and all the codes (from t2) that match them, if any". In other words, we want all the names to be in the results, perhaps with null values for code where no match was found in the t2 table [2].

Let's break this request to pieces. We want:

All names from t1 that have a match in t2, with the code from t2
All names from t1 that have no match in t2, with null for the code

In SQL we can express this as follows:

select id, name, code
    from t1 inner join t2 using (id)
  union
select id, name, null
    from t1 where id not in (select id from t2);

Some things to note:

The first query is precisely our inner join from the previous section, and it's answering the first piece.
The second query lists all the names that don't have a match in t2 using a subquery.
We're listing the column names explicitly here because column names must match exactly for the two tables being union-ed.

What we just wrote is called a left outer join in SQL [3], and can be more easily written as:

select * from t1 left outer join t2 using (id);

The result is:

 id |   name   | code
----+----------+------
  2 | Sam      | x
  3 | Emmanuel | z
  4 | Brayden  |
  1 | Joanne   |

This is the left outer join because we want all the rows from the left-hand side table to appear in the result. As you may have guessed, there's also a right outer join:

select * from t1 right outer join t2 using (id);

 id |   name   | code
----+----------+------
  2 | Sam      | x
  3 | Emmanuel | z
  7 |          | a

Here all the rows from the right-hand side table appear in the result, with a matching column (name) from the left-hand side if found, null otherwise.

Finally, we may want rows from both sides of the join to always appear in the result table. That's called a full outer join:

select * from t1 full outer join t2 using (id);

Resulting in:

 id |   name   | code
----+----------+------
  2 | Sam      | x
  3 | Emmanuel | z
  7 |          | a
  4 | Brayden  |
  1 | Joanne   |

A full outer join is straightforward to express using a union of left and right joins:

select * from t1 left join t2 using (id)
  union
select * from t1 right join t2 using (id);

There's a slight caveat, though. While union removes duplicates, full outer join does not; therefore, the results can be different in some special cases. In the event that you care about seeing duplicates in the output and the database doesn't support a full outer join, this is a more accurate (though less efficient) translation:

select * from t1 left join t2 using (id)
  union all
select * from t1 right join t2 using (id) where t1.id is null;

Joins on multiple columns

The examples so far showed joins on a single shared column - id. While this is the most common case, sometimes more complex matching criteria are used. SQL doesn't restrict the syntax of join to a single condition, so we can join on multiple columns and arbitrary conditions. Let's add another column to our two tables:

        t1                             t2

 id |   name   | ranking        code | id | ranking
----+----------+---------      ------+----+--------
  1 | Joanne   |       7        x    |  2 |       8
  2 | Sam      |       7        z    |  3 |       6
  3 | Emmanuel |       6
  4 | Brayden  |       2

We can run joins on both id and ranking:

select * from t1 inner join t2 on t1.id = t2.id and t1.ranking = t2.ranking;

Resulting in:

 id |   name   | ranking | code | id | ranking
----+----------+---------+------+----+---------
  3 | Emmanuel |       6 | z    |  3 |       6

And with using:

select * from t1 inner join t2 using (id, ranking);

Resulting in:

 id | ranking |   name   | code
----+---------+----------+------
  3 |       6 | Emmanuel | z

Similarly, we can run outer joins:

select * from t1 left outer join t2 using (id, ranking);

Resulting in:

 id | ranking |   name   | code
----+---------+----------+------
  3 |       6 | Emmanuel | z
  2 |       7 | Sam      |
  4 |       2 | Brayden  |
  1 |       7 | Joanne   |

And so on.

Joins on multiple tables

In real-life databases, data is often split to multiple tables; it's not uncommon for queries to probe 4-5 or more tables to gather all the interesting information. Let's use three table for an example. We'll have a table of customers and a table of items:

         customers                       items

 customerid |   name          itemid | description | price
------------+----------      --------+-------------+-------
          1 | Robert               1 | Napkins     |   1.5
          2 | Jennifer             2 | Granola     |  4.25
          3 | Yoshi                3 | Cheese      |     3
          4 | Xi

In addition we'll have a linking table to record orders made by customers:

 customerid | itemid | orderdate
------------+--------+------------
          1 |      2 | 2019-03-02
          1 |      3 | 2019-03-02
          1 |      1 | 2019-03-03
          2 |      1 | 2019-02-22
          3 |      3 | 2019-01-15
          3 |      2 | 2019-02-20
          4 |      3 | 2019-02-21
          4 |      3 | 2019-02-22

We may be interested in all the customers who ordered cheese, and the date of the order. This requires joining all three tables:

select name, orderdate, description
    from (customers
    inner join orders using (customerid))
    inner join items using (itemid)
    where items.description = 'Cheese';

Resulting in:

  name  | orderdate  | description
--------+------------+-------------
 Robert | 2019-03-02 | Cheese
 Yoshi  | 2019-01-15 | Cheese
 Xi     | 2019-02-21 | Cheese
 Xi     | 2019-02-22 | Cheese

Note the parens around the first join. This is not strictly necessary for this query, but I find it useful to control the order of joining explicitly. We can join as many tables as we want, but the order has to make sense. Each join produces a new logical table that participates in other joins, and for some queries the order of joins is important.

While it will be more common to see sequences of inner joins in such queries, it's also possible to mix and match with outer joins; whatever makes sense.

[1]	When we use the `join` keyword in SQL, inner join is the default, so the keyword `inner` is optional. That said, to distinguish inner joins from outer joins IMHO it's preferable to be explicit.

[2] This sounds contrived with our simplistic tables, but in reality it's an extremely common database query. Imagine our t1 is customers with unique IDs and names, and our t2 is some code assigned to each customer. Suppose we want to display all our customers, regardless of who already has a code assigned. For customers that do have a code we want to show it.

[3]	With the keyword `left` before a `join`, the keyword `outer` is optional, so we could just say `left join` instead of `left outer join`. I like the explicitness of having `outer` there. The same applies for `right` and `full` joins.