til/sqlite/subqueries-in-select.md

# Subqueries in select expressions in SQLite - also window functions
 
I figured out a single SQL query for the following today. Given a table of GitHub repositories, for each repository return:
 
1. The repository name
2. The date of the most recent release from that repository (the `releases` table is a many-to-one against `repos`)
3. The total number of releases
4. **The three most recent releases** (as a JSON array of objects)
 
Then sort by the total number of releases and return the top 10.
 
Normally I would use a join, group by and count for the first three - but the fourth question here isn't possible to solve using a regular join.
 
Instead, I ended up using a pattern that involves subqueries in the `select` part of the query. Here's what that looks like:
 
```sql
select
  repos.full_name,
  (
    select
      max(created_at)
    from
      releases
    where
      repo = repos.id
  ) as max_created_at,
  (
    select
      count(*)
    from
      releases
    where
      repo = repos.id
  ) as releases_count,
  (
    select
      json_group_array(
        json_object(
          'id',
          id,
          'name',
          name,
          'created_at',
          created_at
        )
      )
    from
      (
        select
          *
        from
          releases
        where
          repo = repos.id
        order by
          created_at desc
        limit
          3
      )
  ) as recent_releases
from
  repos
order by
  releases_count desc
limit
  10
```
[Try that here](https://latest-with-plugins.datasette.io/github?sql=select%0D%0A++repos.full_name%2C%0D%0A++%28%0D%0A++++select%0D%0A++++++max%28created_at%29%0D%0A++++from%0D%0A++++++releases%0D%0A++++where%0D%0A++++++repo+%3D+repos.id%0D%0A++%29+as+max_created_at%2C%0D%0A++%28%0D%0A++++select%0D%0A++++++count%28*%29%0D%0A++++from%0D%0A++++++releases%0D%0A++++where%0D%0A++++++repo+%3D+repos.id%0D%0A++%29+as+releases_count%2C%0D%0A++%28%0D%0A++++select%0D%0A++++++json_group_array%28%0D%0A++++++++json_object%28%0D%0A++++++++++%27id%27%2C%0D%0A++++++++++id%2C%0D%0A++++++++++%27name%27%2C%0D%0A++++++++++name%2C%0D%0A++++++++++%27created_at%27%2C%0D%0A++++++++++created_at%0D%0A++++++++%29%0D%0A++++++%29%0D%0A++++from%0D%0A++++++%28%0D%0A++++++++select%0D%0A++++++++++*%0D%0A++++++++from%0D%0A++++++++++releases%0D%0A++++++++where%0D%0A++++++++++repo+%3D+repos.id%0D%0A++++++++order+by%0D%0A++++++++++created_at+desc%0D%0A++++++++limit%0D%0A++++++++++3%0D%0A++++++%29%0D%0A++%29+as+recent_releases%0D%0Afrom%0D%0A++repos%0D%0Aorder+by%0D%0A++releases_count+desc%0D%0Alimit%0D%0A++10).
 
Here are the first three rows of the output. The `recent_releases` column includes a JSON array with details of the three most recent releases for each of those repositories.
 
| full_name                  | max_created_at       |   releases_count | recent_releases                                                                                                                                                                                             |
|----------------------------|----------------------|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| simonw/datasette           | 2022-12-15T02:02:42Z |              121 | [{"id":86103928,"name":"1.0a2","created_at":"2022-12-15T02:02:42Z"},{"id":84755750,"name":"1.0a1","created_at":"2022-12-01T21:30:39Z"},{"id":84496148,"name":"1.0a0","created_at":"2022-11-29T19:57:54Z"}]  |
| simonw/sqlite-utils        | 2022-10-25T22:34:30Z |              104 | [{"id":80981028,"name":"3.30","created_at":"2022-10-25T22:34:30Z"},{"id":75560168,"name":"3.29","created_at":"2022-08-28T03:48:36Z"},{"id":72130482,"name":"3.28","created_at":"2022-07-15T22:56:01Z"}]     |
| dogsheep/twitter-to-sqlite | 2021-09-21T17:39:08Z |               28 | [{"id":50003635,"name":"0.22","created_at":"2021-09-21T17:39:08Z"},{"id":48150315,"name":"0.21.4","created_at":"2021-08-20T00:14:08Z"},{"id":28876263,"name":"0.21.3","created_at":"2020-07-23T14:56:02Z"}] |
 
Note that this query doesn't use a regular join at all - it's `select complex-set-of-things from repos order by releases_count desc limit 10`.
 
The first two subqueries in the select statement are relatively simple:
 
```sql
  (
    select
      max(created_at)
    from
      releases
    where
      repo = repos.id
  ) as max_created_at,
  (
    select
      count(*)
    from
      releases
    where
      repo = repos.id
  ) as releases_count,
```
These could be handled by a regular join, but I'm including them here to illustrate the general concept - you can run a query inside the `select` that references the outer query (`where repo = repos.id`) and then give it an `as max_created_at` alias to provide a name for the resulting column.
 
The third subquery is more complex.
 
I want to return a single string value with JSON representing the most recent releases for the repository.
 
I select those three most recent releases like this:
 
```sql
select
  *
from
  releases
where
  repo = repos.id
order by
  created_at desc
limit
  3
```
Then I use the SQLite `json_group_array()` aggregate function to combine those three results into a JSON array string - and the `json_object()` scalar function to create JSON objects for each release.
 
```sql
  (
    select
      json_group_array(
        json_object(
          'id',
          id,
          'name',
          name,
          'created_at',
          created_at
        )
      )
    from
      (
        select
          *
        from
          releases
        where
          repo = repos.id
        order by
          created_at desc
        limit
          3
      )
  ) as recent_releases
```
The nested `select ... limit 3` is here purely to allow me to limit to three releases - I could write the query without it, but I'd end up with a JSON array of _every_ release for each repo - which would be a lot of data, since `simonw/datasette` has 121 releases!
 
This trick is really powerful. It's reminiscent of some of the things you can do with GraphQL - and in fact the [Super Graph](https://github.com/Hc747/super-graph) project uses PostgreSQL queries [that look something like this](https://twitter.com/dosco/status/1250886122974130182) compiled from GraphQL queries.
 
## Exploring this with explain query plan
 
While experimenting with this query I built a new prototype Datasette plugin called [datasette-explain](https://github.com/simonw/datasette-explain) to show the output of SQLite's `explain query plan select ...` command, re-formatted into a nested list.
 
The "Try that here" link above demonstrates that plugin. Here's the formatted `explain query plan` output:
 
```
SCAN repos
CORRELATED SCALAR SUBQUERY 2
    SEARCH releases USING COVERING INDEX idx_releases_repo (repo=?)
CORRELATED SCALAR SUBQUERY 1
    SEARCH releases USING INDEX idx_releases_repo (repo=?)
CORRELATED SCALAR SUBQUERY 4
    CO-ROUTINE (subquery-3)
        SEARCH releases USING INDEX idx_releases_repo (repo=?)
        USE TEMP B-TREE FOR ORDER BY
    SCAN (subquery-3)
USE TEMP B-TREE FOR ORDER BY
```
Some jargon definitions here to help decipher this:
 
- `SCAN` means "scan through every row in the entire table" - we start by scanning all of `repos`.
- `SEARCH ... USING COVERING INDEX` means a query can be answered using just data in the index itself. This is used for the `count(*)` subquery, which can be answered using only that data - because the `releases.repo` column is indexed.
- `SEARCH ... USING INDEX` answers a question using an index, but still has to look up details in the associated table rows. The `max(created_at)` subquery uses this, because `created_at` is not indexed.
- `CORRELATED SCALAR SUBQUERY`: a scalar subquery is a subquery that returns a single value - all three of our subqueries in the `select` clauses do this. `CORRELATED` means that the subquery is correlated with the outer query - it references the outer query's table (`repos` in this case).
- `CO-ROUTINE`: I don't fully understand these. Here's [the SQLite documentation explanation](https://www.sqlite.org/optoverview.html#subquery_co_routines).
- `USE TEMP B-TREE FOR ORDER BY` means that the query is using a temporary B-tree to sort the results.
 
## Achieving the same thing with window functions
 
[Anton Zhiyanov](https://twitter.com/ohmypy/status/1623586489358553088) provided this alternative SQL function which achieves the same result using window functions instead of subqueries:
 
```sql
with cte as (
  select
    repos.full_name,
    max(releases.created_at) over (partition by repos.id) as max_created_at,
    count(*) over (partition by repos.id) as releases_count,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    join releases on releases.repo = repos.id
)
select
  full_name,
  max_created_at,
  releases_count,
  json_group_array(
    json_object(
      'id', rel_id,
      'name', rel_name,
      'created_at', rel_created_at
    )
  ) as recent_releases
from cte
where rel_rank <= 3
group by full_name
order by releases_count desc
limit 10;
```
[Try that out here](https://latest-with-plugins.datasette.io/github?sql=with+cte+as+%28%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++max%28releases.created_at%29+over+%28partition+by+repos.id%29+as+max_created_at%2C%0D%0A++++count%28*%29+over+%28partition+by+repos.id%29+as+releases_count%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank%28%29+over+%28partition+by+repos.id+order+by+releases.created_at+desc%29+as+rel_rank%0D%0A++from+repos%0D%0A++++join+releases+on+releases.repo+%3D+repos.id%0D%0A%29%0D%0Aselect%0D%0A++full_name%2C%0D%0A++max_created_at%2C%0D%0A++releases_count%2C%0D%0A++json_group_array%28%0D%0A++++json_object%28%0D%0A++++++%27id%27%2C+rel_id%2C%0D%0A++++++%27name%27%2C+rel_name%2C%0D%0A++++++%27created_at%27%2C+rel_created_at%0D%0A++++%29%0D%0A++%29+as+recent_releases%0D%0Afrom+cte%0D%0Awhere+rel_rank+%3C%3D+3%0D%0Agroup+by+full_name%0D%0Aorder+by+releases_count+desc%0D%0Alimit+10%3B).
 
The key trick here is the `rank() over` line:
 
```sql
rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
```
This adds an integer called `rel_rank` to each row in that CTE showing the relative ranking of each release, ordered by their created date. [This version of the query](https://latest-with-plugins.datasette.io/github?sql=with+cte+as+%28%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++max%28releases.created_at%29+over+%28partition+by+repos.id%29+as+max_created_at%2C%0D%0A++++count%28*%29+over+%28partition+by+repos.id%29+as+releases_count%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank%28%29+over+%28partition+by+repos.id+order+by+releases.created_at+desc%29+as+rel_rank%0D%0A++from+repos%0D%0A++++join+releases+on+releases.repo+%3D+repos.id%0D%0A%29%0D%0Aselect+*+from+cte) shows the contents of the `cte` table, which starts like this:
 
| full_name                      | max_created_at       |   releases_count |   rel_id | rel_name                                                       | rel_created_at       |   rel_rank |
|--------------------------------|----------------------|------------------|----------|----------------------------------------------------------------|----------------------|------------|
| simonw/datasette               | 2022-12-15T02:02:42Z |              121 | 86103928 | 1.0a2                                                          | 2022-12-15T02:02:42Z |          1 |
| simonw/datasette               | 2022-12-15T02:02:42Z |              121 | 84755750 | 1.0a1                                                          | 2022-12-01T21:30:39Z |          2 |
| simonw/datasette               | 2022-12-15T02:02:42Z |              121 | 84496148 | 1.0a0                                                          | 2022-11-29T19:57:54Z |          3 |
 
Then later in the query the `from cte where rel_rank <= 3` filters to just the first three releases for each repository, such that the `group by` can be used in conjunction with `json_group_array()` to generate the JSON for just those three.
 
The `explain query plan` for this variant looks like this:
 
```
CO-ROUTINE cte
    CO-ROUTINE (subquery-3)
        CO-ROUTINE (subquery-4)
            SCAN releases
            SEARCH repos USING INTEGER PRIMARY KEY (rowid=?)
            USE TEMP B-TREE FOR ORDER BY
        SCAN (subquery-4)
    SCAN (subquery-3)
SCAN cte
USE TEMP B-TREE FOR GROUP BY
USE TEMP B-TREE FOR ORDER BY
```
 
## Using a left join to include repos with no releases
 
There's a catch with this query though: since it's using a regular join between `repos` and `releases` it won't return a row at all for any repository that does not have at least one associated release.
 
We can fix that using a `left join`, like this:
 
```sql
  from repos
    left join releases on releases.repo = repos.id
```
This almost has the desired effect, but if [you run it](https://latest-with-plugins.datasette.io/github?sql=with+cte+as+(%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++max(releases.created_at)+over+(partition+by+repos.id)+as+max_created_at%2C%0D%0A++++count(*)+over+(partition+by+repos.id)+as+releases_count%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank()+over+(partition+by+repos.id+order+by+releases.created_at+desc)+as+rel_rank%0D%0A++from+repos%0D%0A++++left+join+releases+on+releases.repo+%3D+repos.id%0D%0A)%0D%0Aselect%0D%0A++full_name%2C%0D%0A++max_created_at%2C%0D%0A++releases_count%2C%0D%0A++json_group_array(%0D%0A++++json_object(%0D%0A++++++%27id%27%2C+rel_id%2C%0D%0A++++++%27name%27%2C+rel_name%2C%0D%0A++++++%27created_at%27%2C+rel_created_at%0D%0A++++)%0D%0A++)+as+recent_releases%0D%0Afrom+cte%0D%0Awhere+rel_rank+%3C%3D+3%0D%0Agroup+by+full_name%0D%0Aorder+by+releases_count+desc) you'll notice that it returns rows that look like this:
 
| full_name                                                            | max_created_at       |   releases_count | recent_releases                                                    |
|----------------------------------------------------------------------|----------------------|------------------|--------------------------------------------------------------------|
| zerok/covid19-aut-stats                                              |                      |                1 | [{"id":null,"name":null,"created_at":null}]                        |
| yml/sybarval_dataviz                                                 |                      |                1 | [{"id":null,"name":null,"created_at":null}]                        |
 
There are two things wrong here: `releases_count` is 1 when it should be 0, and the `recent_releases` returns an array with an object full of nulls.
 
We can fix the first problem by changing this line:
```sql
count(*) over (partition by repos.id) as releases_count,
```
To this:
```sql
count(releases.id) over (partition by repos.id) as releases_count,
```
This will only increment the counter for rows where that column isn't null, which fixes the problem and returns a 0 value.
 
For the second problem, we can add a filter to the end of the `json_group_array()` aggregation like this:
 
```sql
  json_group_array(
    json_object(
      'id', rel_id,
      'name', rel_name,
      'created_at', rel_created_at
    )
  ) filter (where rel_id is not null) as recent_releases
```
This excludes the left joined release data with the null values, which results in an empty `[]` for repos with no releases.
 
The finished query looks like this:
 
```sql
with cte as (
  select
    repos.full_name,
    max(releases.created_at) over (partition by repos.id) as max_created_at,
    count(releases.id) over (partition by repos.id) as releases_count,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    left join releases on releases.repo = repos.id
)
select
  full_name,
  max_created_at,
  releases_count,
  json_group_array(
    json_object(
      'id', rel_id,
      'name', rel_name,
      'created_at', rel_created_at
    )
  ) filter (where rel_id is not null) as recent_releases
from cte
where rel_rank <= 3
group by full_name
order by releases_count desc
```
[Try that finished query here](https://latest-with-plugins.datasette.io/github?sql=%0D%0Awith+cte+as+%28%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++max%28releases.created_at%29+over+%28partition+by+repos.id%29+as+max_created_at%2C%0D%0A++++count%28releases.id%29+over+%28partition+by+repos.id%29+as+releases_count%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank%28%29+over+%28partition+by+repos.id+order+by+releases.created_at+desc%29+as+rel_rank%0D%0A++from+repos%0D%0A++++left+join+releases+on+releases.repo+%3D+repos.id%0D%0A%29%0D%0Aselect%0D%0A++full_name%2C%0D%0A++max_created_at%2C%0D%0A++releases_count%2C%0D%0A++json_group_array%28%0D%0A++++json_object%28%0D%0A++++++%27id%27%2C+rel_id%2C%0D%0A++++++%27name%27%2C+rel_name%2C%0D%0A++++++%27created_at%27%2C+rel_created_at%0D%0A++++%29%0D%0A++%29+filter+%28where+rel_id+is+not+null%29+as+recent_releases%0D%0Afrom+cte%0D%0Awhere+rel_rank+%3C%3D+3%0D%0Agroup+by+full_name%0D%0Aorder+by+releases_count+desc).
 
## Simplified to use just one window function
 
The `max_created_at` and `releases_count` columns are being created by window functions, but it's actually possible to rearrange the query to use regular aggregations for those. Here's what I figured out:
 
```sql
with cte as (
  select
    repos.full_name,
    releases.created_at,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    left join releases on releases.repo = repos.id
)
select
  full_name,
  max(created_at) as max_created_at,
  count(rel_id) as releases_count,
  json_group_array(
    json_object(
      'id', rel_id,
      'name', rel_name,
      'created_at', rel_created_at
    )
  ) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
from cte
group by full_name
order by releases_count desc
```
[Try that version here](https://latest-with-plugins.datasette.io/github?sql=%0D%0Awith+cte+as+%28%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++releases.created_at%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank%28%29+over+%28partition+by+repos.id+order+by+releases.created_at+desc%29+as+rel_rank%0D%0A++from+repos%0D%0A++++left+join+releases+on+releases.repo+%3D+repos.id%0D%0A%29%0D%0Aselect%0D%0A++full_name%2C%0D%0A++max%28created_at%29+as+max_created_at%2C%0D%0A++count%28rel_id%29+as+releases_count%2C%0D%0A++json_group_array%28%0D%0A++++json_object%28%0D%0A++++++%27id%27%2C+rel_id%2C%0D%0A++++++%27name%27%2C+rel_name%2C%0D%0A++++++%27created_at%27%2C+rel_created_at%0D%0A++++%29%0D%0A++%29+filter+%28where+rel_id+is+not+null+and+rel_rank+%3C%3D+3%29+as+recent_releases%0D%0Afrom+cte%0D%0Agroup+by+full_name%0D%0Aorder+by+releases_count+desc).
 
The trick here is to select the `created_at` and `releases.id` columns in the initial CTE, then use `max(created_at) as max_created_at` and `count(rel_id) as releases_count` in the second section of the query, in order to total things up based on the group by.
 
I also modified the filter on `json_group_array()` to look like this:
 
```sql
) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
```
This was needed so that the `releases_count` number would reflect ALL releases for the repo, even while the JSON array only showed the first three ordered by `created_at` descending (as specified in the `rank()` window function).