The PostgreSQL team announced recently a new release of the most advanced open source relational database - PostgreSQL 12. As usual it comes with an impressive list of improvements (generated columns ♥️), one of them being long awaited by dozens of developers: improved Common Table Expressions.
I should first explain what are the Common Table Expressions for those who are unfamiliar with them:
the CTE’s, often called
“WITH queries”, are SQL constructs giving a possibility of creating temporal data views for a sake of a query execution.
Essentially CTE is an additional query which results can be referenced in the subsequent CTE’s or the main query before which it is being placed. It should be clear enough with an example - here’s a sample query with two CTE’s taken from Postgres docs:
WITH regional_sales AS ( SELECT region, SUM(amount) AS total_sales FROM orders GROUP BY region ), top_regions AS ( SELECT region FROM regional_sales WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales) ) SELECT region, product, SUM(quantity) AS product_units, SUM(amount) AS product_sales FROM orders WHERE region IN (SELECT region FROM top_regions) GROUP BY region, product;
Now we know what they are, but what purpose can they serve us? Well - we could parry here and say: for the same purpose as ordinary database views serve. That’s of course a dramatic simplification - CTE’s are much more powerful and should not be treated as a simple database views. Nevertheless I would like to keep the collation for the sake of this article.
Database views are absolutely optional - one can simply substitute them with a subquery and achieve identical results. Indeed that’s what modern database engines do these days - once a database view is being used they inline it’s query as a subquery. Why to bother then? We are able to deal completely without database views and even if we did use one the database engine would get rid of it anyway.
What benefits do views give us then? And why the heck are they inlined?
The most appropriate explanation here is that database views help us achieve better readability. They offer an elegant way of abstracting some parts of a database into a meaningful object, often matching closely with the domain.
Instead of creating giant and ugly looking queries it is possible to extract some of its parts into an appealing view which is easier to browse and select data from. "Divide and conquer" rule in it’s true form.
Still, we didn’t answer the fact that the underlying view’s query is most of the times inlined while it is being referenced. The reason is performance of course. Smart guys found out that lazy evaluation helps the optimizer a lot - by delaying the execution we could take advantage of a context of the actual query.
This in turn allows many clever optimization techniques, like: pushing down predicates (WHERE filters), eliminating unnecessary JOINS, accessing only subset of columns etc. In other words - a database is smart enough to do as little work as possible when evaluating database views, in the context of the issued query.
Personally I love this pattern: aggregating all the data into views and letting a database to optimize my queries - these folks are really good in it and my queries are dead simple too.
I have mentioned the CTE’s at the beginning, saying they are able to create temporal data views. I still conform to the comparison with database views - they both are in many cases similar. The main difference is that CTE results are temporary and are reachable only in the context of a query which CTE is being part of.
It may make sense to use a CTE in a place where a regular database view is not justified (e.g. it makes sense only in the context of a query and not in the whole domain), expecting similar behavior.
Besides many remarkable advantages there’s at least one disadvantage which disqualifies CTE's for most use cases - before PostgreSQL 12 it was implemented as an optimization fence. What does it mean?
Easy to imagine an example with a generic data view aggregating lots of data. If the view is then used to select just few records it could mean a tremendous waste of computation, if the aggregation is executed immediately. Instead the aggregation should be executed only on a small subset of data, which can be deduced from the outer query.
Unfortunately such counter intuitive behavior was true for CTE’s for a long time - their results were materialized only to be accessible for the rest of the query afterwards.
Not to blame anybody - the creators had quite good reasons to implement such behavior (i.e. guarantee of exactly one evaluation, possibility of a recursive CTE’s and more) but this still feels like focusing on corner cases instead of optimizing the happy path. That’s why the community insisted for a long time for changing the status quo by giving the possibility of disabling the fence and unlocking the full potential of CTE’s.
The SQL gods listen to their prayers and here it is - PostgreSQL 12 with updated Common Table Expressions. By default, when few constraints are met, the queries will be inlined allowing joint optimizations.
It is still possible to force the old behavior - by defining the CTE
AS MATERIALIZED the engine would execute it immediately.
It is also possible to hint the optimizer that we definitely want the CTE to be inlined, e.g. when the CTE is being referenced twice the engine won't inline it by default.
This is truly a game changer for many developers who care about their queries’ readability, allowing them to substitute not very well liked subqueries with elegant
Don’t get me wrong - not every subquery should be immediately replaced, they still have their strengths and in some cases they should be chosen over CTE’s. It is just convenient to have two distinct tools in a toolbox, isn’t it?