After working for a few years on data analytics and processing, there seems to be a contradiction in the way SQL supports large and complex queries, and I've heard similar echoes from several others. I'll call this the Vanishing Schema Paradox:
In a large SQL query, fundamental concepts such as «this is the table of all widgets» vanish after the first few lines.
Or, in slightly more formal terms:
In a large SQL query, the proportion of tables that come from the database schema decreases as the SQL query grows.
This is quite disappointing. After time and effort spent cleaning and documenting the database schema, most tables only last for a few short lines in a typical query.
The cause should not surprise anyone familiar with SQL. The simple concept of «compute this value for each widget» is actually not supported by SQL queries: there is no way to add a column to the widget table. Instead, SQL lets you create a new table with SELECT
, a clone table that includes the new column in addition to the old ones. The rest of the query then manipulates the clone and never touches the original table again.
To give a more concrete example, consider a simple report:
Given a
Products
table with aTarget
column and anOrders
table with aQuantity
column, find the ratio of products for which the total 2019 sales exceeded their target.
Here's a solution in three steps, numbered from the inside out:
3> SELECT AVG(CASE WHEN P.Total2019Sales >= P.Target THEN 1 ELSE 0 END)
3> FROM (
2> SELECT SUM(Orders.Sales) AS Products.Total2019Sales
1> FROM Products
1> LEFT JOIN Orders ON Products.PId = Orders.PId AND Orders.Year = 2019
2> GROUP BY Products.PId
3> ) AS P
Left-join the products and orders tables. This creates a new table, because SQL is about relational algebra, and relational algebra is about creating new relations from old relations, and the distinction between tables and relations is irrelevant to the paradox.
Aggregate by product the Orders.Quantity
values from the new table, resulting in a new table P. We will pretend that table P is the same as Products
, because it contains the same lines and can (if we choose to) include the same columns as well, in addition to the new column Total2019Sales
.
Aggregate table P to the ratio of products where Total2019Sales >= Target
.
Out of three named tables (Products
, Orders
and P
), only two tables appear in the database schema.
I'm not complaining ! For a small query, and with a few comments, this is fine. Most relational algebra operations are intuitive, given enough SQL experience.
The pain appears when growing beyond that. Let's say an additional request comes for the report:
Using the
Category
column, compute the ratio of products which exceeded their target across:
all categories where over 50% of products reached their targets,
all other categories.
If the Total2019Sales
column was part of the Products table, this would be easy:
3> SELECT AVG(CASE WHEN P.Total2019Sales >= P.Target THEN 1 ELSE 0 END), C.Over50
2> FROM Products P
2> INNER JOIN (
1> SELECT
1> P.Category,
1> AVG(CASE WHEN P.Total2019Sales >= P.Target THEN 1 ELSE 0 END) > 0.5 as Over50
1> FROM Products P
1> GROUP BY P.Category
2> ) C ON P.Category = C.Category
3> GROUP BY C.Over50
Group the Products table by category, resulting in a new table C. The column Over50
is true for categories where over 50% of products reached their targets.
Join the Products table with table C. This results in a new table which we will again pretend is the same as the Products table, with the addition of the Over50
column.
Aggregate that relation into the requested ratios.
The problem is that Total2019Sales
is not part of the Products table.
It is part of table P
from the first example, which is not in the schema.
SQL almost solves this problem with Common Table Expressions, whereby a table is given a temporary name for the rest of the query:
WITH P AS (
-- The subquery from example 1:
SELECT Products.Category, SUM(Orders.Sales) AS Total2019Sales
FROM Products
LEFT JOIN Orders ON Products.PId = Orders.PId AND Orders.Year = 2019
GROUP BY Products.PId)
-- Schema tables no longer appear after this line ! --
-- The full query from example 2:
SELECT AVG(CASE WHEN P.Total2019Sales >= P.Target THEN 1 ELSE 0 END), C.Over50
FROM P
INNER JOIN (
SELECT
P.Category,
AVG(CASE WHEN P.Total2019Sales >= P.Target THEN 1 ELSE 0 END) > 0.5 AS Over50
FROM Products P
GROUP BY P.Category
) C ON P.Category = C.Category
GROUP BY C.Over50
I say almost, because we are doing a lot of pretending here.
The heart of the query is to compute two values for each product: the total sales for that product in 2019, and whether that product belongs to a category where over 50% of products reached their targets in 2019.
The proper way to model «two values for each product» in a database schema is to add two columns to the Products table.
However, SQL does not support adding columns to a table in the middle of a SELECT query—modeling data and querying data are two completely independent modes in SQL. Instead, we settle for the next best thing: to create a clone of the Products table, which contains all the columns from the original, plus our additional two columns. This is achieved by a left-join-then-group-by (for Total2019Sales
) or by an inner-join (for Over50
).
This is a dirty hack ! It does not express the intent behind the operation: that the new table should be the Products
table with additional columns. It's easy to make a mistake that drops products or introduces duplicate lines for some products: using an inner-join instead of a left-join, or the wrong keys for a group-by, or a misplaced WHERE
. This is compounded by the fact that foreign keys (and other consistency checks) are no longer present or enforced on the clone.
A complex report might involve five or six tables, and add tens of variables to each of them. It's possible to reach a hundred table clones, each clone an opportunity to make a mistake.
Picture a 200-line SQL query. No tables from the schema appear after line 10, except that one on line 190 which is joined on a column that may or may not still be a foreign key. There are five tables named Products
(or some variation thereof), some of which might represent of «the table of all products», though three of them look like they might have dropped some products from the original table, and a fourth is based on a suspicious INNER JOIN
that could duplicate a few lines.
Not many people complain about this. Developers—those who have a strong background in other programming languages, and can realize how silly this all is—tend to hide SQL queries behind an Object-Relational Mapper, or write very small transactional queries that are more concerned about traversing a graph of relations. Data analysts—those who need to write elaborate SQL queries, involving complex business logic—often have no point of comparison in terms of language expressiveness.
The SQL query model can be improved by adding operations that preserve the schema. In particular, adding a column to a table should be a basic building block. It does not need to actually add the column to the table: it is enough to create a clone (with an absolute guarantee that no lines appear or disappear), and maybe give it the same name as the original (for convenience). The same relational algebra semantics, but wrapped in a different syntax with additional guarantees.
Data Frames are a similar concept.
Let's think about how such a model would look like. The concept is:
Rather than adding tables to the schema, the query should add columns to schema tables. New tables, when necessary, should remain unnamed and thus of very local scope.
Common operations add columns to schema tables and guarantee that no rows are added or dropped.
A map stays in the same table, applying a line-by-line operation to compute one new value for each row.
Products.TargetReached = Products.Total2019Sales >= Products.Target
After the assignment, the Products
table is still the table of all products.
A contraction (borrowing the term from tensors) performs an aggregation into an output table that already exists, along a one-to-many relation. In terms of SQL, this would be a LEFT JOIN
between the output table and the input table (so there are no missing lines), followed by a GROUP BY
the primary key of the output table (so there are no duplicates).
Products.Total2019Sales = sum(Orders.Quantity) where Orders.Year == 2019
Since fewer tables are created in this model, most operations involve schema tables, making it possible to rely more on natural joins (above, between Products
and Orders
) to shorten the code.
Also, ratio(myBool)
as a shorthand for avg(myBool ? 1 : 0)
should be standard SQL.
A broadcast (again, from tensors) pulls data through a many-to-one relation. In terms of SQL, this would be an INNER JOIN
with a foreign-primary key pair (the foreign key ensures that the join finds at least one row per row, the primary key ensures that it finds no more than one row).
Products.Over50 = Categories.Over50
Natural joins make this quite short.
A seek is a broadcast which uses a primary key, but not a corresponding foreign key (and thus, a LEFT JOIN
instead of an INNER JOIN
). There is still a guarantee that there will be one value per row in the table, but some of these values will be null.
New tables, if named, should fit into the schema. They should come with appropriate defaults for primary and foreign keys, and should support adding new keys as necessary.
table Categories = by Products.Category
Categories.Over50 = ratio(Products.TargetReached) > 0.5
Products.Over50 = Categories.Over50
This creates a foreign key in Products
that allows contracting into and broadcasting from Categories
.
Conversely, it should be possible to create a local, unnamed table.
Product.Over50 = ratio(Products.TargetReached) by Products.Category > 0.5
This creates a temporary table with a single column ratio(Products.TargetReached)
contracted from Products
. This column is then broadcasted back into Products
by the assignment. The temporary table then disappears, never to be used again.
The syntax in the above examples comes from Envision, an internal language designed and used by my company (we have good reasons for creating our own language, and «make a better SQL» is not one of them).
In Envision, our complete report (both the global ratio, and the ratio for "over 50%" and "under 50%" categories) would be written as:
Products.Total2019Sales = sum(Orders.Quantity) where Orders.Year == 2019
Products.TargetReached = Products.Total2019Sales >= Products.Target
show scalar "Ratio of products that reached target" with
ratio(Products.TargetReached)
Products.Over50 = ratio(Products.TargetReached) by Products.Category > 0.5
show table "Ratios by categories" with
ratio(Products.TargetReached) as "Ratio that reached target"
Products.Over50 as "In category that reached over 50%"
group by Products.Over50
From beginning to end, Products
retains its meaning as «the table of all products». There is no risk of dropped rows or duplicate rows. The only new tables are unnamed (one for by Products.Category
, the other for group by Products.Over50
) and consumed immediately.