A Taxonomy of Problematic SQL

Introduction

There are recurring patterns in most bad SQL, and systematic ways to recognize these patterns and avoid or fix the problems that map to these patterns. Just as a zoologist needs taxonomy to classify and recognize the species in an ecosystem, and to understand their relationships, a developer needs a systematic method to understand the ways that SQL can be broken, and to repair those problems, especially if the developer is maintaining someone else’s SQL, often SQL from a source that cannot be reached to explain the purpose behind the SQL and how it was written. Even without knowing the purpose behind the SQL, certain patterns can point to recognizable, common problems that need repair. This presentation, borrowing from a chapter in the author’s book, SQL Tuning, describes a taxonomy of abnormal SQL, SQL that often requires repair or points to opportunities for database design improvements. (O’Reilly retains the copyright on the material directly from the book.)

Join trees, a crash introduction

Many of the rules for classifying SQL are vastly easier to express and understand in terms of join trees, abstract representations of the SQL that I explain in much more detail in the book, SQL Tuning. In case you are not already familiar with these, and lack a copy of the book, I’ll introduce join trees briefly, here.

Consider a query:

SELECT …

FROM Orders O, Order_Details OD, Products P, Customers C,

Shipments S, Addresses A, Code_Translations ODT, Code_Translations OT

WHERE UPPER(C.Last_Name) LIKE :Last_Name||'%'

AND UPPER(C.First_Name) LIKE :First_Name||'%'

AND OD.Order_ID = O.Order_ID

AND O.Customer_ID = C.Customer_ID

AND OD.Product_ID = P.Product_ID(+)

AND OD.Shipment_ID = S.Shipment_ID(+)

AND S.Address_ID = A.Address_ID(+)

AND O.Status_Code = OT.Code

AND OT.Code_Type = 'ORDER_STATUS'

AND OD.Status_Code = ODT.Code

AND ODT.Code_Type = 'ORDER_DETAIL_STATUS'

AND O.Order_Date > :Now - 366

ORDER BY …;

The join tree that abstractly represents this query reflects the following:

What tables are being joined, represented as aliases, since the same table can appear multiple times in the FROM clause under different aliases.
What is the nature of each join – which side or sides of the join are unique, and what joins to what.
Which joins are outer joins, and which side of each outer join is the optional side (the side with the “(+)” in old-style Oracle join notation).

The join tree that represents the above query, then, is as follows:

Figure 1, a join tree representing the above query

In this diagram:

• Each table is a node, represented by its alias.

• Each join is a link, with (usually downward-pointing) arrows pointing toward any side of the join that is unique.

• Midpoint arrows point to optional side of any outer join.

For example, the table Shipments is represented by the node labeled “S”, and the outer join “S.Address_ID = A.Address_ID(+)” is represented by the downward-pointing arrow from S to A, with the midpoint arrow pointing toward the “(+)” side of that join and the end-point arrow pointing toward A because ADDRESS_ID is unique for the table Addresses.

Following these rules to create join diagrams, we find a number of regularities among most well-written SQL:

• The query maps to one tree.

• The tree has one root, exactly one table with no join to its primary key.

• All joins have downward-pointing arrows (joins unique on one end).

• Outer joins point down (toward the primary key, that is), with only outer joins below outer joins.

• The question that the query answers is basically a question about the entity represented at the top (root) of the tree (or about aggregations of that entity).

• The other tables just provide reference data stored elsewhere for normalization.

If we know in advance that good SQL tends to follow these patterns, the main taxonomic grouping of SQL should be simply normal queries that precisely meet the above conditions. The rest of our taxonomy will consist of query classes that fail to meet precisely these conditions in specific ways. I will describe each of these special query classes in terms of the atypical features as seen in the query diagrams, themselves. These join trees provide a convenient shorthand for generalizing the type of abnormality in a form far easier to recognize (once you really understand join trees) than any description that focuses on the SQL, itself. In each case, I will use a convention where I show the essential abnormal feature of the abnormal diagram, but I hide (behind gray “clouds”) the parts of the diagram that have no bearing on the abnormality. Each atypical feature will tend to point to problems or opportunities for changes to the SQL or the database design.

Cyclic join graphs (not trees), case 1

Figure 2, a case-1 cyclic join graph

In this query type, you should find somewhere in the WHERE clause joins such as:

T1.fkey1=T2.pkey2 and T1.fkey1=T3.pkey3,

with T2.pkey2=T3.pkey3, explicitly also in the WHERE clause, or just implied by transitivity. This type of join tree is an opportunity for optimization, not a problem, allowing all possible join orders between these three tables. Depending on the inclusion or exclusion of conditions implied by transitivity, the direct representation of the SQL as written might look like any of the following three subcases:

Figure 3, alternative join trees where the cyclic join graph may just be implied

Subcases A, B, and C show where the missing link in the graphs may only be implied by transitivity. In subcases A and B, it is easy to see from the diagram alone that the foreign key pointing to T3 or T2 must equally point to the primary key of the other table, since those two tables, joining one-to-one, presumably share primary key values. In subcase C, the diagram alone does not imply a hidden cycle – you must notice in the SQL itself that the foreign key pointing from T1 to T2 is the same foreign key that points from T1 to T3.

Note that although Case 1 of cyclic join trees does not automatically create a problem, this case does involve a one-to-one join, which we discuss more, later.

Cyclic join graphs (not trees), case 2

Figure 4, a case-2 cyclic join graph

In this query type, you should find somewhere in the WHERE clause joins such as:

T1.fkey1=T2.pkey2 and T1.fkey2=T3.pkey3 and T2.fkey2=T3.pkey3.

Note that you will almost certainly find that T1.fkey2=T2.fkey2, systematically, whenever T2 is a master row to the T1 detail, in other words, wherever T1.fkey1=T2.pkey2. This is because fkey2 in T1 is actually a denormalization, here, for example, a Customer_ID column in Order_Lines, when the normalized value of Customer_ID belongs in Orders. This denormalization, like most denormalizations, is likely not needed (that is, the fkey2 column likely should be dropped from T1), and it certainly isn’t likely that both the join to the normalized foreign key and the denormalized copy of that key is needed from a functional perspective. The extra denormalized join, here is likely to fool the optimizer into expecting that almost no rows will satisfy all three joins, since the optimizer views these joins as statistically independent conditions (which is the opposite of the truth). This false expectation is likely to lead the optimizer to make bad choices, since it will likely drastically underestimate how many rows will remain after performing all three joins.

Cyclic join graphs (not trees), case 3

Figure 5, a case-3 cyclic join graph

In this query type, you should find somewhere in the WHERE clause joins such as:

T1.fkey1=T2.pkey1 and T1.fkey2=T3.pkey2, and T2.col3=T3.col4.

Though this last condition is technically a join, it is best to treat the last join as a constrained filter condition that we cannot evaluate without both T2 and T3. The constrained filter condition requires that you join to both tables to pick up the filter selectivity. In choosing a join order, I generally ignore these constrained filters until one of the tables in the combined filter is reached without regard to the constrained filter, then treat the filter as a single-table filter on the other table.

Cyclic join graphs (not trees), Case 4

Figure 6, a case-4 cyclic join tree

In this query type, you should find somewhere in the WHERE clause joins such as:

T1.fkey1=T2.pkey1 and T1.fkey2=T3.pkey_col1, and T2.col=T3.pkey_col2.

Note that T3 has a two-part primary key, here, and that the joins to the two-part key come from two different tables that are themselves joined in a master-detail relationship. This is uncommon, but happens sometimes, especially where the database tries to take advantage of “natural” keys, keys built from columns having some external meaning (as opposed to just being arbitrary numbers generated from some sequence). I don’t generally recommend this practice of using non-arbitrary primary key columns, but SQL developers often have no control over these design decisions, so you may find yourself with cases like this. Treat both joins to T3 as a constrained join – don’t join to T3 until you have reached both T1 and T2 in the join order. For purposes of diagramming, treat the join to T3 as a single arrow having three end points:

Figure 7, a case-4 cyclic join tree, transformed to show the 3-ended unique join

To make this case concrete, here is an example of a query against a hypothetical data dictionary that would see this case:

SELECT TC.Column_Name FROM

Indexes Ind, Index_Columns IC, Table_Columns TC

WHERE Ind.Index_Name=’EMPLOYEES_X1’

AND Ind.Index_ID=IC.Index_ID

AND Ind.Table_ID=TC.Table_ID

AND IC.Column_Number=TC.Column_Number

ORDER BY IC.Index_Position ASC

Note that the database designer has chosen to use the non-arbitrary value of Column_Number (which indicates where in the column order for the table the column lies, a meaningful property) to serve as the second part of a two-part key for the Table_Columns table. Since it would be a denormalization to store the Table_ID (which is an arbitrary key, to the Tables table) in the Index_Columns table, the full unique join to Table_Columns must combine Table_ID, from the Indexes table and the (non-arbitrary) Column_Number from the Index_Columns table. If, on the other hand, the design simply used a sequence-generated one-part Column_ID for the primary key of Table_Columns, then we would have a simple many-to-one join pointing from Index_Columns to Table_Columns, and there would be no need for a join between Indexes and Table_Columns (although each might share a matching foreign key pointing to Table_ID, still).

Queries with multiple, disconnected trees

Figure 8, multiple trees: sub-graphs with no join between them

Here, treat each tree (or single, isolated node) as an independent query, and find how many rows each of these independent queries would return. (In other words, produce a query that strips out all references (including references in the FROM clause) to T2 and any table that joins, directly or indirectly, to T2, and a separate query that strips out all references to T1 any table that joins, directly or indirectly, to T1.) Considering these separate queries, there are three cases to consider:

Both queries return at most one row.
One query returns at most one row, and the other at least sometimes returns multiple rows.
Both queries can return multiple rows.

In case 1, the full query can be seen as two single-row queries cleverly combined to return a longer single row. From a performance perspective, this is harmless, or even very slightly helpful (since it avoids a round trip to the database) versus the alternative of separate queries. It may be somewhat confusing from a code-maintenance perspective though.

In case two, the original query returns a Cartesian product between a single-row set and a multi-row set, which has exactly the same number of result rows as the multi-row set. This saves a round trip to the database versus the result if we used the separate queries, and it should not require extra work inside the database, if the optimizer makes the right choices, but it also means that the expressions queried for the single-row set (in the SELECT list) are passed across the network multiple times, once for each row in the multi-row set, using somewhat more bandwidth. Neither choice, neither the original query nor the separated queries, is likely to be significantly better than the other choice from a performance perspective, as long as the optimizer makes the right choices. Separated queries may be clearer from a code-maintenance perspective, though.

In case 3, the original query will at least sometimes return a Cartesian product of two multi-row sets, all possible combinations of the members of those two sets, that is. If the multi-row sets are large, this can be disastrously expensive and slow, compared to the alternative of returning each set in a separate, normal query. It is also likely that the code to correctly handle the Cartesian product is more complex than the code that would handle the separated queries. Three solutions are likely:

Separate the queries, and modify the applications code (simplify it, likely) to handle the sets separately.
Examine the original query looking for missing joins. The most common reason for these Cartesian-product queries is that the developer has simply left out a join condition that would convert the two separate join trees into a single normal tree, some join from a foreign key in a node of one tree into the primary key of the top node in the other tree. (If the join is not into the top node of the other tree, then we’ll be left with two roots!)
You may find that none of the columns of one of the trees is even referenced in the query SELECT list, and the SELECT list begins with the DISTINCT keyword. In this case, almost certainly, you can simple remove all references to the unneeded tree from both the FROM clause and the WHERE clause, and eliminate the DISTINCT operator (which was likely added just to deal with all those pesky duplicates that resulted from the unintended Cartesian product with the unreferenced set). This case sometimes happens as a query evolves and some of the master-table lookups are no longer found to be needed, but the developer fails to remove these master tables from the WHERE clause.

Trees with multiple roots

Figure 9, a tree with multiple roots

Consider the query for Figure 9, above, and imagine that we begin the execution plan by reading a row of the table Master. Here I’ve introduced a new convention, the join ratios, shown as numbers next to the join arrows. In each case, the number compares the rowcounts before and after performing that join. Thus, for example, if we found 100 rows from Master, we would expect 3000 rows after joining next to Root2, or 500 rows if we joined 100 rows from Master to Root1. (On the unique end of a join, the join ratio is at most 1, less if the join doesn’t always succeed.) Consider what happens to each row from Master, on average – when it joins to Root1, the rowcount is multiplied by 5, and when it joins to Root2, the rowcount is multiplied by 30. The result of joining Master to both Root1 and Root2, we find an average of 150 (5*30) combinations of the 5 rows from Root1 and the 30 rows from Root2 that match each row from Master – a Cartesian product matched to each row we read from Master. These query result rows do not map one-to-one to any single table or business entity, but rather to combinations of Root1 entities and Root2 entities, and it is unlikely that this is a useful business result. Even if it is in some way a useful result, it contains no more data than you would find by separately reading the 5 matches from Root1 for each Master row, and the 30 matches from Root2 for each Master row, which is surely less work than reading the 150 combinations. There are two likely solutions:

Figure 10, first solution for a tree with multiple roots

Here, the developer examines the two joins to the two roots and finds that one of the joins is incomplete, lacking in some join that makes the match one-to-one on both sides of the join. In the case shown, some added condition (or perhaps some condition already in the WHERE clause that the developer did not view as part of the join) allows the join from Master to reach a unique row in R1 (formerly called “Root1”, but no longer a root). Addition of this extra condition eliminates the application complexity and performance issues of the Cartesian products.

Figure 11, second solution for a tree with multiple roots

Here, there were no missing join conditions found, but the developer recognizes that the application can better handle the original request for data as two queries, one returning the single tree having the single root Root1 (with 5 rows for every Master row), and the other returning the single tree having the single root Root2 (with 30 rows for every Master row).

There is a special case of the tree with multiple roots. (I’m being a little loose with my language, here – this case is not technically a “tree” from the graph-theoretic point of view, since a “tree” has one root by definition.) Sometimes, one of the “detail” tables that makes up the extra root is usually one-to-one or even (zero-to-one)-to-one in its relationship with the shared “Master” table, such as is the case with Root1, below:

Figure 12, a less-problematic case of a tree with multiple roots

From a performance perspective, this case is much less worrisome. The Cartesian product between Root1 rows and Root2 rows for each Master row will be much better behaved on average, since the average multiplier for that product when joining to Root1 will be less than one. There may still be occasional cases where we don’t see the average, which may create an occasional performance problem, though, since the join to Root1 (lacking an arrow on the Root1 end) is not guaranteed unique. More importantly, since the join to Root1 is not guaranteed unique, there can sometimes be results that fail to map one-to-one to the apparent true root, Root2, and this is likely to create mischief from a functional perspective. For example, if the software is adding up values, prices for example, for the rows this query returns, it will occasionally count the same Root2 entity multiple times, and this will almost certainly be dangerously wrong. Furthermore, where such a problem would likely be obvious and would get repaired in early testing with the case where both roots always join many-to-many with the shared Master, in this more subtle case, the behavior may often-enough be correct that the incorrect case escapes notice all too long, and the defect finds its way into production. Therefore, even in this case, (perhaps especially in this case!), you should look for the types of solutions shown in Figures 10 and 11, where functional changes rigorously prevent the many-to-many Cartesian products even in the rare worst cases.

Trees with joins non-unique in both directions

Figure 13, a many-to-many join

Here, we have a many-to-many join, another case where we get a many-to-many Cartesian product, this time a Cartesian product for each value of the non-unique (in both directions) join key that joins T1 and T2 directly. Technically, this also invariably creates a diagram with multiple roots, so it is a special case of the multiple-roots problem, with all the same implications and similar solutions – either break up the query, change the database design, or (more often) find the missing join condition that guarantees uniqueness on at least one side of the join. Just as for the earlier case, there is a case that is less-problematic from the performance perspective, where the join is usually unique on one side:

Figure 14, a many-to-many join that is usually unique on one end

Just as in the earlier multiple-roots case, this case is likely OK from a performance perspective, but it may be insidiously wrong functionally, with a hard-to-find defect that occasionally returns wrong results in the rare true many-to-many case. Just as before, the usual correct solution is to make the join rigorously unique on at least one end, either with a change to the SQL or a change to the database design (perhaps a new foreign key in T1 that points to the true primary key of T2, for example).

Trees with joins unique in both directions

Remember the earlier case of a one-to-one join creating an implied cycle in the join diagram:

Figure 15, a one-to-one join that creates an implied cycle

I promised I would return to this case with a more-general discussion of all one-to-one joins:

Figure 16, a one-to-one join

In general, tables that join one-to-one represent the same entity, or perhaps subsets of the same entity. In such cases, it is fair to ask why we should not simply consolidate these tables into a single larger table that comprehensively represents all the data for that entity in a single table, saving joins. This is often useful, from a performance perspective. Unlike most database design changes, consolidating one-to-one tables is not a denormalization. Even if the tables are not combined, they should at least take full advantage of the one-to-one relationship – the primary key of each table should be reused as the foreign key of the other table, rather than having redundant unique keys. There are several subcases of this opportunity for design improvement by consolidating tables:

Figure 17, rigorously one-to-one tables

In the case where T1 and T2 are each guaranteed to have a match in the other table, the case for consolidation into a single table is particularly strong and simple. The only case where you might choose to keep the tables separate (if you have complete control of the design, and cost of making the change isn’t a problem) would be the case where one of these tables is read frequently, but the columns of the other table all contain rarely-referenced, perhaps-high-volume, supplementary data. In this case, by keeping the tables separate, you can cache more rows of the frequently-used table in a given amount of cache memory, helping performance of the most frequent queries, while only rarely paying the cost of the join to the less-well-cached supplementary-data table. The benefits of better caching on average may exceed the cost of the occasional extra join and the extra insert for each row created.

The problem is less simple if the join is one-to-(zero-to-one):

Figure 18, a one-to-(zero-to-one) join

Here, the case for consolidating tables is still strong, though strictly speaking T2 represents a subset of T1, not quite the whole set. There is a special caution, here, though – the constraints of T2 (such as a NOT null constraint on a column) likely only apply to the subset of T1 entities that are also in T2, so if we add the T2 columns to T1 to combine the tables, the new constraints that enforce the old T2-entity constraints will be more complex, only applying to the subset of rows that would have been in T2 when these were separate tables. There is also likely to be a new “is_a_type_T2_flag” column required just to specify that the entity would have been in the T2 subset as well as in T1 under the old design. There is yet another case where the subset is a particularly small subset of the larger set:

Figure 19, a one-to-(zero-to-one) join that is usually one-to-zero

In this case, it is likely not worth loading up the larger table T1 with columns from the much smaller table T2, since those columns only apply to a tiny subset of the T1 entities, at least not from a performance perspective. In fact, having the small T2 subset consolidated into its own table can be a major performance win, if that is the subset we really wish to query, often. In such cases, a query that drives from T2, with nested-loops to any large tables, is likely to be very fast, where optimization of a query with data consolidated onto T1 would be harder. Yet another case comes when the join is optional in both directions:

Figure 20, a (zero-to-one)-to-(zero-to-one) join, optional in both directions

Consolidating tables is less-simple in this case, since neither table’s entity is a subset of the other. However, a case like this implies that there is a third more-inclusive set of entities that contains both subsets, and that both these tables could be merged into a new table based on the more-inclusive entity. (Think, for example, of merging an Employees table and a Spouses table into a Persons table.)

Filtered outer joins

Figure 21, a filtered outer join

Here, I’ve introduced another refinement to the join diagram, the number 0.5 alongside the table T2. This is the “filter ratio,” and it signifies that some WHERE-clause condition on T2 is true only for half of the rows in that table. Almost all such conditions, however, will defeat the usual purpose of the outer join, which is to preserve (in the query result) all rows from T1 that meet the conditions on the inner-joined tables, regardless of what we find, or do not find, in T2. The only likely case where a filtered outer join may be correct (though misleading) is where the filter discards all the outer-case rows, the rows that fail to find T2 rows matching T1 rows, and the join is then effectively and correctly an inner join. Of course, if the intent is only to preserve the inner-joined rows, then it is far clearer to express the query as an explicit inner join, and this is the right choice – filtered inner joins are perfectly normal.

An example of a filtered outer join that might seem correct would be:

…AND T1.FKey=T2.PKey(+)

AND NVL(T2.Type,’TX’)=’TX’

…

In this case, the pure-outer case, where we fail to find any match in T2 having the given Pkey value is preserved, since in that case, T2.Type will be assigned a null value, and NVL(T2.Type,’TX’) will evaluate to ’TX’. However, consider that we might find a matching row in T2 having the wrong value of T2.Type, and in this case, we will discard the row from T1 that matched that row in T2 – we keep the row in T1 if we find no match at all, but discard it if we find the wrong sort of match. This might be the functional intent, but it is not typically the intent, where we have an outer join. Much more likely, the intent is correctly expressed by:

AND T1.FKey=T2.PKey(+)

AND T2.Type(+)=’TX’

which preserves the T1 rows regardless of what we find in T2.

One uncommon but possible case where a filtered outer join makes sense has conditions like:

AND T1.FKey=T2.PKey(+)

AND T2.PKey IS NULL

At first glance this seems bizarre, since it is probable that T2.PKey is positively forbidden to have a null value by a constraint on T2, making it appear that the query demands an impossible condition and will never return rows! However, when we take the outer join into consideration, we can see that this is really a query that requests only the outer-case rows, the rows in T1 that fail to find a matching row in T2, discarding the inner-case rows that find a match. (The outer case rows will assign a null to T2.PKey in the result, making the second condition true.) This is really just an alternate way to express a NOT EXISTS condition! It is occasionally useful to express the condition in this unusual way to gain added control of when the NOT EXISTS condition is evaluated, to force the fastest possible execution plan. Of course, since it is an unusual and non-intuitive way to express that NOT EXISTS condition, this deserves a comment added to the SQL to explain what is going on, so later developers maintaining the code are not confused and do not undo the correct functionality and performance.

Inner joins below outer joins

Figure 22, an inner join below an outer join

This case is always a mistake – the inner join from T2 to T3 will only ever be possible in the inner case of the join from T1 to T2, so, if the current functionality is correct, both joins should be explicitly inner, to make the intent more clear:

AND T1.Fkey=T2.PKey

AND T2.Fkey2=T3.PKey2

Alternatively, if the original apparent intent to preserve rows from T1 regardless of what happens to the join to T2 was correct, then we need to make both joins outer:

AND T1.Fkey=T2.PKey(+)

AND T2.Fkey2=T3.PKey2(+)

Upward outer joins

Figure 23, an upward outer join

The join in this case looks like:

T2.PKey=T1.FKey(+)

The “(+)” side of the join, in other words, is on side of the non-unique foreign key, not on the primary-key side. This is subtly unlikely to be correct because of the usual rule that rows from a query should map one-to-one to some subset of a single table (normally the table at the single root of the tree). In this case, however, where the join finds a match (or multiple matches) in T1, in the inner case of the outer join, in other words, the result maps one-to-one to rows of T1, but in the outer case, where there is no match in T1, the rows map one-to-one to rows of T2. It is just possible that in a single query you actually want this sort of heterogeneous result, mapping partly to T2 and partly to T1, but it is uncommon and unlikely. More likely, the upward outer join is simply a mistake, functionally.

Filtered upward outer joins

Figure 24, a filtered upward outer join

This case seems to have two problems – a filter on the outer side of the join, and an upward outer join. Surprisingly enough, occasionally two wrongs do make a right! If the filter on T1 is something like

T1.Fkey IS NULL

which is only true in the outer case of the outer join:

T2.PKey=T1.FKey(+)

then we are actually discarding the entire inner case of the join, and just preserving the outer case, the equivalent of a NOT EXISTS condition on a correlated subquery of T1. In this case, the query result will map rigorously one-to-one to rows of T2, only, so the result follows the rule for a single table mapping one-to-one to all result rows. Of course if the filter has any effect other than to discard all the inner-case rows from the join, then this has both the usual problems with upward outer joins and the usual problems with filtered outer joins.

Conclusions

Recurring types of query anomalies are easy to recognize and to describe when the query is shown as a join diagram. This is one of the principle benefits of diagramming a query, after it is written, or even beginning with a diagram, before beginning to write the SQL. This paper would have been far harder to write, and I believe, far harder to understand, if I had tried to describe and generalize these anomalies purely in SQL form. Join diagrams are often useful for SQL tuning, but the way these make anomalies instantly obvious is a major side-benefit of this method, and is reason enough to diagram complex queries that are otherwise all-too-likely to contain these insidious defects.

Most join-diagram anomalies point to a mistake in the query, in the database design, or in the application. If the mistake is in the application layer, or even worse is in the design layer, the solution can be expensive and time-consuming. Consider, though, that if the solution is further delayed, it is likely to grow even more expensive to fix, later, so do not let cost automatically put these fixes on hold!