A Taxonomy of Problematic SQL
There are recurring patterns in most bad SQL, and systematic
ways to recognize these patterns and avoid or fix the problems that map to
these patterns. Just as a zoologist needs taxonomy to classify and recognize
the species in an ecosystem, and to understand their relationships, a developer
needs a systematic method to understand the ways that SQL can be broken, and to
repair those problems, especially if the developer is maintaining someone
elses SQL, often SQL from a source that cannot be reached to explain the
purpose behind the SQL and how it was written. Even without knowing the purpose
behind the SQL, certain patterns can point to recognizable, common problems
that need repair. This presentation, borrowing from a chapter in the authors
book, SQL Tuning, describes a taxonomy of abnormal
SQL, SQL that often requires repair or points to opportunities for database
design improvements. (OReilly retains the copyright on the material directly
from the book.)
Many of the rules for classifying SQL are vastly easier to
express and understand in terms of join trees, abstract representations of the
SQL that I explain in much more detail in the book, SQL Tuning. In case you are
not already familiar with these, and lack a copy of the book, Ill introduce
join trees briefly, here.
Consider a
query:
SELECT
FROM Orders O, Order_Details OD, Products P,
Customers C,
Shipments S, Addresses A, Code_Translations ODT, Code_Translations OT
WHERE UPPER(C.Last_Name)
LIKE :Last_Name||'%'
ORDER BY
;
The join tree that abstractly represents this query reflects the
following:
The join tree that represents the above query, then, is as follows:
Figure 1, a join tree
representing the above query
In this diagram:
Each table is a node, represented by its alias.
Each join is a link, with (usually
downward-pointing) arrows pointing toward any side of the join that is unique.
Midpoint arrows point to optional side of any outer
join.
For example, the table Shipments
is represented by the node labeled S, and the outer join S.Address_ID
= A.Address_ID(+) is represented by the downward-pointing arrow from S to
A, with the midpoint arrow pointing toward the (+) side of that join and the
end-point arrow pointing toward A because ADDRESS_ID is unique for the table Addresses.
Following these rules to create
join diagrams, we find a number of regularities among most well-written SQL:
The query maps to one tree.
The tree has one root, exactly one table with no
join to its primary key.
All joins have downward-pointing arrows (joins
unique on one end).
Outer joins point down (toward the primary key, that is), with only outer joins below outer joins.
The question that the query answers is basically a
question about the entity represented at the top (root) of the tree (or about
aggregations of that entity).
The other tables just provide reference data stored
elsewhere for normalization.
If we know in advance that good
SQL tends to follow these patterns, the main taxonomic grouping of SQL should
be simply normal queries that precisely meet the above conditions. The rest of
our taxonomy will consist of query classes that fail to meet precisely these
conditions in specific ways. I will describe each of these special query
classes in terms of the atypical features as seen in the query diagrams,
themselves. These join trees provide a convenient shorthand for generalizing
the type of abnormality in a form far easier to recognize (once you really
understand join trees) than any description that focuses on the SQL, itself. In
each case, I will use a convention where I show the essential abnormal feature
of the abnormal diagram, but I hide (behind gray clouds) the parts of the
diagram that have no bearing on the abnormality. Each atypical feature will
tend to point to problems or opportunities for changes to the SQL or the
database design.
Figure 2, a case-1 cyclic
join graph
In this
query type, you should find somewhere in the WHERE clause joins such as:
T1.fkey1=T2.pkey2
and T1.fkey1=T3.pkey3,
with T2.pkey2=T3.pkey3, explicitly also
in the WHERE clause, or just implied by transitivity. This type of join tree is
an opportunity for optimization, not a problem, allowing all possible join
orders between these three tables. Depending on the inclusion or exclusion of
conditions implied by transitivity, the direct representation of the SQL as
written might look like any of the following three subcases:
Figure 3, alternative join
trees where the cyclic join graph may just be implied
Subcases A, B, and C show where the missing
link in the graphs may only be implied by transitivity. In subcases
A and B, it is easy to see from the diagram alone that the foreign key pointing
to T3 or T2 must equally point to the primary key of the other table, since
those two tables, joining one-to-one, presumably share primary key values. In subcase C, the diagram alone does not imply a hidden cycle
you must notice in the SQL itself that the foreign key pointing from T1 to T2
is the same foreign key that points
from T1 to T3.
Note that although Case 1 of cyclic join trees does not
automatically create a problem, this case does involve a one-to-one join, which
we discuss more, later.
Figure 4, a case-2 cyclic
join graph
In this
query type, you should find somewhere in the WHERE clause joins such as:
T1.fkey1=T2.pkey2 and T1.fkey2=T3.pkey3 and
T2.fkey2=T3.pkey3.
Note that you will almost certainly find that
T1.fkey2=T2.fkey2, systematically, whenever T2 is a master row to the T1
detail, in other words, wherever T1.fkey1=T2.pkey2. This is because fkey2 in T1
is actually a denormalization, here, for example, a Customer_ID column in Order_Lines,
when the normalized value of Customer_ID belongs in
Orders. This denormalization, like most denormalizations, is likely not needed (that is, the fkey2
column likely should be dropped from T1), and it certainly isnt likely that
both the join to the normalized foreign key and the denormalized
copy of that key is needed from a functional perspective. The extra denormalized join, here is likely to fool the optimizer
into expecting that almost no rows will satisfy all three joins, since the
optimizer views these joins as statistically independent conditions (which is
the opposite of the truth). This false expectation is likely to lead the
optimizer to make bad choices, since it will likely drastically underestimate
how many rows will remain after performing all three joins.
Figure 5, a case-3 cyclic
join graph
In this
query type, you should find somewhere in the WHERE clause joins such as:
T1.fkey1=T2.pkey1 and
T1.fkey2=T3.pkey2, and T2.col3=T3.col4.
Though this last condition is
technically a join, it is best to treat the last join as a constrained filter
condition that we cannot evaluate without both T2 and T3. The constrained
filter condition requires that you join to both tables to pick up the filter
selectivity. In choosing a join order, I generally ignore these constrained
filters until one of the tables in the combined filter is reached without
regard to the constrained filter, then treat the
filter as a single-table filter on the other table.
Figure 6, a case-4 cyclic
join tree
In this
query type, you should find somewhere in the WHERE clause joins such as:
T1.fkey1=T2.pkey1 and T1.fkey2=T3.pkey_col1, and
T2.col=T3.pkey_col2.
Note that T3 has a two-part primary key, here, and that the
joins to the two-part key come from two different tables that are themselves
joined in a master-detail relationship. This is uncommon, but happens sometimes,
especially where the database tries to take advantage of natural keys, keys
built from columns having some external meaning (as opposed to just being
arbitrary numbers generated from some sequence). I dont generally recommend
this practice of using non-arbitrary primary key columns, but SQL developers
often have no control over these design decisions, so you may find yourself
with cases like this. Treat both joins to T3 as a constrained join dont join
to T3 until you have reached both T1
and T2 in the join order. For purposes of diagramming, treat the join to T3 as
a single arrow having three end points:
Figure 7, a case-4 cyclic
join tree, transformed to show the 3-ended unique join
To make this case concrete, here
is an example of a query against a hypothetical data dictionary that would see
this case:
SELECT TC.Column_Name
FROM
Indexes
WHERE Ind.Index_Name=EMPLOYEES_X1
ORDER BY IC.Index_Position ASC
Note that the database designer
has chosen to use the non-arbitrary value of Column_Number
(which indicates where in the column order for the table the column lies, a
meaningful property) to serve as the second part of a two-part key for the Table_Columns table. Since it would be a denormalization to store the Table_ID
(which is an arbitrary key, to the
Tables table) in the Index_Columns table, the full
unique join to Table_Columns must combine Table_ID, from the Indexes table and the (non-arbitrary) Column_Number from the Index_Columns
table. If, on the other hand, the design simply used a sequence-generated
one-part Column_ID for the primary key of Table_Columns, then we would have a simple many-to-one join
pointing from Index_Columns to Table_Columns,
and there would be no need for a join between Indexes and Table_Columns
(although each might share a matching foreign key pointing to Table_ID, still).
Figure 8, multiple trees:
sub-graphs with no join between them
Here, treat each tree (or single, isolated node) as an
independent query, and find how many rows each of these independent queries
would return. (In other words, produce a query that strips out all references
(including references in the FROM clause) to T2 and any table that joins,
directly or indirectly, to T2, and a separate query that strips out all
references to T1 any table that joins, directly or indirectly, to T1.)
Considering these separate queries, there are three cases to consider:
In case 1, the full query can be seen as two single-row
queries cleverly combined to return a longer single row. From a performance
perspective, this is harmless, or even very slightly helpful (since it avoids a
round trip to the database) versus the alternative of separate queries. It may
be somewhat confusing from a code-maintenance perspective though.
In case two, the original query returns a Cartesian product
between a single-row set and a multi-row set, which has exactly the same number
of result rows as the multi-row set. This saves a round trip to the database
versus the result if we used the separate queries, and it should not require
extra work inside the database, if the optimizer makes the right choices, but
it also means that the expressions queried for the single-row set (in the
SELECT list) are passed across the network multiple times, once for each row in
the multi-row set, using somewhat more bandwidth. Neither choice, neither the
original query nor the separated queries, is likely to be significantly better
than the other choice from a performance perspective, as long as the optimizer
makes the right choices. Separated queries may be clearer from a
code-maintenance perspective, though.
In case 3, the original query will at least sometimes return
a Cartesian product of two multi-row sets, all possible combinations of the
members of those two sets, that is. If the multi-row sets are large, this can
be disastrously expensive and slow, compared to the alternative of returning
each set in a separate, normal query. It is also likely that the code to
correctly handle the Cartesian product is more complex than the code that would
handle the separated queries. Three solutions are likely:
Figure 9, a tree with
multiple roots
Consider the query for Figure 9,
above, and imagine that we begin the execution plan by reading a row of the
table Master. Here Ive introduced a new convention, the join ratios, shown as
numbers next to the join arrows. In each case, the number compares the rowcounts before and after performing that join. Thus, for
example, if we found 100 rows from Master, we would expect 3000 rows after
joining next to Root2, or 500 rows if we joined 100 rows from Master to Root1.
(On the unique end of a join, the join ratio is at most 1, less if the join
doesnt always succeed.) Consider what happens to each row from Master, on
average when it joins to Root1, the rowcount is
multiplied by 5, and when it joins to Root2, the rowcount
is multiplied by 30. The result of joining Master to both Root1 and Root2, we find an average of 150
(5*30) combinations of the 5 rows
from Root1 and the 30 rows from Root2 that match each row from Master a
Cartesian product matched to each row we read from Master. These
query result rows do not map one-to-one to any single table or business entity,
but rather to combinations of Root1 entities and Root2 entities, and it is
unlikely that this is a useful business result. Even if it is in some way a
useful result, it contains no more data than you would find by separately
reading the 5 matches from Root1 for each Master row, and the 30 matches from Root2
for each Master row, which is surely less work than reading the 150
combinations. There are two likely solutions:
Figure 10, first solution
for a tree with multiple roots
Here, the developer examines the
two joins to the two roots and finds that one of the joins is incomplete,
lacking in some join that makes the match one-to-one on both sides of the join. In the case shown, some added condition (or
perhaps some condition already in the WHERE clause that the developer did not
view as part of the join) allows the join from Master to reach a unique row in R1 (formerly called
Root1, but no longer a root). Addition of this extra condition eliminates the
application complexity and performance issues of the Cartesian products.
Figure 11, second solution
for a tree with multiple roots
Here, there were no missing join
conditions found, but the developer recognizes that the application can better
handle the original request for data as two queries, one returning the single
tree having the single root Root1 (with 5 rows for every Master row), and the
other returning the single tree having the single root Root2 (with 30 rows for
every Master row).
There is a special case of the
tree with multiple roots. (Im being a little loose with my language, here
this case is not technically a tree from the graph-theoretic point of view,
since a tree has one root by definition.) Sometimes, one of the detail
tables that makes up the extra root is usually
one-to-one or even (zero-to-one)-to-one in its relationship with the shared
Master table, such as is the case with Root1, below:
Figure 12, a less-problematic case of a tree with
multiple roots
From a performance perspective, this case is much less
worrisome. The Cartesian product between Root1 rows and Root2 rows for each
Master row will be much better behaved on
average, since the average multiplier for that product when joining to
Root1 will be less than one. There may still be occasional cases where we dont
see the average, which may create an occasional performance problem, though,
since the join to Root1 (lacking an arrow on the Root1 end) is not guaranteed unique. More importantly,
since the join to Root1 is not guaranteed unique, there can sometimes be
results that fail to map one-to-one to the apparent true root, Root2, and this
is likely to create mischief from a functional
perspective. For example, if the software is adding up values, prices for
example, for the rows this query returns, it will occasionally count the same
Root2 entity multiple times, and this will almost certainly be dangerously
wrong. Furthermore, where such a problem would likely be obvious and would get
repaired in early testing with the case where both roots always join
many-to-many with the shared Master, in this more subtle case, the behavior may
often-enough be correct that the incorrect case escapes notice all too long,
and the defect finds its way into production. Therefore, even in this case,
(perhaps especially in this case!),
you should look for the types of solutions shown in Figures 10 and 11, where
functional changes rigorously prevent
the many-to-many Cartesian products even in the rare worst cases.
Figure 13, a many-to-many join
Here, we have a many-to-many join, another case where we get
a many-to-many Cartesian product, this time a Cartesian product for each value
of the non-unique (in both directions) join key that joins T1 and T2 directly.
Technically, this also invariably creates a diagram with multiple roots, so it
is a special case of the multiple-roots problem, with all the same implications
and similar solutions either break up the query, change the database design,
or (more often) find the missing join condition that guarantees uniqueness on at least one side of the join. Just as for
the earlier case, there is a case that is less-problematic from the performance
perspective, where the join is usually unique on one side:
Figure 14, a many-to-many join that is usually unique
on one end
Just as in the earlier multiple-roots case, this case is
likely OK from a performance perspective, but it may be insidiously wrong
functionally, with a hard-to-find defect that occasionally returns wrong
results in the rare true many-to-many case. Just as before, the usual correct
solution is to make the join rigorously unique on at least one end, either with
a change to the SQL or a change to the database design (perhaps a new foreign key
in T1 that points to the true primary key of T2, for example).
Remember
the earlier case of a one-to-one join creating an implied cycle in the join
diagram:
Figure 15, a one-to-one join that creates an implied
cycle
I promised
I would return to this case with a more-general discussion of all one-to-one
joins:
Figure 16, a one-to-one join
In general, tables that join one-to-one represent the same
entity, or perhaps subsets of the same entity. In such cases, it is fair to ask
why we should not simply consolidate these tables into a single larger table
that comprehensively represents all the data for that entity in a single table,
saving joins. This is often useful, from a performance perspective. Unlike most
database design changes, consolidating one-to-one tables is not a denormalization. Even if the tables are not combined, they
should at least take full advantage of the one-to-one relationship the primary
key of each table should be reused as the foreign key of the other table,
rather than having redundant unique keys. There are several subcases
of this opportunity for design improvement by consolidating tables:
Figure 17, rigorously
one-to-one tables
In the case where T1 and T2 are each guaranteed to have a
match in the other table, the case for consolidation into a single table is
particularly strong and simple. The only case where you might choose to keep
the tables separate (if you have complete control of the design, and cost of
making the change isnt a problem) would be the case where one of these tables
is read frequently, but the columns of the other table all contain
rarely-referenced, perhaps-high-volume, supplementary data. In this case, by
keeping the tables separate, you can cache more rows of the frequently-used
table in a given amount of cache memory, helping performance of the most
frequent queries, while only rarely paying the cost of the join to the
less-well-cached supplementary-data table. The benefits of better caching on
average may exceed the cost of the occasional extra join and the extra insert
for each row created.
The problem is less simple if the join is
one-to-(zero-to-one):
Figure 18, a one-to-(zero-to-one) join
Here, the case for consolidating tables is still strong,
though strictly speaking T2 represents a subset of T1, not quite the whole set.
There is a special caution, here, though the constraints of T2 (such as a NOT
null constraint on a column) likely only apply to the subset of T1 entities
that are also in T2, so if we add the T2 columns to T1 to combine the tables,
the new constraints that enforce the old T2-entity constraints will be more
complex, only applying to the subset of
rows that would have been in T2 when these were separate tables. There is also
likely to be a new is_a_type_T2_flag column required just to specify that the
entity would have been in the T2 subset as well as in T1 under the old design.
There is yet another case where the subset is a particularly small subset of the larger set:
Figure 19, a one-to-(zero-to-one) join that is usually one-to-zero
In this case, it is likely not worth loading up the larger
table T1 with columns from the much smaller table T2, since those columns only
apply to a tiny subset of the T1 entities, at least not from a performance
perspective. In fact, having the small T2 subset consolidated into its own
table can be a major performance win, if that is the subset we really wish to
query, often. In such cases, a query that drives from T2, with nested-loops to
any large tables, is likely to be very fast, where optimization of a query with
data consolidated onto T1 would be harder. Yet another case comes when the join
is optional in both directions:
Figure 20, a (zero-to-one)-to-(zero-to-one) join,
optional in both directions
Consolidating tables is less-simple in this case, since
neither tables entity is a subset of the other. However, a case like this
implies that there is a third more-inclusive set of entities that contains both
subsets, and that both these tables could be merged into a new table based on
the more-inclusive entity. (Think, for example, of merging an Employees table
and a Spouses table into a Persons table.)
Figure 21, a filtered outer join
Here, Ive introduced another refinement to the join
diagram, the number 0.5 alongside the table T2. This is the filter ratio, and
it signifies that some WHERE-clause condition on T2 is true only for half of
the rows in that table. Almost all such conditions, however, will defeat the
usual purpose of the outer join, which is to preserve (in the query result) all
rows from T1 that meet the conditions on the inner-joined tables, regardless of
what we find, or do not find, in T2. The only likely case where a filtered
outer join may be correct (though misleading) is where the filter discards all
the outer-case rows, the rows that fail to find T2 rows matching T1 rows, and
the join is then effectively and correctly an inner join. Of course, if the
intent is only to preserve the inner-joined rows, then it is far clearer to
express the query as an explicit inner join, and this is the right choice
filtered inner joins are perfectly normal.
An example of a filtered outer join that might seem correct would be:
In this case, the pure-outer case, where we fail to find any
match in T2 having the given Pkey value is preserved,
since in that case, T2.Type will be assigned a null value, and
NVL(T2.Type,TX) will evaluate to TX. However, consider that we might find a
matching row in T2 having the wrong
value of T2.Type, and in this case, we will discard the row from T1 that
matched that row in T2 we keep the row in T1 if we find no match at all, but
discard it if we find the wrong sort of match. This might be the functional
intent, but it is not typically the intent, where we have an outer join. Much
more likely, the intent is correctly expressed by:
which preserves the T1 rows regardless of
what we find in T2.
One uncommon but possible case where a filtered outer join makes
sense has conditions like:
At first glance this seems bizarre, since it is probable
that T2.PKey is positively forbidden
to have a null value by a constraint on T2, making it appear that the query
demands an impossible condition and will never return rows! However, when we
take the outer join into consideration, we can see that this is really a query
that requests only the outer-case
rows, the rows in T1 that fail to find a matching row in T2, discarding the
inner-case rows that find a match. (The outer case rows will assign a null to
T2.PKey in the result, making the second condition true.) This is really just
an alternate way to express a NOT EXISTS condition! It is occasionally useful
to express the condition in this unusual way to gain added control of when the
NOT EXISTS condition is evaluated, to force the fastest possible execution
plan. Of course, since it is an unusual and non-intuitive way to express that NOT
EXISTS condition, this deserves a comment added to the SQL to explain what is
going on, so later developers maintaining the code are not confused and do not
undo the correct functionality and performance.
Figure 22, an inner join below an outer join
This case is always a mistake the inner join from T2 to T3
will only ever be possible in the inner case of the join from T1 to T2, so, if
the current functionality is correct, both joins should be explicitly inner, to
make the intent more clear:
Alternatively, if the original apparent intent to preserve
rows from T1 regardless of what happens to the join to T2 was correct, then we
need to make both joins outer:
Figure 23, an upward outer join
The join in this case looks like:
T2.PKey=T1.FKey(+)
The (+) side of the join, in other words, is on side of
the non-unique foreign key, not on the primary-key side. This is subtly
unlikely to be correct because of the usual rule that rows from a query should
map one-to-one to some subset of a single table (normally the table at the
single root of the tree). In this case, however, where the join finds a match (or
multiple matches) in T1, in the inner case of the outer join, in other words,
the result maps one-to-one to rows of T1, but in the outer case, where there is
no match in T1, the rows map one-to-one to rows of T2. It is just possible that
in a single query you actually want this sort of heterogeneous result, mapping
partly to T2 and partly to T1, but it is uncommon and unlikely. More likely, the
upward outer join is simply a mistake, functionally.
Figure 24, a filtered upward outer join
This case seems to have two problems a filter on the outer
side of the join, and an upward outer join. Surprisingly enough, occasionally
two wrongs do make a right! If the
filter on T1 is something like
T1.Fkey IS NULL
which is only true in the outer case of
the outer join:
T2.PKey=T1.FKey(+)
then we are actually discarding the
entire inner case of the join, and just preserving the outer case, the
equivalent of a NOT EXISTS condition on a correlated subquery
of T1. In this case, the query result will map rigorously one-to-one to rows of
T2, only, so the result follows the rule for a single table mapping one-to-one
to all result rows. Of course if the filter has any effect other than to
discard all the inner-case rows from the join, then this has both the usual
problems with upward outer joins and the usual problems with filtered outer
joins.
Recurring types of query anomalies are easy to recognize and
to describe when the query is shown as a join diagram. This is one of the
principle benefits of diagramming a query, after it is written, or even beginning with a diagram, before
beginning to write the SQL. This paper would have been far harder to write, and
I believe, far harder to understand, if I had tried to describe and generalize
these anomalies purely in SQL form. Join diagrams are often useful for SQL
tuning, but the way these make anomalies instantly obvious is a major
side-benefit of this method, and is reason enough to diagram complex queries that
are otherwise all-too-likely to contain these insidious defects.
Most join-diagram anomalies point to a mistake in the query,
in the database design, or in the application. If the mistake is in the
application layer, or even worse is in the design layer, the solution can be
expensive and time-consuming. Consider, though, that if the solution is further
delayed, it is likely to grow even more
expensive to fix, later, so do not let cost automatically put these fixes on
hold!