Getting SQL Right the First Try
(Most of the Time!)
Dan Tow
©2007 Dan Tow, All Rights Reserved
A lot has been written about
how to fix poorly-performing SQL, including a 300+-page book, SQL
Tuning, published by O’Reilly, that I wrote myself. In contrast, there is
relatively little material on how to write SQL in the first place that will
likely perform well the first try and that will be easy to fix even if it does
not perform well. This paper will address getting SQL right the first try, even
before the first performance and functionality test.
Inevitably, even following
the best possible practices, we’ll still need to tune some SQL, and the person
doing that tuning work will often not be the person
who first wrote the SQL. Often the person tuning the SQL will not even be able
to find the person who wrote the SQL,
and later efforts at SQL tuning will usually be performed by someone who is
less familiar with the context, the database design, and the requirements of
the SQL than the person who first wrote that SQL (assuming at least that the
person who first wrote the SQL knew as much as he or she should know to do the job right!). Therefore, one of the most
important roles of the initial SQL is to document precisely, correctly, and
clearly what the application requires of the SQL at the point in the flow of
control when the SQL executes. In other words, it is imperative that a query
return precisely the correct rows,
even in the most obscure corner cases, and that any update makes exactly the right changes to the data.
If a query is precisely
functionally correct, a future SQL tuner can take the SQL as a perfect spec for
what rows the application needs at that point in its flow of control, and the
SQL tuner need not even understand why the application needs those rows, nor
even what the rows represent, as long as the SQL tuner can find a way to reach
those rows faster. A SQL statement could even specify table and column names in
a language unknown to the SQL tuner, and the SQL tuner should still be able to
make safe changes to well-written SQL to reach the specified rows faster. On
the other hand, if the original SQL fails even to do the right thing,
functionally, any functionally neutral performance fix still leaves a
functionally broken result, which cannot safely be repaired without finding
someone familiar enough with the detailed functional requirements to understand
that the change proposed is really a correct functional fix. Often, functional
errors in the SQL are at least part of the reason the SQL performs poorly, so
the performance fix is frequently not even possible until we find and fix the
more complicated functional error.
All too often, SQL is
initially written by trial and error, in a process that creates SQL that
returns results that approximately
match the developer’s approximate
understanding of the requirements! The result is SQL that either remains
permanently, subtly wrong,
functionally, or SQL that must later be repaired by a developer who is,
compared to the initial developer, in a poorer position to fully understand the
precise functional requirements of that SQL. Clearly, getting SQL functionally
right in the first place, through precise understanding of the requirements, as well as through thorough testing
(which won’t usually uncover all corner-case defects, however thorough,
unfortunately, especially if the person creating the tests didn’t precisely
understand the requirements), is the most cost-effective approach!
(When I state that the
developer should have a precise understanding of the functional requirements, I
should myself be precise: It is not absolutely necessary to know every column
that will be selected, from the start – it is easy and safe enough to add
columns to the select list, if use of a prototype shows more columns are needed.
Even if a join must later be added, to serve some lookup that is later found to
be necessary, that’s usually not a big problem. However, it is necessary to know precisely which rows of the main entity
being selected are needed, and precisely what the query output rows represent!)
As it happens, clean SQL that
returns precisely the right rows under even the most obscure cases is easier
for the cost-based optimizer to tune well than SQL that has many of the complex
but subtle functional errors that often come from poorly-thought-out SQL, so
clean SQL usually doesn’t require manual tuning. Even when clean SQL does require manual tuning, it’s usually
an easier tuning exercise than the tuning of poorly-written SQL, often just
requiring a new index, or a minor and fairly obvious tweak to the SQL, which is
easy to make, given that the clean SQL is a clear and correct functional spec
for the SQL required at that point in the application flow of control.
I’ll begin with a couple of
first principles that I’ve found repeatedly in my experience, and that I hope
appear at least plausible to the reader:
Note
that the rules above only apply strictly to the initial coder of the
SQL, provided the initial coder does his or her job well. Given well-coded
initial SQL, those who later make performance enhancements to the SQL, where
needed, can take the initial SQL as a trustworthy spec for the rows required at
that point in the program, and they need not understand the business context of
those rows, or even what the tables and columns represent. Of course, if later functional
changes prove necessary, the coder of those changes must understand how to
achieve the correct new functionality.
To understand the database
design well enough to write functionally-correct code likely to perform well
from the start, you should be able to answer a series of questions with
confidence:
It is surprising how often
owners of broken code cannot answer these very basic questions, but it is
hardly a surprise that the result, without this understanding, is broken code!
Many of the rules for writing
clean SQL the first try are vastly easier to express and understand in terms of
join trees, abstract representations of the SQL that I explain in much more
detail in SQL
Tuning. In case you are not already familiar with these, and sadly lack a
copy of the book, I’ll introduce join trees briefly, here.
Consider a query:
SELECT …
FROM Orders O, Order_Details OD, Products P, Customers C,
Shipments S, Addresses A,
Code_Translations ODT, Code_Translations OT
WHERE UPPER(C.Last_Name) LIKE :Last_Name||'%'
AND UPPER(C.First_Name) LIKE
:First_Name||'%'
AND OD.Order_ID = O.Order_ID
AND O.Customer_ID = C.Customer_ID
AND OD.Product_ID =
P.Product_ID(+)
AND OD.Shipment_ID =
S.Shipment_ID(+)
AND S.Address_ID =
A.Address_ID(+)
AND O.Status_Code = OT.Code
AND OT.Code_Type = 'ORDER_STATUS'
AND OD.Status_Code = ODT.Code
AND ODT.Code_Type =
'ORDER_DETAIL_STATUS'
AND O.Order_Date > :Now - 366
ORDER BY …;
The
join tree that abstractly represents this query reflects the following:
The
join tree that represents the above query, then, is as follows:
Figure 1, a join tree
representing the above query
In
this diagram:
•
Each table is a node, represented by its alias.
•
Each join is a link, with (usually downward-pointing) arrows pointing
toward any side of the join that is unique.
•
Midpoint arrows point to optional side of any outer join.
For
example, the table Shipments is represented by the node labeled “S”, and the
outer join “S.Address_ID = A.Address_ID(+)” is represented by the downward-pointing arrow from
S to A, with the midpoint arrow pointing toward the “(+)” side of that join and
the end-point arrow pointing toward A because ADDRESS_ID is unique for the
table Addresses.
Following
these rules to create join diagrams, we find a number of regularities among
most well-written SQL:
•
The query maps to one tree.
•
The tree has one root, exactly one table with no join to its primary key.
•
All joins have downward-pointing arrows (joins unique on one end).
•
Outer joins point down (toward the primary key, that is), with only outer
joins below outer joins.
•
The question that the query answers is basically a question about the
entity represented at the top (root) of the tree (or about aggregations of that
entity).
•
The other tables just provide reference data stored elsewhere for
normalization.
If
we know in advance that good SQL tends to follow these patterns, we can steer
the SQL toward matching these patterns in the first place, or at least to
understand clearly when and why we deviate from these patterns.
Understand the nature and purpose of your joins:
Understand the structure
of the join tree, and how it maps to the desired query result:
When I first wrote the presentation that goes with
this paper, I shared an early draft with a good friend, Fahd Mirza, and he
asked a question I was ashamed I hadn’t thought through before: “Does this mean
that you should create the query diagram before you begin to write the
SQL?!” As a SQL “fixer,” who almost never originates the SQL I work on, I
always produce my diagrams for already-existing SQL, but the question is
brilliant – of course, if you create the diagram first, it will clarify
the requirements at an abstract level, and it will be almost impossible to
produce first-time SQL that isn’t clean and correct! Furthermore, as Fahd
suggested, the diagrams could make an excellent documentation tool. I gladly
acknowledge that the notion of documenting the join tree before writing the
first line of SQL is a radical change in the usual hit-or-miss process of
writing the first draft of a SQL statement, and likely to generate resistance,
but I honestly can’t find a good argument against it, and I find the idea
compelling!
Here’s what the process should look like, if you
start with creating the join tree, before writing the first line of SQL:
If you use a view, the view-defining SQL is part
of your SQL, and your SQL is only right if the whole combination
is right. Understand the structure of the views you use:
Understand how to write
the query direct to the base tables:
Consider this, too: If you
do understand (as you should) the desired functionality, tables, and views well
enough to write the query as a direct query to simple tables, then why don’t you? (Yes, there are good
reasons to use views, but laziness about understanding the schema and desired
functionality are not among these good reasons, and short-term laziness with views
costs effort in the long run!)
Understand the context of
the query and the rows it returns:
Understand a clean path
to the data from the best (most selective) filter:
Make the SQL
transparently understandable and clear:
Know the value of getting
it right in the first place:
It is sheer folly to imagine
that out of ignorance of the database design or of the functional requirements,
useful SQL can emerge. Detailed knowledge at the first stage is the solution,
for there will never be a better, easier time to answer the questions that
require answering to build functionally correct SQL that stands a good chance
to perform well from the start. Later developers will be left guessing what the
true requirements are, if the initial SQL doesn’t rigorously map to a well-understood
and correct functional spec, based on a clear and correct understanding of the
database design. Someone must communicate at least a rough idea to the initial
developer regarding what is required, functionally, and there is no better time
than this initial communication to ask the questions that will turn a vague
spec into a precise and correct spec.
•
All SQL can be built right!
•
All SQL should be built right!
•
SQL that is built following the right principles rarely
needs fixing, and can always be fixed easily when necessary!