Getting SQL Right the First Try

(Most of the Time!)

Dan Tow

A lot has been written about how to fix poorly-performing SQL, including a 300+-page book, SQL Tuning, published by O’Reilly, that I wrote myself. In contrast, there is relatively little material on how to write SQL in the first place that will likely perform well the first try and that will be easy to fix even if it does not perform well. This paper will address getting SQL right the first try, even before the first performance and functionality test.

Inevitably, even following the best possible practices, we’ll still need to tune some SQL, and the person doing that tuning work will often not be the person who first wrote the SQL. Often the person tuning the SQL will not even be able to find the person who wrote the SQL, and later efforts at SQL tuning will usually be performed by someone who is less familiar with the context, the database design, and the requirements of the SQL than the person who first wrote that SQL (assuming at least that the person who first wrote the SQL knew as much as he or she should know to do the job right!). Therefore, one of the most important roles of the initial SQL is to document precisely, correctly, and clearly what the application requires of the SQL at the point in the flow of control when the SQL executes. In other words, it is imperative that a query return precisely the correct rows, even in the most obscure corner cases, and that any update makes exactly the right changes to the data.

If a query is precisely functionally correct, a future SQL tuner can take the SQL as a perfect spec for what rows the application needs at that point in its flow of control, and the SQL tuner need not even understand why the application needs those rows, nor even what the rows represent, as long as the SQL tuner can find a way to reach those rows faster. A SQL statement could even specify table and column names in a language unknown to the SQL tuner, and the SQL tuner should still be able to make safe changes to well-written SQL to reach the specified rows faster. On the other hand, if the original SQL fails even to do the right thing, functionally, any functionally neutral performance fix still leaves a functionally broken result, which cannot safely be repaired without finding someone familiar enough with the detailed functional requirements to understand that the change proposed is really a correct functional fix. Often, functional errors in the SQL are at least part of the reason the SQL performs poorly, so the performance fix is frequently not even possible until we find and fix the more complicated functional error.

All too often, SQL is initially written by trial and error, in a process that creates SQL that returns results that approximately match the developer’s approximate understanding of the requirements! The result is SQL that either remains permanently, subtly wrong, functionally, or SQL that must later be repaired by a developer who is, compared to the initial developer, in a poorer position to fully understand the precise functional requirements of that SQL. Clearly, getting SQL functionally right in the first place, through precise understanding of the requirements, as well as through thorough testing (which won’t usually uncover all corner-case defects, however thorough, unfortunately, especially if the person creating the tests didn’t precisely understand the requirements), is the most cost-effective approach!

(When I state that the developer should have a precise understanding of the functional requirements, I should myself be precise: It is not absolutely necessary to know every column that will be selected, from the start – it is easy and safe enough to add columns to the select list, if use of a prototype shows more columns are needed. Even if a join must later be added, to serve some lookup that is later found to be necessary, that’s usually not a big problem. However, it is necessary to know precisely which rows of the main entity being selected are needed, and precisely what the query output rows represent!)

As it happens, clean SQL that returns precisely the right rows under even the most obscure cases is easier for the cost-based optimizer to tune well than SQL that has many of the complex but subtle functional errors that often come from poorly-thought-out SQL, so clean SQL usually doesn’t require manual tuning. Even when clean SQL does require manual tuning, it’s usually an easier tuning exercise than the tuning of poorly-written SQL, often just requiring a new index, or a minor and fairly obvious tweak to the SQL, which is easy to make, given that the clean SQL is a clear and correct functional spec for the SQL required at that point in the application flow of control.

I’ll begin with a couple of first principles that I’ve found repeatedly in my experience, and that I hope appear at least plausible to the reader:

If you (the initial coder of the SQL) don’t really understand what the tables and/or views represent, the SQL will likely perform poorly,… and even if it doesn’t, it will likely be functionally wrong!
If you (the initial coder of the SQL) don’t really understand what the SQL is supposed to accomplish, it will likely perform poorly,… and even if it didn’t, it will likely be functionally wrong… and even if it isn’t, it will be hard to understand and maintain!

Note that the rules above only apply strictly to the initial coder of the SQL, provided the initial coder does his or her job well. Given well-coded initial SQL, those who later make performance enhancements to the SQL, where needed, can take the initial SQL as a trustworthy spec for the rows required at that point in the program, and they need not understand the business context of those rows, or even what the tables and columns represent. Of course, if later functional changes prove necessary, the coder of those changes must understand how to achieve the correct new functionality.

Code What You Know

To understand the database design well enough to write functionally-correct code likely to perform well from the start, you should be able to answer a series of questions with confidence:

What set of entities does each table represent?
What is the complete primary key to each table?
What set of entities does each view represent?
What is the virtual primary key of each view?
Roughly how many rows in production will there be in each table or view?

It is surprising how often owners of broken code cannot answer these very basic questions, but it is hardly a surprise that the result, without this understanding, is broken code!

Join Trees, a Crash Introduction

Many of the rules for writing clean SQL the first try are vastly easier to express and understand in terms of join trees, abstract representations of the SQL that I explain in much more detail in SQL Tuning. In case you are not already familiar with these, and sadly lack a copy of the book, I’ll introduce join trees briefly, here.

Consider a query:

SELECT …

FROM Orders O, Order_Details OD, Products P, Customers C,

Shipments S, Addresses A, Code_Translations ODT, Code_Translations OT

WHERE UPPER(C.Last_Name) LIKE :Last_Name||'%'

AND UPPER(C.First_Name) LIKE :First_Name||'%'

AND OD.Order_ID = O.Order_ID

AND O.Customer_ID = C.Customer_ID

AND OD.Product_ID = P.Product_ID(+)

AND OD.Shipment_ID = S.Shipment_ID(+)

AND S.Address_ID = A.Address_ID(+)

AND O.Status_Code = OT.Code

AND OT.Code_Type = 'ORDER_STATUS'

AND OD.Status_Code = ODT.Code

AND ODT.Code_Type = 'ORDER_DETAIL_STATUS'

AND O.Order_Date > :Now - 366

ORDER BY …;

The join tree that abstractly represents this query reflects the following:

What tables are being joined, represented as aliases, since the same table can appear multiple times in the FROM clause under different aliases.
What is the nature of each join – which side or sides of the join are unique, and what joins to what.
Which joins are outer joins, and which side of each outer join is the optional side (the side with the “(+)” in old-style Oracle join notation).

The join tree that represents the above query, then, is as follows:

Figure 1, a join tree representing the above query

In this diagram:

• Each table is a node, represented by its alias.

• Each join is a link, with (usually downward-pointing) arrows pointing toward any side of the join that is unique.

• Midpoint arrows point to optional side of any outer join.

For example, the table Shipments is represented by the node labeled “S”, and the outer join “S.Address_ID = A.Address_ID(+)” is represented by the downward-pointing arrow from S to A, with the midpoint arrow pointing toward the “(+)” side of that join and the end-point arrow pointing toward A because ADDRESS_ID is unique for the table Addresses.

Following these rules to create join diagrams, we find a number of regularities among most well-written SQL:

• The query maps to one tree.

• The tree has one root, exactly one table with no join to its primary key.

• All joins have downward-pointing arrows (joins unique on one end).

• Outer joins point down (toward the primary key, that is), with only outer joins below outer joins.

• The question that the query answers is basically a question about the entity represented at the top (root) of the tree (or about aggregations of that entity).

• The other tables just provide reference data stored elsewhere for normalization.

If we know in advance that good SQL tends to follow these patterns, we can steer the SQL toward matching these patterns in the first place, or at least to understand clearly when and why we deviate from these patterns.

Know What You Code

Understand the nature and purpose of your joins:

Are the tables joined in a tree (no loops, no missing joins)? If not, why, exactly?
What is the direction of every join? (Which side or sides are unique?) If any join is many-to-many, why is it, and how can that be right? (Foreign keys should point to primary keys, so a many-to-many join is usually wrong or incomplete!)
What is the purpose of each join? Are you using anything from the joined-to table?

Understand the structure of the join tree, and how it maps to the desired query result:

What is the entity that each result row (or each pre-aggregated result row) should represent? What table or view maps to that entity?
Is the table or the view mapping to the results the (only!) root detail table, the only table not joined to the other tables with its primary key? (Are all joins from the single root node downward-pointing?)

When I first wrote the presentation that goes with this paper, I shared an early draft with a good friend, Fahd Mirza, and he asked a question I was ashamed I hadn’t thought through before: “Does this mean that you should create the query diagram before you begin to write the SQL?!” As a SQL “fixer,” who almost never originates the SQL I work on, I always produce my diagrams for already-existing SQL, but the question is brilliant – of course, if you create the diagram first, it will clarify the requirements at an abstract level, and it will be almost impossible to produce first-time SQL that isn’t clean and correct! Furthermore, as Fahd suggested, the diagrams could make an excellent documentation tool. I gladly acknowledge that the notion of documenting the join tree before writing the first line of SQL is a radical change in the usual hit-or-miss process of writing the first draft of a SQL statement, and likely to generate resistance, but I honestly can’t find a good argument against it, and I find the idea compelling!

Here’s what the process should look like, if you start with creating the join tree, before writing the first line of SQL:

Find the table that represents the same entity that is represented by each row (or each pre-aggregation row, if the query is a group-by) of the query result. Call this the “base” table. Place an alias representing the base table at the top of the join tree, representing the root detail node.
Consider which columns the query requires to filter undesired rows with where-clause conditions. Where these columns do not belong to the base table, you’ll need one or more unique lookups to reach the reference (or master) tables holding the necessary columns. Add these to the diagram, hanging below the base table. For example, if you want to filter a query of Order_Details based on the customer name, you’ll need a many-to-one join from Order_Details to Orders, and another many-to-one join from Orders to Customers, to find the customer name that goes with each order detail. If a lookup has already been added to the query diagram for some earlier purpose, reuse it, rather than including redundant links to the same rows of the same table (reached through the same foreign key).
Consider which columns the query must select, including those columns that are stored some number of joins away from the base table for purposes of normalization. Add unique lookups as needed to reach these normalized columns you must select. Some of these lookups will be “n-away” from the base table, requiring intermediate unique joins to reach the necessary table. In many cases like this, where the columns concerned are not a part of any filter, just columns being selected, outer joins are appropriate, because you’ll want the base-table row whether or not the lookup is reachable. If a lookup has already been added to the query diagram for some earlier purpose, reuse it (and leave the join as an inner join, if that is what it already was), rather than including redundant links to the same rows of the same table (reached through the same foreign key).
As you add nodes, make sure each has a mnemonic alias that is unique within the diagram.
If at any point in this process, you find yourself seemingly needing to perform a join that is not unique on the end pointing away from the base table, then reconsider whether your understanding of the requirements was correct – perhaps the new entity you are joining to is the true base table, the table that maps one-to-one with the desired result rows of the query. If you find yourself with a many-to-many join, or with multiple root detail nodes, you are building an unlikely query where the results fail to map one-to-one with any entity important enough to have its own table, instead mapping to unlikely combinations of entities. This is usually a mistake!
When the join tree is complete, write the FROM clause to include every node in the diagram, with the aliases shown in the diagram. (Include, too, the full unique joins (LEFT OUTER, where needed) in the FROM clause, a join clause for each link in the diagram, if you are using the new SQL join syntax.)
Write the WHERE clause to include all necessary row filters, and to include old-style, full unique joins (outer, where needed) for each link in the diagram, if you are not using the new-style join syntax that shows the joins in the FROM clause. Include aliases before every column reference to be clear for later maintenance, and in case new columns are added to other tables that would make a formerly-unique column name ambiguous later in the evolution of the application.
Add the select list, order-by, group-by clauses as needed. Include aliases before every column reference.
Add subqueries as needed (These should also look like clean trees, if diagrammed separately.)

If you use a view, the view-defining SQL is part of your SQL, and your SQL is only right if the whole combination is right. Understand the structure of the views you use:

Do rows from the view map cleanly, one-to-one, to a single entity? (Are the view-defining join trees clean? That is, if you diagram the view-defining query, is the result a clean join tree?) Is that exactly the right entity for purposes of your query?
Are there elements in the view (joins, subqueries) that are unnecessary or redundant to your query?
Will use of the view still be correct when and if the view changes?

Understand how to write the query direct to the base tables:

Would the complete join diagram, which explodes the view-defining queries into the view-using query, still be a normal, clean join tree? If not, exactly what would be the corner-case behaviors of that query, and would those behaviors be correct?
How could you code the query to simple tables, only? If the resulting simple-table query would be unusual or complex, is that complexity necessary and correct?

Consider this, too: If you do understand (as you should) the desired functionality, tables, and views well enough to write the query as a direct query to simple tables, then why don’t you? (Yes, there are good reasons to use views, but laziness about understanding the schema and desired functionality are not among these good reasons, and short-term laziness with views costs effort in the long run!)

Understand the context of the query and the rows it returns:

Roughly, how many rows should the query return? Is that a useful result set (not too large for human consumption)?
Does the query return any rows or columns that aren’t really needed?
When, and how often, does the query run? How fast does it need to be? Can it perform well in that context, given reasonable per-row performance, or do we have an example of an “unsolvable” tuning problem? (500 rows/second would be reasonable per-row performance, though this varies greatly in both directions depending on the number of tables being joined and how well-cached the tables are.)

Understand a clean path to the data from the best (most selective) filter:

What is the main, most-selective filter that will trim the result set from the unfiltered rowcount of the root detail table to the desired rowcount? If this isn’t obvious to the casual SQL reviewer, make it obvious, with comments.
Is there a clean, indexed path to that that best filter, and to the other big tables, from there, and have you provided it?

Make the SQL transparently understandable and clear:

Is the SQL clear enough that the next developer (who will know far less about that module of the application), or even a SQL tuning specialist (who may know almost nothing about the application) can easily understand it?
Are the unusual features of the SQL, if any, well commented with clear explanations?
Is every column reference preceded by its alias? (This is clearer, and besides, the column name that is unique, today, may not be later in the evolution of the database design!)

Know the value of getting it right in the first place:

Are these rules seemingly impossible to follow on your database? Problems coding clean SQL often point to schema-design flaws – push back on these flaws when you find them.
Does is seem impossible to make the deadline with clean SQL, following these rules? Would it be better to release the wrong SQL on time?! (NO!)

Conclusion

It is sheer folly to imagine that out of ignorance of the database design or of the functional requirements, useful SQL can emerge. Detailed knowledge at the first stage is the solution, for there will never be a better, easier time to answer the questions that require answering to build functionally correct SQL that stands a good chance to perform well from the start. Later developers will be left guessing what the true requirements are, if the initial SQL doesn’t rigorously map to a well-understood and correct functional spec, based on a clear and correct understanding of the database design. Someone must communicate at least a rough idea to the initial developer regarding what is required, functionally, and there is no better time than this initial communication to ask the questions that will turn a vague spec into a precise and correct spec.

• All SQL can be built right!

• All SQL should be built right!

• SQL that is built following the right principles rarely needs fixing, and can always be fixed easily when necessary!