Semi-Annual SingingSQL Newsletter #7
I hope you won’t object to a bit of personal business unrelated to SQL tuning: Many of my friends in SQL tuning enjoy the field because they find SQL tuning, as I do, to be an interesting puzzle. If that means you like puzzles in general, like I do, then you might like my new invention, Suduro, a variant on Sudoku, for those who want a bit of extra challenge beyond the usual Sudoku. If you might be interested, check out my new online Suduro site, and years from now, when Suduro is published in thousands of newspapers daily ;-), you can say you were one of the very first to solve one!
You may have noticed that these “quarterly” newsletters aren’t exactly making it out 4 times/year – I’ve been swamped, so I’m going to try to get these out at least twice a year, returning to a quarterly schedule if time allows – better to have good material twice a year than something hacked together more often, I hope.
If you wish to unsubscribe at any time, just drop me a line at dantow@singingsql.com.
I’m taking a lowest-common-denominator approach to the format of this newsletter, sending it as a straight-ASCII note so anyone at all can access it securely, but if you’d rather read something more formatted, go to OnlineNewsletter07. I’ll also have links scattered throughout for supplementary materials.
Although this material is copyrighted, feel free to forward this newsletter, as is, to any friends who you think might be interested.
If you have any comments, ideas for future newsletter content, or suggestions, feel free to drop me a line, at dantow@singingsql.com – I’m always happy to hear from you!
To see a cumulative listing of all news, including old news from past newsletters, see all news.
To see a cumulative listing of all links, including links already listed in past newsletters, see all links.
See the links above under Old News.
Views, from Forty-Thousand Feet
From my perspective of as a SQL tuning consultant, I have to give three cheers for views. I’d estimate that 75% of my time spent tuning really tricky, non-routine SQL tuning problems is consumed dealing with complex SQL built on complex views – these are the problems that can consume hours, even days, of the best experts in the business, assuring continued, lucrative employment for professionals like myself. Of course, if you are on the employer’s side, looking for efficiency in a development organization, or if you are an overworked developer on a fixed, non-hourly salary, your perspective may be a little different!
SQL is a fabulously flexible query language, but this flexibility is often a curse, as it spools out rope to hang yourself on a plausible-looking query that returns pure nonsense, or worse, that appears to return the desired results, in a simple test, but fails in a dozen or more corner cases that the test fails to exercise. Such nonsense SQL is easy enough to construct when dealing with a simple, flat join of tables, but it becomes almost a certainty when dealing casually with complex SQL built on top of complex views.
A business application database stores data about all the business entities relevant to that application. Except in the rarest of cases, a query should return rows that represent a single type of entity, mapping rigorously one-to-one to instances of that entity (or to aggregations of that entity, in the cases of group-by queries). If the entity is important enough for a query, it is almost certainly represented by a physical table, so the query result, however many tables are in the FROM clause, should map rigorously one-to-one to a subset of rows from a single, physical table (which I’ll call the “primary” table, here, but which I call the “root detail table” in my book). All the other tables or views mentioned in the query should simply provide lookup data (data pertaining to the primary table, master data under the primary detail data, stored in master tables since the database is normalized) or data necessary to filter out primary-table rows that don’t belong in the desired, queried subset. The actual best path to the data might be complex, but one very simple path (which might not be the fastest) should almost always be possible; reach the rows of the primary table, first, all of them, if necessary, then reach every single other table (with the possible exception of tables in subquery conditions such as EXISTS and NOT EXISTS) through its primary key, from one or more joins beginning with each row of the primary table. This can be shown as a tree, with the primary table at the root, as I elaborate in my book, SQL Tuning, and it can be expressed easily with a simple, flat query against simple tables.
Now, here is the point that gets lost when predefined and inline views come into play: use of views doesn’t significantly expand the universe of queries that make sense, which can be expressed simply with a flat set of joins to simple tables; complex views just make it easier to write nonsense queries that do not belong to the set of queries that make sense, and they make it simpler to obscure the fact that the query is nonsense, and they make it much harder to fix the query when it must finally be fixed, and they make it much more likely that the query will be slow as molasses, and will require the services of a top SQL tuner!
Here are some rules for using views relatively safely:
1. The view-defining query must, itself, be a sensible query, with results rigorously mapping one-to-one to a single physical table, the “primary table”, or to aggregations of rows of that table, in the case of a group-by view. It should be obvious what that primary table is, if necessary through explicit comments built into the view-defining query.
When you can’t follow these rules, and still get the necessary result, this is usually a clue that the schema design is broken, and should be fixed if possible before it does more damage and is even harder to disentangle.
When you don’t follow these rules, but you could, this is surely a clue that you are building a query that will be hard to maintain, and that is all-too-likely functionally wrong, today, or will become functionally wrong as the application and the data evolve. Queries built that fail to follow these rules are very likely to provide many hour of profitable employment for SQL tuning specialists like myself, as they will likely perform very poorly, and they are horribly difficult to repair safely. While this is a benefit for SQL tuning consultants paid by the hour, it is not so good for anyone else.
The rules for using views safely and correctly are complicated and hard to use, without a doubt, particularly considering that most people using views are doing so in search of simplicity in coding, not in search of complexity! Casual use of complex views, without regard for the above rules, may seem simple, but it almost invariably creates horribly complex problems in the future. If the rules to use complex views correctly are prohibitively difficult, this is likely a sign that the query, however difficult, will be easier to write and maintain correctly against base tables (or at most against simple views) than against complex views!
If the only objective is to pass a simple functionality test against a small, artificial development database, especially if the data schema is flawed, then a query hacked together with complex views, without regard to the above rules, is often the simplest solution. Most likely, however, such a hacked-together query will fail in current or future production corner cases unanticipated by the simple test cases against the simple, artificial data. These functional failures may be noticed soon and repaired later at high cost. Worse still, they may not be noticed for a long time, and the company will simply make ultimately expensive business mistakes for months or years based on the unnoticed functional errors! Frequently, such a hacked-together query will perform badly against production data volumes, and it will be horribly hard to fix. By the time such a query needs fixing, to resolve performance problems, to fix corner-case errors, to cope with schema changes, or to modify functionality, as the application evolves, the person fixing the query faces a horrendously difficult problem: what information, exactly, did the person who wrote the original query intend for the query to return, and does the intended result actually match the actual result (which may be extremely hard to decipher under all those layers of indirection provided by the views), and how may the query be modified to deliver the correct result, which may or may not match either the intended result (which is not clear from the actual query) or the actual result (which is also hard to decipher)? Frequently, by the time such a query needs to be fixed, it is very hard even to find anyone who even understands what the query needs to return to meet the business need – the query itself, however flawed and likely incorrect, is the only readily available “specification” for the rows that the application apparently requires at that point in the application flow of control, and a major project is necessary to work out the correct, undocumented requirement from scratch! A well-written query, on the other hand, may be much more work to write at first, but it delivers a result that is much more transparent to understand and modify, and that will likely perform much better, with results more likely matching the actual results needed, even in obscure corner cases that might never be tested.
To see a cumulative listing of all research problems (the one below, and the earlier ones from the earlier newsletters), see all research problems. There has been no solution submitted yet for the earlier-published research problems.
Most databases allow definition of intricate constraints on tables. Where a database design has denormalizations, these ought to be formally declared, ideally, with constraints that preferably enforce that the denormalizations are followed rigorously. If this is done, is there any database that actually uses these defined constraints to refine cost and cardinality estimates by the cost based optimizer? For example, consider an example: We have a mutli-org implementation of an order entry system, where each transaction table includes an org_id column. Users belonging to a given organization can only see data having the appropriate org_id, using single-org views on top of each table that automatically add a condition “and tab.org_id = <myorg>” to every query for every transaction table. Thus, a query:
SELECT …
FROM Orders O, Order_Details OD, …
WHERE O.Order_Date > <last
Sunday>
uses views Orders and Order_Details built on top of tables All_Orders and All_Order_Details making the original query equivalent to
SELECT …
FROM All_Orders O, All_Order_Details OD, …
WHERE O.Order_Date > <last
Sunday>
Let’s say that there are a hundred organizations. This query ought to return, on average, 1% of the orders (and their details) made since last Sunday, based on the two statistically independent filters on the Orders table. Assuming that the query drives from the Orders table, it will immediately pick up the condition filtering Order rows for only the desired organization. These orders, naturally, will join to Order_Details belonging to the same organization, so the view-imposed condition on OD.Org_ID is wholly redundant, and discards no further rows. If a constraint declares OD.Org_ID as a denormalization, inheriting it’s value from the parent order reached through the join on Order_ID, then the optimizer ought to know that all Order_Details reached though that join will already satisfy the filter on OD.Org_ID, so that filter will have no effect on the estimated cardinality. On the other hand, if the cost-based optimizer fails to take account of declared denormalizations such as this, it will fall back on its usual assumptions, and consider the condition on OD.Org_ID to be statistically independent of the other conditions, and it will then calculate that the joins shown, with the filter conditions, will return just 0.01% of the order details of orders placed since last Sunday, a hundred-fold error in the resulting cost calculation, with corresponding effects on the chances that the optimizer will find the true lowest-cost alternative.
Potentially, there would be valuable use for two sorts of constraint declarations, here: Constraints that are actively enforced by the database, and declared denormalizations that are (hopefully) enforced by the application, but not by the database, but that the optimizer is still instructed to accept on faith for purposes of calculating and comparing cardinalities and costs.