Quarterly SingingSQL Newsletter #1
Although this material is copyrighted, feel free to forward this newsletter, as is, to any friends who you think might be interested.
If you have any comments or suggestions, feel free to drop me a line, at
Range-Keyed Queries: In this O’Reilly-sponsored article, I describe strategies to handle a recurring class of problems where part of the table key is a plain-vanilla ID, but another part of the key specifies a range, most-commonly a date range that specifies the dates for which the record applies. For example, time-dependent historical data is frequently specified with the date on which the record first appeared, and the date on which the record ceased to be current.
Wrong-Errors Bugs: A New Class of Bug: In this O’Reilly-sponsored article, I describe ways to eliminate a new class of bug I’ve described, dubbed the wrong-errors bug, and I propose RDBMS enhancements that would make these bugs much less common and problematic.
When Good Optimizers Make Bad Choices (Sometimes): In this presentation to the Northern California Oracle Users Group (NoCOUG), I described why human SQL tuners can still “beat” cost-based optimizers at the “game” of SQL tuning, by exploiting enormous “unfair” advantages that human tuners have over even the best optimizers. The presentation is accompanied by a White Paper and files with supplementary material, including all of the material necessary to reproduce the examples presented. This is an enhanced version of a presentation I gave earlier, at the 2005 Hotsos Symposium.
With this being the first SingingSQL newsletter, I think I’ll cover the big picture in this first article. The material covered is empirical, not theoretical, reflecting my personal experiences in 16 years of tuning. Your own experiences may differ – add them to mine, and you may approach something closer to the universal tuning-world truth.
I come from a physical-sciences background, and we always had to pay very close attention to the units that went with any given measurement – 36 miles is a very different distance than 36 millimeters, and you cannot even properly compare 36 millimeters to 36 grams – comparing millimeters to grams amounts to comparing apples-and-oranges, so to speak! If you gather together a roomful of performance specialists (as often happens when an enterprise develops a really desperate performance problem!), you are likely to hear a babble of units, “context switches per second, CPU cycles, network round trips, bits per second bandwidth consumed, megabytes memory consumed, system calls per second, pages per second, swap-ins per second, logical reads, physical reads,” as specialists in hardware, UNIX administration, network administration, and database performance all view the problems from their own perspectives, like so many blind men each feeling a piece of an elephant. It can seem impossible to determine what the highest-priority problem is when each narrow specialist expresses the problem in units that mean very little to the other specialists, and when there is no obvious way to compare two “problems” that are each expressed in apparently-incompatible units. The key to resolving this problem is to recognize that there is ultimately only one thing that really matters, in tuning and performance:
MONEY
If the “performance problem” does not (and will not, in the future) somehow cost the business money, then it is no problem at all! If one performance problem costs the business twice as much as a second performance problem, then solving the first problem is twice as important! There are three components that make a performance problem cost money:
Working out the relative costs of the performance problems is very useful because there are typically hundreds, even thousands, of tiny problems and only a handful of significant problems, and you’ll save enormous losses by solving the problems in order of there importance. (Even the few significant problems often differ in importance by large factors. Separate problems take significant time to resolve, typically, so you cannot just solve them a hundred at a time. You can, however, generally expect that any given problem will be solved efficiently once it is the clear focus of a qualified expert’s attention – I’ve never yet found a high-priority problem that couldn’t be solved!) Unfortunately, it is impractical to directly measure the money lost during a series of IT-system delays. We must substitute something we can measure, in the place of money, preferably something that maps very closely to true money costs. There are two sorts of costs to consider:
In the first case, it is imperative to focus on the time spent by the business-critical processes that are causing extraordinary pain to the business. Time spent by non-critical processes, in the first case, can be initially ignored. In the second case, you can treat the time spent in all processes roughly equally, and the objective is to reduce all delays, across the whole system. There is a trade-off involved in the choice between focusing on individual processes or on the whole system:
Whether we focus on problems in a single process (extraordinary costs) or perform whole-system analysis (routine costs), we can adopt a single stand-in for money, a stand-in that we can measure:
TIME
If a problem does not keep an end user, or the whole business, waiting for some result to appear or some significant process to complete for more time than the business can very comfortably afford, then there is no performance problem! If the problem does cause a long wait for a result or a process completion, then how much it matters is simply a function of how long the wait takes, and the business importance of the business process being delayed – the ultimate importance of the problem does not depend on whether the delay is for two minutes of network latency, two minutes of paging, two minutes of physical reads, or of logical reads, or any other operation! The end user is unlikely to know the reason for the wait, and is even less likely to care! Once we recognize that time is our focus, the units problem resolves itself. We may choose seconds, or minutes, or hours, or hours of delay per hour of wall-clock time, but we no longer have a problem of comparing apples and oranges once all performance issues under attack are expressed, and compared, in terms of units of time. (A self-appointed performance expert may persist in describing an alleged problem in units that do not map to time. I recommend that you insist that the problem be expressed in terms of the time lost, at least approximately, and that the translation to time be justified. If the “expert” has no idea how much time a problem costs, or even how to find out, but still insists that the problem is important, I would suggest finding a new expert.)
Almost always, we find that a very modest number of problems are responsible for a large fraction of the delays we need to get rid of. Often, we need repair only a single source of delay to enormously speed a given business-critical process, and even looking at a whole, complex IT system, we can usually eliminate over half the runtime with fewer than 20 fixes, and we can safely ignore all other performance issues as insignificant. If we do not express problems in terms of runtime lost, however, we may need hundreds of guesses before we stumble upon the set of fixes that eliminates most performance-related business losses, or we may, more likely, never find the right problems to focus on.
In my own career, I did not even begin, in 1989, with a focus on database delays, which I knew nothing about measuring. I happened to know a little bit about looking at UNIX system CPU and I/O statistics, so that was naturally my first focus. While searching for, and finding, performance problems in hardware, in operating systems, in local-area and wide-area networks, in special database bottlenecks (such as locking and latching), and in routine database runtime (performing excess physical and logical I/O to execute inefficient SQL), I’ve developed a simple rule of thumb regarding where to look for business-applications performance problems first, how to reduce what could be a very complex search to something that is usually much simpler, focusing, as needed, on either a single process that requires tuning, or on the whole system at once:
A typical breakdown of Top-10 SQL, after tuning, would look roughly like:
Is it possible to store two different outlines for the same SQL under the DEFAULT category, where that SQL is interpreted differently, depending on which SQL it is interpreted from, and to have each schema use the appropriate outline? For example, in one schema, "Orders," which is referenced in a query, might be a simple, local table, while in another schema, "Orders" is a UNION-ALL view of the local table and a remote-linked archive table, and the execution plans of the query must necessarily depend on the schema from which the query runs. If it is impossible to store two outlines under the DEFAULT category, what happens when the user attempts to execute the SQL, with user_stored_outlines=TRUE, from the wrong schema, the schema to which the outline does not apply? Does Oracle simply ignore the outline, or does it generate a plan that is wrong for that schema, even a plan that results in a wrong-rows bug? Categories seem to be a mechanism to let the process point to the appropriate set of outlines, depending on which schema the application attaches to, but this is inconvenient if the application has not been specifically written to know that it needs to ALTER SESSION to set user_stored_outlines=<The Correct Category for That Schema> when it connects to a particular schema. Is there some way to automate an ALTER SESSION command so that it happens automatically (some sort of trigger on connect?) when the schema is set or changed?
Copyright 2006 Dan Tow, All rights reserved