Quarterly SingingSQL Newsletter #3
If you wish to unsubscribe at any time, just drop me a line at dantow@singingsql.com. I’ll likely eventually get around to setting up a proper email group, through some third party, so that eventually people can subscribe and unsubscribe without communicating with me, but this is simplest for a start, I think.
I’m taking a lowest-common-denominator approach to the format of this newsletter, sending it as a straight-ASCII note so anyone at all can access it securely, but if you’d rather read something more formatted, go to OnlineNewsletter03. I’ll also have links scattered throughout for supplementary materials.
Although this material is copyrighted, feel free to forward this newsletter, as is, to any friends who you think might be interested.
If you have any comments, ideas for future newsletter content, or suggestions, feel free to drop me a line, at dantow@singingsql.com – I’m always happy to hear from you!
To see a cumulative listing of all news, including old news from past newsletters, see all news.
To see a cumulative listing of all links, including links already listed in past newsletters, see all links.
I took part in a group discussion on the topic of “Does the Optimizer Need a Hint” as part of the new “Ask the Oracles” series in the latest Northern California Oracle Users Group Journal. Here’s an expanded version of my take on the question.
Some of the longest-running running queries you are likely to encounter while tuning applications SQL are not application SQL at all, but instead are the queries you put together to learn more about the tuning problem, as you must learn to beat the optimizer at its own game and find a better execution plan than the optimizer found. I’ll lay out some tricks I use regularly in the Oracle environment to greatly speed my own tuning. In some cases, these generalize easily to other environments; in other cases, it is not so simple. If you’d like to write up equivalent tricks for your own non-Oracle environment, I’ll be happy to post them on singingsql.com, attributed to you (with a link to your site, if you have one), and I’ll mention the addendum(s) to this article in my next newsletter. Anyway, here is my take on the recursive problem on how to make the process of making application SQL faster, faster, without wasting too much time on tuning the non-applications test SQL.
The most important shortcut to avoid spending too much time gathering tuning stats is:
Don’t waste time tuning the wrong SQL!
In Chapter 5 of my book, I discuss at some length the inputs needed to complete a full query diagram. Once you find the right SQL to tune, which I discuss in my first newsletter, completing the query diagram is the start of any solution following my method. The by-the-book stats you need to complete the full query diagram are:
Getting all this precise data could slow the tuning process considerably, where the tables are large, so it is nice to have some shortcuts. A great starting point is to know the table sizes at least approximately:
select num_rows from all_tables where table_name = upper('&&1');
This one-liner script, which I put in a file called qcntr.sql (quick-counter) and use constantly will give a quick picture of the approximate table size as-of the last time the table was analyzed, which is usually sufficient. If that script tells me that the table is small (under a million rows is small enough for these purposes), or if stats are missing (which should be rectified) I’ll sometimes take the time to get a perfect, up-to-the-minute rowcount with cntr.sql:
set heading off
set feedback off
set verify off
set timing off
spool tmpcntr.sql
select 'select count(*) from '||'&&1'||';' from dual;
spool off
set timing on
@tmpcntr
This is surprisingly fast even for a million-row table, typically, because the cost-based optimizer will usually find a path that reads a whole index, rather than a whole table, to get a perfect rowcount, and this is surprisingly fast. Even a full-table scan of a million-row table is faster than most people think, if nothing else is required. Often, if I must run a moderately-long-running query, I’ll leave it running in one window while working out other stats in another window, so the runtime needn’t translate to a waste of your own time.
The next statistic, rowcounts of each joined pair of inner-joined tables, is almost never necessary. If it was necessary, you’d need an often-long-running query something like:
Select count(*) from Details D, Masters M where D.Fkey_ID = M.PKey_ID;
This will usually be much slower than just a full table scan of the largest table, but it is fortunately almost never necessary. Instead, it is usually an excellent, sufficiently accurate guess that the above query will return the same count as the rowcount of the Details table. You can guarantee that the count of the join will equal the rowcount of the detail table if the foreign key pointing to the master table is guaranteed not-null, and referential integrity is enforced for that foreign key. You can check whether the foreign key is guaranteed not-null with a quick DESCRIBE on the table, and usually referential integrity is perfect, or near-enough to perfect for our purposes even if it is not enforced, so where you find a not-null foreign key, you don’t usually need to find a join count. Looking at the worst case, let’s say we find that FKey_ID is allowed to be null, but we don’t know how often it is, and the table is large enough that it is inconvenient to check the full count of NULLs. A quick-and-dirty check just looks for NULLs near the start of the table:
Select rownum from Details where FKey_ID is null;
I’d run this for 10 or 15 seconds (and hit control-C to bail out) and see whether it scrolls rows onto my screen quickly, slowly, or not at all. If I see no rows at all, NULLs are rare or non-existent, even if they are allowed, at least for the oldest rows in the table. (If I had some reason to suspect that NULLS might be more common for new rows, I’d look for some indexed path to sample the newer rows, for example, Order_Date > sysdate-1, if this was an ORDERS table, and if Order_Date was indexed, or FKey_ID > {Maximum FKey_ID} - 10000, if FKey_ID was indexed and I suspected that FKey_ID grew with time (as most primary and foreign keys do), and I first did a fast
Select max(FKey_ID) from Details;
.) Usually, being assured that NULLs aren’t too common, I’ll just assume referential integrity and go with the assumption that the join count equals the Details rowcount. If I’m feeling super-cautious, though, I might do a quick look for missing referential integrity as follows:
Select rowcount from Details D where D.Fkey_ID is not null
and not exists (select null from Masters M where D.Fkey_ID = M.PKey_ID);
By letting this run for just a few seconds (and then hitting control-C to bail out), you can see by the speed that it scrolls (or fails to scroll) numbers by the screen whether failed referential integrity is common, rare, or super-rare-to-non-existent in the part of the table you’re sampling, without looking at the whole table. (If it makes more sense to sample recent rows, add a clause to the outer query, as above.)
If, instead, NULL foreign-key values are common, I’ll usually still assume referential integrity for the non-null values, and I just need to know the frequency of NULLs. If
Select rownum from Details where FKey_ID is null;
returned rows slowly or not at all, I’d just treat NULLs as too rare to change anything. If NULLs looked common, though, I might need a real NULL count. If the table was not too large, I’d usually run
Select count(*) from Details where FKey_ID > 0;
And I’d usually just assume the join count equaled that count of positive FKey_IDs. (Note that this assumes that the key values are numeric and positive. For most applications this is almost universally a good assumption. If FKey_ID was indexed, I could first run:
Select min(Fkey_ID) from Details;
Then
Select count(*) from Details where FKey_ID >= {minimum FKey_ID};
By expressing the queries this way, I can find not-null rowcounts just by hitting the index on the foreign key (which the optimizer will usually choose), which is normally surprisingly fast even for very large tables.
The last statistic we need is the selectivity of each filter condition, at least roughly. The very best trick, here, is an informed guess! Even if you’re working on an application you’ve never seen before, as long as the table and column names aren’t based on some language you don’t know, you can usually make a pretty fair guess what tables represent and what sort of property is stored in each filtered column. (Hint: schema designs that actively hide that sort of information are generally poor and result in hard-to-maintain SQL!) Given such guesses, in useful real-world queries (which do not normally return excessive rowcounts) you can also usually guess to find a filter that is highly selective, usually the most selective or at worst the second-most-selective filter in the query. If that filter is also indexed, or if that filter can be rephrased so that it can follow an index, you’re almost home free! From here, there are a couple of fast possible paths to an answer:
Here’s an illustration of the second approach, using a 4-way join:
Select …
From Orders O, Details D, TableX X, TableY Y
Where O.Customer_ID = :1
And O.Order_ID = D.Order_ID
And D.ColD = ‘A’
And O.X_ID = X.X_ID
And X.ColX = ‘B’
And O.Y_ID = Y.Y_ID
And Y.ColY = ‘C’;
We might not guess what ColD, ColX, or ColY represent, or how selective the filters on those columns are, but we’d likely guess that we have lots of customers, and the condition on O.Customer_ID is highly selective, and that is our starting point. If we’re being conservative, we might first get stats supporting our guess that a condition on Customer_ID is highly selective. Here’s a script, qdistr.sql, that I use regularly for that sort of purpose:
select num_distinct from all_tab_columns where column_name =
upper('&&1') and table_name = upper('&&2');
SQL> @qdistr customer_id orders
Would quickly give me the estimated number of distinct Customer_IDs, based on the last table analysis, even if the table is huge. If the table is not large, I might even get more-precise information with distr.sql:
set heading off
set lines 160
set feedback off
set verify off
set timing off
spool tmpdistr.sql
select 'select count(*), count('||'&&1'||'), count(distinct '||'&&1'||
') from '||'&&2'||';' from dual;
spool off
@tmpdistr
set timing on
This also gives me information about how often that column has any non-null value at all, which can be useful to know for columns that are extra selective just because they are so often null, and as a bonus I get a complete rowcount on the table at the same time.
As a next step, it’s nice to choose a real value for the bind variable :1 to assume for purposes of the rest of the tuning exercise. The best choice is generally one that is less selective than average, but not usually the very worst case. Let’s say that we found 50,000 distinct Customer_IDs, verified an index on that column (or created one, if needed) and ran
Select min(Customer_ID) from Orders;
to find the lowest one, 1001, with a quick index hit. A fast way to choose a good semi-conservative value to tune with would be
Select Customer_ID, count(*) from Orders where Customer_ID < 1001+1000
Group by Customer_ID order by 2;
This should run fast, off just the index on Customer_ID, and the last few rows it returns would be the worst-case customers out of up to the first thousand (and oldest customers tend to be biggest – for most other columns, you’d look at highest values), and you might go with the fifth-worst-case customer you see, or so, for a fairly conservative guess.
Let’s say you found from the estimation approaches above that Details had 30,000,000 rows, Orders had 10,000,000 rows, TableX had 2,000,000 rows, and TableY had 1,000,000 rows. Let’s further stipulate that all foreign keys are not-null and have perfect referential integrity, and that the fifth-worst value, 1547, of Customer_ID that you found < 2001 returned 20,000 rows. In this case, your query diagram, so far, would be:
D ?
| 3
|
| 1
V
O 0.002
|\
|
\
|5 \10
|
\
|1 \1
V V
X ? Y ?
The selectivities of the filters on D, X, and Y still matter in our effort to find the best join order, but we want to find those selectivities without full table scans on these large tables, and we suspect that the filters are not very selective, so they likely don’t have indexes and should not. We already know that
Select count(*) from Orders where Customer_ID=1547;
will return 20,000. Next, we can use this selective path to sample just 0.2% of the rows of D, and small fractions of the rows of X and Y as follows:
Select rn from
(Select /* ordered use_nl(O X) */ rownum rn from Orders O, TableX X
where O.Customer_ID = 1547
and O.X_ID = X.X_ID
and X.ColX = ‘B’)
where mod(rn,10)=0;
The objective above is just to find out how much more the filter on ColX reduces the rowcount below the 20,000 rows coming from O, but note a couple of tricks I’ve used: First, I’m looking at rownum values, instead of a single count, so I can get results as they come, to give an idea of my rate of progress before getting the final result. For really long-running checks this can be a big help. Second, I’ve arranged an inline view that gives me only every tenth row from the inner query, so I don’t have to wait for potentially as many as 20,000 rows to scroll by, but I still get a result within 10 of the real answer. (I’d choose an even bigger number inside the mod() expression if I expected even more rows.) Third, I’ve gone ahead and added possibly-unneeded hints just so I don’t have to even worry about whether I’m going to get a fast plan to this statistical result. Let’s say that the above query returned 500 rows, with the number 5,000 as the biggest value of rn returned. In this case, we know that the filter X.ColX=’B’ must have discarded 75% of the 20,000 rows coming from Orders, and the filter ratio on X is 0.25. Note that we found this out just by sampling 20,000 rows from the 2-million-row TableX, because we ran a query driving from what we think is the best driving table. (It is possible that out of the 2-million rows of TableX, the filter has a different selectivity than it has among the row we’re sampling, but the right filter fraction to use is really the one we’re getting in this way I’m describing, if the filters are statistically correlated, since what we really want to know is how selective is the next filter *after* we reach the earlier filters!) An even faster sample would go as follows:
Select rownum from
(Select O.X_ID from
(Select rownum rn, O.X_ID from Orders O
where O.Customer_ID = 1547)
where mod(rn,10)=0) O, , TableX X
where O.X_ID = X.X_ID
and X.ColX = ‘B’;
Note that this approach filters 90% of the 20,000 rows meting the Customer_ID condition before doing the work of the join to TableX, so this could be almost as fast as just the single-table query to Orders. The drawback would come if the rowcount being joined to the second table was too small, resulting in a poor sample size.
Next, let’s find the filter ratio on Y, just sampling all the rows that survive the first two joins:
Select rn from
(Select /* ordered use_nl(O X Y) */ rownum rn from Orders O, TableX X, TableY Y
where O.Customer_ID = 1547
and O.X_ID = X.X_ID
and X.ColX = ‘B’
and Y.ColY = ‘C’)
where mod(rn,10)=0;
Lets say this returns 50 rows, with the highest value of rn returned equal to 500. In this case, we know that the filter on ColY discarded 90% of the 5000 rows coming from the join of O and X, so the filter ratio on Y is 0.1. Note that we learned this by sampling just 0.5% of the million rows in that table. Now that we know that the filter on Y is better than the filter on X, we know to go with a better join order run the final query:
Select /* ordered use_nl(O X Y D) */ rownum from Orders O, TableY Y, TableX X, Details D
Where O.Customer_ID = 1547
and O.X_ID = X.X_ID
and X.ColX = ‘B’
and Y.ColY = ‘C’
and O.Order_ID = D.Order_ID
and D.ColD=’A’;
If this came back with, for example, 50 rows, we’d know that the filter ratio on D was 0.01, and that the current join order (O, Y, X, D) was correct. In fact, the filter ratio on D is irrelevant to the choice of plan unless it turns out to be significantly better than the filter ratio on O, so unless the final query returned just a row or two, we’d stick with the current join order.
If we couldn’t even guess which filter might be selective, or suspected we had no really selective filter at all, we could still employ a sampling strategy such as above to progressively improve our guesses. For example, we’d start by checking the number of distinct values on any equality conditions, or the size of the ranges compared to the minimum and maximum values, for range conditions. Then, we might formulate an artificial query for tuning purposes, adding an extra selective filter just to make the tests run faster. For example, if the query was
Select …
From Orders O, Details D, TableX X, TableY Y
Where O.ColO = ‘D’
And O.Order_ID = D.Order_ID
And D.ColD = ‘A’
And O.X_ID = X.X_ID
And X.ColX = ‘B’
And O.Y_ID = Y.Y_ID
And Y.ColY = ‘C’;
We might just choose some small random range of Customer_IDs as an artificial driver to more or less randomly sample 1000th of the rows:
Select …
From Orders O, Details D, TableX X, TableY Y
Where O.ColO = ‘D’
And O.Order_ID = D.Order_ID
And D.ColD = ‘A’
And O.X_ID = X.X_ID
And X.ColX = ‘B’
And O.Y_ID = Y.Y_ID
And Y.ColY = ‘C’
And O.Customer_ID between 1001 and 1050;
We’d start by learning how many orders fall in that customer range, then add the filter on ColO, then add one table and its filter at a time to find the sampled filter ratios for each other table, until we had a picture of the whole query and its estimated filter ratios, then strip off the condition on Customer_ID to test the final choice of best plan for the real application query. At each stage, we’d just query for rownum (or every 10th, 100th,… rownum) so we could see early results as they appear, without waiting for the whole query to run.
To see a cumulative listing of all research problems (the one below, and the earlier ones from the earlier newsletters), see all research problems. There has been no solution submitted yet for the earlier-published research problems.
If a human being were to execute SQL manually, against a series of tables and indexes that the human could see, he or she would likely choose a good-looking path to the data and start to follow that path. If, however, the human found that the results of following that path violated the assumptions he or she made when choosing that path, and appeared likely to take much longer than originally expected, he or she would probably revisit the question of which execution plan was best and potentially would start over with a better plan before wasting too much time on the bad plan. Does any database do this? Propose a good algorithm for a database that would.
©2006 Dan Tow, All rights reserved