Dan Tow
©2008 Dan Tow, All Rights Reserved
A surprising amount of
tuning work goes into overriding optimizers’ tendency to join tables with hash
joins and sort-merge joins to full table scans. It’s not obvious why
optimizers, which have been refined for years, should still seem to under-favor
the nested-loops alternative, but I have some thoughts on the subject:
Most data in a business
application consists of bundles of information detailing a business event, an
entity that can be located on a timeline in the history of the business. The
most obvious prototype for the business event is an order, but information
related to that event will be tracked in several tables that would join
directly or indirectly to the Orders table, including tables tracking order
details, shipments, invoices, payments, commissions, et cetera. The entire
cluster of information related to the highest-level master table in the
hierarchy will likely be created within a short time window, as one business
event triggers rapid follow-up events, and many rows in master-detail
relationships typically will be created in a single transaction, such as orders
and order details, which would each be meaningless without the other.
When we query business data
for most purposes, we typically need to see data related to quite recent
business events. These recent events may be important to query because we need
to monitor the current health of the business or because the events are still
unfinished, with tasks required to fully complete the business process related
to the order or other business event. Business applications boil down to tools
to trigger the business to make the right actions and decisions, today.
“Ancient history,” which is to say an event older than about a year, is rarely
relevant to business actions and decisions that need to be made today, and even
events a month or two old are not often relevant to day-to-day business
operations.
In a typical heap table, the
data for these recent events resides together at the top of the table, or occasionally
together in older blocks made freshly empty by being purged of data so old that
it is longer needed. Because the rows representing the most recent data reside
together in a small subset of the table blocks, an execution plan for a query
of these rows tends to find multiple rows needed by the query in each of these
blocks. This self-caching effect is very useful, where blocks needed by the
query are cached early in the query, then reused from the cache multiple times
during the query. Even early in the query, the first time these blocks are
needed, they are likely already to be cached by other recent queries, which also
tend to need blocks from that small subset of the table that holds the recent
rows.
When we query a tiny subset
of an events-data table, and the database can see that the subset is tiny, then
the optimizer has no trouble figuring out that the joins to related-events
tables will also read just a tiny subset of those tables, and nested loops
plans to those tables are an easy choice. There are two cases, though, where the
choice stymies the typical optimizer:
Consider each of these
cases, in turn. As savvy human tuners, we should know that a typical business
report provides a small enough set of return data that a human would find the
report useful. Ten or twenty pages is about the outside limit of report length
a human is likely to read from end to end, corresponding to the result of a
maximum of about 1000 rows returned from a query. Even in a data-warehousing
context, really huge query results should be the exception, rather than the
rule. Therefore, if the filters on a query appear unselective, the savvy human
tuner should suspect that either the report is poorly designed, from the
perspective of the user-interface, or the filters, in combination, are actually
more selective than they appear. This happens surprisingly often in business
queries, which tend to look for exceptions that could be described abstractly
as:
“If X is true, then Y should
not also be true, but we need to see the exceptions to this rule.”
The query for exceptions to
the rule, which would call for business action (either action to fix the
specific exception, or, better still, action to analyze and prevent future
occurrences of the exception), would look like
Select …
where not <X> and not <Y> and <joins to
other tables with supporting data>;
My favorite example of this
sort of query is a query looking for orders that are neither closed nor recent,
since orders should not stay in the “open” state long. Often (assuming
well-designed business processes), exceptions are very rare, but each of the
conditions (not <X> and not <Y>) by itself may be quite common.
Since optimizers almost invariably assume statistical independence for pairs of
conditions like this (a very wrong assumption in these cases!), they frequently
grossly overestimate how many rows they’ll reach at the point where both
conditions can be applied, and this causes a gross overestimate in the rows to
be joined to subsequent tables, case #1, above. In these cases, if the
optimizer understood what the savvy human tuner understands, that the query
probably won’t return over a thousand rows, then the optimizer would tend to
favor nested loops following the join key, rather than an incorrect choice to
hash join or sort-merge join to a full table scan of the later table.
Unfortunately, the optimizer has no preconception that rowcounts
from reasonable queries tend to be small, and it makes the choice to favor a
plan that is only justified if that result turns out to be unreasonably large,
as it simplistically appears it will be.
Now consider case #2 above,
the case that the rowcount really is quite large, or
at least it is large before some group-by sums up a large result set. (Reports
of over 1000 rows should be rare, but short reports that sum over 1000 rows make perfect sense in the business context!) As
our example, let’s say we want to look at the most recent month of data out of
a 4-year history. Let’s further assume that there are 96 rows of data per
database block in the driving table. Now, consider the join to a related events
table, which we’ll assume has a three-deep index tree on the join key, and
twice as many blocks as the driving table. We won’t count the root block in the
index tree in our calculations, because we assume that root blocks are
perfectly cached.
If we follow nested loops to
the second table, for each row from the driving table, we’ll do 2 logical I/Os
(not counting the root block) to the join-key index, and a logical I/O to the
joined-to table block, or 3 logical I/Os in all per
joined-from row. Since we are reading 1/48th of the joined-from
table, but have 96 rows/block in that table, we’re reading twice as many rows
as we have blocks in that table, which is the same number of rows as the number
of blocks in the joined-to table. Since we counted 3 logical I/Os (not counting
the root index blocks) per joined-from row, we count 3 times as many logical
I/Os to reach the joined-to table by nested loops as the total block count in that same table! All the optimizer has to do
to favor a hash join or a sort-merge join to a full table scan of that
joined-to table is to decide that reading the table once in a single pass is
better than reading three-times as
many blocks one block at a time, with logical I/Os driving through a join-key index,
in nested loops! Unquestionably, we should expect a higher logical I/O count
with the nested-loops plan here, so does that mean the nested-loops plan is
inferior?
Assume, first, that all
blocks are cached. There is a CPU overhead simply to perform a logical I/O, and
the nested-loops plan will surely cost more CPU for its logical I/Os. Once we have performed the logical I/O, however, there
is also CPU overhead for what we do
with the block. In the nested-loops plan, we read a small fraction of the block.
The runtime for both the logical I/O and reading this small fraction is
typically under 10 microseconds. Unless
the rowcounts reached are truly huge, or the
join-order is inefficient, reaching far more rows than the rowcount
that satisfies all the filters, the CPU costs for these nested loops are
insignificant! (With bad join orders,
hash joins to full table scans are much more likely to be a significant “win” than with correct
join orders!) Consider the blocks we read for the join alternative that reaches
the joined-to table with a full table scan, however: For these blocks, the
database must view the entire block, every row in the block, and this typically
takes much longer than the 10 milliseconds per block we need for the
nested-loops alternative. In all, CPU efficiency shows a trade-off – more CPU
for just for the individual logical IOs for the
nested-loops alternative, here, but less CPU for the work done inside each
block, and the nested-loops plan is better than it appears, from just a
CPU-consumption perspective.
Now assume that the blocks
are not all so well cached. This is much more realistic, if these tables are
big enough that we really need to worry about the tuning problem. The full table scan encounters blocks (or, in the case of Oracle, multi-block
groups of blocks, typically 8-block
groups of 8K blocks) from the entire table, including the majority of the table
that qualifies as “ancient history.” These old blocks will be extremely poorly
cached, so physical I/O will be high for the non-nested-loops alternative.
Consider the nested-loops alternative, though: The joined-to key values and
joined-to related-events rows will tend to be roughly one month old, or less,
just like the rows in the driving table. There are likely no more than a few
hundred index blocks, at the most, holding the necessary recent join-key data.
These are likely well-cached before the query even starts, but even if they
aren’t they will be read in and cached (and re-used many times) for the
remainder of the query during the first couple of seconds, likely. The table
blocks for the most-recent month of the related-events table will be slightly
less-well cached, at first, but even these blocks will tend to be read in
roughly the same order they are stored, at most a physical-I/O count equal to
about 1/48th of the blocks in the table. Even this physical I/O
count exaggerates the work on disk, because typically read-ahead performed in
the disk subsystem will act like a multi-block read, reaching just the blocks
needed next before they are needed, and caching those blocks in the disk
subsystem memory so that most of what the database thinks is a physical I/O request turns out actually to be a
super-fast read from the disk-subsystem cache. In all, the expected time for
the physical I/Os to follow the nested-loops alternative is vastly better than
the non-nested-loops alternatives, in this example, and the time-savings is
overwhelmingly more important than any potential cost for extra logical I/Os.
Time-related data tends to
cluster together in tables, and time-related data tends to join to other time-related data in
well-clustered ways, as well. I refer to this as natural data clustering. From the physical I/O perspective, two
tables that have joined rows generally created at about (or exactly) the same
time act almost as if their rows were stored together in the same blocks – the
number of physical I/Os and physical I/O time
(which tends to dominate in well-tuned cases like this) necessary to read
joined rows grouped well in creation time is roughly the same as it would be if
the tables were reorganized into Oracle multi-table clusters, even though no
one has lifted a finger to formally cluster the tables. The data read from
these tables are naturally clustered
as an automatic consequence of master and detail event-related rows generally
being created at about the same time, and as an automatic consequence of most
applications queries reaching mainly recent data at the top of the heap tables.
Consider a join of the most
recent 100 blocks of Orders rows with the most recent 200 blocks of Order_Details: However high the logical I/O for a
nested-loops plan might be, the query will hit just 300 distinct table blocks
and perhaps a dozen of so join-key blocks (which were likely cached, anyway),
and even a pessimist shouldn’t expect more than 300 physical I/Os. If we
reasoned that Orders and Order_Details are invariably
joined to one another, and belong in a two-table Oracle cluster, the very same
rows would fit in roughly the last 300 blocks of such a cluster, for no
significant net savings in physical I/O! In precisely the scenarios where
physical multi-table clusters look attractive (rows of related tables are
predictably created together), the savings in physical I/O turn out to be trivial
or non-existent! There is a large
savings in logical I/O, but in well-designed queries like this, this matters
little. The biggest difference between the clustered and non-clustered example
is in the optimizer’s (mistaken) estimate, in cases like this – the optimizer
will correctly estimate a low cost for the physically clustered case, but it
will (incorrectly) estimate a high cost for the case of the nested-loops join
between single tables, although the cluster factor in Oracle’s data dictionary
will enable to optimizer to correctly estimate a low cost for the read of the
driving table’s well clustered blocks.
Is there any case where the
optimizer is right to be pessimistic? It turns out that there is such a case –
if the driving condition is not correlated with row-creation time (or if the
table rows have become scrambled with respect to row-creation time, see Killing Cache Efficiency with
Parallel Table Rebuilds), then nested loops may need nearly as many physical
I/Os as logical I/Os. For example, if we query all the orders for a particular
customer over all time, the driving-table rows will be scattered, and so will
the joined-to Order_Details. Consider, though, that
well-designed application queries should rarely look like this – such a query
will get progressively slower as the tables accumulate more and more history,
and why would we want to see “ancient” records of a customer’s orders, anyway?
Most queries should reach some time-correlated condition (often combined with a non-time-correlated
condition, such as customer ID) in the first multi-row table read (which may be
preceded by one or more single-row reads, such as a read of a single customer
record), which then drives to related event-type records created around the
same time, using nested loops.
I want to demonstrate the
point physically with specific tables, to place some numbers behind the
abstract theoretical argument:
Consider three tables, each
with one-million rows, in a three-way one-to-one relationship, two of them
maximally co-clustered, with the third maximally unclustered
with respect to the other two. (If you wish to make this concrete in your mind
with real applications tables, think of the first two tables as an Orders table
built into the generic application, and an orders-extension table added as a
customization, to allow extended information on every order needed by the
specific application site, but not anticipated in the original
generic-application design. The third table is trickier, since almost any real
example of a table joined one-to-one to an events table would naturally
co-cluster with that table, since the rows of one-to-one tables almost have to
be created at the same time. The closest example, though, would be to imagine
that the company rarely sells to the same customer twice, making customer data
almost one-to-one with orders data, but at some very recent date, the business
has physically rebuilt the customer table so that rows are stored in
alphabetical order by customer name, not in the order the orders and their
customers were entered at all. (Something like this would also happen on a
single-table cluster clustered on a non-time-related column, or on an
index-organized table arranged on a non-time-related column.))
Here are some reproducible
table-creation scripts for demonstration purposes, for tests I ran on 10g, with
8KB blocks (I got similar results on 9i.):
create table dtow_test_10r(a
number);
insert into dtow_test_10r
values(0);
insert into dtow_test_10r
values(1);
insert into dtow_test_10r
values(2);
insert into dtow_test_10r
values(3);
insert into dtow_test_10r
values(4);
insert into dtow_test_10r
values(5);
insert into dtow_test_10r
values(6);
insert into dtow_test_10r
values(7);
insert into dtow_test_10r
values(8);
insert into dtow_test_10r
values(9);
commit;
CREATE TABLE dtow_test_1000000r1
(pkey_id
NUMBER(18),
recent_flag CHAR(1) NOT NULL,
val VARCHAR2(80),
CONSTRAINT dtow_test_1000000r1_u1 PRIMARY KEY (pkey_id));
create index dtow_test_1000000r1_n1
on dtow_test_1000000r1(recent_flag);
insert into dtow_test_1000000r3
select /*+ noparallel
ORDERED */ t1.a*100000+t2.a*10000+t3.a*1000+
t4.a*100+t5.a*10+t6.a, decode(t1.a*10+t2.a,99,'Y','N'),
'81234567897123456789612345678951234567894123456789312345678921234567891123456789'
from
dtow_test_10r t6,
dtow_test_10r t5,
dtow_test_10r t4,
dtow_test_10r t3,
dtow_test_10r t2,
dtow_test_10r t1;
commit;
ANALYZE TABLE dtow_test_1000000r1 COMPUTE
STATISTICS;
BEGIN
DBMS_STATS.GATHER_TABLE_STATS ('PERF11I','DTOW_TEST_1000000R1',
METHOD_OPT
=> 'FOR COLUMNS SIZE 254 RECENT_FLAG');
END;
/
CREATE TABLE dtow_test_1000000r2
(pkey_id
CONSTRAINT fk_r1 REFERENCES dtow_test_1000000r1(pkey_id),
val2 VARCHAR2(80),
CONSTRAINT dtow_test_1000000r2_u1 PRIMARY KEY (pkey_id)
);
insert into dtow_test_1000000r2
select /*+ noparallel
ORDERED */ t1.a*100000+t2.a*10000+t3.a*1000+
t4.a*100+t5.a*10+t6.a,
'81234567897123456789612345678951234567894123456789312345678921234567891123456789'
from
dtow_test_10r t6,
dtow_test_10r t5,
dtow_test_10r t4,
dtow_test_10r t3,
dtow_test_10r t2,
dtow_test_10r t1;
commit;
ANALYZE TABLE dtow_test_1000000r2 COMPUTE
STATISTICS;
CREATE TABLE dtow_test_1000000r3
(pkey_id
NUMBER(18),
recent_flag CHAR(1),
val VARCHAR2(80),
CONSTRAINT dtow_test_1000000r3_u1 PRIMARY KEY (pkey_id));
create index
dtow_test_1000000r3_n1 on dtow_test_1000000r3(recent_flag);
insert into dtow_test_1000000r3
select /*+ noparallel
ORDERED */ t1.a*100000+t2.a*10000+t3.a*1000+
t4.a*100+t5.a*10+t6.a, decode(t1.a*10+t2.a,99,'Y','N'),
'81234567897123456789612345678951234567894123456789312345678921234567891123456789'
from
dtow_test_10r t1,
dtow_test_10r t2,
dtow_test_10r t3,
dtow_test_10r t4,
dtow_test_10r t5,
dtow_test_10r t6;
commit;
ANALYZE TABLE dtow_test_1000000r3 COMPUTE
STATISTICS;
BEGIN
DBMS_STATS.GATHER_TABLE_STATS ('PERF11I','DTOW_TEST_1000000R3',
METHOD_OPT
=> 'FOR COLUMNS SIZE 254 RECENT_FLAG');
END;
/
-- Set both of these parameters to the defaults:
alter session set optimizer_index_caching=0;
alter session set optimizer_index_cost_adj=100;
Here is some information
about the optimizer’s perspective on these tables:
SQL> select blocks from dba_tables
2 where table_name LIKE 'DTOW_TEST_1000000R%';
BLOCKS
TABLE_NAME
---------- ------------------------------
12822
DTOW_TEST_1000000R1
12657
DTOW_TEST_1000000R2
12822
DTOW_TEST_1000000R3
SQL> select LEAF_BLOCKS, BLEVEL,
CLUSTERING_FACTOR, INDEX_NAME
2 from dba_indexes where TABLE_NAME LIKE 'DTOW_TEST_1000000R%'
3 order by
INDEX_NAME;
LEAF_BLOCKS
BLEVEL CLUSTERING_FACTOR INDEX_NAME
----------- ---------- -----------------
------------------------------
1815 2 12820 DTOW_TEST_1000000R1_N1
1879 2 12819 DTOW_TEST_1000000R1_U1
1879 2 12657 DTOW_TEST_1000000R2_U1
2059 2 1000000 DTOW_TEST_1000000R3_U1
3228 2 22819 DTOW_TEST_1000000R3_N1
For space reasons, I’ll only
briefly mention Oracle’s handling of single-table queries of these tables: The
clustering factors enable Oracle to have a generally good picture of the true
cost of these queries, in terms of the actual physical I/O we should expect,
although the estimate falls down somewhat, even with a histogram, on handling
of a skewed data distribution such as for the RECENT_FLAG in DTOW_TEST1000000R3.
Generally, though, self-caching of blocks hit early in the single-table query
that will be reused later is well handled by the optimizer’s apparent use of
CLUSTERING_FACTOR.
Self-caching in the
joined-to table is another matter, however! Compare a simple co-clustered join
of the first two tables, with and without dynamic sampling, compared to the
similar join between the wholly non-co-clustered join of the first and third
tables, with these three tables stored in j1.sql, j2,sql,
and j3.sql, as probed by my script exq8.sql (which you can find, along with
some other useful scripts used here on my site):
SQL>
@j1
1 select /*+ ORDERED use_nl(t1 t2) index(t1 dtow_test_1000000r1_n1) */
2 max(t2.val2)
3 from
dtow_test_1000000r1 t1, dtow_test_1000000r2 t2
4 where
t1.pkey_id=t2.pkey_id
5* and
t1.recent_flag = 'Y'
1SELECT STATEMENT
c=20337 r=1
2 SORT AGGREGATE c=_ r=1
3 NESTED LOOPS
c=20337 r=10028
4 TABLE
ACCESS BY INDEX ROWID 1*DTOW_TEST_1000000R1 c=150 r=10058
5 INDEX
RANGE SCAN DTOW_TEST_1000000R1_N1 c=21 r=10058
4 TABLE
ACCESS BY INDEX ROWID 2*DTOW_TEST_1000000R2 c=2 r=1
5 INDEX
UNIQUE SCAN DTOW_TEST_1000000R2_U1 c=1 r=1
1 select /*+ dynamic_sampling(t1 10) dynamic_sampling(t2
10)
2 ORDERED use_nl(t1 t2) index(t1
dtow_test_1000000r1_n1) */
3 max(t2.val2)
4 from dtow_test_1000000r1
t1, dtow_test_1000000r2 t2
5 where
t1.pkey_id=t2.pkey_id
6* and
t1.recent_flag = 'Y'
SQL>
@j2
1SELECT STATEMENT
c=20221 r=1
2 SORT AGGREGATE c=_ r=1
3 NESTED LOOPS
c=20221 r=9971
4 TABLE
ACCESS BY INDEX ROWID 1*DTOW_TEST_1000000R1 c=150 r=10000
5 INDEX
RANGE SCAN DTOW_TEST_1000000R1_N1 c=21 r=10058
4 TABLE
ACCESS BY INDEX ROWID 2*DTOW_TEST_1000000R2 c=2 r=1
5 INDEX
UNIQUE SCAN DTOW_TEST_1000000R2_U1 c=1 r=1
SQL>
@j3
1 select /*+ ORDERED use_nl(t1
t2) index(t1 dtow_test_1000000r1_n1) */
2 max(t2.val)
3 from
dtow_test_1000000r1 t1, dtow_test_1000000r3 t2
4 where
t1.pkey_id=t2.pkey_id
5* and
t1.recent_flag = 'Y'
1SELECT STATEMENT
c=20337 r=1
2 SORT AGGREGATE c=_ r=1
3 NESTED LOOPS
c=20337 r=10058
4 TABLE
ACCESS BY INDEX ROWID 1*DTOW_TEST_1000000R1 c=150 r=10058
5 INDEX
RANGE SCAN DTOW_TEST_1000000R1_N1 c=21 r=10058
4 TABLE
ACCESS BY INDEX ROWID 2*DTOW_TEST_1000000R3 c=2 r=1
5 INDEX
UNIQUE SCAN DTOW_TEST_1000000R3_U1 c=1 r=1
What
is notable about these three execution plans, and the costs that the optimizer
estimates for each step is how similar they are – evidently, the optimizer,
even with dynamic sampling at the highest level, sees no difference between the
optimally co-clustered case and the wholly non-clustered case. In both the
co-clustered case and the other case, almost every logical I/O “counts” in the
cost calculation, as if these had the same runtime cost whether they were
self-cached early in the query or not. In the co-clustered case, “cost” exceeds
expected physical I/O by a factor of about 75, even assuming no caching at all
from other, previously-run SQL. Of course, Oracle has never claimed that its
cost function estimates physical I/O. Note, though, that where physical I/O
dominates runtime, and the cost function does
predict physical I/O fairly well in the hash-join case, but is a factor of 75
below physical I/O in the nested-loops co-clustered case, this would create a
striking tendency for the optimizer to favor hash-join alternatives that will
run far longer than the nested-loops option!
Therefore,
it isn’t surprising that the hints shown are needed to get the nested-loops
plans – left to its own, the optimizer will choose a hash join to a full table
scan of the second table, and will estimate the cost of that plan to be lower,
by a factor of 5(!), although, here, its “cost” calculation will be largely
based on a count of multi-block I/Os to perform that full table scan, I/Os that
likely will be physical, an estimate
of the number of likely physical I/Os will be just about right (not seeing the
factor-of-75 mismatch, that is):
1 select max(t2.val2)
2 from
dtow_test_1000000r1 t1, dtow_test_1000000r2 t2
3 where
t1.pkey_id=t2.pkey_id
4* and
t1.recent_flag = 'Y'
1SELECT STATEMENT
c=3737 r=1
2 SORT AGGREGATE c=_ r=1
3 HASH JOIN
c=3737 r=10028
4 TABLE
ACCESS BY INDEX ROWID 1*DTOW_TEST_1000000R1 c=150 r=10058
5 INDEX
RANGE SCAN DTOW_TEST_1000000R1_N1 c=21 r=10058
4 TABLE
ACCESS FULL 2*DTOW_TEST_1000000R2 c=3542 r=997051
How
do these cost estimates compare with the reality of actual runtimes for these
cases, where there are no rows pre-cached by previous queries? Here are
runtimes and I/O statistics with long waits prior to each test to clear the
cache I ran the following script, with the following results:
alter session set optimizer_index_caching=0;
alter session set optimizer_index_cost_adj=100;
set heading off
@mysid10
select /*+ ORDERED use_nl(t1 t2) index(t1 dtow_test_1000000r1_n1) */
max(t2.val2)
from dtow_test_1000000r1 t1,
dtow_test_1000000r2 t2
where t1.pkey_id=t2.pkey_id
and t1.recent_flag = 'Y';
@reads10
select /*+ ORDERED use_nl(t1 t2) index(t1 dtow_test_1000000r1_n1) */
max(t2.val2)
from dtow_test_1000000r1 t1,
dtow_test_1000000r3 t2
where t1.pkey_id=t2.pkey_id
and t1.recent_flag = 'Y';
@reads10
--Wait a while for the cache to clear out.
<pause here, at least an
hour, before proceeding with the rest of the test.>
select max(t2.val2)
from dtow_test_1000000r1 t1,
dtow_test_1000000r2 t2
where t1.pkey_id=t2.pkey_id
and t1.recent_flag = 'Y';
@reads10
select max(t2.val2)
from dtow_test_1000000r1 t1,
dtow_test_1000000r3 t2
where t1.pkey_id=t2.pkey_id
and t1.recent_flag = 'Y';
@reads10
With
the results:
330
13756 7886
Elapsed: 00:00:00.01
8123456789712345678961234567895123456789412345678931234567892123456789112345679
Elapsed: 00:00:00.74
Logical Reads = 30302 Physical Reads = 314 MB PIO =
0
Elapsed: 00:00:00.02
8123456789712345678961234567895123456789412345678931234567892123456789112345679
Elapsed: 00:00:49.17
Logical Reads = 30223 Physical Reads = 10029 MB PIO
= 0
Elapsed: 00:00:00.02
<pause here>
8123456789712345678961234567895123456789412345678931234567892123456789112345679
Elapsed: 00:00:06.65
Logical Reads = 12972 Physical Reads = 1751 MB PIO =
1583
Elapsed: 00:00:00.02
8123456789712345678961234567895123456789412345678931234567892123456789112345679
Elapsed: 00:00:06.75
Logical Reads = 12899 Physical Reads = 1592 MB PIO =
1584
Elapsed: 00:00:00.02
As
you can see, the first query, joining perfectly co-clustered heap tables DTOW_TEST_1000000R1
and DTOW_TEST_1000000R2, runs in just 0.74 seconds, with just 314 physical
I/Os, which it turns out is just about what we’d expect hitting a
well-clustered 1% of each of these table’s blocks, and a handful of index
blocks. In the second query, on the other hand, we’d expect the optimizer’s
estimate of the cost of hitting the second table to be right, since these
tables’ matching rows are laid out completely differently, with clustered rows
in one not being the least bit clustered in the other. Here, we see over 10,000
physical I/Os, even though the first query has just cached all the blocks
needed from the driving table and index, just what we’d expect for the 10000
row-lookups in the second table, matching well the CBO’s
estimate of the cost of reaching that table. Even here, though, the number of
index blocks physically read for the join-key index must surely have been lower
than the CBO’s estimated index cost of about 10,000,
since the entire index has just about
1900 blocks, all of which will end up cached early in the query. As expected,
this second query runs many times slower than the first, in 49 seconds,
although the optimizer estimated its cost as essentially identical to the first!
The
third query gets the plan the optimizer wants
for the first query, with hints taken away, a hash join to a full table scan of
DTOW_TEST_1000000R2. The multi-block physical I/O count is just what we’d
expect for a table with the given number of table blocks, and the query runs
many times (9 times) slower than query one, even though the optimizer gave it a
cost that was over 5 times lower than
the calculated cost of the first plan! The net error in relative runtimes
compared to relative calculated costs is 45! The fourth query, though, compared
to the second (the same query with hints), is
faster, by about the same factor that the optimizer estimated costs predict,
showing that when the joined tables are not
co-clustered, and blocks are not pre-cached, the cost function can be
reasonably predictive of relative runtimes. (I find similar results on 9i.)
Oracle’s
optimizer must make certain assumptions to optimize efficiently with a
manageable set of statistics. For example, Oracle maintains data about the
distribution of data for each table column, and uses these data to predict the
selectivity of any given condition on any single column. However, Oracle quite
reasonably has no data on the combined selectivities
of multiple conditions on multiple columns, because this would require an
enormously detailed analysis and a huge volume of correlated distribution
statistics. For example, if we query for an open order for a given customer,
Oracle (assuming a histogram has been generated) has a good idea of the number
of orders with Open_Flag=’Y’, and a good idea of the
average number of orders for a randomly chosen customer, but no direct data
pertaining to the combination of these conditions. Oracle applies a standard
statistical assumption, in the face of this difficulty – Oracle assumes that
the two conditions are statistically independent, that is, that the probability
of the two conditions being true is the product of the probabilities of each
condition being true. Most often, this assumption is just fine, and even in the
cases where it turns out to be dramatically wrong (as it sometimes is), it is
easy to see that Oracle has little alternative, unless it gathers data at the
time of the query parse, specifically for that parse. In fact, Oracle does gather this data at parse time,
when we choose high levels of dynamic_sampling – it can find correlated frequencies for
multiple conditions on the same table being true, at parse time, at the cost of
some data sampling prior to the parse.
It
is natural to assume that co-clustering is a problem similar to correlated
distributions – just one of those things the optimizer can’t be expected to
know, given its limited data. However, I argue that this is not the case! If you look at the block
counts, above, for DTOW_TEST_1000000R1 and DTOW_TEST_1000000R2, and the clustering_factors for each of the three indexes on those
tables, you find that the co-clustering is provable with data the optimizer already has - DTOW_TEST_1000000R1
clusters perfectly on Recent_Flag, and those
perfectly clustered blocks cluster equally well to PKey_ID,
on that table, so Recent_Flag must belong to a narrow
range of PKey_ID values. That narrow range in turn
will join to the very same narrow range of PKey_ID
values in DTOW_TEST_1000000R2, which is in turn clustered perfectly to the
narrowest possible range of blocks in that table, all steps in this chain of
logic being verifiable with cluster_factors known to
the optimizer! One might object that the situation is artificial – a function
of the artificial way I generated these tables. However, nothing could be
further from the truth! In fact, excellent co-clustering of large transaction
tables is the rule, not the
exception, because most large master-detail table pairs and one-to-one table
pairs usually have matching rows on both sides of the join created at almost
exactly the same time! (Consider the case of Orders and Order_Details,
discussed above, for a concrete example.) Consider the following typical, predictable features of a well-chosen
nested-loops plan joining two large transaction tables:
1) The driving index likely takes advantage of at least
one column that correlates well with recent rows, since the business
application is unlikely to often query ancient history, and any condition that
fails to correlate with the recentness of data will gradually become
unselective, as enough data accumulate. This means that we most often drive
from well-clustered indexes, if the SQL and indexes are well designed. The
optimizer’s data show it how well clustered the index is, so it likely will
accurately assess the number of blocks it will reach in the first table, and
the number of rows it will find in that table.
2) In reaching the most-recent blocks of the first
table, through the most-recent blocks of the driving index, the optimizer will
find blocks that are cached far better than average, since these most-recent
blocks are also of interest to other portions of the application and other
users. The optimizer (with default system parameters) does not appear to take
into account this tendency to find well-cached blocks with correctly chosen
indexed access when comparing costs with, for example, a full table scan
alternative.
3) In reading the well-clustered most recent blocks of
the driving table, where these aren’t already cached in the database cache, the
disk subsystem will likely perform read-ahead of the nearby blocks when it must
perform disk reads. Because the plan has a high probability of requesting those
next blocks very soon, as it performs its clustered range scan, subsequent
“physical reads” that Oracle requests are likely to be satisfied from the disk
subsystem’s read-ahead cache, meaning that the average time to perform these
“physical” single-block read requests by Oracle is far less than we’d expect
for true physical reads. This has the effect of making a clustered series of
single-block reads perform much more like a series of multi-block reads than
the cost function accounts for. The cost function expects a perfectly clustered
range scan of a whole table to take about 8 times longer than a full table
scan, for example, where the multi-block reads read 8 blocks at a time, but the
actual performance of this range scan is far better than that, owing to typical
read-ahead going on outside of Oracle’s control.
4) In the nested loop, the series of index scans on the
well co-clustered join key will almost always hit the very same index blocks
that Oracle reached for the row before. Even in cases where they do not hit
blocks recently reached earlier by the same query, these index blocks are very
likely to be cached by other applications or by read-ahead in the disk
subsystem. Although logical I/Os to the index are not free, and can have
significant cost where rowcounts reached are high,
these CPU-time costs are frequently minor in well-optimized queries to large
tables, compared to the physical-I/O-time costs. The optimizer cost function,
on the other hand, takes poor account of this.
5) In the nested loop, the series of table access by
index rowid steps to reach the joined-to table see
superb self-caching, requiring no more physical I/O per row than required for
the well-clustered driving table, although the optimizer likely expects a cost to the second table of
close to a hundred times higher than the cost of reaching the first table! Even
the first time the query hits these well-clustered, probably-recent blocks, it
likely finds them cached, for the same reason that the matching blocks in the
driving table were well-cached.
The
optimizer finding a plan without hints will compare its estimate of the cost of
this nested-loops plan with a plan that reaches both tables with full table
scans and performs a hash join, and a plan that reaches only the first table
through the well-clustered index, then performs a hash join to a full table
scan of the second table. While the cost function implicitly expects most
logical I/O in the nested-loops plan to be uncached
physical I/O, when the contrary is true, the cost-function-implied estimate of
true physical I/O for full table scans is likely right on the money – most old
blocks of these large tables will be uncached, and real, physical multi-block reads will cover
most of the tables. Features 2 through 5 of the nested-loops alternative,
above, will predictably lead to high cost estimates, compared to the true
physical I/O count, typically by something on the order of a factor of a
hundred, especially for the dominant costs estimate of reaching the second
table, while the alternative cost estimates will map well to the true physical
I/O count for full-table-scan alternatives. This leads to a severe tendency to
over-favor full table scans in queries of co-clustered tables, especially hash
joins to full table scans. If the true cost of the nested-loops alternative is overwhelmingly lower, the optimizer will
still likely do the right thing, but if the alternatives are within a factor of
a hundred or so, in true runtime cost, the danger of it making the wrong choice
is very real.
I’m
occasionally accused of being a nested-loops bigot, since my book focuses so
much material on optimization of nested-loops plans. Many people besides myself have noticed the tendency of the optimizer to
over-favor hash joins to full table scans, however. As a result, there are
common hacks to overcome this tendency, especially overriding the default
settings for optimizer_index_caching and optimizer_index_cost_adj to make indexed access appear more
favorable to the optimizer. I hope this paper clarifies why these hacks so
often seem to help, why the optimizer cost function does not just automatically
find the right costs for so many queries. At the same time, it should be clear
that these hacks are a two-edged sword – the default optimizer-calculated costs
are much more trustworthy in cases where tables being joined are not
co-clustered, for example, a join of Orders and Customers, on Customer_ID, in a scenario where the average customer
places many orders spread over years of history. Therefore, for such a join,
default settings of these parameters will work about right, or at least come
much closer to the right result. Adjusted cost functions also have a tendency
to blur the difference between alternative nested-loops plans, producing cost-based
ties when one join order would be correctly preferred to another if the costs were
unadjusted. What is really needed is a better accounting for the true tendency
to co-cluster, where it applies, and the lack of co-clustering, where it does
not apply, and the optimizer_index settings can’t
correctly handle both needs.
When
tuning manually, this isn’t hard to handle. If necessary, you can look at clustering_factors, and compare them to blockcounts
for that table, to deduce when clustering is good on the driving index, and
co-clustering is good between the joined tables. However, with just a tiny bit
of intuition about the application, assuming the table and column names make
any sense in a business context, it is usually easy to guess what columns
cluster well, and what tables co-cluster well, having their joined rows
generally created at about the same time. When you find such co-clustering
opportunities, you almost certainly want those tables joined by nested loops,
following the full join key, unless the query returns a huge number of rows.
When, on the other hand, you find yourself driving from an unclustered
filter condition, it is fair to ask whether a clustered filter condition, even
If it is slightly less selective, in terms of the fraction of rows reached,
might perform better. The only queries likely to favor unclustered
driving filters are either:
1) Queries having a highly selective unclustered
filter, such as a query for just Orders for a single customer. These filters
are so selective that they likely favor nested loops to other large tables even
without the benefits of co-clustering. Even though these queries may cover data
over a lot of history, lacking any selective, well-clustered time-dependent
component, they return a very selective history, so the rowcounts
are reasonable. Or,
2) Queries lacking selective filters, altogether, and
having no filter at all that is well-clustered. These queries tend to favor
hash joins to full table scans even of big tables. (Hash joins to full table
scans of small tables are quite commonly favorable, although the
nested-loops alternative in these cases is commonly not much worse, since
neither alternative likely requires any physical I/O, nor much CPU time
compared to the rest of the query time.) However, queries of large tables
without selective filters, and without well-clustered filters, tend to return
more rows than are useful in most business contexts, so the focus here is on
avoiding these queries, altogether.
Hash
joins to full table scans of large tables commonly look much better than they
are. Occasionally, the hash joins make good sense when we join large tables
that are not well co-clustered. More often, where large hash joins beat a
nested-loops plan, the query commonly either reads far more rows than are truly
useful in a business context, or the nested-loops plan being compared is the wrong nested-loops plan, likely with the
wrong join order, or missing necessary join-key indexes, or with join-key
indexes disabled by a type conversion or some other function on the join key.