Natural Data Clustering:

Why Nested Loops Win So Often

Dan Tow

A surprising amount of tuning work goes into overriding optimizers’ tendency to join tables with hash joins and sort-merge joins to full table scans. It’s not obvious why optimizers, which have been refined for years, should still seem to under-favor the nested-loops alternative, but I have some thoughts on the subject:

Most data in a business application consists of bundles of information detailing a business event, an entity that can be located on a timeline in the history of the business. The most obvious prototype for the business event is an order, but information related to that event will be tracked in several tables that would join directly or indirectly to the Orders table, including tables tracking order details, shipments, invoices, payments, commissions, et cetera. The entire cluster of information related to the highest-level master table in the hierarchy will likely be created within a short time window, as one business event triggers rapid follow-up events, and many rows in master-detail relationships typically will be created in a single transaction, such as orders and order details, which would each be meaningless without the other.

When we query business data for most purposes, we typically need to see data related to quite recent business events. These recent events may be important to query because we need to monitor the current health of the business or because the events are still unfinished, with tasks required to fully complete the business process related to the order or other business event. Business applications boil down to tools to trigger the business to make the right actions and decisions, today. “Ancient history,” which is to say an event older than about a year, is rarely relevant to business actions and decisions that need to be made today, and even events a month or two old are not often relevant to day-to-day business operations.

In a typical heap table, the data for these recent events resides together at the top of the table, or occasionally together in older blocks made freshly empty by being purged of data so old that it is longer needed. Because the rows representing the most recent data reside together in a small subset of the table blocks, an execution plan for a query of these rows tends to find multiple rows needed by the query in each of these blocks. This self-caching effect is very useful, where blocks needed by the query are cached early in the query, then reused from the cache multiple times during the query. Even early in the query, the first time these blocks are needed, they are likely already to be cached by other recent queries, which also tend to need blocks from that small subset of the table that holds the recent rows.

When we query a tiny subset of an events-data table, and the database can see that the subset is tiny, then the optimizer has no trouble figuring out that the joins to related-events tables will also read just a tiny subset of those tables, and nested loops plans to those tables are an easy choice. There are two cases, though, where the choice stymies the typical optimizer:

The filter on the driving table reaches a tiny fraction of the events, but the optimizer misestimates this fraction, estimating a much larger rowcount passing the driving filter than is the actual case.
The filter on the driving table reads a moderately large fraction of the data, for example a whole month of data out of a 4-year history.

Consider each of these cases, in turn. As savvy human tuners, we should know that a typical business report provides a small enough set of return data that a human would find the report useful. Ten or twenty pages is about the outside limit of report length a human is likely to read from end to end, corresponding to the result of a maximum of about 1000 rows returned from a query. Even in a data-warehousing context, really huge query results should be the exception, rather than the rule. Therefore, if the filters on a query appear unselective, the savvy human tuner should suspect that either the report is poorly designed, from the perspective of the user-interface, or the filters, in combination, are actually more selective than they appear. This happens surprisingly often in business queries, which tend to look for exceptions that could be described abstractly as:

“If X is true, then Y should not also be true, but we need to see the exceptions to this rule.”

The query for exceptions to the rule, which would call for business action (either action to fix the specific exception, or, better still, action to analyze and prevent future occurrences of the exception), would look like

Select …

where not <X> and not <Y> and <joins to other tables with supporting data>;

My favorite example of this sort of query is a query looking for orders that are neither closed nor recent, since orders should not stay in the “open” state long. Often (assuming well-designed business processes), exceptions are very rare, but each of the conditions (not <X> and not <Y>) by itself may be quite common. Since optimizers almost invariably assume statistical independence for pairs of conditions like this (a very wrong assumption in these cases!), they frequently grossly overestimate how many rows they’ll reach at the point where both conditions can be applied, and this causes a gross overestimate in the rows to be joined to subsequent tables, case #1, above. In these cases, if the optimizer understood what the savvy human tuner understands, that the query probably won’t return over a thousand rows, then the optimizer would tend to favor nested loops following the join key, rather than an incorrect choice to hash join or sort-merge join to a full table scan of the later table. Unfortunately, the optimizer has no preconception that rowcounts from reasonable queries tend to be small, and it makes the choice to favor a plan that is only justified if that result turns out to be unreasonably large, as it simplistically appears it will be.

Now consider case #2 above, the case that the rowcount really is quite large, or at least it is large before some group-by sums up a large result set. (Reports of over 1000 rows should be rare, but short reports that sum over 1000 rows make perfect sense in the business context!) As our example, let’s say we want to look at the most recent month of data out of a 4-year history. Let’s further assume that there are 96 rows of data per database block in the driving table. Now, consider the join to a related events table, which we’ll assume has a three-deep index tree on the join key, and twice as many blocks as the driving table. We won’t count the root block in the index tree in our calculations, because we assume that root blocks are perfectly cached.

If we follow nested loops to the second table, for each row from the driving table, we’ll do 2 logical I/Os (not counting the root block) to the join-key index, and a logical I/O to the joined-to table block, or 3 logical I/Os in all per joined-from row. Since we are reading 1/48^th of the joined-from table, but have 96 rows/block in that table, we’re reading twice as many rows as we have blocks in that table, which is the same number of rows as the number of blocks in the joined-to table. Since we counted 3 logical I/Os (not counting the root index blocks) per joined-from row, we count 3 times as many logical I/Os to reach the joined-to table by nested loops as the total block count in that same table! All the optimizer has to do to favor a hash join or a sort-merge join to a full table scan of that joined-to table is to decide that reading the table once in a single pass is better than reading three-times as many blocks one block at a time, with logical I/Os driving through a join-key index, in nested loops! Unquestionably, we should expect a higher logical I/O count with the nested-loops plan here, so does that mean the nested-loops plan is inferior?

Assume, first, that all blocks are cached. There is a CPU overhead simply to perform a logical I/O, and the nested-loops plan will surely cost more CPU for its logical I/Os. Once we have performed the logical I/O, however, there is also CPU overhead for what we do with the block. In the nested-loops plan, we read a small fraction of the block. The runtime for both the logical I/O and reading this small fraction is typically under 10 microseconds. Unless the rowcounts reached are truly huge, or the join-order is inefficient, reaching far more rows than the rowcount that satisfies all the filters, the CPU costs for these nested loops are insignificant! (With bad join orders, hash joins to full table scans are much more likely to be a significant “win” than with correct join orders!) Consider the blocks we read for the join alternative that reaches the joined-to table with a full table scan, however: For these blocks, the database must view the entire block, every row in the block, and this typically takes much longer than the 10 milliseconds per block we need for the nested-loops alternative. In all, CPU efficiency shows a trade-off – more CPU for just for the individual logical IOs for the nested-loops alternative, here, but less CPU for the work done inside each block, and the nested-loops plan is better than it appears, from just a CPU-consumption perspective.

Now assume that the blocks are not all so well cached. This is much more realistic, if these tables are big enough that we really need to worry about the tuning problem. The full table scan encounters blocks (or, in the case of Oracle, multi-block groups of blocks, typically 8-block groups of 8K blocks) from the entire table, including the majority of the table that qualifies as “ancient history.” These old blocks will be extremely poorly cached, so physical I/O will be high for the non-nested-loops alternative. Consider the nested-loops alternative, though: The joined-to key values and joined-to related-events rows will tend to be roughly one month old, or less, just like the rows in the driving table. There are likely no more than a few hundred index blocks, at the most, holding the necessary recent join-key data. These are likely well-cached before the query even starts, but even if they aren’t they will be read in and cached (and re-used many times) for the remainder of the query during the first couple of seconds, likely. The table blocks for the most-recent month of the related-events table will be slightly less-well cached, at first, but even these blocks will tend to be read in roughly the same order they are stored, at most a physical-I/O count equal to about 1/48^th of the blocks in the table. Even this physical I/O count exaggerates the work on disk, because typically read-ahead performed in the disk subsystem will act like a multi-block read, reaching just the blocks needed next before they are needed, and caching those blocks in the disk subsystem memory so that most of what the database thinks is a physical I/O request turns out actually to be a super-fast read from the disk-subsystem cache. In all, the expected time for the physical I/Os to follow the nested-loops alternative is vastly better than the non-nested-loops alternatives, in this example, and the time-savings is overwhelmingly more important than any potential cost for extra logical I/Os.

Time-related data tends to cluster together in tables, and time-related data tends to join to other time-related data in well-clustered ways, as well. I refer to this as natural data clustering. From the physical I/O perspective, two tables that have joined rows generally created at about (or exactly) the same time act almost as if their rows were stored together in the same blocks – the number of physical I/Os and physical I/O time (which tends to dominate in well-tuned cases like this) necessary to read joined rows grouped well in creation time is roughly the same as it would be if the tables were reorganized into Oracle multi-table clusters, even though no one has lifted a finger to formally cluster the tables. The data read from these tables are naturally clustered as an automatic consequence of master and detail event-related rows generally being created at about the same time, and as an automatic consequence of most applications queries reaching mainly recent data at the top of the heap tables.

Consider a join of the most recent 100 blocks of Orders rows with the most recent 200 blocks of Order_Details: However high the logical I/O for a nested-loops plan might be, the query will hit just 300 distinct table blocks and perhaps a dozen of so join-key blocks (which were likely cached, anyway), and even a pessimist shouldn’t expect more than 300 physical I/Os. If we reasoned that Orders and Order_Details are invariably joined to one another, and belong in a two-table Oracle cluster, the very same rows would fit in roughly the last 300 blocks of such a cluster, for no significant net savings in physical I/O! In precisely the scenarios where physical multi-table clusters look attractive (rows of related tables are predictably created together), the savings in physical I/O turn out to be trivial or non-existent! There is a large savings in logical I/O, but in well-designed queries like this, this matters little. The biggest difference between the clustered and non-clustered example is in the optimizer’s (mistaken) estimate, in cases like this – the optimizer will correctly estimate a low cost for the physically clustered case, but it will (incorrectly) estimate a high cost for the case of the nested-loops join between single tables, although the cluster factor in Oracle’s data dictionary will enable to optimizer to correctly estimate a low cost for the read of the driving table’s well clustered blocks.

Is there any case where the optimizer is right to be pessimistic? It turns out that there is such a case – if the driving condition is not correlated with row-creation time (or if the table rows have become scrambled with respect to row-creation time, see Killing Cache Efficiency with Parallel Table Rebuilds), then nested loops may need nearly as many physical I/Os as logical I/Os. For example, if we query all the orders for a particular customer over all time, the driving-table rows will be scattered, and so will the joined-to Order_Details. Consider, though, that well-designed application queries should rarely look like this – such a query will get progressively slower as the tables accumulate more and more history, and why would we want to see “ancient” records of a customer’s orders, anyway? Most queries should reach some time-correlated condition (often combined with a non-time-correlated condition, such as customer ID) in the first multi-row table read (which may be preceded by one or more single-row reads, such as a read of a single customer record), which then drives to related event-type records created around the same time, using nested loops.

I want to demonstrate the point physically with specific tables, to place some numbers behind the abstract theoretical argument:

Consider three tables, each with one-million rows, in a three-way one-to-one relationship, two of them maximally co-clustered, with the third maximally unclustered with respect to the other two. (If you wish to make this concrete in your mind with real applications tables, think of the first two tables as an Orders table built into the generic application, and an orders-extension table added as a customization, to allow extended information on every order needed by the specific application site, but not anticipated in the original generic-application design. The third table is trickier, since almost any real example of a table joined one-to-one to an events table would naturally co-cluster with that table, since the rows of one-to-one tables almost have to be created at the same time. The closest example, though, would be to imagine that the company rarely sells to the same customer twice, making customer data almost one-to-one with orders data, but at some very recent date, the business has physically rebuilt the customer table so that rows are stored in alphabetical order by customer name, not in the order the orders and their customers were entered at all. (Something like this would also happen on a single-table cluster clustered on a non-time-related column, or on an index-organized table arranged on a non-time-related column.))

Here are some reproducible table-creation scripts for demonstration purposes, for tests I ran on 10g, with 8KB blocks (I got similar results on 9i.):

create table dtow_test_10r(a number);

insert into dtow_test_10r values(0);

insert into dtow_test_10r values(1);

insert into dtow_test_10r values(2);

insert into dtow_test_10r values(3);

insert into dtow_test_10r values(4);

insert into dtow_test_10r values(5);

insert into dtow_test_10r values(6);

insert into dtow_test_10r values(7);

insert into dtow_test_10r values(8);

insert into dtow_test_10r values(9);

commit;

CREATE TABLE dtow_test_1000000r1

(pkey_id NUMBER(18),

recent_flag CHAR(1) NOT NULL,

val VARCHAR2(80),

CONSTRAINT dtow_test_1000000r1_u1 PRIMARY KEY (pkey_id));

create index dtow_test_1000000r1_n1 on dtow_test_1000000r1(recent_flag);

insert into dtow_test_1000000r3

select /*+ noparallel ORDERED */ t1.a*100000+t2.a*10000+t3.a*1000+

t4.a*100+t5.a*10+t6.a, decode(t1.a*10+t2.a,99,'Y','N'),

'81234567897123456789612345678951234567894123456789312345678921234567891123456789'

from

dtow_test_10r t6,

dtow_test_10r t5,

dtow_test_10r t4,

dtow_test_10r t3,

dtow_test_10r t2,

dtow_test_10r t1;

commit;

ANALYZE TABLE dtow_test_1000000r1 COMPUTE STATISTICS;

BEGIN

DBMS_STATS.GATHER_TABLE_STATS ('PERF11I','DTOW_TEST_1000000R1',

METHOD_OPT => 'FOR COLUMNS SIZE 254 RECENT_FLAG');

END;

CREATE TABLE dtow_test_1000000r2

(pkey_id CONSTRAINT fk_r1 REFERENCES dtow_test_1000000r1(pkey_id),

val2 VARCHAR2(80),

CONSTRAINT dtow_test_1000000r2_u1 PRIMARY KEY (pkey_id)

);

insert into dtow_test_1000000r2

select /*+ noparallel ORDERED */ t1.a*100000+t2.a*10000+t3.a*1000+

t4.a*100+t5.a*10+t6.a,

'81234567897123456789612345678951234567894123456789312345678921234567891123456789'

from

dtow_test_10r t6,

dtow_test_10r t5,

dtow_test_10r t4,

dtow_test_10r t3,

dtow_test_10r t2,

dtow_test_10r t1;

commit;

ANALYZE TABLE dtow_test_1000000r2 COMPUTE STATISTICS;

CREATE TABLE dtow_test_1000000r3

(pkey_id NUMBER(18),

recent_flag CHAR(1),

val VARCHAR2(80),

CONSTRAINT dtow_test_1000000r3_u1 PRIMARY KEY (pkey_id));

create index dtow_test_1000000r3_n1 on dtow_test_1000000r3(recent_flag);

insert into dtow_test_1000000r3

select /*+ noparallel ORDERED */ t1.a*100000+t2.a*10000+t3.a*1000+

t4.a*100+t5.a*10+t6.a, decode(t1.a*10+t2.a,99,'Y','N'),

'81234567897123456789612345678951234567894123456789312345678921234567891123456789'

from

dtow_test_10r t1,

dtow_test_10r t2,

dtow_test_10r t3,

dtow_test_10r t4,

dtow_test_10r t5,

dtow_test_10r t6;

commit;

ANALYZE TABLE dtow_test_1000000r3 COMPUTE STATISTICS;

BEGIN

DBMS_STATS.GATHER_TABLE_STATS ('PERF11I','DTOW_TEST_1000000R3',

METHOD_OPT => 'FOR COLUMNS SIZE 254 RECENT_FLAG');

END;

-- Set both of these parameters to the defaults:

alter session set optimizer_index_caching=0;

alter session set optimizer_index_cost_adj=100;

Here is some information about the optimizer’s perspective on these tables:

SQL> select blocks from dba_tables

2 where table_name LIKE 'DTOW_TEST_1000000R%';

BLOCKS TABLE_NAME

---------- ------------------------------

12822 DTOW_TEST_1000000R1

12657 DTOW_TEST_1000000R2

12822 DTOW_TEST_1000000R3

SQL> select LEAF_BLOCKS, BLEVEL, CLUSTERING_FACTOR, INDEX_NAME

2 from dba_indexes where TABLE_NAME LIKE 'DTOW_TEST_1000000R%'

3 order by INDEX_NAME;

LEAF_BLOCKS BLEVEL CLUSTERING_FACTOR INDEX_NAME

----------- ---------- ----------------- ------------------------------

1815 2 12820 DTOW_TEST_1000000R1_N1

1879 2 12819 DTOW_TEST_1000000R1_U1

1879 2 12657 DTOW_TEST_1000000R2_U1

2059 2 1000000 DTOW_TEST_1000000R3_U1

3228 2 22819 DTOW_TEST_1000000R3_N1

For space reasons, I’ll only briefly mention Oracle’s handling of single-table queries of these tables: The clustering factors enable Oracle to have a generally good picture of the true cost of these queries, in terms of the actual physical I/O we should expect, although the estimate falls down somewhat, even with a histogram, on handling of a skewed data distribution such as for the RECENT_FLAG in DTOW_TEST1000000R3. Generally, though, self-caching of blocks hit early in the single-table query that will be reused later is well handled by the optimizer’s apparent use of CLUSTERING_FACTOR.

Self-caching in the joined-to table is another matter, however! Compare a simple co-clustered join of the first two tables, with and without dynamic sampling, compared to the similar join between the wholly non-co-clustered join of the first and third tables, with these three tables stored in j1.sql, j2,sql, and j3.sql, as probed by my script exq8.sql (which you can find, along with some other useful scripts used here on my site):

SQL> @j1

1 select /*+ ORDERED use_nl(t1 t2) index(t1 dtow_test_1000000r1_n1) */

2 max(t2.val2)

3 from dtow_test_1000000r1 t1, dtow_test_1000000r2 t2

4 where t1.pkey_id=t2.pkey_id

5* and t1.recent_flag = 'Y'

1SELECT STATEMENT c=20337 r=1

2 SORT AGGREGATE c=_ r=1

3 NESTED LOOPS c=20337 r=10028

4 TABLE ACCESS BY INDEX ROWID 1*DTOW_TEST_1000000R1 c=150 r=10058

5 INDEX RANGE SCAN DTOW_TEST_1000000R1_N1 c=21 r=10058

4 TABLE ACCESS BY INDEX ROWID 2*DTOW_TEST_1000000R2 c=2 r=1

5 INDEX UNIQUE SCAN DTOW_TEST_1000000R2_U1 c=1 r=1

1 select /*+ dynamic_sampling(t1 10) dynamic_sampling(t2 10)

2 ORDERED use_nl(t1 t2) index(t1 dtow_test_1000000r1_n1) */

3 max(t2.val2)

4 from dtow_test_1000000r1 t1, dtow_test_1000000r2 t2

5 where t1.pkey_id=t2.pkey_id

6* and t1.recent_flag = 'Y'

SQL> @j2

1SELECT STATEMENT c=20221 r=1

2 SORT AGGREGATE c=_ r=1

3 NESTED LOOPS c=20221 r=9971

4 TABLE ACCESS BY INDEX ROWID 1*DTOW_TEST_1000000R1 c=150 r=10000

5 INDEX RANGE SCAN DTOW_TEST_1000000R1_N1 c=21 r=10058

4 TABLE ACCESS BY INDEX ROWID 2*DTOW_TEST_1000000R2 c=2 r=1

5 INDEX UNIQUE SCAN DTOW_TEST_1000000R2_U1 c=1 r=1

SQL> @j3

1 select /*+ ORDERED use_nl(t1 t2) index(t1 dtow_test_1000000r1_n1) */

2 max(t2.val)

3 from dtow_test_1000000r1 t1, dtow_test_1000000r3 t2

4 where t1.pkey_id=t2.pkey_id

5* and t1.recent_flag = 'Y'

1SELECT STATEMENT c=20337 r=1

2 SORT AGGREGATE c=_ r=1

3 NESTED LOOPS c=20337 r=10058

4 TABLE ACCESS BY INDEX ROWID 1*DTOW_TEST_1000000R1 c=150 r=10058

5 INDEX RANGE SCAN DTOW_TEST_1000000R1_N1 c=21 r=10058

4 TABLE ACCESS BY INDEX ROWID 2*DTOW_TEST_1000000R3 c=2 r=1

5 INDEX UNIQUE SCAN DTOW_TEST_1000000R3_U1 c=1 r=1

What is notable about these three execution plans, and the costs that the optimizer estimates for each step is how similar they are – evidently, the optimizer, even with dynamic sampling at the highest level, sees no difference between the optimally co-clustered case and the wholly non-clustered case. In both the co-clustered case and the other case, almost every logical I/O “counts” in the cost calculation, as if these had the same runtime cost whether they were self-cached early in the query or not. In the co-clustered case, “cost” exceeds expected physical I/O by a factor of about 75, even assuming no caching at all from other, previously-run SQL. Of course, Oracle has never claimed that its cost function estimates physical I/O. Note, though, that where physical I/O dominates runtime, and the cost function does predict physical I/O fairly well in the hash-join case, but is a factor of 75 below physical I/O in the nested-loops co-clustered case, this would create a striking tendency for the optimizer to favor hash-join alternatives that will run far longer than the nested-loops option!

Therefore, it isn’t surprising that the hints shown are needed to get the nested-loops plans – left to its own, the optimizer will choose a hash join to a full table scan of the second table, and will estimate the cost of that plan to be lower, by a factor of 5(!), although, here, its “cost” calculation will be largely based on a count of multi-block I/Os to perform that full table scan, I/Os that likely will be physical, an estimate of the number of likely physical I/Os will be just about right (not seeing the factor-of-75 mismatch, that is):

1 select max(t2.val2)

2 from dtow_test_1000000r1 t1, dtow_test_1000000r2 t2

3 where t1.pkey_id=t2.pkey_id

4* and t1.recent_flag = 'Y'

1SELECT STATEMENT c=3737 r=1

2 SORT AGGREGATE c=_ r=1

3 HASH JOIN c=3737 r=10028

4 TABLE ACCESS BY INDEX ROWID 1*DTOW_TEST_1000000R1 c=150 r=10058

5 INDEX RANGE SCAN DTOW_TEST_1000000R1_N1 c=21 r=10058

4 TABLE ACCESS FULL 2*DTOW_TEST_1000000R2 c=3542 r=997051

How do these cost estimates compare with the reality of actual runtimes for these cases, where there are no rows pre-cached by previous queries? Here are runtimes and I/O statistics with long waits prior to each test to clear the cache I ran the following script, with the following results:

alter session set optimizer_index_caching=0;

alter session set optimizer_index_cost_adj=100;

set heading off

@mysid10

select /*+ ORDERED use_nl(t1 t2) index(t1 dtow_test_1000000r1_n1) */

max(t2.val2)

from dtow_test_1000000r1 t1, dtow_test_1000000r2 t2

where t1.pkey_id=t2.pkey_id

and t1.recent_flag = 'Y';

@reads10

select /*+ ORDERED use_nl(t1 t2) index(t1 dtow_test_1000000r1_n1) */

max(t2.val2)

from dtow_test_1000000r1 t1, dtow_test_1000000r3 t2

where t1.pkey_id=t2.pkey_id

and t1.recent_flag = 'Y';

@reads10

--Wait a while for the cache to clear out.

select max(t2.val2)

from dtow_test_1000000r1 t1, dtow_test_1000000r2 t2

where t1.pkey_id=t2.pkey_id

and t1.recent_flag = 'Y';

@reads10

select max(t2.val2)

from dtow_test_1000000r1 t1, dtow_test_1000000r3 t2

where t1.pkey_id=t2.pkey_id

and t1.recent_flag = 'Y';

@reads10

With the results:

330 13756 7886

Elapsed: 00:00:00.01

8123456789712345678961234567895123456789412345678931234567892123456789112345679

Elapsed: 00:00:00.74

Logical Reads = 30302 Physical Reads = 314 MB PIO = 0

Elapsed: 00:00:00.02

8123456789712345678961234567895123456789412345678931234567892123456789112345679

Elapsed: 00:00:49.17

Logical Reads = 30223 Physical Reads = 10029 MB PIO = 0

Elapsed: 00:00:00.02

8123456789712345678961234567895123456789412345678931234567892123456789112345679

Elapsed: 00:00:06.65

Logical Reads = 12972 Physical Reads = 1751 MB PIO = 1583

Elapsed: 00:00:00.02

8123456789712345678961234567895123456789412345678931234567892123456789112345679

Elapsed: 00:00:06.75

Logical Reads = 12899 Physical Reads = 1592 MB PIO = 1584

Elapsed: 00:00:00.02

As you can see, the first query, joining perfectly co-clustered heap tables DTOW_TEST_1000000R1 and DTOW_TEST_1000000R2, runs in just 0.74 seconds, with just 314 physical I/Os, which it turns out is just about what we’d expect hitting a well-clustered 1% of each of these table’s blocks, and a handful of index blocks. In the second query, on the other hand, we’d expect the optimizer’s estimate of the cost of hitting the second table to be right, since these tables’ matching rows are laid out completely differently, with clustered rows in one not being the least bit clustered in the other. Here, we see over 10,000 physical I/Os, even though the first query has just cached all the blocks needed from the driving table and index, just what we’d expect for the 10000 row-lookups in the second table, matching well the CBO’s estimate of the cost of reaching that table. Even here, though, the number of index blocks physically read for the join-key index must surely have been lower than the CBO’s estimated index cost of about 10,000, since the entire index has just about 1900 blocks, all of which will end up cached early in the query. As expected, this second query runs many times slower than the first, in 49 seconds, although the optimizer estimated its cost as essentially identical to the first!

The third query gets the plan the optimizer wants for the first query, with hints taken away, a hash join to a full table scan of DTOW_TEST_1000000R2. The multi-block physical I/O count is just what we’d expect for a table with the given number of table blocks, and the query runs many times (9 times) slower than query one, even though the optimizer gave it a cost that was over 5 times lower than the calculated cost of the first plan! The net error in relative runtimes compared to relative calculated costs is 45! The fourth query, though, compared to the second (the same query with hints), is faster, by about the same factor that the optimizer estimated costs predict, showing that when the joined tables are not co-clustered, and blocks are not pre-cached, the cost function can be reasonably predictive of relative runtimes. (I find similar results on 9i.)

Oracle’s optimizer must make certain assumptions to optimize efficiently with a manageable set of statistics. For example, Oracle maintains data about the distribution of data for each table column, and uses these data to predict the selectivity of any given condition on any single column. However, Oracle quite reasonably has no data on the combined selectivities of multiple conditions on multiple columns, because this would require an enormously detailed analysis and a huge volume of correlated distribution statistics. For example, if we query for an open order for a given customer, Oracle (assuming a histogram has been generated) has a good idea of the number of orders with Open_Flag=’Y’, and a good idea of the average number of orders for a randomly chosen customer, but no direct data pertaining to the combination of these conditions. Oracle applies a standard statistical assumption, in the face of this difficulty – Oracle assumes that the two conditions are statistically independent, that is, that the probability of the two conditions being true is the product of the probabilities of each condition being true. Most often, this assumption is just fine, and even in the cases where it turns out to be dramatically wrong (as it sometimes is), it is easy to see that Oracle has little alternative, unless it gathers data at the time of the query parse, specifically for that parse. In fact, Oracle does gather this data at parse time, when we choose high levels of dynamic_sampling – it can find correlated frequencies for multiple conditions on the same table being true, at parse time, at the cost of some data sampling prior to the parse.

It is natural to assume that co-clustering is a problem similar to correlated distributions – just one of those things the optimizer can’t be expected to know, given its limited data. However, I argue that this is not the case! If you look at the block counts, above, for DTOW_TEST_1000000R1 and DTOW_TEST_1000000R2, and the clustering_factors for each of the three indexes on those tables, you find that the co-clustering is provable with data the optimizer already has - DTOW_TEST_1000000R1 clusters perfectly on Recent_Flag, and those perfectly clustered blocks cluster equally well to PKey_ID, on that table, so Recent_Flag must belong to a narrow range of PKey_ID values. That narrow range in turn will join to the very same narrow range of PKey_ID values in DTOW_TEST_1000000R2, which is in turn clustered perfectly to the narrowest possible range of blocks in that table, all steps in this chain of logic being verifiable with cluster_factors known to the optimizer! One might object that the situation is artificial – a function of the artificial way I generated these tables. However, nothing could be further from the truth! In fact, excellent co-clustering of large transaction tables is the rule, not the exception, because most large master-detail table pairs and one-to-one table pairs usually have matching rows on both sides of the join created at almost exactly the same time! (Consider the case of Orders and Order_Details, discussed above, for a concrete example.) Consider the following typical, predictable features of a well-chosen nested-loops plan joining two large transaction tables:

1) The driving index likely takes advantage of at least one column that correlates well with recent rows, since the business application is unlikely to often query ancient history, and any condition that fails to correlate with the recentness of data will gradually become unselective, as enough data accumulate. This means that we most often drive from well-clustered indexes, if the SQL and indexes are well designed. The optimizer’s data show it how well clustered the index is, so it likely will accurately assess the number of blocks it will reach in the first table, and the number of rows it will find in that table.

2) In reaching the most-recent blocks of the first table, through the most-recent blocks of the driving index, the optimizer will find blocks that are cached far better than average, since these most-recent blocks are also of interest to other portions of the application and other users. The optimizer (with default system parameters) does not appear to take into account this tendency to find well-cached blocks with correctly chosen indexed access when comparing costs with, for example, a full table scan alternative.

3) In reading the well-clustered most recent blocks of the driving table, where these aren’t already cached in the database cache, the disk subsystem will likely perform read-ahead of the nearby blocks when it must perform disk reads. Because the plan has a high probability of requesting those next blocks very soon, as it performs its clustered range scan, subsequent “physical reads” that Oracle requests are likely to be satisfied from the disk subsystem’s read-ahead cache, meaning that the average time to perform these “physical” single-block read requests by Oracle is far less than we’d expect for true physical reads. This has the effect of making a clustered series of single-block reads perform much more like a series of multi-block reads than the cost function accounts for. The cost function expects a perfectly clustered range scan of a whole table to take about 8 times longer than a full table scan, for example, where the multi-block reads read 8 blocks at a time, but the actual performance of this range scan is far better than that, owing to typical read-ahead going on outside of Oracle’s control.

4) In the nested loop, the series of index scans on the well co-clustered join key will almost always hit the very same index blocks that Oracle reached for the row before. Even in cases where they do not hit blocks recently reached earlier by the same query, these index blocks are very likely to be cached by other applications or by read-ahead in the disk subsystem. Although logical I/Os to the index are not free, and can have significant cost where rowcounts reached are high, these CPU-time costs are frequently minor in well-optimized queries to large tables, compared to the physical-I/O-time costs. The optimizer cost function, on the other hand, takes poor account of this.

5) In the nested loop, the series of table access by index rowid steps to reach the joined-to table see superb self-caching, requiring no more physical I/O per row than required for the well-clustered driving table, although the optimizer likely expects a cost to the second table of close to a hundred times higher than the cost of reaching the first table! Even the first time the query hits these well-clustered, probably-recent blocks, it likely finds them cached, for the same reason that the matching blocks in the driving table were well-cached.

The optimizer finding a plan without hints will compare its estimate of the cost of this nested-loops plan with a plan that reaches both tables with full table scans and performs a hash join, and a plan that reaches only the first table through the well-clustered index, then performs a hash join to a full table scan of the second table. While the cost function implicitly expects most logical I/O in the nested-loops plan to be uncached physical I/O, when the contrary is true, the cost-function-implied estimate of true physical I/O for full table scans is likely right on the money – most old blocks of these large tables will be uncached, and real, physical multi-block reads will cover most of the tables. Features 2 through 5 of the nested-loops alternative, above, will predictably lead to high cost estimates, compared to the true physical I/O count, typically by something on the order of a factor of a hundred, especially for the dominant costs estimate of reaching the second table, while the alternative cost estimates will map well to the true physical I/O count for full-table-scan alternatives. This leads to a severe tendency to over-favor full table scans in queries of co-clustered tables, especially hash joins to full table scans. If the true cost of the nested-loops alternative is overwhelmingly lower, the optimizer will still likely do the right thing, but if the alternatives are within a factor of a hundred or so, in true runtime cost, the danger of it making the wrong choice is very real.

I’m occasionally accused of being a nested-loops bigot, since my book focuses so much material on optimization of nested-loops plans. Many people besides myself have noticed the tendency of the optimizer to over-favor hash joins to full table scans, however. As a result, there are common hacks to overcome this tendency, especially overriding the default settings for optimizer_index_caching and optimizer_index_cost_adj to make indexed access appear more favorable to the optimizer. I hope this paper clarifies why these hacks so often seem to help, why the optimizer cost function does not just automatically find the right costs for so many queries. At the same time, it should be clear that these hacks are a two-edged sword – the default optimizer-calculated costs are much more trustworthy in cases where tables being joined are not co-clustered, for example, a join of Orders and Customers, on Customer_ID, in a scenario where the average customer places many orders spread over years of history. Therefore, for such a join, default settings of these parameters will work about right, or at least come much closer to the right result. Adjusted cost functions also have a tendency to blur the difference between alternative nested-loops plans, producing cost-based ties when one join order would be correctly preferred to another if the costs were unadjusted. What is really needed is a better accounting for the true tendency to co-cluster, where it applies, and the lack of co-clustering, where it does not apply, and the optimizer_index settings can’t correctly handle both needs.

When tuning manually, this isn’t hard to handle. If necessary, you can look at clustering_factors, and compare them to blockcounts for that table, to deduce when clustering is good on the driving index, and co-clustering is good between the joined tables. However, with just a tiny bit of intuition about the application, assuming the table and column names make any sense in a business context, it is usually easy to guess what columns cluster well, and what tables co-cluster well, having their joined rows generally created at about the same time. When you find such co-clustering opportunities, you almost certainly want those tables joined by nested loops, following the full join key, unless the query returns a huge number of rows. When, on the other hand, you find yourself driving from an unclustered filter condition, it is fair to ask whether a clustered filter condition, even If it is slightly less selective, in terms of the fraction of rows reached, might perform better. The only queries likely to favor unclustered driving filters are either:

1) Queries having a highly selective unclustered filter, such as a query for just Orders for a single customer. These filters are so selective that they likely favor nested loops to other large tables even without the benefits of co-clustering. Even though these queries may cover data over a lot of history, lacking any selective, well-clustered time-dependent component, they return a very selective history, so the rowcounts are reasonable. Or,

2) Queries lacking selective filters, altogether, and having no filter at all that is well-clustered. These queries tend to favor hash joins to full table scans even of big tables. (Hash joins to full table scans of small tables are quite commonly favorable, although the nested-loops alternative in these cases is commonly not much worse, since neither alternative likely requires any physical I/O, nor much CPU time compared to the rest of the query time.) However, queries of large tables without selective filters, and without well-clustered filters, tend to return more rows than are useful in most business contexts, so the focus here is on avoiding these queries, altogether.

Hash joins to full table scans of large tables commonly look much better than they are. Occasionally, the hash joins make good sense when we join large tables that are not well co-clustered. More often, where large hash joins beat a nested-loops plan, the query commonly either reads far more rows than are truly useful in a business context, or the nested-loops plan being compared is the wrong nested-loops plan, likely with the wrong join order, or missing necessary join-key indexes, or with join-key indexes disabled by a type conversion or some other function on the join key.