Is it possible to store two different
outlines for the same SQL under the DEFAULT category, where that SQL is
interpreted differently, depending on which SQL it is interpreted from, and to
have each schema use the appropriate outline? For example, in one schema,
"Orders," which is referenced in a query, might be a simple, local
table, while in another schema, "Orders" is a UNION-
If you look at stored outlines in Oracle, it appears that Oracle has some undocumented hints that it uses, at least internally. What undocumented hints are available on each RDBMS (Oracle, SQL Server, et cetera), what do they do, and can they be used in user-created SQL, or just internally? (This is an open-ended problem, clearly, so I am not so much looking for a complete answer all at once as any information out there that represents progress toward an answer.)
On most systems, vast system
resources are simply wasted, in terms of disk space, memory, and CPU. This
likely represents an opportunity, since these resources could be used in
conservative ways that would never significantly affect ordinary load, while
performing valuable background services. For example, during times of low CPU
and I/O consumption, index blocks, in B*-tree indexes, could be continually,
incrementally rebuilt, getting rid of enormous amounts of unused space
(especially in indexes with common deletes and updates), resulting in
more-compact, better-cached indexes without DBA intervention. As another
example (which Oracle 10g already does, to a limited degree), post-execution
optimization could be performed on high-resource-consumption SQL,
automatically, in the background, to find better execution plans for that SQL
the next time it executes.
If a human being were to
execute SQL manually, against a series of tables and indexes that the human
could see, he or she would likely choose a good-looking path to the data and
start to follow that path. If, however, the human found that the results of
following that path violated the assumptions he or she made when choosing that
path, and appeared likely to take much longer than originally expected, he or
she would probably revisit the question of which execution plan was best and
potentially would start over with a better plan before wasting too much time on
the bad plan. Does any database do this? Propose a good algorithm for a
database that would.
Here’s an extension of last quarter’s research problem, which was to consider cases when it might be worthwhile to halt and retune a query in the middle of its execution. Even if it never changed plans “midstream,” the optimizer could still gather information relevant to whether it should choose a new plan the next time that SQL executes. What sort of data should the database gather during execution pertaining to finding a better plan for a later execution. When should it bother? What sort of data should the database gather offline? When should it bother? How and when should it use that data to find a better plan for a later execution? (I have a number of ideas regarding possible answers, but I’d first like to open up the discussion.) Does any database besides Oracle10g (which has some capability along these lines) do something like this? Propose a good algorithm for a database that would. How well can we apply the answers to these questions to finding more effective methods to choose which SQL to tune manually and to improving the actual process of manual tuning?
Does Oracle in its very latest incarnations “look across the link” to see statistics from the other end of the link when optimizing the execution plan of a query that combines data on more than one database? Does any non-Oracle RDBMS do this?
Will any version of Oracle (or any other RDBMS) parse and optimize a raw (PL/SQL-function-free) query remotely, then pass the raw rows to the local database to perform any function calls and any GROUP BY operation that might depend on those function calls?
What data would optimizers need to calculate the tendency for two specific tables queried on a specific filter to co-cluster, so that the optimizer could correctly estimate self-caching in the joined-to table? What would be the formula to apply that data to a correct estimate of the performance benefits of co-clustering??
Most databases allow definition of intricate constraints on tables. Where a database design has denormalizations, these ought to be formally declared, ideally, with constraints that preferably enforce that the denormalizations are followed rigorously. If this is done, is there any database that actually uses these defined constraints to refine cost and cardinality estimates by the cost based optimizer? For example, consider an example: We have a mutli-org implementation of an order entry system, where each transaction table includes an org_id column. Users belonging to a given organization can only see data having the appropriate org_id, using single-org views on top of each table that automatically add a condition “and tab.org_id = <myorg>” to every query for every transaction table. Thus, a query:
SELECT …
FROM Orders O, Order_Details OD, …
WHERE O.Order_Date > <last
Sunday>
uses views Orders and Order_Details built on top of tables All_Orders and All_Order_Details making the original query equivalent to
SELECT …
FROM All_Orders O, All_Order_Details OD, …
WHERE O.Order_Date > <last
Sunday>
Let’s say that there are a hundred organizations. This query ought to return, on average, 1% of the orders (and their details) made since last Sunday, based on the two statistically independent filters on the Orders table. Assuming that the query drives from the Orders table, it will immediately pick up the condition filtering Order rows for only the desired organization. These orders, naturally, will join to Order_Details belonging to the same organization, so the view-imposed condition on OD.Org_ID is wholly redundant, and discards no further rows. If a constraint declares OD.Org_ID as a denormalization, inheriting it’s value from the parent order reached through the join on Order_ID, then the optimizer ought to know that all Order_Details reached though that join will already satisfy the filter on OD.Org_ID, so that filter will have no effect on the estimated cardinality. On the other hand, if the cost-based optimizer fails to take account of declared denormalizations such as this, it will fall back on its usual assumptions, and consider the condition on OD.Org_ID to be statistically independent of the other conditions, and it will then calculate that the joins shown, with the filter conditions, will return just 0.01% of the order details of orders placed since last Sunday, a hundred-fold error in the resulting cost calculation, with corresponding effects on the chances that the optimizer will find the true lowest-cost alternative.
Potentially, there would be valuable use for two sorts of constraint declarations, here: Constraints that are actively enforced by the database, and declared denormalizations that are (hopefully) enforced by the application, but not by the database, but that the optimizer is still instructed to accept on faith for purposes of calculating and comparing cardinalities and costs.
Occasionally, cost-based optimizers conclude that the most-efficient path to the data in a SQL query is to read two un-joined rowsets, independently, and to find the Cartesian product, that is, to find all possible combinations of the rows in the two sets, where the product, in general, has m*n rows, where m is the number of rows in the first set, and n is the number of rows in the second set. (After the Cartesian product is performed, there may be more joins, usually conventional joins done as nested loops or hash joins, and it is often the case that after these additional joins are performed, there will eventually be a link-up between the two sets originally joined by Cartesian product, where the later-joined tables ultimately join to each other and to both Cartesian-joined sets. The result of these later joins that join the two Cartesian sets to each other can discard many, or even most, of the rows joined in the intermediate results prior to completing the execution plan.
In general, I have found that these Cartesian joins look attractive to the optimizer when the optimizer estimates that one or both of the rowsets being joined has only a single row (or even less), on average. When at least one of the Cartesian joined sets does have just one row (or none, even better!), this works out nicely, in general. The problem is that when the estimate is wrong, even by a little bit, disaster can result. Consider the following hypothetical case:
SELECT …
FROM Order_Details OD, Orders O, Order_Sources OS
WHERE OD.Ship_Date>:yesterday
AND OD.Order_ID=O.Order_ID
AND O.Order_Source_ID=OS.Order_Source_ID
AND OS.Source_Country=’JAPAN’
AND OS.Source_Type=’RESELLER’
Let’s say that there are 10-million order details, with 1-thousandth of those shipped since yesterday (and the optimizer knows this), 1-million orders, and 500 order sources, with histograms reflecting the facts that 20 order sources have Source_Country=’JAPAN’ and 20 order sources have Source_Type=’RESELLER’. The optimizer makes its usual assumption of statistical independence between these two conditions, calculating that most likely there is at most one row meeting both conditions. Under this assumption, the best possible path to the data is to go first to the Order_Sources table, discarding all but one row (or discarding all rows, and halting execution right there!), then to read the 10000 recently-shipped order details with an index on Ship_Date, forming a Cartesian product between those details and the one Order_Source, then reach the orders through its primary key, with nested loops from the 10000 rows in the Cartesian product. If all goes well, that really is the best plan! What happens if just one assumption is a bit off, though? What if it happens that all the Japanese order sources are resellers? (Perhaps the business model in Japan is rather different than for the rest of the company’s operations!) Well, in this case, the Cartesian product will find 200-thousand combinations of Order_Detail and Order_Source, and the optimizer will perform 200-thousand nested loops to Orders. (These will reach any given order at least 20 times, possibly with pretty good caching, but surely with very wasteful logical I/O!) The plan is now a terrible idea, nowhere near as good as a simple plan following nested loops from Order_Details, to Orders, to Order_Sources, and worse still compared to a plan that pre-cached the rows from Order_Sources, then did a hash join of those rows to the result of the nested-loops join between the other two tables. Even if the correlation between these two conditions was only very weak, and there were just two Order_Sources meeting both conditions, the Cartesian product would be a dramatically bad idea.
So, simply, the research problem is how to avoid all the cases of harmful, non-robust Cartesian products, without losing their occasional benefits?
©2008 Dan Tow, All rights reserved