Search This Blog

Sunday, October 25, 2009

Tuning SQL Queries

About 80 percent of database query performance problems can be solved by adjusting the SQL statement. However, you must understand how the particular DBMS being used processes SQL statements in order to know what to tweak.

The query optimizer is the software component in the RDBMS that analyzes an SQL statement to determine the best way to execute it. Most modern optimizers are cost-based, which means that they estimate the cost of all possible ways to execute a statement and then chose the one with the lowest cost.

Here are some considerations regarding query optimizers:

    * Order of table names: Does the order of the table names in the FROM or JOIN clause have any influence on the order in which tables are accessed when performing the joins? This is more likely the case with a rule-based optimizer. Ideally, the DBMS should access the most selective table (the one that will eliminate the most number of rows from the result set) first.

    * Order of search predicates Does the order of the predicates in the WHERE clause have any influence on the order in which the predicates are evaluated? Ideally, the most restrictive predicate (the one that eliminates the most number of rows) should be evaluated first.

    * Lack of statistics If a cost-based optimizer is being used, what does it do when statistics have not been collected for one or more tables? Some optimizers, such as Oracle's, revert back to rule-based, while others assume default values for the required statistics or simply refuse to use any indexes and do full table scans of all the tables. A full table scan is where the DBMS reads every row in the table to find the desired ones, which of course can be a performance disaster for tables with very large numbers of rows.

    * Query rewrites Are queries rewritten into more efficient forms by the optimizer? For example, many optimizers automatically rewrite sub selects into equivalent joins in order to simplify subsequent processing. In some cases, you may find that certain DBMS options must be enabled in order to allow the optimizer to rewrite queries.
    * View definition merges For queries that use views; at what point does the DBMS merge the view definition (the query that defines the view) into the SQL statement submitted by the database user? This has obvious implications for the optimizer -the sooner it can evaluate the entire SQL statement, the smarter its decision should be.
    * Other criteria What other criteria influence the optimizer? For example, some optimizers will favor unique indexes over non unique ones, and some will favor the use of an index to sequence rows over sorting the result set.

Here are some considerations regarding query design:

    * Know your data.  
When writing the SQL statement, you should have some idea of how many rows are in each table, how selective your WHERE predicates are, and how many rows you expect in the result set. The larger the number of rows involved, the more time you should spend thinking about the best way to write the SQL statement.
    * Minimize returned rows.  The fewer the rows in the result set, the more efficiently the query will run.
    * Avoid scans of large tables.  For tables over 1000 rows or so, scanning all the rows in the table instead of using an index can be expensive in terms of resources required. And, of course, the larger the table, the more expensive a table scans becomes. Full table scans occur in the following situations:
    * The query does not contain a WHERE clause to limit rows.
    * None of the columns referenced in the WHERE clause matches the leading column of an index on the table.
    * Index and table statistics have not been updated. Most RDBMS query optimizers use statistics to evaluate available indexes, and without statistics, a table scan may be seen as more efficient than using an index.
    * At least one column in the WHERE clause does match the first column of an available index, but the comparison used obviates the use of an index. These cases include the following:

          o Use of the NOT operator (for example, WHERE NOT CITY = 'New York'). In general, indexes can be used to find what is in a table, but cannot be used to find what is not in a table.
          o Use of the NOT EQUAL operator (for example, WHERE CITY <> 'New York').
          o Use of a wildcard in the first position of a comparison string (for example, WHERE CITY LIKE '%York%').
          o Use of an SQL function in the comparison (for example, WHERE UPPER(CITY) = 'NEW YORK').

    * Avoid unnecessary columns.  The wider the data in each row in the result set, the more disk space and memory that is required for intermediate operations such as sorts and to hold the result set.
    * Avoid unnecessary tables.  The fewer the tables, the more efficient the query.
    * Avoid sorts of large result sets.  Sorts are expensive, especially when the rows being sorted will not fit in memory. When the result set is expected to be very large, sorts should be avoided. Most optimizers will use an index if it can eliminate the need for a sort, but creating an index solely to avoid a sort of a large number of rows is probably not wise because of the overhead required to maintain the index.
    * Match data types in predicates.  Whether a predicate compares two column values as is done with joins, or a column value and a literal as is done when filtering rows, it is important for the data types to match. When the data types do not match, the DBMS must convert one of them before performing the comparison, and while the work to do this is relatively small, it mounts quickly when it has to be done for each row in a large table.
    * Use IN instead of OR when possible.  The IN operator can be rewritten as a JOIN, but the OR operator often requires multiple queries to be run with the results combined by the DBMS. The former is far more efficient.
    * Use GROUP BY instead of DISTINCT.  In most DBMSs, a GROUP BY is a more efficient way to eliminate duplicate rows compared with the DISTINCT keyword. The reason for this is that a GROUP BY invokes the sort required to find the duplicates earlier in the processing of the query, while a DISTINCT applies the sort as the very last step (applied to the final result set). The sooner the duplicate rows are eliminated, the more efficiently the remainder of the processing on that result set can be performed.
    * Use hints if you must.  Hints are special syntax that can be placed in the SQL statement to direct the optimizer to take certain actions, such as forcing the use of a particular index or a particular method to join tables. While this can be a very attractive option, it should only be used as a last resort because hints are not portable across database vendors, and they sometimes stop working when the DBMS software is upgraded to a newer version. The Oracle, MySQL, and Microsoft SQL Server optimizers all respond to hints, but the syntax accepted by each is different.
    * Temporary tables may help.  Temporary tables can help in some situations, such as assembling large result sets and then adding an index or two to support multiple subsequent queries against the temporary table. However, remember that you're doubling up the reads and writes when you do this because all the data selected from the original (base) tables must be written to the temporary table(s) and then read back from there. In short, there are no free lunches.
    * Views may help.  Views can help because they hide complex operations such as nested aggregate functions. And with DBMSs that don't have an SQL statement cache, views may process more efficiently than ordinary queries because the SQL statement that defines the view has already been parsed and optimized, which means this work does not have to be done every time the view is accessed. But above all, remember that views are also SQL queries, so they are subject to all the tuning considerations you apply to any SQL statement.

No comments:

Post a Comment