Search This Blog

Sunday, October 25, 2009

Tuning SQL Queries

About 80 percent of database query performance problems can be solved by adjusting the SQL statement. However, you must understand how the particular DBMS being used processes SQL statements in order to know what to tweak.

The query optimizer is the software component in the RDBMS that analyzes an SQL statement to determine the best way to execute it. Most modern optimizers are cost-based, which means that they estimate the cost of all possible ways to execute a statement and then chose the one with the lowest cost.

Here are some considerations regarding query optimizers:

    * Order of table names: Does the order of the table names in the FROM or JOIN clause have any influence on the order in which tables are accessed when performing the joins? This is more likely the case with a rule-based optimizer. Ideally, the DBMS should access the most selective table (the one that will eliminate the most number of rows from the result set) first.

    * Order of search predicates Does the order of the predicates in the WHERE clause have any influence on the order in which the predicates are evaluated? Ideally, the most restrictive predicate (the one that eliminates the most number of rows) should be evaluated first.

    * Lack of statistics If a cost-based optimizer is being used, what does it do when statistics have not been collected for one or more tables? Some optimizers, such as Oracle's, revert back to rule-based, while others assume default values for the required statistics or simply refuse to use any indexes and do full table scans of all the tables. A full table scan is where the DBMS reads every row in the table to find the desired ones, which of course can be a performance disaster for tables with very large numbers of rows.

    * Query rewrites Are queries rewritten into more efficient forms by the optimizer? For example, many optimizers automatically rewrite sub selects into equivalent joins in order to simplify subsequent processing. In some cases, you may find that certain DBMS options must be enabled in order to allow the optimizer to rewrite queries.
    * View definition merges For queries that use views; at what point does the DBMS merge the view definition (the query that defines the view) into the SQL statement submitted by the database user? This has obvious implications for the optimizer -the sooner it can evaluate the entire SQL statement, the smarter its decision should be.
    * Other criteria What other criteria influence the optimizer? For example, some optimizers will favor unique indexes over non unique ones, and some will favor the use of an index to sequence rows over sorting the result set.

Here are some considerations regarding query design:

    * Know your data.  
When writing the SQL statement, you should have some idea of how many rows are in each table, how selective your WHERE predicates are, and how many rows you expect in the result set. The larger the number of rows involved, the more time you should spend thinking about the best way to write the SQL statement.
    * Minimize returned rows.  The fewer the rows in the result set, the more efficiently the query will run.
    * Avoid scans of large tables.  For tables over 1000 rows or so, scanning all the rows in the table instead of using an index can be expensive in terms of resources required. And, of course, the larger the table, the more expensive a table scans becomes. Full table scans occur in the following situations:
    * The query does not contain a WHERE clause to limit rows.
    * None of the columns referenced in the WHERE clause matches the leading column of an index on the table.
    * Index and table statistics have not been updated. Most RDBMS query optimizers use statistics to evaluate available indexes, and without statistics, a table scan may be seen as more efficient than using an index.
    * At least one column in the WHERE clause does match the first column of an available index, but the comparison used obviates the use of an index. These cases include the following:

          o Use of the NOT operator (for example, WHERE NOT CITY = 'New York'). In general, indexes can be used to find what is in a table, but cannot be used to find what is not in a table.
          o Use of the NOT EQUAL operator (for example, WHERE CITY <> 'New York').
          o Use of a wildcard in the first position of a comparison string (for example, WHERE CITY LIKE '%York%').
          o Use of an SQL function in the comparison (for example, WHERE UPPER(CITY) = 'NEW YORK').

    * Avoid unnecessary columns.  The wider the data in each row in the result set, the more disk space and memory that is required for intermediate operations such as sorts and to hold the result set.
    * Avoid unnecessary tables.  The fewer the tables, the more efficient the query.
    * Avoid sorts of large result sets.  Sorts are expensive, especially when the rows being sorted will not fit in memory. When the result set is expected to be very large, sorts should be avoided. Most optimizers will use an index if it can eliminate the need for a sort, but creating an index solely to avoid a sort of a large number of rows is probably not wise because of the overhead required to maintain the index.
    * Match data types in predicates.  Whether a predicate compares two column values as is done with joins, or a column value and a literal as is done when filtering rows, it is important for the data types to match. When the data types do not match, the DBMS must convert one of them before performing the comparison, and while the work to do this is relatively small, it mounts quickly when it has to be done for each row in a large table.
    * Use IN instead of OR when possible.  The IN operator can be rewritten as a JOIN, but the OR operator often requires multiple queries to be run with the results combined by the DBMS. The former is far more efficient.
    * Use GROUP BY instead of DISTINCT.  In most DBMSs, a GROUP BY is a more efficient way to eliminate duplicate rows compared with the DISTINCT keyword. The reason for this is that a GROUP BY invokes the sort required to find the duplicates earlier in the processing of the query, while a DISTINCT applies the sort as the very last step (applied to the final result set). The sooner the duplicate rows are eliminated, the more efficiently the remainder of the processing on that result set can be performed.
    * Use hints if you must.  Hints are special syntax that can be placed in the SQL statement to direct the optimizer to take certain actions, such as forcing the use of a particular index or a particular method to join tables. While this can be a very attractive option, it should only be used as a last resort because hints are not portable across database vendors, and they sometimes stop working when the DBMS software is upgraded to a newer version. The Oracle, MySQL, and Microsoft SQL Server optimizers all respond to hints, but the syntax accepted by each is different.
    * Temporary tables may help.  Temporary tables can help in some situations, such as assembling large result sets and then adding an index or two to support multiple subsequent queries against the temporary table. However, remember that you're doubling up the reads and writes when you do this because all the data selected from the original (base) tables must be written to the temporary table(s) and then read back from there. In short, there are no free lunches.
    * Views may help.  Views can help because they hide complex operations such as nested aggregate functions. And with DBMSs that don't have an SQL statement cache, views may process more efficiently than ordinary queries because the SQL statement that defines the view has already been parsed and optimized, which means this work does not have to be done every time the view is accessed. But above all, remember that views are also SQL queries, so they are subject to all the tuning considerations you apply to any SQL statement.

Postgres monitoring 8.4

 I like the monitoring enhancement of postgres8.4  The following two are very useful if you are debugging the performance issue.

Pg_stat_statements can provide you all the queries and response time. 
  • pg_stat_statements (contrib module)

    • Allows real-time monitoring of summary query statistics, letting you see at a glance which of your normalized queries is executing the most often and using the most system time. pg_stat_statement will speed up the process of "bad query" troubleshooting by an order of magnitude. 
    •  
      Auto Explain is good if you are debugging application performance.  Just enable for few hours and get SQL performance and turn it off.


    • auto_explain (contrib module)

      • Lets you automatically log select EXPLAIN plans to the PostgreSQL log for later analysis. Will help a great deal with troubleshooting slow stored procedures, as well as queries which perform differently between production and testing. Also allows logging of EXPLAIN plans for nested statements, that is, statements executed from within a function.


    Sunday, October 4, 2009

    MySQL vs Postgres

    We have two good matured open source databases. One is MYSQL and other is POSTGRES.

    Both systems have much to offer in terms of stability, flexibility, and performance. MySQL has features that PostgreSQL lacks, and vice versa. However, my primary focus is to help you determine which of the two databases to use in your own development.

    For example, some of the features I use are foreign key references, triggers, and views. They allow me to hide the complexity of the database from the application, thus avoiding the creation of complicated SQL commands. I know many developers who prefer the rich functionality of PostgreSQL’s SQL commands. One of the most notable differences between MySQL and PostgreSQL is the fact that you can’t do nested subqueries of subselects in MySQL. PostgreSQL follows many of the SQL ANSI standards, thus allowing the creation of complex SQL commands.

                                             POSTGRESQL                        MYSQL
    ANSI SQL compliance     Closer to ANSI SQL         Follows some of the ANSI
    Performance                                   Slower                 Faster
    Sub-selects                                      Yes                    No
    Transactions                                    Yes                     Yes, however InnoDB table type must be used
    Database replication                        Yes                     Yes
    Foreign key support                        Yes                      No
    Views                                             Yes                      No
    Stored procedures                          Yes                      No
    Triggers                                          Yes                      No
    Unions                                            Yes                      No
    Full joins                                         Yes                      No
    Constraints                                     Yes                       No
    Windows support                           Yes                       Yes
    Vacuum (cleanup)                          Yes                        No
    ODBC                                          Yes                        Yes
    JDBC                                           Yes                         Yes
    Different table types                      No                          Yes

    Let me suggest some reasons for using PostgreSQL over MySQL:

    * Complex database design
    * Moving away from Oracle, Sybase, or MSSQL
    * Complex rule sets (i.e., business rules)
    * Use of procedural languages on the server
    * Transactions
    * Use of stored procedures
    * Use of geographical data
    * R-Trees (i.e., used on indexes)

    Let me suggest some reasons for using MySQL over PostgreSQL:

    * MySQL is relatively faster than PostgreSQL.
    * Database design will be simpler.
    * You can create a basic Web-driven Web site.
    * MySQL’s replication has been thoroughly tested.
    * There’s no need for cleanups in MySQL (Vacuum).

    I think now you can make your decision easier. If you are having complex system and need good features of database obviously POSTGRESQL is good choice and if your system is simple and needs fast response time first choice is MYSQL.