I have interviewed hundreds of developers over the years and it is amazing how little developers know about this key essential skill called SQL. In my humble opinion it is the one skill that every full stack developer must know. Transactional systems are not going away even though there is a lot of talk about big data. This is especially true for mission critical ACID (Atomicity, Consistency, Isolation, Durability) transactions. Let us dive in to these key gems that will seem so obvious once my reasons are revealed. The performance can increase dramatically especially when dealing in the millions of rows. The objective is to squeeze a few more transactions per minute, to get data back to the user ASAP in this era of “I want it now”. Although many of my tricks were first developed on DB2 because I am IBM DB2 certified, many of the points below hold true on many databases. With that said let us start exploring.
- Select only the columns needed by your program. Do not use SELECT *. There are 2 main reasons, you are making the database force more network traffic to your application and you now have to use more CPU to move each column into your program.
- AVOID USING SELECT DISTINCT without a supporting index unless the number of qualifying rows is less than 5000. The reason being you will be sorting all the qualifying rows and this can be expensive as the data set grows.
- Use the AS feature when selecting derived columns. The reason is documentation.
- Use the COALESECE feature to avoid null indicators variables for columns allowing nulls and results in sets that can be null. SUM (COL1) is null if null rows qualify. Here are facts from the one IBM Gold Consultant and my mentor Richard Yevich (RIP). It reduces the code to maintain and improves access paths if used in predicates.
- USE TABLE EXPRESSION feature over VIEWS. The reason is for documentation and performance analysis.
- Use WHERE Clause to select the rows needed. There are 2 main reasons, you are making the database force more network traffic to your application and you now have to use several times more CPU to process the result set.
- Use <> rather than NOT EQUAL. The reason is <> is universally accepted.
- Only use Literal in a SQL statement when it never changes. If it changes then use a host variable to the move the literal to the host variable in the host language prior to the SQL statement. The reason is the program will be easier to maintain when the variable changes.
- Avoid arithmetic expression (including concatenation) in predicates where there are no other stage 1 predicates. Instead, store expression in host variable and use the host variable in the WHERE statement. The Reason is to make the predicate stage 1. Complex predicates are Stage 2. If they only further filter then they are great however if they are the only predicate index/table scans are the only access paths possible.
- Avoid negative logic if possible. The reason is that positive logic is much easier to comprehend.
- If more than 25% of the rows will be returned, induce a tablespace scan by using the +0 or CONCAT ‘ ‘ techniques. The reason is that due to the prefetching, the database can often scan the entrie table faster than reading 25% of the index pages + 25% of the data pages.
- Ensure the use of a join predicate to avoid a Cartesian product of rows returned. Reason: Missing Join predicates cost CPU and it might not not give the expected results.
- Use Joins instead of subselects. The reason is that the database join operation algorithms are always faster.
- Do not code RIGHT OUTER JOINS. Switch the order of the tables in the from clause and code LEFT OUTER JOINS. This will reduce the confusion with EXPLAIN output (only displays Left and Full joins)
- Do not code INNER JOINTS with the ON clauses. Older version of many databases process the where clauses after the join is complete unless table expressions are used. Stay with the “,” and no ON clause so that normal join algorithms are used.
ORDER BY/GROUP BY
- Use the column name in the ORDER BY and GROUP BY clauses. Do not use the column relative number from the SELECT clause. Reason is documentation and ease of reading code.
- Do not ORDER BY a column that has only one value in the return rows. There is no need to use CPU and I/O to sort something that does not need sorting.
- use UNION ALL when duplicates do not exist or when the number of duplicates is small and can be eliminated in the program of if duplicates are needed. Reason is to eliminate the sort.
- Use Select * for correlated subqueries when using EXISTS or NOT EXISTS. The reason is that no rows are returned, only a flag set to true or false
- Avoid using IN phrase. Use JOIN or EXISTS in a correlated subquery. The reason is to avoid creating and sorting a work file.
EXISTANCE CHECKING TECHNIQUES
Do not use SELECT COUNT(*) to check for existance. Reason is it cost CPU and I/O to count all the rows when the actual count is not needed. While COUNT(*) scans the entire table, counting up all entries matching your condition, EXISTS() will exit as soon as it sees the result it needs.
SELECT columns needed
FROM tables needed
( SELECT * FROM table WHERE condition)
correlate the subquery if the checking should be done for each row. Leave it non correlated if the checking only needs to be done once.
- Close cursor when the last row is processed rather than at the end of the program unless it is the last action of the program. If no more processing happens after the cursor then do not close the cursor but let the program close the cursor to release the locks sooner and improve concurrency.
- Retain cursor positioning after Commit processing to reduce program code by eliminating the need for browsing logic.
- If cursors are use to drive updates use FOR UPDATE or WHERE CURRENT OF CURSOR to create row level processing. If updates have to be ordered switch to WHERE KEY =:HVKEY. The reason FOR UPDATE OF and ORDER BY are mutually exclusive.
- Always check the row counter or timestamp in the UPDATE statement to warn users if the data has changed. The checking the counter or time stamp in a SELECT statement allows data integrity problems to slip in.
- Build small tables, such as those used for code definition, in memory or application tables if they are used repetitively. The reason is that is save CPU and I/O by having the database read it only once.
- Ensure restart logic is coded for applications requiring high availability.
- For SQL used in applications requiring high concurrency, the SQL should be retried a few times before sending an error message. 5 times is a good start.
- Do not start coding SQL with out knowing all the indexes available on the table. Reason is that all possible index-able local and join predicates should be coded to ensure index usage.
- Run an EXPLAIN on all SQL statements built prior to the release from development. Granted that production and development can have different plans depending on the data. Analyze results. The reason is that you can steer the optimizer in the right direction using the two tuning techniques of (OPTIMIZE FOR n ROWS and +0, CONCAT ‘ ‘)
Views can be used for these requests. The reasons is that in insulates the user from tables changes over time. It can hide SQL complexity, enforce security or pre-join tables.
Stephen Choo Quan
Stephen is a double threat holding both SAP business objects certified architect as well as being a certified IBM DB2 Database Developer. read More…