Friday, November 1, 2019

SQL Performance Explained

This time I read SQL Performance Explained by Markus Winand. The main theme of the book is database indexing that is probably the most important thing developers should know on SQL database performance.

Note that there is "free web-edition" of the book Use the index, Luke - the web edition has sections matching the book sections (linked to below).

Anyway, I warmly recommend to buy the ebook if you're interested on the subject - It's has a pretty affordable price and has a good bunch of knowledge in pretty small package.

1 - Anatomy of an index

  • Database index is redundant structure referring to the actual information that is stored usually in a different place.
  • Two main data structures:
    • Balanced search tree (B-tree)
    • Doubly linked list
  • Balanced search tree contains leaf nodes.
    • Each leaf node is stored in a database block (smallest storage unit).
    • Each leaf node has index entries sorted.
    • Index entries contain the indexed columns and a reference to the "physical" table row.
  • There is also a doubly linked list connecting the leaf nodes.
  • Index lookup has three steps
    • Tree traversal
    • Following the leaf node chain
    • Fetching the table data from the "physical" table

2 - The Where Clause

  • Execution plan shows how the database actually executes an SQL statement. See DB-specific instructions
    • Created by query optimizer / query planner - usually "cost-based optimizer"
    • Optimization is usually based on statistics
      • Table size (both in rows and blocks)
      • Column level statistics - number of distinct values, range (smallest/largest values), NULL occurrences etc.
      • Index statistics - mainly tree depth
  • Author calls B-Tree traversal first power of indexing.
  • Concatenated indexes
    • Index across multiple columns
    • Note that the order of columns matters - the first column is always usable
  • Full table scan
    • Going through the whole table
    • Note that in some cases that can be the most efficient operation
      • DB might be able to read larger chunks at a time
      • No need for "random access"
    • Note: It is important how many rows are got from the index for filtering
  • Function-based indexing
    • Summa summarum has some details to be aware of!
    • By default aim to index the original data - usually that is the most useful info
  • Parameterized queries
    • Queries with bind parameters (also referred to as dynamic parameters or bind variables)
    • DBs with execution plan cache can reuse execution plan when using bind parameters
    • When using bind parameters, the optimizer has no concrete values to determine their frequency -> will always select the same execution plan
      • There are not too many cases where that affects the execution plan
  • Searching for ranges
    • Can utilize indexes
    • As a basic rule, keep the scanned range as small as possible.
    • As a rule of thumb, index first for equality and then for ranges.
  • LIKE expressions
    • can only use the characters before the first wildcard.
      • -> Avoid LIKE expressions with leading wildcard
    • Be aware that with bind parameters the DB does not know about the leading wildcard
      • Work-around append hack: Instead of (foo LIKE ?) use (foo || '' LIKE ?). (Use only as a last resort)
  • Index merge
    • In most cases one index with multiple columns is better than multiple single-column indexes
    • Exception: Queries with multiple range conditions (E.g. FOO < ? AND BAR < ?)
    • Methods for combining indexes
      • Index join
      • Bitmap indexes (DWH, mainly not suitable for OLTP)
  • Partial indexes
    • Index only part of the rows
      • E.g. CREATE INDEXES foo_bar ON foo (bar) WHERE baz = 123
    • Very common in queuing systems where most of processing is for unprocessed entries.
  • Oracle NULL has some surprises, e.g.:
    • Oracle treats an empty string quite much as NULL
    • As a gotcha: Oracle does not include rows in an index if all indexed columns are NULL.
    • Can be used to emulate partial indexes.
  • Obfuscating conditions - where clause patterns that prevent index usage
    • Watch out indexing with DATE types - Summa summarum: Write queries for continuous periods as explicit range conditions.
    • Numeric strings - Summa summarum: Use numeric types for numbers.
    • Combining columns - Summa summarum: When combining multiple columns in a range condition, use a redundant condition.
    • ...

3 - Performance and Scalability

  • Environmental parameters that affect performance
    • Data volume
    • System load
    • Hardware
  • Scalability - Impacts of data volume
    • Indexing
    • Pay attention to predicate information of the query plans
    • Pay attention to filter predicates
  • Scalability - Impacts of system load
    • Bigger hardware is not always faster but it can usually handle more load.
      • Same with scaling horizontally (more servers)

4 - The Join Operation

Three approaches:

  • Nested loops - Kind of server-side N+1 selects
  • Hash join - Query candidate records from another table into a hash table
    • This is the usually used one.
  • Sort merge - Kind of a zipper
    • Works well if the inputs are already sorted

5 - Clustering Data

  • "Clustering" here: Storing data that is accessed together close to each other to make data access require fewer IO operations.
  • The author refers to clustering as the second power of indexing.
  • Index predicates
    • Having a filtered value in the index can reduce table access.
    • Table access is not so big deal if accessed rows are stored in same table block. (All rows will be read together in one read operation)
    • Index clustering factor - Correlation between order of row entries in the index and order of rows in the table.
    • As a tip, don't introduce a new index for filter predicates but extend an existing index.
  • Index-only scan
    • Performance difference depends on the number of accessed rows and index clustering factor.
  • As a special case, some databases support index-organized tables
    • That is, a B-tree without a heap table
    • Note that secondary indexes come very inefficient to use.

6 - Sorting and Grouping

  • Indexed ORDER BY execution
    • Saves sorting effort
    • Allows pipelined execution - returning first results without processing the all data.
    • -> The author calls pipelined order by third power of indexing
  • Remember that databases can read indexes in both directions.
  • GROUP BY - two different approaches
    • Hash approach - Use a temporary hash table
    • Sort/group approach - Sort the data first by the grouping key
      • Can use index to avoid sorting -> enables pipelined GROUP BY

7 - Partial Results

  • E.g. querying for top N rows.
  • Pipelined ORDER BY is very powerful to optimize these.
    • -> DB can optimize for partial result only if it knows it when planning the query
    • -> Inform the DB when you don't need all rows.
  • Paging
    • Deterministic sort order is needed
    • Two approaches
      • Offset method - Number of rows from the beginning
        • Vulnerable to drifting
        • Gets slower when going further.
      • Seek method - Search for last entry of previous page

8 - Insert, Delete and Update

  • Avoid redundant indexes whenever possible
  • Note that the first index makes the greatest difference
  • With INSERT & DELETE, each index adds time spent.
  • With UPDATE - indexes on columns that are not updated don't generally add so much time.

Bonus

The author of the book has also a nice website modern-sql.com on interesting newer SQL features.

No comments: