Very Noisy

I'm Mark Kennedy. I'm part of the furniture at Eurogamer Network Ltd. and I do my best to lead the web development team there.

This website is mainly about food, though there's a few technical things here too.

Projects

Pals

SQL Indexing For Dummies

I wrote this guide to try to encourage some people I work with that carefully chosen database indexes are really important in real world database applications, such as data-driven websites. They believe me now, so I figured somebody else might want to read it, or wave it in front of their non-indexing colleagues.

Contents

Introduction

This is a little article explaining why we need a database table index, how it works, and how we can deduce what indexes are required for a certain table and the queries we know will be run against it.

We’ll be specifically talking about MySQL’s indexing capabilities, however everything extends to other database products (many of which have more extensive indexing features).

Our example database

This is the table we’ll be looking at.

Table: items

field type
item_id int
title varchar
long_text text
item_date datetime
deleted enum( ‘Y’, ‘N’ )
category int

It is populated with a few thousand rows, where ‘long_text’ field contains a large amount of data. Each record is therefore a few kilobytes.

MySQL’s MyISAM storage engine organises the data on disk like this http://forge.mysql.com/wiki/MySQL_Internals_MyISAM. Don’t get too bogged down reading it, just know that the rows are variable length therefore access cannot be random – some records will also be split creating a fragmented data file.

How the database finds records normally

If all we have is the above table spec, then let’s see how SQL goes about matching rows and returning data for a query.

Query:

SELECT * FROM items WHERE category=4;

We will ignore the effects of query caching, since it’s not a feature you can rely upon.

In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat.

In the process of performing this query, the database server will read in the entire data file off of the disk regardless of how many rows satisfy the where clause! If your query matches only 2% of the rows, this is hopelessly inefficient.

Surpise!

This query:

SELECT * FROM items WHERE item_id=2356;

...with no indexing, is just as bad. It will read in the entire table trying find all rows with this ID. You can improve matters slightly by setting a limit:

SELECT * FROM items WHERE item_id=2356 LIMIT 1

In this case it will stop when it’s found the correct row. Still, if your row is at the end of the table, that’s a lot of wasted scanning.

Also, the data is ordered in whatever way the database likes. Initially it will be in whatever order the data was inserted. Then as data is deleted and re-inserted, the data will just get slapped wherever it fits (although the database server usually has algorithms to avoid fragmentation of records as much as possible).

Most programmers might be surprised that the last query was very slow. After all, they often select items by some row id and get the result back very quickly. Why is this type of query normally fast then?

Primary key

Well, in this case, it would be normal to create a primary (which is also a unique) key on the item_id. Most database designers would have done so without even thinking about it, just to enforce uniqueness upon that row.

But the database is not just enforcing uniqueness upon the primary keyed field, it’s built a fast search index on it too!

The index it builds is in tree form. MyISAM indexes are always B-trees (except for FULLTEXT – more later). http://en.wikipedia.org/wiki/B-tree will tell you all about B-trees, but I'll try to present a simplified version that gets across the fundamentals without getting too bogged down in the maths or exact number of bytes used here or there.

The index built for our primary key on item_id would look like this:

Simple b-tree

(Note: this could be incorrect in details related to real MyISAM implementation, but the overall concept is valid)

It’s probably quite easy to see how this search tree works. There are certain properties of B-trees that help make it fast to insert and delete records, but right now it’s only important to understand how the SQL server goes about finding the records it’s after.

The top level node here contains 4 sorted values representing the range of item_ids to be found below this node. The child nodes have the same range values showing the values taken by its own children. At the last level we have nodes containing the final item_id value, and a pointer to the byte in the disk file the final record lies.

So, say I’m looking for item_id 4. The server looks at the top node, finds the range containing 4 (the first range). It then descends to the child and finds in this case it’s the second range it wants. It then follows this to the final set of items and easily plucks the correct item_id from the 3 available leaves. Now it has a disk pointer and can rapidly seek to the correct byte on disk without reading the datafile in. It now has the correct record without any disk scanning!

Of couse it had to scan the index in some way. But if you think carefully about the index, you’ll see that

  1. It’s a much smaller volume of data. Each node is really only a half dozen integers. The total number of bytes is a converging series that is certainly a lot smaller than the huge disk record containing all of the text for each record. Typically, the index for a table can be kept in RAM for super fast access due to its small relative size.
  2. There is no need to look at all of the index file. Each node contains pointers to the child nodes, and each level contains pointers between sibling nodes creating a linked list (helps if you want leaves within a range). So, to find item 4 we did 3 hops. Not a great saving. To find record 20 we also do 3 hops. This is more like it! The number of hops required increases in a sort-of logarithmic manner with respect to database size (i.e. opposite to exponential growth, which is a good thing!).

So now you should understand why the common PRIMARY KEY index applied to database tables makes lookups by this field very fast. Try removing such an index on a very large table and seeing just how slow WHERE item_id=xxx becomes!

Keying other fields

So, what if you want to use a different type of search in your table. What if you wanteded all rows WHERE category=11? When you designed the table, you didn’t specific category_id as a UNIQUE or PRIMARY KEY since it isn’t a unique field, but that doesn’t mean you can’t build an index on it. An index doesn’t HAVE to contain only unique values. Here is the same diagram of a btree, but indexing a the non-unique field of category id:

b-tree non unique

Note that there is no implied order of items with the same category.

You would create this type of index simply with:

ALTER TABLE items ADD KEY idx_category( category );

or

ALTER TABLE items ADD INDEX idx_category( category );

KEY and INDEX are synonyms.

You could also specific the index at table creation time:

CREATE TABLE items (
...<snip>...
category_id INT,
INDEX idx_category( category)
)

Note the naming convention I’ve used for index names. You can call an index whatever you like, as long as it’s a unique index name. Indexes can share names with fields, so you’ll often see INDEX category( category ) but I don’t recommend this since it just causes confusion down the line. Please avoid this!

Proving that indexes really work

Create the table described above:

CREATE TABLE items (
item_id INT PRIMARY KEY AUTO_INCREMENT,
title VARCHAR(255),
long_text TEXT,
item_date DATETIME,
deleted ENUM( 'Y', 'N' ),
category INT
);

Now populate with 100,000 rows of data (run :make_db.sh.txt to make some SQL inserts you can use, which in turn calls :random_string.sh.txt which should be in current dir – this script takes a while to run!).

The data contains randomly generated titles and long_text, with a random category from 0 to 4. I wonder how many items have category = 3?

mysql> select count(*) from items where category=3;
+----------+
| count(*) |
+----------+
|    25012 |
+----------+
1 row in set (0.25 sec)

That’s 1/4 second. Pretty slow! Let’s add an index on category_id:

mysql> ALTER TABLE items ADD INDEX idx_category( category );
Query OK, 100002 rows affected (3.80 sec)
Records: 100002  Duplicates: 0  Warnings: 0

Weeeeeeee. Ok, to test again:

mysql> select count(*) from items where category=3;
+----------+
| count(*) |
+----------+
|    25012 |
+----------+
1 row in set (0.09 sec)

That’s better! On a decent mysql server (I’m testing this on noddy dev box) you would expect the index to be held entirely in memory, thus access should be <10ms. It doesn’t help that the table is rather small, so scanning it isn’t *that* expensive compared to index scanning (which still has a lot of data in it, regardless of how efficiently items are linked) and in both cases we’re dealing with roughly 1/4 of the whole table, which is rather a lot.

Perhaps a better example would perhaps be to locate a single row (no other rows have cat 4 and 3 is the current max):

mysql> update items set category=4 where item_id=90000;

(Note that this update happened very quickly. If there was no primary key index on item_id, it would have to scan the table to find the row to update, taking as long as the following test. Updates are just selects to find the rows, followed by some writing to the table once row pointers are obtained. If you have a table with a lot of updates, indexes are very valuable.)

I’ll drop that index:

mysql> ALTER TABLE items DROP index idx_category;
Query OK, 100002 rows affected (2.79 sec)
Records: 100002  Duplicates: 0  Warnings: 0

And test again:

mysql> select * from items where category=4;
+---------+-------+------------+---------------------+---------+----------+
| item_id | title | long_text  | item_date           | deleted | category |
+---------+-------+------------+---------------------+---------+----------+
|   90000 | rt5f  | N4OzO3brz0 | 2007-07-17 12:45:24 | N       |        4 |
+---------+-------+------------+---------------------+---------+----------+
1 row in set (0.25 sec)

Ok, just as slow as counting all the rows! Probably because the row I updated was right at the end.

Add index again:

ALTER TABLE items ADD INDEX idx_category( category );

Test again:

mysql> select * from items where category=4;
+---------+-------+------------+---------------------+---------+----------+
| item_id | title | long_text  | item_date           | deleted | category |
+---------+-------+------------+---------------------+---------+----------+
|   90000 | rt5f  | N4OzO3brz0 | 2007-07-17 12:45:24 | N       |        4 |
+---------+-------+------------+---------------------+---------+----------+
1 row in set (0.00 sec)

Wow! 0.00 secs :)

Ordering

Ok, so we’ve seen that indexes help MySQL locate rows quickly. We also know that indexes are stored with the values in a sorted order, so you’d hope that MySQL would be able to use this info.

Without index:

mysql> SELECT * FROM items order by category limit 10;
+---------+-------+------------+---------------------+---------+----------+
| item_id | title | long_text  | item_date           | deleted | category |
+---------+-------+------------+---------------------+---------+----------+
|       4 | R GB  | dom3DuEUCL | 2007-07-17 11:36:46 | N       |        0 |
|       7 | K bL  | PwiN004A5v | 2007-07-17 11:36:46 | N       |        0 |
|      11 | QI q  | aRyu pEEv0 | 2007-07-17 11:36:46 | N       |        0 |
|      15 | i9lx  | 4VeQ u 6S  | 2007-07-17 11:36:46 | N       |        0 |
|      18 | vZ h  | fHX RxPb9h | 2007-07-17 11:36:46 | N       |        0 |
|      24 | Kdss  | Pi6nKisTtU | 2007-07-17 11:36:47 | N       |        0 |
|      28 | QFoT  | kE4ABlZg o | 2007-07-17 11:36:47 | N       |        0 |
|      32 | it B  |  414 RmuyF | 2007-07-17 11:36:47 | N       |        0 |
|      35 | vM    | fCPNac1sgQ | 2007-07-17 11:36:47 | N       |        0 |
|      39 | ykVw  | 7yUuk meJ5 | 2007-07-17 11:36:47 | N       |        0 |
+---------+-------+------------+---------------------+---------+----------+
10 rows in set (0.44 sec)

Add that category index:

mysql> SELECT * FROM items order by category limit 10;
+---------+-------+------------+---------------------+---------+----------+
| item_id | title | long_text  | item_date           | deleted | category |
+---------+-------+------------+---------------------+---------+----------+
|       4 | R GB  | dom3DuEUCL | 2007-07-17 11:36:46 | N       |        0 |
|       7 | K bL  | PwiN004A5v | 2007-07-17 11:36:46 | N       |        0 |
|      11 | QI q  | aRyu pEEv0 | 2007-07-17 11:36:46 | N       |        0 |
|      15 | i9lx  | 4VeQ u 6S  | 2007-07-17 11:36:46 | N       |        0 |
|      18 | vZ h  | fHX RxPb9h | 2007-07-17 11:36:46 | N       |        0 |
|      24 | Kdss  | Pi6nKisTtU | 2007-07-17 11:36:47 | N       |        0 |
|      28 | QFoT  | kE4ABlZg o | 2007-07-17 11:36:47 | N       |        0 |
|      32 | it B  |  414 RmuyF | 2007-07-17 11:36:47 | N       |        0 |
|      35 | vM    | fCPNac1sgQ | 2007-07-17 11:36:47 | N       |        0 |
|      39 | ykVw  | 7yUuk meJ5 | 2007-07-17 11:36:47 | N       |        0 |
+---------+-------+------------+---------------------+---------+----------+
10 rows in set (0.01 sec)

See the difference?

Why did finding ten rows in the non-indexed table take 0.5 secs?

Despite only wanting 10 rows, the database had no way of knowing what the first 10 rows would be. It had to take all the rows, sort them in memory (if there was enough – otherwise it would have to copy all the rows to a temporary disk file and do a sort operation on this, slowing it down much more) then take the first 10. Adding the index means it already has a presorted list of rows.

Why not keep the rows presorted in the main table?

Say you regularly have to run both of these queries:

SELECT * FROM items order by category limit 10;
SELECT * FROM items order by title limit 10;

Which sort would you use to order your table? The answer is neither, plus it’s not in the slightest bit efficient to actually insert new rows into the middle of the table – you are much better off plonking them on the end unless an earlier delete has freed up space in the middle of the table file. Remeber that the database table is just a great big file on disk and moving a large a chunk of data a few bytes back in a single file is a slow operation.

Showing that MySQL is actually using your index (EXPLAIN)

Ok, so you have some empirical evidence that indexes speed up SELECTs. But how can you know that MySQL is using the index you created? EXPLAIN helps you to do this by showing how MySQL is approaching your query.

Official ''EXPLAIN'' docs.

Let’s look at what EXPLAIN has to say when running our first query on the table without the index (DROP it first):

mysql> EXPLAIN SELECt * FROM items WHERE category=4;
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | items | ALL  | NULL          | NULL |    NULL | NULL | 100002 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+

Hmm!

EXPLAIN output can be extremely useful and extremely confusing. It’s best to learn to read the most important bits, and then try to decipher the remaining fields and/or values if the obvious stuff doesn’t tell you why your query isn’t as fast as you think it should be.

Reading EXPLAIN

''id''

This will increment for each additional step in a JOIN‘d query. This query has no joins, so there is one stage.

''select_type''

This will tell you something about where the joined data is being sourced from. SIMPLE means a standard query directly from a permanent table in the DB. In a joined query, it would say PRIMARY for the outermost join, or could say ‘SUBQUERY’ if that’s what you’re using. See the docs for a decent explaination.

''table''

Where the data is sourced from. Self-explantory really. Doesn’t always contain a table name – if subqueries are used this could say DERIVED.

''type''

This tells you the sort of condition you’re joining on. Again see the docs. Best is const, which is primary key = CONSTANT VALUE type joins (e.g. item_id=90000). It can also show const in some other situations, but nearly always cos the value being joined is available at the beginning of the query and doesn’t need to be read or recalculated again. For most indexed joins of the type field1=field2, it will say ref or eg_ref. range is when your indexed join is of the form field1 < val and so on. ‘ALL’ is the worst possible value here: it means the database had to scan the entire table, i.e. couldn’t find an index to use.

''possible_keys''

This is a list of indexes on the data source used in this stage of the join.

''key''

This is the index the query optimiser actually used in the query. Often, you will have a list of possible keys and none will be chosen since an appropriate one wasn’t found!

''ref''

This tells you what value the indexed field was being compared against. It could be const or the name of a field on another stage of the query.

''rows''

How many rows have to be considered by this part of the join. You may also hear this referred to as ‘cardinality’ by people wanting to show off. The smaller the number in this column, the less work the database has to do. High numbers mean large lists of data need considering. It does not relate to the number of rows returned by the DB to the client, just the number it is having to ‘consider’. Large cardinality and no index usage is generally a recipe for slow queries.

''Extra''

Anything else? For a column that contains just ‘extra’ info, there is often very important stuff here. It will often say Using where, meaning that you have requested a subset of the rows found by a join, using a WHERE clause to achieve this. If it says Using temporary, MySQL is having to copy the data from query result into a temporary table, often on disk, in order to perform the next part of the query. Subselects often trigger this behaviour. Using filesort means you’ve asked for the data to be returned in a particular order and it is achieving this by collating all the returned records and performing a sort algorithm (Quicksort, bubblesort, something like that). Temporary tables and filesorts are often the causes of slow queries.

Decoding the non-indexed EXPLAIN example

So taking another look at our non-indexed explain, you should be able to see straight away that it’s considering 100001 rows (it always seems to add 1 to rows and I don’t know why), there are no possible keys and it’s not using any (obviously), hence type ALL. This is quite bad!

An indexed EXPLAIN

If we add our category index once more, the explain becomes:

mysql> EXPLAIN SELECT * FROM items WHERE category=4;
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key          | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------------+
|  1 | SIMPLE      | items | ref  | idx_category  | idx_category |       5 | const |    2 | Using where |
+----+-------------+-------+------+---------------+--------------+---------+-------+------+-------------+

Now we can that MySQL has realised that it can use the idx_category index, with a ref join against constant 2 to limit the number of considered rows to just the 1 (hence it prints 2!).

Run some of the other queries mentioned in this doc and then take a look at their EXPLAIN output. Try to see if you can match this with how you think the database is performing your query.

Non integer types

So what about the other fields? So far we’ve only indexed integers (ENUMs and DATETIMEs are really just INTs once converted) – what about text fields? Well it’s no different to how you’d file documents in a library, or the index you’d find in the back of a book. You index by the first letter, then the second, then the 3rd, and so on. In the btree diagram above, just change the top level to ABCD, the first item in the second level to AA AB AC, and so on until you get to the correct leaf node.

However, you can only index VARCHARs. TEXTs can only be indexed using fulltext search indexes (not discussed here). If you’re doing WHERE <text type field>=blah your code then you should consider whether you actually need a text field for the data and could instead use a VARCHAR, or whether you really want to be doing exact matches on possible enormous CLOBs (which is what a TEXT is really) when a fulltext engine might suit better.

Since the indexing of VARCHAR starts at the beginning of the word, LIKE “abc%” can also be optimised. You go as far in the index tree as you need to find all items starting ‘abc’ and then return everything. However, LIKE “%abc%” isn’t optimisable since any number of characters can be present at the start of the word – you don’t know what depth to start at in the index tree!

Very inaccurate, but nevertheless useful visualisation of text btree:

Text b-tree

In use...

mysql> select * from items where title like 'dta%';
+---------+-------+------------+---------------------+---------+----------+
| item_id | title | long_text  | item_date           | deleted | category |
+---------+-------+------------+---------------------+---------+----------+
|    5802 | dTAr  | 8m pFcp42J | 2007-07-17 11:41:10 | N       |        1 |
|   18899 | dTAv  | tz RAJ nO  | 2007-07-17 11:51:09 | N       |        2 |
|   51670 | dTAv  | tz RAJ nO  | 2007-07-17 12:16:10 | N       |        2 |
+---------+-------+------------+---------------------+---------+----------+
3 rows in set (0.35 sec)

mysql> explain select * from items where title like 'dta%';
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | items | ALL  | NULL          | NULL |    NULL | NULL | 100002 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

Let’s add an index:

mysql> ALTER TABLE items ADD INDEX idx_title( title );
Query OK, 100002 rows affected (4.70 sec)
Records: 100002  Duplicates: 0  Warnings: 0

And try again:

mysql> select * from items where title like 'dta%';
+---------+-------+------------+---------------------+---------+----------+
| item_id | title | long_text  | item_date           | deleted | category |
+---------+-------+------------+---------------------+---------+----------+
|    5802 | dTAr  | 8m pFcp42J | 2007-07-17 11:41:10 | N       |        1 |
|   18899 | dTAv  | tz RAJ nO  | 2007-07-17 11:51:09 | N       |        2 |
|   51670 | dTAv  | tz RAJ nO  | 2007-07-17 12:16:10 | N       |        2 |
+---------+-------+------------+---------------------+---------+----------+
3 rows in set (0.00 sec)

mysql> explain select * from items where title like 'dta%';
+----+-------------+-------+-------+---------------+-----------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key       | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+-----------+---------+------+------+-------------+
|  1 | SIMPLE      | items | range | idx_title     | idx_title |     256 | NULL |    3 | Using where |
+----+-------------+-------+-------+---------------+-----------+---------+------+------+-------------+
1 row in set (0.00 sec)

It’s clearly a huge improvement.

Multi column indexes

There are many cases where you will only require an index on one field. The common case being the primary key. However, on a large, production, database-driven website, most queries will be selecting rows based on multiple criteria and then ordering the output.

Example (after checking that I’ve deleted all the indexes):

mysql> select * from items where category=3 order by title desc limit 2;
+---------+-------+------------+---------------------+---------+----------+
| item_id | title | long_text  | item_date           | deleted | category |
+---------+-------+------------+---------------------+---------+----------+
|   77668 | ZzwX  | TiV1 2SL   | 2007-07-17 12:36:02 | N       |        3 |
|   98326 | ZzWe  | SusoT bag7 | 2007-07-17 12:51:43 | N       |        3 |
+---------+-------+------------+---------------------+---------+----------+
2 rows in set (0.36 sec)

Again, remember that data transfer overheads aside, LIMIT 2 doesn’t significantly speed up the query. LIMIT 1000 would take only marginally longer. Either way all the rows matched by the WHERE have to be put somewhere, then sorted.

The EXPLAIN:

mysql> explain select * from items where category=3 order by title desc limit 2;
+----+-------------+-------+------+---------------+------+---------+------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra                       |
+----+-------------+-------+------+---------------+------+---------+------+--------+-----------------------------+
|  1 | SIMPLE      | items | ALL  | NULL          | NULL |    NULL | NULL | 100002 | Using where; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+--------+-----------------------------+

The reiterate, the database is:

  1. Finding the rows that match WHERE category=3
  2. Sorting the results by title

Would adding an index to category help? No. In fact on a poorly specced database machine (such as the dev box here) it will actually make matters worse:

mysql> select * from items where category=3 order by title desc limit 2;                                                                                                                                             
+---------+-------+------------+---------------------+---------+----------+                                                                                                                                                         | item_id | title | long_text  | item_date           | deleted | category |
+---------+-------+------------+---------------------+---------+----------+
|   77668 | ZzwX  | TiV1 2SL   | 2007-07-17 12:36:02 | N       |        3 |
|   98326 | ZzWe  | SusoT bag7 | 2007-07-17 12:51:43 | N       |        3 |
+---------+-------+------------+---------------------+---------+----------+
2 rows in set (0.78 sec)

mysql> explain select * from items where category=3 order by title desc limit 2;
+----+-------------+-------+------+---------------+--------------+---------+-------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key          | key_len | ref   | rows  | Extra                       |
+----+-------------+-------+------+---------------+--------------+---------+-------+-------+-----------------------------+
|  1 | SIMPLE      | items | ref  | idx_category  | idx_category |       5 | const | 29097 | Using where; Using filesort |
+----+-------------+-------+------+---------------+--------------+---------+-------+-------+-----------------------------+

It’s worse because it had to load the index file from disk to find the matching rows, on top of scanning the main table to get the 29097 rows for doing the sort. In this case, reading from storage was the main issue, not efficiency of locating rows. On a larger, more capable database server, you would expect better relative performance.

Anyway, how can we avoid the filesort on 29000 rows? What about adding an index to title?

mysql> alter table items add index idx_title( title );
Query OK, 100002 rows affected (5.87 sec)
Records: 100002  Duplicates: 0  Warnings: 0

mysql> select * from items where category=3 order by title desc limit 2;
+---------+-------+------------+---------------------+---------+----------+
| item_id | title | long_text  | item_date           | deleted | category |
+---------+-------+------------+---------------------+---------+----------+
|   77668 | ZzwX  | TiV1 2SL   | 2007-07-17 12:36:02 | N       |        3 |
|   98326 | ZzWe  | SusoT bag7 | 2007-07-17 12:51:43 | N       |        3 |
+---------+-------+------------+---------------------+---------+----------+
2 rows in set (0.80 sec)

Didn’t help at all! In fact it’s 0.02 secs slower from having to read a large index file (it’s now got 2 indexes in it). Remember, if you’re looking for only a few rows, the elements of the index are well linked enabling the database to quickly locate the correct entry, but if you’re selecting tens of thousands of rows for a sort, then your efficiency is severely impacted by having to essential scan a huge chunk of the index to copy the rows numbers pointed to by the matched leaf nodes.

mysql> explain select * from items where category=3 order by title desc limit 2;
+----+-------------+-------+------+---------------+--------------+---------+-------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key          | key_len | ref   | rows  | Extra                       |
+----+-------------+-------+------+---------------+--------------+---------+-------+-------+-----------------------------+
|  1 | SIMPLE      | items | ref  | idx_category  | idx_category |       5 | const | 29097 | Using where; Using filesort |
+----+-------------+-------+------+---------------+--------------+---------+-------+-------+-----------------------------+
1 row in set (0.00 sec)

Interesting. It’s not even using the new index! Why? Well:

  1. MySQL will only use 1 index per table (unless you’re using a table twice via an alias, which isn’t discussed here).
  2. Once MySQL has found the 29000 rows matched by the WHERE how could it possible utilise the title index? The title index was built for a dataset that included the ENTIRE table, not a subset of 29000 rows. It could possible go through the complete title index, cross referencing it with the rows it had in temporary storage, copying them in the order it hit them in the index. This would be like having 4 words, and instead of sorting them using a normal algorithm, looking at every word in a dictionary in order, copying from your list of 4 to a new list every time you found one in the dictionary. A filesort would in most cases probably be faster.

So what to do? Well it’s easy really. Imagine the category btree above, with the title stuck on the bottom:

Multi-column b-tree

It should be clear that if I ask for category=0, order by title, it can just read down as far as the beginning of the title index, then read left to right. I could also ask for category=3, title like ‘da%’ order by title, and it would still be able to read down as far as cat=0, title like ‘da%’ and then just read the remaining, lower branches from left to right (effectively skipping the more detailed parts of the index).

Implement this with:

mysql> alter table items add index idx_category_title( category, title );
Query OK, 100002 rows affected (8.17 sec)
Records: 100002  Duplicates: 0  Warnings: 0

It takes longer to build that other indexes. But all the work it does in this one time operation pays off later:

mysql> select * from items where category=3 order by title desc limit 2;
+---------+-------+------------+---------------------+---------+----------+
| item_id | title | long_text  | item_date           | deleted | category |
+---------+-------+------------+---------------------+---------+----------+
|   77668 | ZzwX  | TiV1 2SL   | 2007-07-17 12:36:02 | N       |        3 |
|   98326 | ZzWe  | SusoT bag7 | 2007-07-17 12:51:43 | N       |        3 |
+---------+-------+------------+---------------------+---------+----------+
2 rows in set (0.00 sec)

Ace, huh?

EXPLAIN it:

mysql> explain select * from items where category=3 order by title desc limit 2;
+----+-------------+-------+------+---------------------------------+--------------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys                   | key                | key_len | ref   | rows  | Extra       |
+----+-------------+-------+------+---------------------------------+--------------------+---------+-------+-------+-------------+
|  1 | SIMPLE      | items | ref  | idx_category,idx_category_title | idx_category_title |       5 | const | 29097 | Using where |
+----+-------------+-------+------+---------------------------------+--------------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)

The cardinality is still high because your WHERE really did match this number of rows, index or no index. However, since the data was presorted (it used the idx_category_title index) all it had to do was read off the first two and then just stop. It knew there were 29000 rows just by asking the index, but it didn’t have to go anywhere near 28998 of them!

More to come!