The Power of Database Indexes

In one of my past projects, we were handling very large amounts of data. In that case, the performance was not necessarily that critical, and the application was running absolutely fine - however, one of my developers noted that we could improve the performance somewhat with indexes. Since I’m all for fine-tuning our apps, I decided to give it a try. Wow, the results of it were insane! The lookups were sped up by a factor of 8!

Here I will cover what a database index is, how it can help you, and show some samples for large datasets. It will be covered on a dummy table of books, in a database running inside a postgres docker container.

What is an index
Examples in action
Adding indexes with liquibase

What is an index

In short, an index is a tradeoff between performance and required storage space. A database index improves the speed of data retrieval operations on tables, but it will cost some additional storage space. However, with the cost of storage being very low, this is often not an actual problem. It is something to keep in mind though.

Indexing a table will mean some create helpful structures internally to locate the data without having to search every record in the table. Without indexes, lookups searches are linearly performant (O(N)), but most indexes enable sub-linear time lookup, often to the tune of O(log(N)). This means that the biggest differences will be felt on fairly large datasets.

Additionally to the performance speed, indexes can be used to police database constraints. This can be done by using additional keywords such as UNIQUE, FOREIGN KEY, etc. Keep in mind that simply having PRIMARY KEY on a field will usually create an index that field implicitly. These constraints can also be put on the table directly, but with an index, you can also combine this on combinations of fields.

Indexes are often forgotten by developers. This makes sense, since they are not data engineers, and they rather focus on creating new features, and usually, databases are crazy fast anyway. So what different does that little index really make?

Examples in action

Now, in order to see the power of indexes, we will spin up a postgres docker image, create a table called book with a couple of randomly filled in fields, and then do query lookups. I will do those lookups with and without the indexes, and also on tables containing varying amounts of records - 50, 500k and 5M.

Setup

To create the image, run the following command: docker run --name tutorial-psql -p 5432:5432 -e POSTGRES_DB=tutorial -e POSTGRES_PASSWORD=changeme -v /data:/var/lib/postgresql/data -d postgres:12.4-alpine.

Next, we exec into the image using docker exec -it tutorial-psql psql -U postgres. For the sake of the tutorial, I will create a new database called “tutorial”: create database tutorial;.

To verify that everything has gone according to plan, check that your database is present when you type \l. If so, then switch to that database by typing \c tutorial (which should be empty, something you verify using \d).

Database setup — You should now be connected to your Database "Tutorial"

Great, now we add the book table:

create table book(
  id uuid primary key,
  title varchar(40) not null,
  price integer not null,
  insertion_date date not null,
  category varchar(20) not null
);

Then, finally, we need some data to check the speed! For this, we’ll insert some random values into the table by using a small PL/SQL script. In order to create the random UUIDs, you may need to run this command first: CREATE EXTENSION IF NOT EXISTS "uuid-ossp";.

Afterwards, you can create however many records you want using this script (by changing the number in the loop)

do $$
begin
	for counter in 1..50 loop
		insert into book(id, title, price, insertion_date, category) values(uuid_generate_v4(), substr(md5(random()::text), 0, 11), floor(random() * 10 + 1)::int, '2021-01-01', ('{Crime,Comedy,Romance,Thriller,Horror,Documentary,Biography,Nature}'::text[])[ceil(random()*8)]);
	end loop;
end; 
$$;

This script will insert 50 records for the book. The primary key will be UUID, then some random characters for the title, a random price between 1 and 10, a hard-coded date (just to have some additional column), and then one of 8 categories.

Checks

Right, now we will perform the following lookups, first without indexes, and then with an index.

Count all books of the biography type -> select count(*) from book where category=‘Biography’;
Get the book for a random UUID -> select * from book where id=‘{id}‘;
Get the book for a specific title -> select * from book where title=‘{title}‘;
Look up all books where the title starts with ‘a’ -> select * from book where id like ‘a%‘;
Find all books of type Nature and with a price of 3 -> select * from book where category=‘Nature’ and price=3;

Performance timing

In order to see how long the lookups take, simply write \timing, which will then output the time it took for the operation to complete. The results of all queries will be summarized in a table below.

50 records

Without index

select count(*) from book where category=‘Biography’; -> 0.768ms
select * from book where id=‘2ad832d9-d66e-4281-984c-14509827d820’; -> 1.121ms
select * from book where title=‘2b1385bcd7’; -> 0.812ms
select * from book where title like ‘a%’; -> 8.39ms
select * from book where category=‘Nature’ and price=3; -> 1.069ms

With indexes

Now, let’s create some indexes. It seems like we want to index the titles and the categories. On the largest dataset, we’ll additionally create the index for the category along with the price.

To create both these indexes, let’s run

create index title_idx on book(title);
create index category_idx on book(category);

You may notice that the indexes were created really quickly. This is because there’s not many records yet, so the indexing doesn’t take much additional space or operations.

Now, let’s run the queries again:

select count(*) from book where category=‘Biography’; -> 2.483ms / 0.633ms
select * from book where id=‘2ad832d9-d66e-4281-984c-14509827d820’; -> 0.735ms
select * from book where title=‘2b1385bcd7’; -> 0.79ms
select * from book where title like ‘a%’; -> 1.51ms
select * from book where category=‘Nature’ and price=3; -> 1.244ms

You can already see that the query for the books with titles starting with ‘a’ was 6 times faster. However, the first query was much slower, in percentage. This can be due to several factors, such as the computer’s performance. We’re talking about really fast operations, so these differences don’t necessarily mean much yet.

500,000 records

Now, let’s first delete all records, the indexes, and then crank up the table somewhat, by inserting half a million entries!

delete from book;
drop index category_idx;
drop index title_idx;

The insertion alone took already 80s, so maybe we’ll get some more representative results here!

Without index

In order to see the first couple of records (after all, we need to get some of these random values to look up), let’s check out the first two, and do the queries for the id and title on those, by running select * from book limit 2;. This gives us here the title 4b519a9dc8 (first entry) and the id 0b13b51e-176b-4208-8d88-05b1baaa4e09 (second entry).

select count(*) from book where category=‘Biography’; -> 66.695ms
select * from book where id=‘0b13b51e-176b-4208-8d88-05b1baaa4e09’; -> 0.768ms
select * from book where title=‘4b519a9dc8’; -> 52.334ms
select * from book where title like ‘a%’; -> 89.156ms
select * from book where category=‘Nature’ and price=3; -> 58.386ms

Since the result lists are now so long, I will not show the images anymore.

With indexes

Now, let’s recreate the earlier indexes. You may already notice that this index creation took noticeable longer than before (1.49s & 0.86s on my machine).

Now, let’s run the queries again:

select count(*) from book where category=‘Biography’; -> 41.005ms
select * from book where id=‘0b13b51e-176b-4208-8d88-05b1baaa4e09’; -> 0.736ms
select * from book where title=‘4b519a9dc8’; -> 0.872ms
select * from book where title like ‘a%’; -> 90.06ms
select * from book where category=‘Nature’ and price=3; -> 48.302ms

Okay, so we can already see some interesting results, but we’ll keep the final comparison after the next test - 5 million entries!

5,000,000 records

Now, rather than deleting the entire table again, I’ll simply drop the indexes one final time, and then add 4.5M more entries. The waiting is getting to me, and I’m already expecting this to take over 10 minutes…

(14 minutes later)

Without index

select count(*) from book where category=‘Biography’; -> 2146ms
select * from book where id=‘0b13b51e-176b-4208-8d88-05b1baaa4e09’; ->1.451ms
select * from book where title=‘4b519a9dc8’; -> 1880ms
select * from book where title like ‘a%’; -> 6589ms
select * from book where category=‘Nature’ and price=3; -> 2882ms

This is precisely the reason why developers often don’t think about databases that much - even with 5M records, which are quite a few, these operations are still quite fast. However, if they have to be performed often, improvements will be extremely appreciated by consumers. Let’s check out the results with the indexes.

With indexes

Now, let’s recreate the earlier indexes one final time. This time around, it really takes much longer to create them, especially the index for the title. It makes sense of course, there are a lot more distinct titles than categories. On my machine, the index’s creations took 1 minute and 36s, respectively.

The longer creation time for indexes in large tables

Now, let’s run the queries again:

select count(*) from book where category=‘Biography’; -> 1142.79ms
select * from book where id=‘0b13b51e-176b-4208-8d88-05b1baaa4e09’; -> 1.513ms
select * from book where title=‘4b519a9dc8’; -> 1.658ms
select * from book where title like ‘a%’; -> 5999ms
select * from book where category=‘Nature’ and price=3; -> 2295ms

Summary

In this table, we see the summary of the response times in milliseconds, with non-indexed indexed side-by-side.

Entries	Biography	ID	Title	Title with ‘a’	Category and price
50	0.768 / 0.633	1.12 / 0.735	0.812 / 0.79	8.39 / 1.51	1.069 / 1.244
500k	66.695 / 41.005	0.768 / 0.736	52.334 / 0.872	89.156 / 90.06	58.386 / 48.302
5M	2146 / 1142	1.451 / 1.658	1880 / 1.658	6589 / 5999	2882 / 2295

Now, we can notice certain points:

The difference between non-indexed and indexed becomes so much more visible on larger datasets
The category lookup was 50% longer on 500k items, but almost 100% more on 5M
The like statement was hardly different for indexed entries. This is in part due to not being very limiting in the provided value (a)
The increased time in lookup by ID was tiny for both options
The difference in performance for the lookup by title is factor 1000!
The difference in performance of category vs category and price is larger

Due to the last point, let’s now additionally create an index over multiple fields, namely the category and the price. We should now expect there to be some noticeable difference at least. The time to beat is 2295ms.

So, let’s create the next index: create index category_price_idx on book(category, price);. This creation took about 44s on my machine, so let’s hope it’s worth it (technically, if we created the index right at the beginning, this wouldn’t have been the case and I could be more relaxed about this, but still…)

The lookup for a more complex index — Yes, indeed worth it!

Wow! The new lookup time was 197.557 ms -> an improvement of around 91.5% against the index on category, and 93% against no index!!

As you can see, indexing the correct combinations will be massively important. Imagine, you could sever 10 requests with that latest index in the same time that it would take you to serve 1 single request without indexes. As for the title, that number mounts to 1000!

Now, should you just index every single column and every combination, always, by default?

No, of course you should not. Keep in mind that this does increase the storage space required by the DB. However, since you do know during the development what kind of lookups you will be making, you should absolutely index those fields that will be used for lookups often!

One small note regarding the performance of lookups by ID. If we write \d book, we will see the description of the table.

Table Book description — The indexes improving the speed of the table lookups

As we can see in the description of the table, there is an index already for the ID. Since we did not add this key at any time, postgres took it upon itself to define that index, simply based on the fact that this is the primary key. This explains the fast performance of those lookups.

Adding indexes with liquibase

Now that I’ve shown the performance improvements of the tables due to indexes, you may want to get started to add them in your projects. You could of course add them manually, but since we’ve used liquibase for all other DB schema updates, we will do the same here.

Of course, since we do not yet have that many fields in our Account class, there may not be that much use. However, I can imagine some filtering feature where we look up all accounts a user has in a certain currency. So let’s add an index to that column.

To add an index with JPA is quite easy. In the table definition, where we currently only have @Table(name = "account"), we will add the index by changing it to @Table(name = "account", indexes = [Index(columnList = "currency")]). Keep in mind that this is the Kotlin way, and Java may require you to define it as @Index.

After having added this, we use JPABuddy to generate the changelog. I’ve changed the generated index name, and then the file looks as follows:

<?xml version="1.0" encoding="UTF-8"?>
<databaseChangeLog
        xmlns="http://www.liquibase.org/xml/ns/dbchangelog"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog
                      http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-4.3.xsd"
        objectQuotingStrategy="QUOTE_ONLY_RESERVED_WORDS">
    <changeSet id="1639326925580-1" author="cedric (generated)">
        <createIndex indexName="account_currency_idx" tableName="account">
            <column name="currency"/>
        </createIndex>
    </changeSet>

</databaseChangeLog>

If I now start my application in the integrated mode, I’ll see that the migration has indeed taken place. I will do a final check on the DB itself, so I exec into the image, and I run \d account, where I can see that my index has now been added to that table with precisely the name I gave it!

Currency index on account — The currency is now indexed on my accounts!

In summary - always try to think ahead of what lookups you will require. Don’t simply discard these thoughts with the idea that DB operations are fast anyway, as the performance improvements can be incredible. Your customers will definitely thank you for it!

Published Dec 12, 2021