Database Indexes in Relational Databases
🤔 Motivation
Before we get into what an index is, let's start with an example to understand why this concept is even needed.
Let's assume we have two tables in a relational database: Company and Employees. Here's what they look like:
The company table stores information about the company. For example:
On the other hand, the Employees table contains information about all the employees of these given companies. For example:
Note that Company <-> Employee is a one-to-many relationship, which means that one company can have multiple employees, but an employee can be associated with only one company. Usually, the way to model this in the database would be to have the company ID as a column of the employees table.
Let's assume that the company table stores hundreds of thousands of records and every company on average has 10 employees. This would mean that the Employees table will have millions of records.
With this in mind, say your system wants to query the list of all Employees belonging to company ID 2. A very simple way of doing that is to scan every row of the Employees table, match the company ID column with the value we want (2 in this case) and then return all the rows where we get a successful match. This will work fine, but the downside is that for every such query, we will have to scan the entire Employees table which means that just to get a few rows every single time, we will end up scanning millions of records every single time. This is a huge performance bottleneck for systems with a high query rate and large tables. This is where the concept of Indexes comes in.
Now imagine you had some sort of data structure to help you speed up your query. For simplicity, imagine you had a hash map keyed by company ID, and whose values would be a list of all the employee IDs belonging to that company. If you had this data structure, retrieving all employees of a company would be much faster, more efficient and wouldn't require you to scan all the rows of the table. In practice, this is what an index is.
To further explain this concept with another example, say you have a 500-page book on databases, and you want to look up the concept of joins. The fastest way to do that would be to look up the index of the book and search for the chapter on joins and directly go to that page, rather than manually scanning all the 500 pages of the book.
🗂️ Definition of an Index
So now that we have a brief idea on why an Index is needed and what it does, let's try to make a slightly more formal definition of an index. We can say that an index is a data structure that provides a quick and efficient way to look up records in a database table based on the values in one or more columns. It works much like an index in a book, allowing you to quickly find the desired information without scanning the entire contents.
An index can be on one column or together on a set of columns. When a column is indexed, the database maintains a data structure in a way that reduces the need for full table scans. By doing so, the database can swiftly locate specific rows when a query filters by that indexed column. This efficiency results in quicker query responses and improved overall performance.
💱 Trade-offs
🤔 If indexes significantly speed up read queries, why not index all the columns?
The short answer is that there is no free lunch.
Slower writes: It is true that indexes can significantly improve reads. However, if you account for all DB operations, writes become slower because the index needs to be updated after every write. If there are many insert, update, or delete operations, or if the index itself is already large, adding indexes can introduce a bottleneck in the write operations.
Increased memory: The index data structure itself needs to consume memory. The more indexes you build and the larger your data is, the more memory you would need to store these indexes.
❓ As a follow-up to the previous question, what are some common indexing mistakes people make in practice?
The answer is simple: not getting the right balance on what to index on. Many times, people either land up over-indexing or under-indexing.
Over-indexing: Indexing too many columns, even ones that are not going to be queried. This will have a high impact on write operations as well as potentially waste a lot of storage space.
Under-indexing: Indexing too little, in particular missing out on query patterns that often occur, and can stand to gain a lot of performance from an Index. This creates a poor read performance and missed optimization opportunities.
❓ Okay, so then how does one get the right balance of what to index?
Getting the right balance and figuring out the right indexes is not a one time simple task. As time passes, the database changes not only in size, but in the kind of data it stores and the query patterns it services. If we want to maintain the right balance, the indexes need to evolve with the changing database as well.
Analyze query patterns: Before selecting an index, analyze the types of queries frequently executed in your database. Understand the SELECT, UPDATE, DELETE, and INSERT operations that are prevalent in your application. For queries involving specific filters or sorting, consider indexes on the columns used in the WHERE clause or ORDER BY clause. This can significantly improve the performance of these operations.
Tracking cardinality: Cardinality refers to the uniqueness of values in a column. High cardinality means that the column has mostly unique values, while low cardinality indicates a smaller set of unique values. Columns with high cardinality, such as primary keys or columns with mostly unique values, are good candidates for indexing. Indexing such columns enhances search performance. However, when indexing columns with low cardinality, it may result in inefficient use of index space. For instance, indexing a Boolean column might not be beneficial due to its limited unique values.
Regular monitoring and tuning: Regularly profile and analyze the performance of your database using monitoring tools. Identify slow queries, high-traffic tables, and areas for improvement. On the other hand, also keep track of index usage statistics to understand which indexes are actively contributing to query performance. You can throw out the ones not needed.
Additional Reading
Designing Data Intensive Applications. Chapter 3 Storage and Retrieval - Data Structures That Power Your Database
: Deep dive on how MySql uses and implements indexes