Relational vs. Non-Relational Databases

#persistent-storage #system-design

So now that we have seen relational and non-relational databases, let us list out some of the things we need to look out for while selecting the right type of DB for our system.

🙋 Question: Is it true that “most” simple use cases can be implemented using relational databases?

In general, it is mostly true that most simple use cases will work with relational DBs. However, the key thing to note here is that relational DBs might not be the best fit for your system and in some cases might be extremely suboptimal and lead to a degraded and poor performance.

If we wanted to understand this briefly, imagine we have a system that allows users to simply view and update their own names. This system needs to be able to support billions of users and a very high write/update traffic. The system also needs to scale horizontally to support more and more traffic.

Now, if we use a relational database, we are trading off high write performance in order to get transactional support. However, this system does not need transactional support (no multi object writes per operation) and therefore we would be giving away write performance for something we don't really need at all.

Secondly, this simplistic system does not need data to be stored in structured columns (since it's simply just the name) and read queries don't require complex joins (since you only read by user ID). Therefore, using relational databases don't give us any more advantage with read queries and unnecessarily might slow us down with unneeded strict schemas and system downtime when schema changes need to be propagated to the database.

🤔 Selecting Between the Two

So the question is, how do we select the right storage system for our system? To get a more thorough understanding of this, we need to answer a set of questions about the system in a way that tells us if we can benefit from the pros of each option and are okay to trade off the cons of that particular option. Here are some of them.

❓ What are the entities that we will be storing and retrieving from the database? Are these entities strongly connected to each other? In other words, do they have relationships with each other?

In the above example, the only entity our system had was the user. Users have no relation to each other, and each user simply updates or views their own information.

Contrast this with the system that we discussed previously, the advertiser and ads data system, where we see a strong relationship among entities.

⁠

Screenshot 2024-04-29 at 2.24.48 PM.png

⁠

Therefore, if entities in a database have relationships with each other, a relational database would be a better fit for the system. Here, the advertiser system would benefit from a relational database whereas the user meta data would not really benefit from it.

❓ Is the structure of the data your system stores continuously evolving? Do you foresee more fields/attributes being added and removed all the time?

We can imagine that the advertiser and ads data system is fairly structured. We might add attributes to the entities from time to time but it is not going to happen frequently. On the other hand, it's possible that the user metadata system keeps evolving in the data it stores. It may store names today but in the near future it's totally possible that we add more data like location, email, phone, birthdays, etc. Such systems that change continuously might not want to be bogged down by strict predefined schemas, and slow schema changes that cause downtime. They benefit better with non-relational databases.

❓ Even if there are relationships between entities, do we need strong transactional support while editing and reading these entities? Are operations done on several connected entities together?

Now let's assume we have a social network where users can be friends with other users.

⁠

Screenshot 2024-04-29 at 2.44.06 PM.png

⁠

Here, there are entities that have relationships with each other. It may be tempting to straight away choose a relational DB. However, we will probably never see a case where multiple of these entities and relationships are created as part of the same operation. Users usually create their own entities, and then friendship relations are created individually and separately at a later point in time.

To be more clear, we probably will not have cases where we want to create multiple users as well as friendship connections in one go. Furthermore, we won't have atomicity constraints where the entities and relationships all need to be created at once and if one relationship fails to be created, all the other entities need to be rolled back/deleted. Such systems benefit from non-relational databases because they don't need ACID support and can also benefit from much faster writes and much better scaling, which is what we would want from a social network.

On the other hand, the advertiser system may require that a campaign and a set of ads under it are all created in one operation. Failure to create the campaign entity should result in no ads getting created. Also, advertiser entities don't scale as fast as user entities in general, and so we are not bogged down by slightly slow writes. On the other hand, we do need the ACID support and therefore a relational database in general would do well here.

❓ What kind of read queries is the system going to support? Do we have “point” queries that don't require complex joins, or do we have queries requiring joins all the time?

Point queries are queries that have simple constraints and don't require us to join data from multiple tables to get the results. For example, in the user metadata systems, all that the user can do is read their data given their ID and update their own data with the help of their ID. There is no real need for one user to query other users' information. Non-relational databases would work for such systems because they support blazing fast point queries.

On the other hand, the advertiser system does need to answer questions like “give us all the campaigns and ads of this advertiser”. Automatically, this would mean that we would have to join data from the campaigns and ads tables. In general, non-relational databases perform very poorly with joins. Remember, there is no structured set of columns and no concept of foreign keys, so the database wouldn’t be able to perform joins optimally. On the other hand, relational databases do really well with joins since the database knows the structure of the table well and can benefit from the indexes it has created for primary and foreign keys (We will talk about indexes in future chapters in more detail). Such cases benefit greatly from relational databases.

⏰ Planning for the Future

One pattern that we may see in the above Q&A is that we not only look at the current state of the system, but also plan a bit for the near future. We need to know briefly how the system is going to evolve over time, the query patterns that might change over time, the data that might change over time. Just looking at the current state of the data might give us a one-sided view, which might not be optimal.

For example, the user metadata system simply stores the names of users today. Knowing that, we may decide to go with a relational database. However, if we know that it's possible that more and more data is going to be added to this metadata store and that the fields are going to be added/removed very frequently, we might want to rethink the decision. Relational DBs will soon slow us down with slow schema changes and frequent application downtime, and therefore we might want to change our decision to using a non-relational database.

🌎 Consider the Holistic Impact

For the ease of understanding, we looked at each question individually and in isolation, but in practice, we need to look at all questions together holistically. For example, it's possible that answering one question might lead us to choose relational databases but in answering the next one, non-relational databases come out as the optimal choice. In such cases, we must:

Enlist all the pros and cons of each option with reference to the questions.

See if there are any non-negotiable requirements that come along the way. For example, if transactional support is absolutely needed, and we cannot give that away, even if the fields of the data keep changing very dynamically we might have to select relational databases and accept the disadvantage of frequent schema changes.

Finally, go with the option that maximizes how the system will benefit with the pros of that option and minimize the disadvantage we will face with the cons of the option. In all honesty, this is a world of trade-offs, and therefore we will very rarely have a system-database pair where we benefit from all the pros of the database and have no disadvantage from the cons of the database.