_index
Commit to an option, but note the alternatives you didn't pick.
If you are not listing trade-offs - you are a junior developer
Look for single points of failure and bottlenecks
👉 Single points of failure (SPOF)
Database outage
Systems can still go down despite being horizontally scaled because of networking issues, misconfigurations, or bad code deployments. What happens when your system can't reach your database? These outages can also affect data consistency or cause stale data.
Service outage
Service outages are in a similar vein to database outages. As a single logical unit, a microservice can still be a single point of failure.
🍼 Bottlenecks
Message queues get backed up
Workers can't process messages because of a mismatch in formatting or versioning. Alternatively, the worker fleet cannot be scaled up quickly enough. If it's to process newly uploaded videos, that may be essential because it's long-running and blocks other processes from starting. If it's to batch process Likes, it may be less urgent but good to point out as a problem to mitigate.
High latency
Sometimes everything works, but it's just too slow. Service-level agreements (SLAs) are contracts that a system will perform within certain thresholds. These breaches often manifest in higher-than-accepted latency, e.g., >200ms or >50ms response time. No one likes to wait >1 second for Google to deliver the top 10 search results.
Scalability
Despite our best intentions, our service may be capable of handling way more scale than our downstream dependency. For example, suppose your FooService can handle 500k requests/second, but your dependency, UserService & UserDB, can only handle 100k requests/second. You could potentially compound UserService's scaling issues by retrying requests!
Master class content:
https://www.youtube.com/watch?v=M4lR_Va97cQ