Resiliency Assessment

📚 Fault tolerance

Resiliency in distributed system manifest in various ways: availability, security, scalability, testability, maintainability, extensibility, etc. Using modern distributed system practices passively increase their fault tolerance to some degree. Therefore, employing a practical design is half the battle and inherently prevents many of these concerns.

Availability & Scalability & Durability
Horizontally scalable systems with load balancing and redundancy across multiple locations reduce the impact of isolated failures.

Maintainability & Extensibility & Testability
Stateless-by-default systems allow computing nodes to be interchangeable. A modular design helps keep components independent from each other and easily swapped. Although, databases and long-standing connections are still stateful. Ideally, modular design, stateless architecture, and simple APIs allow each system to be relatively isolated and straightforward, so they won't require a convoluted test suite.A maintainable system strives to keep existing and future systems intuitive, practical, and straightforward using a combination of these traits.

Security & Scalability
Load balancers and API gateways can employ rate-limiting, denial-of-service attack prevention, and authentication to avoid duplicating these efforts in every service. Preventing systems from getting overwhelmed or abused is less effort than fixing a system after its taken place. An ounce of prevention is worth a pound of cure.

🏗️ In real-world systems

Generally speaking, the more costs a business can afford, the more distributed system traits they can use to increase their fault tolerance. For example, a start-up company may run their system on 2 application servers and 2 database servers in 1 data center compared to an established tech company may be able to have every system span across 3 data centers, each with > 3 application servers and > 3 database servers, at a minimum. For example, Netflix has 100,000 application servers with AWS across its 1000 microservices in multiple data centers. Similarly, companies with a cloud department, such as Microsoft with Azure, Google with GCP, Amazon with AWS, and Facebook with their own internal solution, can affordably de-risk their systems by distributing them all over the globe!

🧠 Impact analysis

The primary areas of risk to focus on are single points of failure and bottlenecks, but others can exist that are unique to your system. These points of failure are bound to happen eventually due to the law of large numbers. However, the significance of their impact and how you handle them depends on their purpose within your system. A general rule of thumb for determining impact and the appropriate response is to consider the following: “How does this affect the user or the business?”.

👉 Single points of failure (SPOF)

Database outage Systems can still go down despite being horizontally scaled because of networking issues, misconfigurations, or bad code deployments. What happens when your system can't reach your database? These outages can also affect data consistency or cause stale data.Service outage Service outages are in a similar vein to database outages. As a single logical unit, a microservice can still be a single point of failure.

🍼 Bottlenecks

Message queues get backed up Workers can't process messages because of a mismatch in formatting or versioning. Alternatively, the worker fleet cannot be scaled up quickly enough. If it's to process newly uploaded videos, that may be essential because it's long-running and blocks other processes from starting. If it's to batch process Likes, it may be less urgent but good to point out as a problem to mitigate.High latency Sometimes everything works, but it's just too slow. Service-level agreements (SLAs) are contracts that a system will perform within certain thresholds. These breaches often manifest in higher-than-accepted latency, e.g., >200ms or >50ms response time. No one likes to wait >1 second for Google to deliver the top 10 search results.Scalability Despite our best intentions, our service may be capable of handling way more scale than our downstream dependency. For example, suppose your FooService can handle 500k requests/second, but your dependency, UserService & UserDB, can only handle 100k requests/second. You could potentially compound UserService's scaling issues by retrying requests!

⁠Operations⁠⬅️

⁠Preventative Measures⁠➡️

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut ( Ctrl P ) instead.