How do you design interactions in a distributed system to prevent failures?

Focusing on the Reliability pillar of the AWS Well-Architected Framework, this blog post will explore best practices to prevent failures and improve mean time between failures (MTBF). It’s very common for workloads to be reliant on distributed systems that, in turn, rely on communication networks to interconnect components, such as servers or services. However, it is key that your workloads can operate reliably, independently of data loss or latency in these networks. To achieve this, components of the distributed system must be made to operate in a manner that doesn’t generate a negative impact upon other components or the workload itself. Let’s explore the four best practices listed by AWS to aid in this process.


Identifying which kind of distributed system is required

There are two possible kinds of distributed systems to choose from when designing your architecture. On the one hand, there are hard real-time distributed systems. In this case, synchronic and rapid responses are required and have more stringent reliability requirements. Soft real-time systems, on the other hand, have a more generous time window of minutes or more for response, and handle responses through batch or asynchronous processing. It is key that you can identify the challenges that the use of distributed systems involves such as latency, scaling, understanding network APIs, marshalling and unmarshalling, and the complexity of algorithms.


Implementing loosely coupled dependencies

A great way to increase resiliency and agility is to implement loosely coupled dependencies in queuing systems, streaming systems, workflows, and load balancers. Loosely coupled components are those which do not force change on components that rely on it when they experience changes. Loosely coupled dependencies help increase resiliency and agility by isolating behavior of a component from others that might depend on it. In order to achieve this, you can use Amazon EventBridge, which allows you to build event-driven, loosely coupled and distributed architectures. Interactions should be made asynchronous whenever possible. This includes any interactions that do not need immediate responses and the acknowledgment that a request has been registered will suffice.


Making all responses idempotent

Idempotency refers to the quality of a service that guarantees that each request will be completed exactly once, essentially making it so that making multiple identical requests is the same as making a single one. You can achieve this by issuing API requests with an idempotency token attached and using this same token whenever the request is repeated. This makes it easy to implement retries without fearing that a request will be processed multiple times by mistake.


Doing constant work

It’s very common for systems to fail when there are large, rapid changes in load. In order to avoid this threat to system stability, you should be engineering workloads for constant work. An example of this could be a health check system monitoring the health of hundreds and hundreds of servers at the same time. This system should send the same size payload (a full snapshot of the current state) every time, avoiding sharp changes in load whether there are none, some or many servers failing. Workloads should be engineered so that, regardless of the number of successes or failures, payload sizes should remain constant.


Useful resources:

AWS re:Invent 2019: Moving to event-driven architectures (SVS308)

AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes loose coupling, constant work, static stability)

AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge (MAD205)

What Is Amazon EventBridge?

What Is Amazon Simple Queue Service?

Amazon EC2: Ensuring Idempotency

The Amazon Builders' Library: Challenges with distributed systems