Solutions Architect Series – Part 8: Architectural Reliability Considerations

This is my learning note from the book Solutions Architect’s Handbook written by Saurabh Shrivastava and Neelanjali Srivastav. All the contents are mostly distilled and copied from the book. I recommend you to buy this book to support the authors.

Another series: Fundamentals of Software Architecture: An Engineering Approach

Design principles for architectural reliability

The goal of reliability is to keep the impact of any failure to the smallest area possible. By preparing your system for the worst, you can implement a variety of mitigation strategies for the different components of your infrastructure and applications.

  • Making systems self-healing: System failure needs to be predicted in advance, and in the case of failure incidence, you should have an automated response for system recovery, which is called system self-healing.
  • Applying automation: Automation is the key to improving your application’s reliability. Try to automate everything from application deployment and configuration to the overall infrastructure.
  • Creating a distributed system: Monolithic applications have low reliability when it comes to system uptime, as one small issue in a particular module can bring down the entire system. Dividing your application into multiple small services reduces the impact area, so that issue is one part of the application shouldn’t impact the whole system, and the application can continue to serve critical functionality.
    However, the communication mechanism can be complicated in a distributed system. You need to take care of system dependencies by utilizing the circuit breaker pattern.
  • Monitoring capacity: Resource saturation is the most common reason for application failure. Often, you will encounter the issue where your applications start rejecting requests due to CPU, memory, or hard disk overload. Adding more resources is not always a straightforward task as you should have additional capacity available when needed.
  • Performing recovery validation: When it comes to infrastructure validation, most of the time, organizations focus on validating a happy path where everything is working. Instead, you should validate how your system fails and how well your recovery procedures work. Validate your application, assuming everything fails all the time. Don’t just expect that your recovery and failover strategies will work. Make sure to test them regularly, so you’re not surprised if something does go wrong.
  • Recoverability is sometimes overlooked as a component of availability. To improve the system’s Recovery Point Objective (RPO) and Recovery Time Objective (RTO), you should back up data and applications along with their configuration as a machine image. You will learn more about RTO and RPO in the next section. In the event that a natural disaster makes one or more of your components unavailable or destroys your primary data source, you should be able to restore the service quickly and without lost data.
  • Start small and build as needed: Make sure to streamline the first step of taking a backup. Most of the time, organizations lose data as they didn’t have an efficient backup strategy. Take a backup of everything, whether it is your file server, machine image, or databases.

Improving reliability with the cloud

In the cloud, easy monitoring and tracking help to make sure your application is highly available as per the SLA. The cloud enables you to have fine control over IT resources, cost, and handling trade-offs for RPO/RTO requirements. Data recovery is critical for application reliability. Data resources and locations must align with RTOs and RPOs.

With the cloud, you can design a scalable system, which can provide flexibility to add and remove resources automatically to match the current demand. Data is one of the essential aspects of any application’s reliability. The cloud provides out-of-the-box data backup and replication tools, including machine images, databases, and files. In the event of a disaster, all of your data is backed up and appropriately saved in the cloud, which helps the system to recover quickly.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.