When we create things we expect to last, we must design not only how they work, but how they survive. Yet in the race to deliver new things, we often overlook what’s needed to keep those things working properly.
Imagine you want to make your car resilient. You begin by reinforcing it, adding every possible safety measure, from bulletproof glass to automatic braking and supplemental restraints. Your car is now undoubtedly more robust. Robustness is a necessary, but not sufficient, element of resiliency: Just because something is robust doesn’t make it resilient.
So now you add redundancy. You buy several of these cars, in case one fails or is in the shop. You make copies of the keys, and give them to people you trust. You store extra gas at your house. Your driving system has fewer single points of failure. Like robustness, redundancy is an important foundation for resiliency. If you have backups of an important document, you can withstand the failure of a single copy. While modern Internet systems are made of many components, we have learned how to design them redundantly. We use multiple DNS servers, route traffic across multiple autonomous systems, into parallel data centers, where server instances handle requests. Each layer is designed to spread load across lower layers that can handle the load.
Resiliency is more than just robustness (making something strong) or redundancy (making multiple copies of something so that we have a backup when parts fail.) Truly resilient systems let us know when they’re broken, and heal themselves automatically. They are designed so that when they fail, they try to return to some sort of equilibrium state in which they are once again operating correctly.
NIST says that a resilient system can “operate under adverse conditions or stress, even if in a degraded or debilitated state, while maintaining essential operational capabilities” and “recover to an effective operational posture in a time frame consistent with mission needs.”
Change is the only real constant, and creating something that can survive change is hard. Any resilient system must withstand four kinds of interruption:
The system must defend against outside forces that seek to change it, either intentionally or as part of some adverse event. It must also resist human error and mistakes, and restrict the activities of those who would damage or misuse it. Ultimately, the system must not only keep operating, but maintain its structure and intended functionality in spite of outside forces.
Beyond robustness and redundancy, resilient systems must include:
-
Documentation for restoring service.
-
Anticipation of what accidental and intentional failures the system might face, with procedures to mitigate and recover from those failures.
-
Monitoring and alerting to notify operators when the system has failed.
-
Failsafes or kill-switches to disable the system if it behaves in an unintended way.
-
Graceful degradation (for example, an overloaded email system might send messages but not include attachments; a hospital might triage patients when the power goes out.)
-
Procedures for updating the procedures themselves, as the system is modified and updated.
As we rely more heavily on digital-first services, resilient design becomes critical. At FWD50, we’ll be discussing how to bring resiliency to service delivery and process creation, incorporating these sorts of requirements into the public service. We have already confirmed some amazing speakers for this year's conference, like Bryon Kroger who will be addressing this core theme in his upcoming session.