Do you run or walk? When you do, is it on a treadmill or in the street? And why does this have anything to do with cloud and datacenter operations? Surprisingly, they have more in common than might be thought. So I am asking for a few paragraphs of leeway here.
Lets start with running on a treadmill. In this scenario, you can control the lighting and climate. You will never hit rain, or have an obsticle or another runner in your path. Basically, the environment you are running in is static, and controlled by you.
Now compare that with running on the street. You may pick your path from a vastly greater set of possibilities. You may even time it to match the past history of good weather and no crowds. However, in the end, life happens. The street may now have a pothole. Other people may have decided today is their day to use your favorite running path. Mother Nature may have finally had enough with your avoiding her wrath. Basically, no matter how well you planned, the environment is not under your control. You still control the sneakers you wear and the path you run, and how you react to the changes, but not much more than that.
So what happens in the second scenario? You adapt. You react to the changes on the fly. You run around the pothole or obstacles, or adjust your speed to navigate the crowd. You may even duck under cover to wait out the rain. This reaction to change is the key difference. When running on the treadmill, the changes dont happen, and the need to react is not there.
This is where I pull this back to datacenters and clouds.
The datacenter is like a treadmill. It is controlled and static. Proactive controls will be the primary focus to ensure stability of the running systems. If some new event happens, a postmortum will drive failure mode analysis that yields new proactive controls to prevent the problem from happening again. The proactive controls become some pre-release checklist, and the reactive controls typically become runbooks. The runbooks usually identify the failures expected, and specific actions for an operations team to take. More advanced organizations will implement automated runbooks. The usual pattern for a runbook entry is to identify some change in the environment, for example a hung server, and the action to take, such as restart the server. Over my career, most runbooks I have come across follow this model:
- Breakglass to log on to server
- Download the logs for analysis
- Try to restart the process
- If any of the above fails, restart the server
I have been in enterprises with thousands of running applications, all with runbooks that have those four basic steps for any identified problem. This response may address the immediate problem of being hung or offline, but it can lead to problems of its own. Restarting a process or server that froze because of bad data or incorrect configuration are classic examples. Restarts do not address these types of problems, which means manual investigation and remidiation, after discovering that prior actions failed.
The cloud is the street. Vastly more choices, but no matter which choice you make - most, if not all, of them are out of your control. Consider that the hardware, operating system versions, and just about everything else is always changing under your application in the cloud. That means the assumptions made in a datacenter about the static environment do not hold in the cloud. In fact, you must not ever assume anything of the infrastucture or services in the cloud. It is always changing, sometimes for the better, sometimes not. Not only is the hardware being replaced en masse all the time, but the software you are using is being upgraded constantly. Amazon is deploying to production at least every 11.7 seconds on average (source). You should assume that the other cloud vendors are operating at the same velocity.
The ever changing cloud platform is why the statement from Werner Vogels needs to be core to your thinking: “Everything fails all the time”. You are building your need to be rock solid application on digital quicksand.
I will expand upon that with my own rule here:
Dave’s First Rule of Cloud Failure: The question is not about when something fails, but do I know what is failing right now?
A key take away is that identifying all possible failure modes is not possible. It never really was, but you were able to get closer in a static environment. However, classes of failure are identifiable. It is now important to identify the types of problems you will need to handle, generalized patterns of how to detect those failures, and how to mitigate. In developing these new failure modes, you need to think in three time frames: before, during, and after an incident instead of just after it happens. Prediction allows a chance at prevention. If you don’t prevent, you have a better chance of reacting to the real problem and not having to revert to the kicking the machine solution by doing a reboot.
Now that all that rosiness is out of the way, what do you do? Sadly, I don’t have the answer. I only have a suggestion with Reactive Controls. The idea is to treat the cloud like running in the street, and learn how to design your systems to react to your environment.
That means three things:
- Observation - collecting, storing, and distribution of metrics.
- Awareness - systems must be able to receive events based on the observed metrics.
- Reaction - systems must develop actions for specic events.
Think of this model as shift-left for operations. In shift-left development cycles, we create test cases as we discover failures. Similarly, in shift-left operations, we will creation reactions for observed failures over time. The key is to try and broaden the failure detection to capture more than the single use case discovered. Worst case, you have a mitigation for a single failure, best case, you have a mitigation for failures you didn’t even know could happen.
This leads to my second rule:
Dave’s Second Rule of Cloud Failure: Capture every measurement you can. You will not know which metric will help identify an impending failure until after you have failed.
I would also suggest that this mindset is not limited to the cloud. There is no reason you counldn’t design your datacenter based systems with reactive controls. It will only make your systems better.
I would love to learn how others deal with operations in the cloud. So, your homework is to share.