InfoQ recently published a new talk by Adrian Cockroft, Managing Failure Modes in Microservice Architectures.
It’s a good talk, but I think the title is a little restrictive. Many of the problems with microservices are problems with all computer systems – it’s just that microservices punish mistakes more brutally.
In a recent job interview I was asked if I could deal with the specific demands of microservices when (the interviewer felt) much of my experience involved monoliths or small groups of services. My response was that principles such as loose coupling, monitoring, and resilience are needed in all systems .
It’s a rare system that has no external dependencies. I once worked with a monolithic system that made a call to Salesforce as part of the login process. When Saleforce went down, users could not log into the system. The issue was obvious – we had to manage failures in the external dependency. Microservices, by involving many more dependencies, force people to engage with this; or else, suffer massive disruption and downtime.
One of the most interesting things about Cockroft’s talk was his observation about the problem with disaster recovery:
Your switching processes, and code, and practices are not well-tested. You’ve got an unreliable switch between your primary and backup datacenter. You might as well just not have the backup datacenter.
The answer here is to make the failure systems part of the normal workings; to be constantly switching between datacenters, for example.
There was also an interested related point about the chaos monkey. This was not just about testing resilience:
We were enforcing autoscaling. We wanted to be able to scale down. If you think of an autoscaler scaling up and scaling down, to scale down, it has to be able to kill instances. The Chaos Monkey was there to enforce the ability to scale down horizontally scaled workloads. That was actually what it was for. It was to make sure you didn’t put stateful machine, stateful workloads in autoscalers. Then you can have this badge of honor gamified a bit. “My app survived all of this chaos testing, and it’s running in this super high availability environment, and your app didn’t. Do you mean your app’s not important?” You can gamify it a bit.
Many of the principles discussed in relation to massive companies such as Netflix are needed by everyone. The good thing about this is that the tools being produced empower companies of all sizes. Microservices enforce a more mature attitude to failure; but failures occur in systems of all sizes.