InfoQ recently published a new talk by Adrian Cockroft, Managing Failure Modes in Microservice Architectures.
It’s a good talk, but I think the title is a little restrictive. Many of the problems with microservices are problems with all computer systems – it’s just that microservices punish mistakes more brutally.
In a recent job interview I was asked if I could deal with the specific demands of microservices when (the interviewer felt) much of my experience involved monoliths or small groups of services. My response was that principles such as loose coupling, monitoring, and resilience are needed in all systems .
It’s a rare system that has no external dependencies. I once worked with a monolithic system that made a call to Salesforce as part of the login process. When Saleforce went down, users could not log into the system. The issue was obvious – we had to manage failures in the external dependency. Microservices, by involving many more dependencies, force people to engage with this; or else, suffer massive disruption and downtime.
One of the most interesting things about Cockroft’s talk was his observation about the problem with disaster recovery:
Your switching processes, and code, and practices are not well-tested. You’ve got an unreliable switch between your primary and backup datacenter. You might as well just not have the backup datacenter.
The answer here is to make the failure systems part of the normal workings; to be constantly switching between datacenters, for example.
There was also an interested related point about the chaos monkey. This was not just about testing resilience:
We were enforcing autoscaling. We wanted to be able to scale down. If you think of an autoscaler scaling up and scaling down, to scale down, it has to be able to kill instances. The Chaos Monkey was there to enforce the ability to scale down horizontally scaled workloads. That was actually what it was for. It was to make sure you didn’t put stateful machine, stateful workloads in autoscalers. Then you can have this badge of honor gamified a bit. “My app survived all of this chaos testing, and it’s running in this super high availability environment, and your app didn’t. Do you mean your app’s not important?” You can gamify it a bit.
Many of the principles discussed in relation to massive companies such as Netflix are needed by everyone. The good thing about this is that the tools being produced empower companies of all sizes. Microservices enforce a more mature attitude to failure; but failures occur in systems of all sizes.
We’ve just announced the first two events for Brighton Java in 2020. On Tuesday January 28th, we have a talk on Serialization Vulnerabilities in Java from Joe Beeton and sponsored by Amex.
The talk looks at how Java applications can be vulnerable to serialization attacks, and how they can be protected. The talk will be useful to Java developers of all experience levels. Joe has done some interesting work around this topic, which I’m looking forward to hearing more about.
In February, we have Luke Whiting talking on Kubernetes in the Data Center: Theory vs Reality. This talk will be a little different to the usual format. Luke will start with a talk he gave at the start of a project, following it up with the lessons learned in hindsight.
We’re also putting together future events, including a collaboration with Silicon Brighton and an exciting speaker for April. But we’re always on the lookout for new speakers, and welcome people of all levels – introductory talks on a topic are just as useful as expert sessions. If you’ve not spoken before, we’re happy to help you prepare. Either leave a comment, or email email@example.com
I’m also hoping to do some practical first steps sessions. The JVM eco-system feels more exciting than ever right now; but it’s sometimes hard to work out how to begin with new topics. Hopefully, we can provide a collaborative environment for this. More soon.
How many open tickets do you have in your tracking software?
And how many tickets were closed in the last six months?
In a lot of places I’ve worked, the relative size of these two numbers meant it would take years to clear all the open tickets. When the company has been going a long time, there are even sometimes tickets relating to issues that were fixed years before, often as the side-effect of another change.
These other tickets turn up in searches and need to be considered, if only in a small way, when new work is prioritised. While it’s useful to have long-term plans and roadmaps, I am not convinced that the ticketing system is the right place.
The more tickets you have, the more temptation there is to throw things in the backlog that need doing one day. Things get lost, and it’s hard to be certain you’re working on the most important thing.
If something is not going to get done in the next year, there’s no point tracking it.
What about bugs and live issues?
The best platforms I’ve worked on have had a zero-defect policy. If something is worth fixing, it should be fixed as soon as possible. If you’re prepared to live with it, and you’re not going to get round to fixing it any time soon… why bother recording it?
What about technical debt?
If your technical debt is causing problems, you’ll be able to figure the most important thing to focus on. The rest of those things that you might get onto next year… Still might not be a bug enough problem to actually solve next year.
Massive backlogs are a weight we don’t need to carry. Tickets you’d love to do but won’t are clutter. They’re self-delusion.
Do some new year tidying up: delete the tickets you’re not going to work on, and focus on what you actually can do.