Gangstas Don't Scale
By Julian Browne on June 19, 2009. Filed Under: architecture, development
A common anti-pattern in distributed systems design comes about because of the belief that scale is easily achieved by making all communication asynchronous. And to be fair, there is some logic to why this misconception occurs. If you take the real-life analogy of making a phone call as a synchronous option for communication, with sending a letter as its asynchronous counterpart, then you can more efficiently increase a team’s workload (assuming the team performs simple, repetitive, tasks), in proportion to increases in its size, if they spend all day banging out letters rather than dialling numbers, redialling when busy, waiting for someone to answer, issuing instructions, waiting for confirmation, etc.
Making a phone call is, loosely speaking, a blocking) operation in that you and your communication partner can’t make, or take, another call until you’re done with the current one (so you effectively queue anything that needs doing next), whereas you can write and send letters without their direct involvement (and there’s an added bonus that identical messages can be sent in bulk).
Of course in reality both phone calls and letters “block” to some extent - often when we think we are describing parallel activities we really mean sequential activities that are completed fast enough to avoid the need for a systematised queue. We just avoid these queues by offloading some tasks to asynchronous processing, something I talked about in Scalability where I made the statement:
for scalability, orthogonal activities should be designed as asynchronous services
.. and went on to define orthogonal in a way which hopefully didn’t promote the anti-pattern.
In 2005 Gregor Hohpe wrote a short but thought-provoking article entitled ”Your Coffee Shop Doesn’t Use Two-Phase Commit” where he described the asynchronous mechanisms Starbucks judiciously employ to ensure coffee-making at scale and the techniques they use to work around the price of asynchrony.
I was thinking about this price as I wrote my CAP Theorem meets the Sex Pistols piece - concerned that I might give the impression that chilling-out when it comes to coordinated application state means that you should drop blocking, and locking, and two-phase commit, and make everything, everywhere, loose and unstructured and asynchronous.
It surely is a seductive idea. One that got a lot of buy-in from the EAI community six or seven years ago. Want a simpler architecture? Easy. Draw a big box in the middle of your application landscape diagram and call it the “enterprise bus”. All communication paths become asynchronous messages and hey presto: simplicity and scale.
Or not. Just as CAP Theorem tells us where to expect design trade-offs and not to throw money at an unattainable goal, there’s a similar admonition for the design trade-offs themselves. Kind of an add-on to CAP Theorem that simultaneously refines and supports it.
CAP says that we can’t have Consistency and Availability and Partition Tolerance and have them guaranteed and concurrent. So we work around this by loosely coupling some parts of the business process and our primary tool for this is asynchrony. But asynchrony isn’t free, because you can’t have it and guaranteed and concurrent fault tolerance. If everything works OK, all the time, you’ll be fine. It’s when they don’t that asynchronous communications can be a bad thing.
Those times might be rare, but they are worth understanding so that you can spot them early. When Hohpe wrote that piece about Starbucks he noted that they used asynchrony at a few key points in the sales order process, but not everywhere.
Order placement, payment and collection are synchronous. And for good reason.
Extreme Example 1
I once worked for a company that had deployed, somewhat blindly, an enterprise message bus, which was then mandated for all communications. All interconnects between applications were delivered via messages, which included business events, request-reply calls, workflows, transformations, data and process integration. When it worked it was great. When it went wrong the poor support teams spent hours manually fixing data consistency issues between applications, while the business got annoyed, customers got cheesed off and money got wasted.
Extreme Example 2
More recently I was talking to someone about introducing asynchronous communication into an existing application set that has hitherto been solely based on synchronous calls. On examining the volumes, timing, reliability and transactional needs our conclusion was that while we both felt an async bus would provide new capability that could be exploited in the future, perhaps even helping to shape demands that the business would otherwise remain unaware of, there was no urgent need to do so. Scalability, for now at least, is being met just fine without it.
Byzantine Fault Tolerance aims to address the unpredictable results of errors in distributed systems. It’s a product of the Byzantine Generals’ Problem, which is itself a more specific form of what’s commonly called the Two Generals’ Problem - a thought experiment in coordinating activities via a potentially unreliable asynchronous communication line.
I recently rediscovered an academic paper which must be just about the first to describe these issues as they relate to distributed systems called ”Some Constraints and Trade-offs in the Design of Network Communications” by Akkoyunlu, Ekanadham and Hubert from, would you believe, as far back as 1975.
Before the two generals were generals they were, apparently, gangsters. The paper illustrates how, under certain circumstances, asynchronous communications inhibit scale because they can lead to an infinite loop which prevents an operation from completing. I’ll recreate it here, updating the story using characters from the HBO show The Wire, which not long ago hit TV screens in the UK (six years after it aired in the US, we are so backwoods here sometimes it’s unreal). But do read the original paper if you have time as it covers a lot more than just this subject.
So this is how it goes..
Detective Jimmy McNulty and detective Kima Greggs have staked out a Barksdale drug house for a week. As supplies run low, the crew that run the place call for a re-up of stock, which must be delivered and paid for. At the moment this transaction takes place there will be the drugs, a couple of captains, a few foot soldiers, some illegal guns and a whole load of money all in the house. For the gang it’s a dangerous moment. For the police a golden opportunity to make some arrests.
Greggs covers the back of the property and McNulty the front. Police backup arrives, but not enough for comfort. There will be five gang members in the house and six cops in total - three around the back and three at the front. They can overwhelm the gang at the appropriate time, but only if they go together in a coordinated operation.
The way this would normally be done would be by radio. The drug car pulls up, McNulty lets the product enter the house and then yells “Go. Go. Go.” into his walkie talkie and all hell breaks loose - ‘cept that would be synchronous communication and their commanding officer back at headquarters, political suck-up William Rawls, fancies himself as an enterprise architect and has mandated that everything must be asynchronous.
Asynch means McNulty pays a kid $10 to run around to the back of the house and tell Greggs when to go. But McNulty’s not stupid, the kid could just run off with the $10, or get shot, or lost or whatever, so he tells the kid to come back and confirm that he’s told Greggs to go. Only then can he charge in with confidence.
But Greggs isn’t stupid either. She gets told by the kid to go, but she can’t actually go until she knows McNulty knows the kids told her to go. Because if the kid doesn’t confirm to McNulty that he’s told Greggs to go then McNulty ain’t gonna go. And Greggs is facing five armed and seriously pissed off gangsters with just two officers for backup.
If McNulty agrees to all this then he knows that until the kid has completed his second confirmation the mission is not starting. So he can’t go until that’s happened. But how does he know that it’s happened? He asks the kid to do a third confirmation so that McNulty knows that Greggs knows that McNulty knows that Greggs knows it’s time to go.
And if McNulty won’t go until this third confirmation is complete then Greggs sure as hell is not entering that house until the kids confirms for a fourth time.
And while all this is going on the gang complete the deal and go about their business. The kid gets to thinking that none of this is worth $10 and goes home.
What we’ve proved is that when you positively absolutely have to coordinate a transaction like this, where each participant quite reasonably doesn’t want to act without the other, because the consequences are just too dire, you can’t do it asynchronously. Or rather you can, but you might want to build in a lot of checks to deal with those consequences.
Starbucks could chose to take orders via slips of paper that you pick up at the front door and drop in a box. They could take orders this way without much of a queue ever forming. They could allow you to pay asynchronously too by dropping the exact money into another box. They don’t because it’s broadly a better experience for all if they know you paid, you know your order was understood and they can tempt you with a muffin and a flapjack. The price you pay when it’s busy is there’s sometimes a queue. The queues are mostly bearable because everything else is async.
There are lots of ways to achieve scale, and each has its pros and cons. Asynchronous links improve loose coupling but reduce the ability to manage distributed transactions, which is indicated by CAP Theorem as something to avoid if you want availability and partition tolerance over consistency.
But when you need consistency, usually because it provides a greater long term benefit than anything else for that particular operation, your choices are suddenly limited to something coordinated and synchronous or something that can fix problems after the fact.
Those problems can be a devil to fix. But you can’t ignore them. You got to keep the devil way down in the hole.