Moving to the Cloud
don’t be trapped by old concepts, Matthew, you’re evolving into a new lifeform
Tags: architecture , business , delivery , strategy
It may seem odd in 2023 to be talking about moving infrastructure to the cloud, but I am constantly surprised at how many businesses that could be cloud-based aren’t. AWS has been around for 17 years, Google Cloud for 15 years, Azure for 13 years and alternative cloud offerings for 10 years. Actually, I say I’m surprised, but perhaps intrigued is a better word. Like so much in tech, these things do have a sort of weird explanation to them that needs a touch of dismantling.
Let’s talk culture and body snatchers.
The vast majority of business models are absolutely fine to move to the cloud. You can ignore all the posts filled with twaddle about how you should be wary of cloud for reasons of security, outages, monitoring and support etc. Yes, things can be a bit different in the cloud. However, to put it politely, would you rather rely on services built by engineers from Google, Amazon or Microsoft, or the ones you can hire yourself? Exactly. Some specialised businesses may need to run on their own hardware but, as cloud capabilities improve, that list is dwindling.
What I think is the cause of this tardiness is not so much that the business model doesn’t fit, but that there’s a clear unaddressed lack of cloud readiness. Not just in businesses that haven’t started a migration, but in businesses that struggle to get traction on their migration plan. Some of it is technical, some is cultural. And I think being realistic about these areas would go a long way to implementing effective migration plans.
Let’s look at a major cause of technology inertia first.
The Body Snatcher Architecture
Today, many organisations have what we call a monolithic architecture. And lots of them are talking about (or attempting) to move to a micro-service alternative. There are some great books and posts on how to do this, and some equally helpful content on problems to be aware of.
I don’t have any fundamental problems with the mainstream view on this1. Micro-services are a distributed systems approach to architecture, and when you run things in the cloud that turns out to be a sound model to follow. However, distributed systems are hard, which means projects often run into serious issues. Ironically these issues tend to be easier to overcome if you start with a monolith, rather than try to design a fancy micro-services architecture from scratch.
There are two points to take away from this: a) there’s no shame in starting with a monolith. Monoliths are not an anti-pattern. And b) you’re better equipped to deal with complex architectural challenges when you know your business better.
The issues arise as the business grows. Small monoliths don’t become bigger monoliths, they become body-snatchers.
The Body Snatchers was a novel written in 1954 by Jack Finney. The book was made into a film in 1956 and remade again in the all-star cast definitive version in 1978 (it’s been remade at least twice since, once in 1993 and again in 2007). The basic plot is that Earth is invaded by alien seeds, which grow into pods that clone nearby sleeping humans into identical copies that have no emotion or soul.
As the demands on the database increase, a typical early step businesses take is to add clustering. Now there’s a primary server and one or more replicas. This has a couple of benefits: there’s some failover capability if the primary server has a wobble and code that doesn’t need a real-time view of data can connect to the replicas, thus reducing load on the primary. This is Cloning Stage 1. But it’s fine. Technically it’s still a monolith application.
Time passes and the load increases further, either by traffic or additional complexity in the data model requiring more joins. Now the primary is so busy that the replicas sometimes lag behind it by more than a few seconds. As this grows to minutes (and hours in extreme cases) reporting starts to feel untrustworthy. Indexes get added, which will help reads but may make writes take longer. However, everything is still too slow. The next step is to flatten some of the main entity data so that joins are eliminated. But within the core tables themselves because then we’d lose referential integrity. Instead, we’ll add some flattened/join-less tables, maintained by triggers and batch jobs. Cloning Stage 2a. Now we have core tables, flattened tables and copies of both.
More time passes and the jobs that read and flatten data, plus all the reads and writes and replication and indexes take their toll. We need a NoSQL outpost to store that flattened data. Someone adds Mongo or Couch or similar and those flattening jobs write to separate (possibly clustered) servers that specialise in access to indexed and denormalised data. Load shared. Cloning Stage 2b complete.
While all this is happening, what began as a clean MVC framework for the code has bloated beyond recognition. Views, which used to be quite clean, now contain business logic and string-concatenated SQL statements that fulfil their particular data needs from whichever of the various cloned sources made sense at the time. This lack of consistency and efficiency is itself taxing the database but there’s no time to fix it and besides no one is especially keen to clean it up because there’s no canonical reference of what any of the data in the database even means. The pace of change has left columns that are poorly named, tables that get populated but are no longer used, duplicate places to get the same type of data, nulls in places no nulls should be.
But we’re not done cloning, because this is about the time someone sells the idea of a cache. So in comes Redis, Hazelcast, Gridgain, or whatever and now we have another copy to worry about. Our data is made in the relational furnace of the primary, processed, flattened, replicated and cached and all these stores are accessed willy-nilly from all parts of the code base.
And somehow we have to move this to the cloud, in a business that’s firing new feature requests at tech like a Tommy gun, without it breaking.
We’ll cover what to do about this in a bit. For now, I just want to make the point that when companies say they have a monolith, they rarely do. There can be challenges moving a monolith to the cloud, but it’s a fairly tidy architecture. Moving a mess of data-sucking alien pods is a whole different story.
And we’ve yet to look at the business side.
Thinking Cloud is Cheaper
If you rent actual servers with a big provider like Rackspace or ANS (was UK Fast) then you’ll typically pay a well-understood fee per month for those boxes.
You will probably moan about how it takes their support team days or weeks to set up new equipment (at least compared to cloud, which is seconds), but your cost will be predictable.
Cloud is LOADS more expensive. Whatever the vendor account management teams tell you about cloud optimisation blah blah, it’s NOT cheaper. Based on actual experience from many migrations I would say, as a rule of thumb, you should assume it’ll be initially about 3x more expensive. Yes, I said 3x. If you spend 100k today, you’ll spend 300k on day one in the cloud.
You can optimise this down a lot, but initially, because most businesses move their body-snatcher architecture over in dribs and drabs it costs a packet because cloud pricing punishes wild data transfers and centralised thinking.
Saying the words ‘lift and shift’
It’s very tempting to think you can copy/paste your carefully curated collection of snowflake servers to the cloud, and technically it is possible, but a) see “Thinking Cloud is Cheaper” and b) you’ll almost certainly get woeful performance and have a terrible time trying to manage it.
Not Changing the Delivery model
Whether you have a clean monolith or a field of body-snatchers, you can generally release code the way you’ve always done it (which is usually a mostly manual error-prone affair that involves relying on a few named individuals who ‘know the system’) because the servers are mostly static. The cloud forces you to think about scripting builds, running automated tests (the horror), and treating servers like cattle, not pets. This requires a LOT of effort and people to care about such things above and beyond the code.
Not Educating the Exec Team
The cloud is not magic. At its simplest, it’s a way to get infrastructure set up really quickly, configure and reconfigure it as you scale, and a whole host of things you can double-click on in a console to save you from having to build things yourself. I think everyone understands the infrastructure side by now but those services come with all kinds of side effects, which are basically the product of being distributed systems. The exec team needs to understand that there will be LOTS of talk about services with weird names, trials that don’t work out, edge case understanding moments, and plenty of money spent getting there.
Not Adjusting the Workload
Whatever plans the business had, they need adjusting to give time to get all the above things done.
Having a Migration Team
Clearly, this is all intensive stuff and you’ll want people dedicated to learning what works and what doesn’t and pushing the plan forwards. But what they are building will be the new world. Old servers will be decommissioned and reinstated as multiple virtual images and a new landscape will emerge. One that will need to be understood by the whole team. Documentation isn’t enough. From the head of engineering to the most junior developers and testers, everyone needs to experience it first-hand. Otherwise, knowledge of ‘how things work’ will remain in the heads of the few, just as it was before. Developers will leave, looking to build their cloud skills elsewhere.
A Cloud Plan
OK, enough negativity. Let’s put together a plan that works. Starting with a cloud rationale.
Not everyone in the company will care about the move, but everyone should know it’s happening and why. They should also broadly how life will be different. The rationale should be simple. it should not be about cost saving, but about agility and the ability to access services designed to operate under failing conditions.
DevOps / SRE / Platform Engineering
There has been a huge amount of nonsense talked about the team of specialists that look after your business platform. I’m not sure it hugely matters what you call them (other than to attract the right talent) but it very much matters what they do and how you treat them. I’m going to talk about this at length in a future article so for now I’ll just say that you must have an empowered team whose job it is to look after the health of your platform. This has a big overlap with the development of the features within the platform (as in, both groups should be in all conversations about projects, tech choices, strategy, direction, etc) but it is a discipline in its own right and needs nurturing accordingly.
And those platform custodians won’t get far without reliable actionable data. The days of rooting through logs files or piecing together an understanding of the cause of an issue from multiple monitoring tools (that also didn’t alert you to the problem before the customers did) should be long gone. They’re not, in part, because body-snatching architectures are a major time suck so there’s not usually any capacity to run code any differently.
Start With Knowledge
The good news is some of the body-snatcher pods are probably movable if you add a bit of scripting and some telemetry data. What you need is to do a formal analysis (like the one provided by Strazone which was acquired by Google in 2020) to break your body-snatching pods into a) simple services that will lift and shift, b) services that need ‘some work’ and c) services that need re-architecting.
Note that c) doesn’t always mean microservices, but this is a good time to be intentional about the way forward. Don’t just chop them up randomly - know how the work fits into the strategy for a future platform.
Connect your hosting provider and your cloud with a fat pipe so you can run your infrastructure from both places and move live services around in real-time. Avoid big bang changes and overnight releases like the plague. Put code live during the day when everyone is around to help if it breaks.
Aim to Use Every Proprietary Service That will Buy You Time Back
Cloud providers offer tons of services to lock you into their platform. On the surface of it, it may seem like a good strategy to ignore them and build your own, so you are free to leave to get a better deal. You won’t. And it’ll cost tons more to do that than the cost of any lock-in.
Years ago I worked at Hive the Internet of Things start-up created by British Gas in the UK. We scaled a business from a few hundred customers to millions in AWS. Initially, this was our strategy: keep the use of AWS native services (even basic ones like messaging) to a minimum so we could always be ready to move away. I’m not sure we ever thought beyond that statement but we sure made a bed to lie in. It worked, but it took a lot of effort and engineering time. Once we changed tack and took advantage of all the services AWS had to offer we got time back to do other things (like build better products and look after our platform..).
The change of strategy was something I talked about at a few conferences at the time.
Have a Mantra
The DevOps team at Hive had a mantra which was when faced with a technology choice to follow this logic:
Prefer Services to Software - if the cloud provider offers it then use it rather than build it. Then use the time freed up to do more (or just to go home on time and enjoy life).
Prefer Software to People - if someone has to do it then automate it and make it part of a toolset anyone can use. It democratises the management of the platform and reduces the chance of a hero culture.
Prefer People to Bureaucracy - but if you can’t use a cloud service or automate it then trust the people on your team, just make it as easy for them as possible to get it done so they can go home at a reasonable time too.
Prefer ChatOps for Everything - I think this one is aging a bit in an observability world, but the philosophy still holds true: always put everything in Slack or Teams or whatever tool you use. You should have one place where you can find out what’s going on.
The painting is called “Under a Cloud” by the American artist Albert Pinkham Ryder
The subtitle is a quote from the 1978 film “Invasion of the Body Snatchers”
Actually I do. But that’s for another time. No doubt smaller, independently deployable services are great for agility. But they require some fairly serious engineering discipline which a lot of companies (and certainly a lot that are ‘doing microservices’) do not have. ↩