Part 1 of 3: Becoming a spaceman
Tags: architecture , development
This is the short version of the story of my experience with Space-based Architecture. Being the instigator for one of the most often referred-to commercial implementations of an SBA and unusually an implementation outside of the financial trading sector, where the approach is more mainstream, it’s appropriate that I cover not just its implementation, but why it seemed like a good idea in the first place.
I was Head of Architecture & Design at Virgin Mobile in the UK until shortly after it was taken over by NTL:Telewest (now re-branded Virgin Media). Like all operators in the mobile telecommunications space, we had to contend with operational challenges caused by frantic growth in the late nineties, particularly in keeping our customer-facing systems reliable and highly available. The culture in mobile is fast paced and competitive which often means that anything that looks and smells like strategic architecture delivering non-functional requirements is hard to do. Development time-scales tend to be short and aggressive and, in Virgin Mobile particularly, very focused on great customer service.
One channel we found hard to exploit was the web. The existing site was fairly dated and hardly inspired potential customers to us as the funky consumer champion Virgin businesses like to be. We also knew that a simple skin-job and a bit of Ajax wasn’t going to be enough - a few graphics and cool text would not provide a pleasant user experience if our back-end systems were down.
The business sponsor and I kicked various ideas around for a while and finally convinced our board to splash out more cash than they’d originally had in mind on a new order management system (OMS) to sit behind the web front end. The benefit case was fairly easy to draw up: we could make ordering a reliable and predictable process, and we could reuse our new OMS for other channels. Why stop at the web? Once it works and works well, why not re-use it for telesales and high street stores too?.
We clearly needed something extensible. We also needed something that would intrinsically support an atomic processing model. Order processing can get quite involved, with credit checks, stock checks, progress updates, warehouse communications, despatch notes, call centre updates. If any one of these legacy systems were temporarily out of action, we needed the customer’s order to be in a reliable state when it came back up.
And most of all we needed buckets and buckets of scalability. Anyone who’s worked in the mobile sector will know it’s a strange place when it comes to transactional throughput. A bank, for example, will have peaks and troughs around consistently high average levels of activity. Paradoxically, this makes design easier because, while the problem may be hard to solve, you have to solve it for every minute of the day (so your mind is focussed, your business is prepared, and the money and desire are there to support you). Not so in mobile, you can have a fairly slow day, followed by a day where order activity goes through the roof - it’s not unusual for Christmas orders to be many factors above a standard day’s activity. If there’s a promotion going on at the same time (and there nearly always is) you can be in real trouble if your systems can’t cope.
It’s not just the risk of poor user experience, there’s lost revenue and the incalculable impact on your brand (the ‘opportunity window’ of users changing mobile operators is a narrow one, and these days brand perception is everything) to add to the pressures - what we were in effect looking for was a strategic solution that could bend to needs we weren’t yet aware of and that could be tactically implemented using the (non EJB) Java skill base we already had.
Whilst transactional consistency was high on our list, scalability and tolerance to outages in our legacy systems ranked higher. That is to say, without putting data integrity at risk, we needed to make sure we could manage incoming orders in a user-friendly manner even if a satellite application couldn’t be contacted.
We examined all the standard approaches: light and heavyweight application servers, third party OMS products, etc, but landed on a space-based architecture based on Gigaspaces, principally by the following logic:
Nothing we looked at would scale easily (or at least easily enough for us). Our business had been bitten before by the don’t-worry-this-new-fangled-thing-will-save-the-world story, and we needed not to repeat it. When we started to hit hardware limits we wanted to scale out not up, and because of time constraints we wanted to avoid relying on traditional load balancing and some of those thorny problems you can get managing sessions.
Investment Bank Strength but Mobile Flexibility
My background before mobile was in investment banking in the City of London (particularly futures and options trading) where thinking in terms of ACIDic transactions, throughput, scalability and bullet-proof reliability is second nature. But what we didn’t want was a classic three+ tier architecture where transactional integrity was delivered by the RDBMS - innovative leaders like Amazon had already shown this to suffer scaling issues.
The notion of JavaSpaces, Java’s implementation of the tuple spaces blackboard pattern is relatively well understood in finance. Possibly not as widely used as it could be, but certainly not considered rocket science.
JavaSpaces (and its underlying JINI) is very tolerant of unreliable nodes and network connections - JINI being borne from Deutsch’s fallacies of networked computing and used by the US Army and Formula One race teams in some pretty hostile environments.
Because it works on the principle of lots of small and cheap boxes in a grid to make a space, it’s naturally scalable horizontally; no having to migrate to a bigger box when the first runs out of capacity.
A good Java developer can pick up JavaSpaces in a matter of hours, and be proficient in days.
There are really only four verbs (actions) to learn:
put - Put something into the space
notify - Tell me when a thing I am interested in is put into the space
get - Retrieve something from the space to be processed, i.e. nobody else can then retrieve or read it
read - As for get, but a copy stays in the space for others to read, or get
As you would expect, there are subtleties that extend this, such as reading something only if it exists but, like all paradigms, these follow as your needs become more sophisticated.
What is important to understand is that, although it has many container properties, a space is not like an application server. Entries in it are static. They are just objects floating in a highly-available pool. They can be both data and code, but they only come to life when you get them out and activate them.
One of my earliest illuminating moments was when I first discovered that if something is removed from the space by a process which then dies, the entry reappears. You don’t lose it, although if you so wish you can set an entry to remove itself after a timeout (essentially a lease). The underlying JINI stuff is so cool it deserves a chapter all to itself.
Generically, JavaSpaces follows the Blackboard Pattern which is a clean and elegant one to work with. It makes sense even to techno-phobic business people.
Just use the analogy of a bunch of business specialists standing around a blackboard taking turns to solve a problem - the “data” is on the board and the end result reflects the co-operation of the parties, which is better than if only one of them had worked on it.
A clear model leads to better communications. If you have open lines to your business sponsor you have a much better chance of delivering what they really want in an agile manner.
You can easily try the basics on your desktop. There’s a great open source implementation of JavaSpaces in Blitz. We needed something a little bit stronger in terms of clustering and operational support and making friends with our operational people was key to making the solution supportable. Gigaspaces has this.
Nov 2007 Update: Gigaspaces recently announced a start-up programme, which allows businesses to get access to the technology for free and continue to do so whilst company revenue stays under an agreed limit.
Once you start to see the world in terms of events and spaces, and not just services, many many more liberating things are possible. There will be more on this later as I’m working on something of my own in this space (pardon the pun) right now.
Of course the final architecture adjustments are a little trickier, but then aren’t they always? I would certainly advise playing with some simple set-ups first just to get the feel of it, and thoroughly researching the various patterns available.
We chose a fairly straightforward master-worker approach: a master process collects submitted orders from the sales channel (in this case the web, but as the master is simply the provider of an order collection service, there is scope for other channel support) and places them into a space, and a host of workers go to.. er.. work on it.
Workers can be at whatever granularity you need (credit check, stock management, status update, etc) and can operate in the sequence you need them to (this is because it’s simple to tell each worker only to look for orders that match a certain pattern in the space). Getting orders in and out of the space safely is part of the API, and when you need to scale you just add another box to the space and off you go.
We assumed that legacy systems would be unavailable, rather than available, and designed the process accordingly (avoiding the common ‘fallacy of order management’ computing).
There were technical challenges along the way, but nothing we couldn’t handle (if I had my time again I would have introduced the ideas and the training much earlier), the biggest of which by far was dealing with the few developers who were against trying anything seen as ‘new’ or ‘different’ (depressingly common in teams with an entrenched love of their own suffering).
But would I do it all again? Certainly. Despite project and political annoyances, which let’s face it are prevalent everywhere, it became a raging commercial success. I can’t quote figures, naturally, but the development paid for itself very quickly, and went on to support a sizeable percentage of the company’s business without any downtime.
Famously, when a legacy system did go down for a short time, all the orders that otherwise would have been in jeopardy, were held safely and restarted later, and whilst the architecture may only have played a supporting role, the site went on to win best telecom web site of 2006 as voted on by customers.
Not a bad day’s work, all in all.
The views in this article are entirely my own and not those of Virgin Mobile or Gigaspaces
Some commercially-sensitive details have been withheld