The rise and rise of the NoSQL Movement
By Julian Browne on November 3, 2009. Filed Under architecture, business, development
About four years ago I sat in a meeting that had finished early. We were chatting away and the subject turned, as it always does, to the lamentable state of IT. In the preceding weeks I'd asked finance to run me off a number of reports showing just what we were spending on various aspects of our integration architecture. They made pretty scary reading. Wherever we had invested significant funds to improve and mature our infrastructure we were now spending significantly more and taking longer per-project even for minor changes. So this wasn't one of those daily, common-or-garden, grumbles about the poor IT people being put upon by nasty Mr Business. We had facts. Cold financial facts that said we were starting to suck on a grand scale.
The reason I mention the meeting is that two ideas came together that day that changed my mind about what a good architecture looks like. And, more importantly, how good architecture can be delivered quickly. If I'm being honest I didn't really know what to do with the output of that meeting. I knew that we'd uncovered a major cause of architectural rot and that there were things we could do to work around it, but not what the 'right' answer was. So it's cheered me up no end to discover over the past year that a bunch of people came to the same conclusions as we did, but did know what to do about it.
And they went and built real things to address the problem. These people are the pioneers of the NoSQL Movement and they represent probably the most important change in architectural thought for ten years(notes).
Before we get to that, let's go back to some basics about relational databases.
Relational databases are really kind of beautiful. Tables with rows and keys and indexes are an elegant form within which to store your data. Using those keys to model relationships, and SQL to retrieve data is, relatively speaking, a simple process. Nearly everything we've ever asked of the RBDMS has been delivered over the years with impressive levels of performance and transactional integrity. For many many applications they are a very good fit. This site is driven by a relational database and I have no plans to change that. But animate the data through any kind of workflow process that is subject to change over time and we start to encounter problems.
Businesses do not stand still and data doesn't make any money. Logic applied to data does. On the whole code is pretty easy to change if it's well-written. I know I am making a broad assumption there, but let's agree that good code is at least possible, even if we don't see it very often. All businesses that I have been involved with over the last ten or fifteen years have been attempting a similar feat: to make more money out of maturing and stable markets. Consumers pretty consume TV subscriptions, mobile phones, energy, retail goods, food, etc in predictable ways and competing on price alone is a very hard game to play.
In the trading world they use a horrible term to describe one solution to this: optionality. Optionality means selling conventional stuff, in conventional ways, to conventional markets but with enough options in the sourcing and distribution to allow for speculative activity, thereby increasing margins without focusing only on price. For example, we might contract to sell orange juice to a supermarket chain at a predetermined price, and in quantities responsive to their demands, but reserve the right to source and deliver it according to our business plan i.e. buy it from wherever we choose, blend it from multiple sources, buy in bulk and store it when the price is high, sell any excess to other vendors, and so on.
This has a dramatic affect on the business logic and the data model - both need to be able to change quickly and change often in hard-to-predict ways. Relational databases suck at this. If we change the table structure and/or relationships we have to change all corresponding SQL and any code that knows about it. This suckiness is one of the reasons that Object-Relational Mapping has become so popular. And though things are much better now, ORMs haven't been without their problems too. The Ruby on Rails ORM - ActiveRecord - illustrates the relational abstraction very well.
If in the first instance our business fulfils orange juice Orders by matching one Supplier to one Customer our Order model would look like this:
class Order < ActiveRecord::Base belongs_to :supplier belongs_to :customer end
For any instance of an Order object we can find the supplier and customer with code like this:
my_order = Order.find(42) my_customer = my_order.customer # returns a Customer object my_supplier = my_order.supplier # returns a Supplier Object
In case you are not familiar with Rails, as long as conventions are followed - there are populated tables called 'orders', 'customers' and 'suppliers' with the required 'customer_id' and 'supplier_id' in 'orders' - all of this will work fine without any SQL or other code at all.
Now let's say the business wants to invoke some optionality (yuck) and source multiple orange juice suppliers per order (e.g. because they are cheaper but can only supply smaller quantities). How easy is it to change the model?
class Order < ActiveRecord::Base has_and_belongs_to_many :suppliers belongs_to :customer end
and to retrieve the data:
my_order = Order.find(42) my_customer = my_order.customer # returns a Customer object my_suppliers = my_order.suppliers # returns an array of Supplier Objects
There would need to be a join table called 'orders_suppliers' containing columns for 'order_id' and 'supplier_id' but still no SQL and very little change to the code.
Marvelous. We can respond rapidly to a business demand. Good architecture, quickly. Well, sort of. It's better architecture but perhaps we should see how it responds to more complex changes first. After all, if the business need is to change often then it's our ability to match that tempo that will earn us our 'good architecture' badge.
And talking of tempo, a word about Agile in all this. I've always been a little uncomfortable with the strength to which Agilists cling to the first of the Agile Manifesto values
Individuals and interactions over processes and tools
Dealing and communicating with people is, of course, just about the most important thing to get right on any project, but this software thing can be difficult. Problems can come up that, even when communicated well, are just not palatable. In my experience, whilst the business is grateful for transparent communications with no surprises, they're also surprisingly keen that we justify our salaries by showing some expertise in the areas we profess to be expert in, namely processes and tools.
In a way I'd rather the agile manifesto tag line was re-written to say:
That is, while there is value in the items on the right, we value the items on the left more.
Well, actually no. There's actually quite a bit of value on both sides, but honestly we recognise that you (the business) value the things on the left more so we'll conduct ourselves on that basis. Even though to do a good job for you we really have to value the things on the right pretty highly. Tell you what: we'll hide all the things on the right from you by working really hard so that all you'll ever see are the things on the left. Deal?
Seriously though, there was a lot more time to browse the web and read XKCD cartoons when all we had to do was lock you in a room with a bunch of business analysts to produce a requirement spec for us to get picky about six months later.
The ORM in Rails is good, but not perfect. Although it simplifies our code and allows us to change that quickly it hasn't done anything about the data underneath. Even in the simple example above there's quite a bit of data restructuring and migration to do behind the scenes. A major change could easily introduce significant knock-on affects that would impact our ability match business expectations. Affects that we could communicate openly, but it would be better if we didn't need to. In short, it would be better if we could blend the code and the storage and do away with the ORM altogether. One less technical inconsistency, one more feather in the agile cap.
The change inhibitor is the schema.
A schema is both a formal model of the data and a kind of insurance policy that naughty code won't put incorrect things where they shouldn't be (nulls for example). If your business needs are relatively static then this schema is your friend. It sits below the code like a burly nightclub bouncer, checking everything coming in, maintaining data quality and things like referential integrity. But the example we're looking at might require a schema change for every new order structure, plus the ability to maintain backward compatibility with previous schemas/orders. That's a tough thing for our bouncer to achieve. Unless we alter the schema so that all data types are just blobs, decipherable only at runtime. That is we move the data control into the code and, in effect, remove the schema. And when you think about it this isn't as big a leap as it sounds - even with data consistency checking in the schema our code still had to deal with the times when the bouncer wouldn't let something in (e.g. because there was a null heading for a not-null cell). But what would a database that allows this kind of extensible duck-typed model be? One thing's for sure - you wouldn't want to run SQL against it.
So, on this very theme, I'd been thinking that staying out of databases for certain (changeable, complex) parts of our model seemed like a good agile-enabling idea. Compute grids and the Spaced-based Architecture are good ways to implement this, avoiding unnecessary architectural tiers and only dropping out to the relational database when you have to.
It was while we were kicking these ideas around that Neil Wilkinson, an Enterprise Architect at Data Systems & Solutions (mentioned before here as one of the originators of Role-driven SOA), injected the idea of documents. My memory of the moment is that he said it like Mr. Maguire said the word "plastics" to Benjamin Braddock in the 1967 movie The Graduate (a quote which made number 42 on the one hundred top movie quotes of the last hundred years) though it may just be my brain editing events for dramatic affect. Neil's idea, covered to some extent in Role Models & Services, was simply that businesses are very used to dealing in documents. They like them because they can be passed around (workflow), annotated (audit trail) and are so flexible that they can cater for two instances of the same document type being treated entirely differently (prototype-based inheritance). If we could develop an alternative architecture that was document-based and not based on relational-tables we might very well have something.
We did develop a few good ideas, including what I now see was an inadvertent reinvention of REST (almost to the point of saying "hey if only we could do REpresentational State Transfer..") but we weren't a start-up, we were a business in need of a product and we had day jobs that involved a million other things.
The principle though is crucial. I'm a big fan of Domain Driven Design, which, if you remove all the practices and foofaraw, is all about creating highly expressive models, models that stretch from requirements to code without creating an impedance mismatch between the business and the developers. It's an unavoidable fact that relational models do not sit well in this world, being beautiful technically but somewhat jarring when first introduced. Requirements to documents, to document-like objects, to persisted document-like entities, though is pretty seamless. And if it's easy to model and easy to build then it's also much easier to change in the transparent agile way we're after.
A small change to the business model will lead to a small (i.e. quick) change to the code and even a big change to the business model will require a big, but natural and transparent and obvious, change to the implementation.
I'm not going to expound upon the relative technical merits of the various NoSQL products for a number of reasons. I don't think it's possible today to do a fair comparative assessment. There are a lot of products out there, at various stages of development, and their respective owners are in that difficult phase of simultaneously promoting, presenting, funding, designing and leading their developments. Whatever fault I may find in one product could be rectified within a few weeks and whatever scenario I tested any against could be addressed with a config tweak I am not aware of. This is not to say the leading products are bleeding edge or unstable. I've played around with most of them and all were eminently fit for at least a first-pass corporate adoption.
Product owners, being product owners, will always argue over whose philosophy and implementation is superior: key-value, graph, document. In choosing one over the others you have to spend a lot of time in that "it depends" kind of argument that checks the model of what you need against how the product handles that kind of incremental change. A good place to start is Seth Chisamore's summary of the recent NoSQL East conference.
To find out more it's worth taking a look at how these different products solve some of the issues experienced with relational semantics:
neo4j captained by the witty globetrotting Emil Eifrem. The tweet of mine he used in one of his presentations made me think of expanding what I said into this article. I confess to a slight bias for the graph-based approach - allowing relationships to have attributes is such a powerful concept.
Be aware that there are lots of products in this area, though I think the competition is a good thing.
It's November now and time to start thinking about next year's portfolio. A wise architect would have an assessment of these products somewhere on their road-map. Not to replace relational databases, but to augment them in areas of operation where this kind of hard-to-manage rapid change is expected. The No-SQL movement was recently clarified as in fact meaning N-O-SQL (Not "no SQL", merely "not only SQL") - a healthy step which should go some way to avoid the ridiculous counterclaims that these products are attempting to usurp the position of the RDBMS. Anyone exploring the world of non-relational approaches will quickly see that there isn't one alternative but a spectrum, each with subtle characteristics that may or may not hit the spot for your particular needs.
Lastly, much has been made of how NOSQL products help deal with the kind of scale issues thrown up by CAP Theorem. One of the conclusions of that article was that isolation to maintain consistency, as part of the relational approach, directly impedes your ability to scale or maintain availability at high transaction rates. It might be tempting then to dismiss all this because very few of us have this kind of scaling problem to solve. What sometimes gets lost in the NOSQL message is that it's not only about scale - even at low transactional volumes, if your plans remain centred on relational databases you will find it harder to keep up with business change too.
Actually many of the concepts aren't new. MUMPS which dates from the 1960s has some of the features noted here. I know this because one of my first jobs was on a MUMPS system.
This article has a follow-up entitled Freedom from the Tyranny of Schemas