Freedom from the Tyranny of Schemas

An absence of obstacles to the realisation of desires

July 30, 2011 14 minute read

oatmeal cartoon — Don’t get too carried away
with your NoSQL Product

Time flies - it was nearly two years ago that I wrote Strained Relationships, an article extolling the potential benefits of NoSQL data stores. My main point then, and now, was that certain features of the new wave of non-relational products looked a promising solution (in part) to improving speed-of-change in large enterprises. Sadly, too many articles in the NoSQL space still focus their attention on drooling fanboi speed and whilst it’s true that NoSQL products are generally faster than their relational cousins (as long as you are prepared to shift designs and mind-set accordingly) we aren’t all Twitter. Plus there are many ways to solve the latency problem, NoSQL is just one of the newer ones.

Anyway, my conclusion, way back in the heady days of 2009, was that 2010 felt like a good year to start piloting NoSQL, even in big, recalcitrant, corporate IT.

Two years later, this is an update to that.

I took my own advice and did a pilot. Three in fact. And I spent a lot of time getting my head around the nuances of various NoSQL offerings. It’s been an interesting period. A few months ago one of those pilots turned into a commercial project. A project which is now heading for production in an enterprise organisation. And the momentum of that project begat another smaller project which is already in production. So it’s all very real now.

This update comes in two parts. The first is a direct follow-up to my last post on the whole subject of NoSQL - a sort of recap on the journey taken to get a NoSQL product accepted. The second, NoSQL in the Enterprise is a more direct summary of the product chosen - MongoDB. I’ve not found much in the way of plain-speaking unbiased accounts of NoSQL products for corporate use and even fewer that describe what it’s like to baby-sit one from the wild-and-crazy moment where you suggest going ‘non-relational’ to actually building something with one.

So let’s wipe away the drool, put our enterprise hats on, and see about filling the gap, in some form that’s plain-speaking and unbiased, of course.

We’ll begin where all good stories start - at the beginning.

NoSQL: How would you know?

How the heck might one come to the conclusion that a NoSQL product is in one’s future?

I can’t talk much about the problem I am working on today (though I hope to share more later) but for obvious reasons it’s really not worth embarking on a journey with any NoSQL product unless you are solving a problem that requires one. The good news is that NoSQL solutions have a broad range of applications. Anyone who says that NoSQL is only for niche, cache-centric, high-end, web-scale use is wrong. Correspondingly, anyone who thinks going NoSQL automatically makes them web-scale is wrong too.

NoSQL products are just data stores with properties you can’t easily (or don’t) get in the RDBMS world. To a large extent they make explicit the trade-offs mentioned in Cap Theorem. But, let’s be clear before we start, if your problem stacks up against the relational model and you can manage data change ok, then absolutely stick with what you’ve got. How the heck might one come to the conclusion that a NoSQL product isn’t in one’s future is easy - you’re ok with what you have. But that may not always be the case: things change and you might want new tools to deal with new problems. So humour me for a moment and see if any of this rings a bell.

NoSQL products are not toys that will go away sometime soon. Most of them are quite mature now. The trick, as with any tool, is to know what you’re getting into. I think I experienced three phases of NoSQL enlightenment. The first was sensing we had a business problem that could be more easily managed if we didn’t put a schema in the driving seat. The second was realising that there was a viable option not to design our applications in such a way that we would accidentally stumble into hard problems later. The last was working out how to explain (or sell the idea of) the whole caboodle to anyone who hadn’t passed through phases I and II.

Phase I : Smells like NoSQL

Realistic Scenario A - You probably aren’t Google (if you were, you wouldn’t be looking for advice here..). Your transactions are of various types, but for the greater part you probably only require levels of latency that are just-good-enough. Low enough to keep customers interested, but probably not so small you need to fight over every millisecond. You probably want to leave the door open to scale without too many headaches, but that probably doesn’t mean you have to think in terms of tens of thousands of transactions per second.

So you probably wouldn’t think to consider NoSQL. And you may be right.

But low latency isn’t a bad thing. End users love it. Your business probably complains about the latency of at least some parts of your architecture. You might also feel sometimes that you should take a closer look at your transactional requirements just to see if there are better ways to handle them without so much overhead.

Realistic Scenario B - You probably aren’t Amazon. You may have only slight variations over time in your core business entities of customers, products, etc. You may be able to manage data change as it ripples through the code in affected systems.

So you probably don’t need to respond to new product launches in weeks rather than months or years. And you may be right.

But most data changes represent pain you don’t need. And it’s nigh on impossible, in most businesses, to define the One True Schema to meet future requirements. Is the schema you have limiting your ability to add new functionality? Are the business able to understand and accept that? If it’s easy for them to say “just make it do this extra thing”, then maybe it should be easy for you to make the change.

We fell mostly into scenario B, though I can’t pretend the promise of low latency went unnoticed. For one thing, the faster you can respond to business queries generally, the less likely you are to have to distribute at all, which means you can remain compliant with Martin Fowler’s First Law of Distributed Objects:

Don’t distribute your objects

Opening up to a different way of solving problems is a challenging step for most of us. But NoSQL is not about decommissioning your Oracle and MySQL databases. It’s about having more options to solve application data persistence.

Saying you will only develop software based on an RDBMS is the same as saying you want one tool to do many jobs. That’s as stupid as saying you’ll adopt NoSQL “because it’s fast”, or “because there’s no schema”. Both of these statements are ridiculous and incorrect. It’s actually quite easy to make a NoSQL solution run slower than dirt. And NoSQL doesn’t mean no schema. Everything’s got a schema - try storing 20,000 records about books with each one structured in a different way. NoSQL just means your schema presents itself in alternative ways, or alternative places.

Think of it as removing the database table structure from around the rows you already understand and giving them a bit more room to breathe.

If that doesn’t feel too scary then the next question is how you will fill the responsibility gap left by removing those tables.

Phase II : Feels like NSQL

So you have a series of issues that could do with a different approach and you need to understand what that different approach might be. This is the inside-out to outside-in switch.

To build around an RDBMS you would typically start by defining some entities. And you’d represent these entities as tables. As you normalise these entities more tables spring up. Some of these will be sub-entities and some will be tables to link entities to related entities. At the end you have a general-purpose schema that represents your data domain. Because your query language is fixed (SQL) you might try a few tests and further optimise your schema (sometimes by denormalising it because joins are computationally quite expensive).

Relational models are inside-out enterprisey things - they fulfil this role by being good at suiting the needs of the many by being slightly (or very) sub-optimal for the needs of all.

Optimising core structures for one client is bound to affect others, probably in bad ways. To get around this you might use views but they don’t really solve the problem. They merely add a layer of coupling that makes change harder later. Also, relational databases can be expensive to buy. And the free ones can be expensive to maintain. For genuine enterprise repositories, with hundreds of clients, all with different profiles, this may be about as good as you can get. It’s painful and slow but there’s no magic wand. The only positive thing you can say is that the problems and the skills required are well understood.

For many applications this isn’t a compromise you have to make. Imagine you thought first not about data structures, but client convenience. What if you denormalised and restructured your data all the way?

Where does that end up? With data structures that exactly match the client’s transactional needs.

Want a customer record? It’s all there in one scoop. Want all the records that match a specific pattern? One scoop.

To define what ‘a scoop’ is you have think outside-in: put data items that are needed together, together. You’d make your basic units of persisted data match the way clients interact with them - that is they would look like the end result of a query against normalised data, not the normalised data itself.

Put another way: instead of designing your data around the fixed query language it uses on the inside, you design it to suit how it will be used on the outside (giving the NoSQL moniker some logical basis even if it lumps a lot of different ways to go together), whether that be documents (MongoDB, CouchDB), key-values (Riak, Redis), columns (Cassandra, Hypertable) or a graph that allows clients to traverse relationships like a souped-up mind map (Neo4J).

It turns out that designing this way also makes querying super-flexible, because you can build quite complex queries when the data is simple. For those edge cases, or for large data sets, there’s usually something like map-reduce, though it’s not necessarily ideal for mainstream queries. CouchDB uses map-reduce for building views which makes it both flexible and fast (as long as you don’t need new views too frequently).

Transactions too can be mitigated. In the document model you can mostly design so that atomic operations are specific to one document. Occasionally you will need more complex transactions. There’s no denying that you’ll need to build something to handle that because NoSQL products are generally quite light on transactional semantics. For me the main lesson has been that you don’t always need them as much as you think you do and designing outside-in goes a long way to avoiding them.

What you need now is permission to try it out.

Phase III : Selling NoSQL

In order to put these ideas into practice you need to get buy-in for the concept of solving the problem non-relationally.

Promoting any one database solution is a weird thing to have to do. You don’t normally need to sell Oracle, for example. It might even be the ‘company standard’. Even if not, Oracle have an army of people busy selling it themselves. They may even be trying to sell it at the same time you are talking up your alternative. And they are good at selling. Creators of NoSQL products are not good at selling. At least not to corporates. They mostly interact with start-ups and internet/tech businesses, which is a different culture altogether. Heck, many NoSQL vendors are start-ups. Their products are open source.

Corporates care about brand perception, revenue, customers, and what their peers are doing, which means they want to feel that anything they do comes with a snuggly, comfortable, safety-blanket. They will pay Oracle handsomely for this. You can argue whether they get it or not but that’s beside the point.

In making the case for your NoSQL approach you have to remember to cover two aspects:

(a) What features you get with NoSQL that you don’t get with an RDBMS (and vice versa)

(b) How it’s going to ‘feel like Oracle’.

I know. Sorry about that, but if you don’t do (b) then it won’t matter whether you are right on (a) or not.

To cover (a) you only need to reiterate what you learned during phases I and II. Explain what issues you have now. Explain how an alternative change process might work. Explain how much faster it might be. Explain how that opens up opportunties even if speed isn’t a major issue. Explain how much less it’s all going to cost. Explain how long changes might take in future and why. And remember this: a sales person from a product vendor can only sell what their company makes. They have to twist your business requirements to fit their products. They have to make you beleive that what you need is available from them out of the box. Explain that NoSQL products don’t actually do very much at all in comparison. They don’t profess to be able to solve all your problems and make you a cup of tea when they’re done. Your business requirements will remain intact, because there’s nothing to twist them to. They will either fit NoSQL or they won’t. If they don’t you’ll just do something else. And you won’t have wasted a lot of money. The problem with big license fees is that once the business has paid up, the money is seen as an investment to be used. Even when the investement is not well-matched to future projects.

To cover (b) you’re going to have to design it right and build it right. Then you’re going to have to replicate all of the softer features you get from a big name database, including all the post-launch support, monitoring, maintenance, etc. This is the snuggle blanket which I’ll cover, for MongoDB, in the next part.

And you have to prototype some features, so you can show something rather than just talk about it. Do not sell with slide-ware. For my current project I put together a reasonably simple example using Ruby, Sinatra and CouchDB. It didn’t take long, but it did prove the concept. It also became a useful focal point when discussing whether another NoSQL product might be a better fit. Nobody cared that it was in Ruby. They saw things that looked worth taking further. Enough that when a real project with funding came along there was momentum to support a proof of concept. In the early sprints we faced the hardest problems head-on, so that if it didn’t work we wouldn’t find out at the end.

In thinking about NoSQL choices before build started we switched from CouchDB as the main contender to MongoDB, not because it was technically superior (architecturally they’re chalk and cheese), just that the moving parts of the business problem better lined up with the way MongoDB operates.

The two best phases of a project are working out what to do and building it. I find launches a bit anti-climactic. You’re suddenly go from manically busy to having nothing to do. Spiking ideas in different tools is a lot of fun. Explaining what comes out of that process is great too.

But you know what they say: be careful what you wish for, because you might just get it.

If you are successful in making your case, then you haven’t actually won - all the risk lies ahead.

If you embark on a production build with any tool, you need to understand it. The next article is mostly about MongoDB. If you’re considering using it then it may be helpful. If not then the areas covered and the questions raised should apply to most NoSQL products.

Notes

The phrase ‘Freedom from the Tyranny of Schemas’ is not mine, but I do so wish it was. It’s actually from a YouTube comedy skit called ‘Hadoop and NoSQL Downfall’
The first image, as if you didn’t know, is by Matthew Inman, from The Oatmeal’s “What it’s like to own an Apple product”. You can buy it as a signed print or indeed many of the other works in the Oatmeal Shop
The image of the nose is by Thiago Costa made with the ArtRage package. Used with permission of the artist.
The hand painting, used with permission of the artist, is by Erik Irwin you can see his more recent, and quite beautiful, figurative illustrations on his blog.
The salesman image is repeat visit to this site by Graham McKean. It’s called “The Man Who Tried To Turn The Tide”. Apt when you’re trying to advance a technology unfamiliar to others. You can find his works for sale at Meridian Art.
A great source of knowledge on NoSQL products can be found on Alex Popescu’s site.
A lexicon of the staggering variety of NoSQL products can be found here
One thing to bear in mind is that when it comes to data stores, there are no silver bullets.
A funny, and short, Lightening Talk on NoSQL by and for skeptics.
The term One True Schema comes from a 2008 post by Ted Neward called “So You Say You Want to Kill XML”
The subtitle is a quote from Bertrand Russell. The full version is “Freedom in general may be defined as the absence of obstacles to the realisation of desires”