NoSQL in the Enterprise

.. or MongoDB for Architects

By Julian Browne on July 30, 2011. Filed Under architecture, development

Welcome to part two. Last time we looked at the experience of getting a NoSQL product accepted in an enterprise environment. Assuming you got through that, the next step is to do something useful with it. Like any tool, you will only get good stuff out if you know how make the best of it. In this case that means not treating it too much like a relational database and understanding the internal nuances.

"Is this a schema I see before me?"

For our particular set of requirements we chose MongoDB. We tried Oracle first but either the data model became too unwieldy, which slowed down development, or we were looking at blobs, which lost us query flexibility. We tried CouchDB and these problems went away, but managing constant changes to query semantics wasn't quite as easy as we would have liked. However, it did appear we were on the right track - our business problem definitely had a NoSQL feel to it. Then, following a recommendation from Sean Reilly, who'd used it before, we gave MongoDB a whirl and everything fell into place. Ironically, from a communications perspective, that turned out to be harder to manage than I expected. MongoDB is becoming really popular and there's plenty of skepticism around NoSQL anyway. I knew a technical justification would be needed, but it was strange having to constantly make clear that we didn't make our choice because MongoDB is the new Lady Gaga.

I'm a little uncomfortable defending product-choices too strongly. It can be perceived as outright promotion, which this isn't intended to be. Apart from being a satisfied customer I have nothing to do with MongoDB or 10gen (the company behind MongoDB) and I have particular admiration for CouchDB and Neo4J too. Actually, since you ask, I quite like the whole range of key-value, document, graph, and big-data tools. Part of the attraction is that the communities around them are mostly filled with interesting, funny, and knowledgeable people and this comes through in the products and vendor presentations. NoSQL is not a world of bitter fud-spreading, not by the vendors anyway: 10gen compares MongoDB with CouchDB and Basho compares Riak with MongoDB in fair, balanced, technical terms.

Healthy rivalry in NoSQL makes a lot of sense because, underneath the excitement and confusing terminology, these tools are quite different. Frankly, the term 'NoSQL' is the thing I have an issue with. It isn't very helpful defining products by what they're not. But, though it's not always useful, there is some logic to the classification, as we saw last time when talking about the inside-out to outside-in switch. Emile Eifrem, founder of Neo4J, whose presentations tick all the boxes in the interesting, funny, and knowledgeable categories said:

The problem in NoSQL is that it's not clearly defined. Well, that's not the problem, that's one of the challenges with NoSQL. People have varying views of this. Another challenge is that NoSQL is extremely hyped right now, so pretty much anyone wants to attach themselves to that term and it's also that it's defined by what it's not - it's not SQL. You could say "Hey, is this room NoSQL? - It doesn't support SQL"

Data persistence products are just tools. Relational databases are one category of them and I don't see what the problem is in having multiple tools at your disposal. Too many is too many, obviously, but one is clearly too few. NoSQL products are really pleasant to work with, more so if you've chosen the one that fits your needs properly. Building on a tool that's nice to use feels good. It eases the path to elegant and simple designs. It makes coming to work fun. What's not to like about that?

MongoDB: The Basics

MongoDB is a document-based data store, which means its unit of currency is a document. In MongoDB's case this means JSON documents (internally they are stored and transferred as BSON documents for efficiency), which look like this:

{
    "name"      : "Milton Waddams",
    "age"       : 42,
    "loves"     : [ "cake", "staplers", "fire" ]
}

That very basic structure tells you something about a person (and the kind of movies I like). It's a form of the master-detail pattern, i.e. there's a master 'person' and some 'detail' (about things they love) placed together. Things one person loves may be shared by other people. If you were to put that (very simple) example into relational database you'd most likely use three tables:

Retrieving a person and their interests from a relational database requires a join. Retrieving from MongoDB means fetching back the single document. In one scoop. What you appear to have lost at first glance is some control and integrity over how "loves" data is maintained. The relational model enforces standardisation, so that the next person who loves staplers will have their record point to the same row in the "loves" table. With MongoDB (and this is a pattern that comes up again and again), you have choices: you could maintain a list of de-duped 'loves' from all person documents in another JSON document and point to it by reference, or you could delegate that management to the application domain layer above. Initially that sounds like a pain, but it's a pattern that you will only need to solve once and the payback is forever more you can interact with people documents in single-scoop (fast) atomic operations.

Now, say a new person needs to be added but they have an attribute that doesn't apply to anyone else (i.e. an extension to existing business functionality). Since there's no table schema telling you what to do you can go ahead and do just that:

{
    "name"          :   "Bill Lumbergh",
    "age"           :   38,
    "loves"         :   [ "himself" ],
    "douchelevel"   :   9
}

This has no impact on existing documents and, if you want to search on common attributes like 'name', your existing query syntax is unaffected too.

Understanding JSON documents is relatively trivial. Even die-hard relationalists find that within days it becomes second nature. Designing good documents is the next stage, so let's take a look at that.

Documents and Collections

Except for a few special cases, MongoDB documents have unique ids. A typical id looks like this:

68cc67093575062e3d95369e

Default ids are 12 bytes long and are generated by your chosen client driver. They are made up of four parts: a timestamp (4 bytes), a machine id (3 bytes), a process id (2 bytes) and a counter (3 bytes). MongoDB ids are kept in the _id field as a BSON ObjectId. You can use other types (e.g. UUIDs) if you so choose, though there are a few rules to follow.

The simplest way to link MongoDB documents to one another is using ids (as in the example further down). Whether or not to link documents, or store their combined data together, is an application specific concern.

Here are some factors to help decide:

  1. Only link documents if your clients query them separately more often than they do so combined. Most of the time you are trading off performance (getting all you need in one scoop) with generality (storing separate document types separately).

  2. Split into multiple documents if combined documents would be unusually large. MongoDB documents can be really big (16Mb in the current version) but big documents mean fewer of them in RAM at one time so always look to keep them as small as possible. This includes key/field names. It sounds odd but the size of large document collections can be reduced considerably by using very short field names and small size generally improves memory management.

  3. Separate into multiple documents if you are going to change the size of those combined documents frequently. MongoDB pads documents with a little space to allow for growth. If that padding gets used up then it may have to move the document to a different place in the store, which has an overhead to it. Newly updated fields are then located at the end of the new document. Padding is optimised by MongoDB keeping track of how often it needs to move documents around. So rather than add more data to an existing document frequently it may be better to add the new data as a new document.

We kept master-detail data together until the point it clearly didn't belong, which was indicated by our automated performance tests.

The canonical example of documents and links for MongoDB seems to have become the "blog post and comments" study.

For example, you could design one document structure for blog posts, and another for comments on those posts, meaning each blog post document would need to maintain an array of links to the comment documents relating to it (or comment documents maintain a link to the post they pertain to):

{
    "_id"       :   "0001",                     // ids are simplified for clarity
    "type"      :   "blogpost",
    "author"    :   "Milton Waddams",
    "title"     :   "Freedom from the Tyranny of Schemas",
    "date"      :   "30th July 2011",
    "content"   :   "Time flies - it was nearly two years ago that I wrote ..",
    "tags"      :   [ "architecture", "business" ],
    "comments"  :   [
                        { "_id" :   "1001" },
                        { "_id" :   "1002" }
                    ]
}

{
    "_id"       :   "1001",
    "type"      :   "comment",
    "author"    :   "Bill Lumbergh",
    "date"      :   "1st August 2011",
    "comment"   :   "Milt, we're gonna need to go ahead and move you downstairs."
}

{
    "_id"       :   "1002",
    "type"      :   "comment",
    "author"    :   "Milton Waddams",
    "date"      :   "2nd August 2011",
    "comment"   :   "Excuse me, I believe you have my stapler... "
}

Or, you could be more scoop-efficient and combine them:

{
    "_id"       :   "0001",
    "author"    :   "Milton Waddams",
    "title"     :   "Freedom from the Tyranny of Schemas",
    "date"      :   "30th July 2011",
    "content"   :   "Time flies - it was nearly two years ago that I wrote ..",
    "tags"      :   [ "architecture", "business" ],
    "comments"  :   [
                        {
                            "author"    :   "Bill Lumbergh",
                            "date"      :   "1st August 2011",
                            "comment"   :   "Milt, we're gonna need to go ahead and move you downstairs."
                        },
                        {
                            "author"    :   "Milton Waddams",
                            "date"      :   "2nd August 2011",
                            "comment"   :   "Excuse me, I believe you have my stapler... "
                        }
                    ]
}

The first example is what you might do to replicate something close to a relational database. It would work, but to build an HTML page containing a post and its comments you're going make multiple calls to the database, and it feels a bit joiny. The second gets you the page in one scoop and fits with the outside-in concept discussed last time. And notice that those embedded documents don't have ids now. You could add them but they would just be arbitrary (and useless) fields to MongoDB.

If we had opted for option 1 above then it would make sense to separate posts from comments, so that when we searched for posts we didn't have to trawl through the comments as well (we could assume that there will be many more comments than posts). MongoDB supports this through collections. Collections partition data within the database - in this case you would have a collection called "posts" and one called "comments". When querying one you won't (can't) get access to the other. This makes collections a bit like tables. Be careful if using a lot of collections though as there's a default namespace limit of 24,000 (collections and indexes both count towards this). The limit can be raised with the nssize command line option.

If all the documents in one collection adhered to the same schema then it would feel very like a table. Here's a screen shot of the posts collection in MongoVue, a Windows MongoDB client, looking for all the world like a regular table:

MongoVue

I used MongoVue quite a bit in early presentations. A lot of questions seemed to fall away once people saw a familiar-looking interface with familiar-looking data sitting in it. I guess there's a common misconception out there that NoSQL databases munge up your data in arcane ways.

Sub collections are permitted too (e.g. posts.milton, posts.bill), though these are for syntactic convenience only.

Another collection feature is capped collections - special pre-allocated (by size and/or number of documents), fixed-size storage areas. Capped collections are convenient if you are happy with the constraints (can't delete or grow the size of documents and eventually the earlier documents will be overwritten with newer ones). Capped collections have very stable write speeds because there's no need for dynamic space allocation.

Documents and collections are stored in databases (there's a surprise). One MongoDB instance can support multiple databases simultaneously. By convention the currently selected database is denoted by "db" in shell commands. Databases are accessed and deleted using terms that will be familiar to most:

show dbs                    // list available database names

use officespaceblog         // select the blog post database

db.dropDatabase()           // delete the blog post database

Before covering how to interact with documents and collections in more detail it's worth taking a short detour. Prior to using MongoDB directly I'd heard a few myths about its reliability, particularly in respect of losing data. I'm not sure why confusion exists about how the MongoDB server works - the source is available to browse and it's well commented.

Let's take a quick peek under the bonnet.

The Server

MongoDB is written in C++. The core server daemon (mongod) is very small. With so much less to do than a conventional RDBMS this is something MongoDB has in common with many other NoSQL products. It's a pleasant experience when you first download it and realise there isn't some arcane landscape of files and directories to understand. There's a bin directory and everything's in there - the main daemon, the shard manager process (mongos), a monitoring process (mongostat), a command-line query shell (mongo) and some tools for managing data imports and the like.

Here's the list of tools taken from the README on github:

mongodump       - MongoDB dump tool - for backups, snapshots, etc..
mongorestore    - MongoDB restore a dump
mongoexport     - Export a single collection to test (JSON, CSV)
mongoimport     - Import from JSON or CSV
mongofiles      - Utility for putting and getting files from MongoDB GridFS
mongostat       - Show performance statistics

It's fair to say that the server works in an unusual way when compared to traditional databases. Some of its features have implications for your design.

Here are three parts to understand:

  1. Take the risk for this one transaction type. Not always a bad option if the data's not valuable.

  2. Wait for the journal write using getLastError with the j parameter. And note that calling getLastError can also be used to make sure different connections see written data consistently.

  3. Use fsync to write all files to disk. When journalling is on this actually just waits until the next journal write.

  4. Where you have multiple nodes in a replica set you can insist that data is sent to more machines before the operation returns (using the w parameter), though you need to be careful that w isn't greater than the number of nodes up at that moment otherwise the call will just wait.

Basically you have a lot of choices and you want to think through your requirements for each type of write operation. If you deploy a write-heavy application to a single node, that saves business-critical data, and you do not use journalling then there's a real risk (in the long term a certainty) you will lose information in the case of a hard server crash. But if you run a production environment in that way then one might say you haven't really thought much about anything.

There are plenty of tuning options in MongoDB but how safe your data is and whether your application flies along, or stutters and chokes, is going to be down to application design, document design, operating system tweaks and how you deploy it. That's not a good or a bad thing - it's no different in principle than it would be with many databases, it just means you go looking for answers in different places.

Hard part over. Let's relax for a moment and talk about the other parts.

Interacting with MongoDB

  1. criteria - specifies the document we want to make a change to. This might just be the id but can in fact be anything. If multiple documents match though only the first will actually get modified unless you use the 'multi' option.

  2. changes - the new fields you want to add/change in the matched document(s) or a sub-command indicating the change you wish to make. Notable sub-commands include: $inc (increment a numeric field by a defined amount), $push (add a value to the end of an array, or start an array containing the value), and $rename (change a field name).

        // change age to 43
        db.people.update( { "name" : "Milton Waddams" }, { "age" : 43 } );          
    
  3. upsert? - a true/false that makes the save upsert functionality discussed above more explicit

  4. multi? - allow multiple updates

    i.e. if multiple documents match the criteria then MongoDB will attempt to apply the change to all of them. Each operation will be atomic but, because each document change is distinct (or more accurately the lock may yielded periodically), other concurrent write operations may be interwoven, which could affect one or more of the matched documents, with last-write-wins. Remember too that some changes may fail for certain reasons (e.g. an array push to a field that not an array in one of the matching documents). You can make this a bit more isolated by using the $atomic operator:

        // all people aged 42 become 43
        db.people.update( { "age" : 42 , $atomic : 1 }, { "age" : 43 }, false , true );
    

    this will make the update pseudo atomic - i.e. no concurrent/conflicting writes will be allowed while it's happening (need to be careful using this with very many documents though because of the lock).

    For some writes (decrement stock level for instance) you really need to know that two writes didn't both try and change a value such that invalid data remains in the document. If we have one copy of a book in stock and no back orders possible then two clients should not both be able to add it to their basket. For this there's a pattern which uses a neat trick with the criteria to update called 'update if current' which allows you to fetch a document, change it (remove the last book from stock), and then write it back only if the fields you are interested in (stock level) haven't been changed in the meantime:

    // find a book, assumes isbn is unique, with at least 1 in stock
    book = db.books.find({ "isbn": 12345, "in_stock" : { $gt: 0 } });  
    
    // remember current stock level, e.g. 1
    book_stock = book.in_stock   
    
    // decrement stock level, e.g. to 0
    --book.in_stock;   
    
    // will fail if stock has been changed in the meantime
    db.books.update({"isbn" : 12345, "in_stock" : book_stock}, book));
    

That about covers the basics. There's a lot more to explore in the online manual

Building with MongoDB

Here are a few things we learned about getting the best out of MongoDB in development.

MongoDB works really well if you are following agile practices. Playing stories in short sprints with incremental database changes can be an onerous job. MongoDB's lazy creation of fields and collections (if you refer to one that's not there it gets created) makes for easy testing because it puts the code in control of the data rather than the other way around. We followed quite strict TDD, CI and QA practices so were able to write minimum amounts of test-passing code, which we could then safely grow and refactor as we understood our architectural needs better. Because we also adopted automated performance tests (nightly runs that put MongoDB under load) we could tune and adapt as we went. I would strongly recommend this approach; up-front design with a tool like MongoDB is quite risky. Far better to keep it simple, test often, and change your design incrementally in a highly-controlled way. This makes for the best kind of emergent architecture and dramatically reduces the chances of nasty surprises in production. I would do this with any database but it's particularly important in this case.

We made the choice to expose our application services via REST. It turned out that MongoDB supported this easily too. Collections and documents fit nicely with resources as do the semantics of interacting with them, and JSON on the inside plus JSON as the default content-type made it easier still.

Commercialising MongoDB

In an enterprise organisation the story doesn't stop at deployment. For commercial support we engaged 10gen once development had got to the point where we knew we would be going live (I had thought about doing that immediately we started, but with MongoDB's popularity I didn't want to sound like another maybe proposition that would waste their time). 10gen develop MongoDB and provide a commercial wrapper in the shape of support, training and health checks so you can get your designs underwritten by the same team that creates the product. One of the best things about 10gen is you're never too far from an engineer if you need help. Plus they're a likeable lot and very easy to work with.

The MongoDB license structure is in two parts - the core server is covered by the GNU AGPL v3.0. and the drivers by Apache License v2.0. Commercial support comes at a cost (can't say what we paid obviously, but it's very competitive indeed considering what you get).

Like most innovative tools these days MongoDB is open source. In enterprise IT open source is still a fairly novel concept and, to some, an untrusted concept. This goes back to what I said last time about the curious situation of having to promote a database tool to your IT department and sometimes to your business. I can understand the fear of moving away from the comforting feel that a well-heeled account manager from a commercial organisation brings. I know that comforting feeling is misplaced and that in reality you can make just as big a mess with an expensive product as a free one. But to be successful you can't ignore the fear, even if you believe it to be unfounded.

Many years ago I wrote up some thoughts on what I called the Third Way of software sourcing - instead of two choices: buy expensive stuff and tailor the bejesus out of it, or write it all your self, I suggested a third option for corporates is to engage with open source communities in a more respectful way. That is, don't see open source as 'free' (and risky) but an asset to use and contribute back to. You get a great head start and mitigate risks by helping resolve defects and improve the software. The community benefits by having a brand name to associate with and it keeps the OSS project alive. We've tried to follow this with MongoDB and have submitted a few fixes where we found the need. We couldn't release our main application to the community for intellectual property reasons, and no doubt hard-core open sourcers would have an issue with that, but we did release back what we could - build tools such as our schema validator have no direct competitive advantage, so that's going to be available for others to use and extend. In turn we can benefit from that work. It's a slow process but one with good outcomes for everyone and I hope it goes some way to changing attitudes, because it's a much better way to work than handing over millions for a license fee and only then start the torturous process of customising a product that was supposed to do everything 'out of the box'.

Scale and Growth

This subject will be covered later. The primary tools for scaling in MongoDB - Sharding (to scale out writes) and Replica Sets (to scale out reads) - haven't given us much to say yet. Given where we are with our projects right now I suspect the more interesting challenges around scale are yet to come and I'd rather talk about that having gone through it in production. There are plenty of references out there to cover this (some in the notes below) for now.

Summary

I hope that's provided some food for thought. I've covered quite a bit of ground and also left out some big areas (I've not covered map-reduce, indexing, cursors, gridFS or authentication for example).

In closing I would say that there's no doubt in my mind that NoSQL works in the enterprise if you follow good development practices. If you can work responsibly with the teams that build them then so much the better. MongoDB was a great choice for our requirement. It's been easy to work on and our weekly demos to the product owner have gone down extremely well. Sometimes we've surprised even ourselves at how quickly we can deliver new features. There are many cases where I would not use a NoSQL solution, but plenty of others where I would now say it's a perfectly valid and long-term approach.

Notes