The Big Data Deception

What was I supposed to do - call him for cheating better than me, in front of the others?

11 minute read

Tags: , ,

captain ahab

You can’t go to a conference, read a blog (ahem) or open a tech mag without someone talking about Big Data these days. Now I’m as excited the next person whenever new techniques, approaches, tools, frameworks, whatever come along, but equally, given our industry’s penchant for hype, it’s important to keep one eye out for denuded emperors keen to show off their new wardrobe or vendors with sales targets to hit.

About three seconds after it was announced that Barack Obama had won the US election Twitter was awash with tweets also congratulating Nate Silver of the New York Times. Nate had, along with only a handful of others, predicted the result with 100% accuracy. He hadn’t used a psychic octopus, a gifted cat, halloween masks or eyebrow thickness, he used Big Data.

I read all the posts and along with everyone else admired Nate Silver’s analytical skills. To collate and filter all that data, adjust for various biases and survey limitations, was no mean feat. Surely if you needed a business case for Big Data that was it. This is a guy who can predict who’s going to win an election and he’s 100% right for every state. On the flip side, it also crossed my mind that I wouldn’t want to be him at the next election, because the media will whip up such high expectations that, even if he gets the overall result right, any variances from 100% at state level will feel like disappointment. Who knows? Maybe he’ll even pull it off a second time. It’s a tough challenge because electorates leading up to election days can be an inscrutable bunch. In 1992 at the UK general election, the Conservative party landed back in office with 8% more of the vote than polls had predicted (if the polls had been accurate they would have lost by 1%). The phenomenon is well known - as the “Shy Tory Factor” or the “Bradley Effect” - people telling pollsters that they’ll vote for the candidate they perceive to be the more socially acceptable, when in truth they intend to vote for someone else.

I see two sources of disappointment with with Big Data. The first is that it’s really hard to make sense of large data sets in meaningful ways. Nate Silver has clearly done his homework, but that doesn’t mean it’s something anyone can do on any data set. Having data on a grand scale seems to me to be only the beginning. It certainly isn’t any value in an of itself. Big Data? Big Fucking Deal. Smart Analysis skills are clearly the deciding factor and that’s not really anything new. What is new is the application of smart analysis skills, plenty of data to apply them to, and an improvement in the software to manage huge sets of data.

There probably are weird correlations to discover about people who like marmite, play lacrosse on Thursdays and drive a Honda. And I am sure knowing that individuals in this group are 20% more likely to buy electric hedge trimmers is a boon to marketing teams at Black and Decker. But there’s a lot of speculative data mining that happens before these gems are unearthed and I note that Data Scientists are commanding large salaries to indulge themselves in this mining.

And good luck to them. No doubt there will be plenty of companies who will ultimately get back all they invest in Big Data and more. Pharmaceutical manufacturers, for example, might benefit from an enhanced ability to trawl large data sets looking for patterns. But my second big issue is the one that bothers me most and that’s the adoption of Big Data by mainstream enterprise IT.

Because whereas I can leave it to old-fashioned (real) scientists to pick through data sets infused with bias, self-conscious moderation, politics, subjectivity, and correlation vs causation mix-ups, I find it hard to sit quietly by whilst the enterprise purveyors of Business Intelligence and Data Warehousing hoover up more cash as duped CIOs succumb to yet another grand-scale technology swindle.

You might think that sentiment too harsh, but before you make final judgement let’s take a quick history tour.

Architecture as a specialism in large companies came into being in the late 1990s. I remember because, like most people, I was seduced by the job title and the notion that it might be possible to earn a ton of money whilst not actually be accountable for any of the code or operational characteristics of the system. What wasn’t to like?

One of the first enterprise problems selected by this new breed of self-styled architect was point-to-point integration. Up until that time naughty developers had been opening up connections from new applications to old applications whenever they needed to. Worse still, a lot of these connections were running hard-coded SQL commands against back-office databases to get at data. This meant changes in one application created knock-on (i.e. expensive, hard to manage, hard to test) changes in many other applications. It was called ‘tight coupling’ and as Mr Garrison would say ‘Tight coupling is bad, m’kay?’

So architecture teams introduced Enterprise Application Integration (EAI) - a fancy way of saying that instead of allowing Application A to talk to Application B directly, new governance rules would dictate that Application A would now have to talk to message bus M, via adaptor Am, which enriched, validated and converted said message into canonical data format Dm, then sent Dm to Application B, at which point it got converted into a suitable local format via adaptor Bm. Oh, and because these message buses were asynchronous, various other governing applications were also bolted on to M to ensure predictable message-order delivery (Mo), once and only once delivery (Md), guaranteed delivery (Mg), etc, etc.

Before EAI appeared, application integration was certainly problematic. EAI solved this by adding a ton of extra complexity and ensuring enterprise architects could buy nice cars.

It wasn’t long before EAI became unmanageable. It was expensive (those adaptors weren’t cheap and there were hundreds of them), required a centralised command and control governance structure, and didn’t do much to address the original problem. Message bus vendors liked it though - Tibco launched an entire business off the back of EAI. In a logical world we’d have had a good hard retrospective, examined the issues, and made plans to adjust EAI, removing complexity whilst focusing on the original problem of reducing coupling.

Unfortunately there were those cars that needed paying for. Rather than face up to what was a set of extremely bad choices, that did a great disservice to the businesses that paid for it, architects and vendors adopted Service Orientated Architecture (SOA) as their get out of jail free card. SOA - deriving a set of clearly bounded, abstract, autonomous, business services across all applications - was, and is, a reasonable way to model integration needs. The downside was that SOA required patience, investment, and the business driving it. It also wasn’t particularly prescriptive about implementation. SOAP was though (and it had the same first three letters) so sharp vendors added XML-powered web services to their old EAI products and took positions on lumbering industry committees, to ensure ‘enterprise standards’ proliferated enough that consultant income could roughly equate to the cost of a new Mercedes. In the process of doing this they birthed and baptised the Enterprise Service Bus.

Note that all this fuss was about integration, not building applications, deploying applications, or more importantly, understanding customers. Companies were spending a fortune making systems talk to each other just as a new breed of web businesses were springing up (without legacy IT, without an ESB, without armies of consultants). These upstarts knew things about analytics and customer interaction. They were made of web. They talked of REST and HTTP. This was precisely the time old-school businesses needed to understand the dynamics of their market better. But to find out what was actually going inside the heads of customers, architects and vendors needed a nice house with a drive to park their cars on, and maybe a couple of Caribbean holidays a year. So we got the Enterprise Data Warehouse (EDW).

The logic to the EDW is implacable - put all a company’s trading data into a massive centralised repository which is maintained by a secret society in the sure and certain hope that it will be resurrected one day in the form of invaluable insight. EDW tools are expensive, consultancy services are expensive. And they want all data from all systems, because it’s impossible to say which of the billion items generated each day will be required at that special future moment when all that investment will suddenly be repaid.

Except it won’t ever be repaid. Because Enterprise Data Warehouses are like inverse Mayan apocalyptic predictions, subscribed to by those with poor logical judgement or a commercial interest in preying on the fears of others.

However, vendors, architects, strategists, consultants, the CIOs who appointed them, and anyone else fingered in this ten-year conspiracy need not worry. Because now we have Big Data. Big Data is new and cool and hip, and it’s not a three-letter acronym for a term with ‘enterprise’ in it. No sir. Big Data is made of Web. It’s Big, it’s Data, it’s Cloud, it’s Agile, it’s Something as a Service. It’s open source. And now the vendors won’t come in suits, they’ll come in dockers and a polo shirt, because this is start-up cool, baby.

Except the prices. Just as the ESB was a fake-SOA fur coat applied to cover the embarrassment of EAI we now see the same happening to EDW with Big Data.

This makes me sad because I’m a big fan of many of the open-source tools being misappropriated into the Big Data bandwagon. Storage is cheaper than ever, cloud servers are easy to fire up and getting easier to manage all the time. Businesses are generating more data than ever. NoSQL stores, applied wisely, have a lot to offer. Great things are possible. And I am not saying it would be cheap necessarily. Projects are expensive things. I am simply saying that Data Warehouses are not the basis for an effective Big Data strategy (that’s even assuming one is going to provide value). Far better to give the money to teams already building real things and let them architect close-to-the-edge real-time analysis plus easy data access when that’s not possible.

Nobody is likely to listen though. The Enterprise Data Warehouse team, who rebranded as Business Intelligence, now lay claim to Big Data. Vendors throw in some Hadoop integration. Everybody gets a MongoDB mug.

At the heart of this deception lies a very simple con trick - by describing technical problems using overwrought and evocative language, budget holders have been led to believe that architecture complexity is the only way to secure the enterprise against the forces of chaos. IT Management has come to see the world in terms of a false dilemma: allow chaos to reign unbridled, or accept the yoke of complex control systems.

And all this was achieved without once anyone ever asking the basic question:

Is the solution, and its ongoing management, going to cost more or less than the cost of the problem we’re trying to solve?

It’s not that technical problems should be ignored. But think of it this way - a really shitty IT problem that results in 100 people having to be hired to solve it manually would cost a business substantially less than 2 million UK pounds (3.2m US dollars) a year. I’ve not heard of many problems that need 100 people, and I’m not saying 2 million pounds is chicken feed. But I’ve heard of many technology solutions that cost far more than that just for the software licenses. Even without license costs you’d be lucky to get 7 large SI consultants for that. And forget about ROI with data warehouses and enterprise service buses - there isn’t any. They actually raise the cost of each and every project that they come into contact with (as compared with a do-nothing option). Is this really a good time to sprinkle Big Data pixie dust into the mix and expect things to get better?

Point to point integration certainly gave IT departments headaches. And message buses used to solve asynchronous data distribution problems are elegant and scalable. SOA remains a valuable contribution to thinking about large enterprise application estates and how to fit them better to business needs. Intelligence reporting across large data sets is critical to smart decision making. Sometimes in solving these problems we will have to deal with complex solutions, though as an old colleague of mine used to say: the best you’ll ever get is an IT solution as complex as the business model itself is. You’ll never get an IT solution that’s simpler than the business rules and data it’s managing (that’s kind of self evident).

Finding the simplest possible solution is surely why we come to work every day. It’s why we studied software at college, why we read blogs, study the work of others, read about disasters, attempt code katas, argue for hours about semantic nuances, run technical spikes, fight against unnecessary requirements, go to conferences, buy books, obsess about craftsmanship.

I think we do these things so that we never lose sight of our goal in the pursuit of a solution and so that we never hand over millions in license fees for third-party products without asking hundreds of difficult questions first. And never accept any approach as correct above all others on the basis of a plausible PowerPoint presentation.