In college, one of my favorite courses was Probability and Statistics. I was particularly fond of Simpson’s Paradox, an obscure effect in which data masks the truth. Here’s a quick example from Wikipedia on two drug treatments:
The paradoxical conclusion is that treatment A is more effective when used on small stones, and also when used on large stones, yet treatment B is more effective when considering both sizes at the same time.
In this example the “lurking” variable (or confounding variable) of the stone size was not previously known to be important until its effects were included.
I was thinking about this the other day in the context of Big Data. Yes, Simpson’s Paradox existed well before the advents of YouTube, Instagram, Twitter. With so much information available to us, what truths will it mask? What’s more, an increasing percentage of this data is unstructured and, I would argue, subject to some level of interpretation. Big Data will lead to some big discoveries and insights, but also some big mistakes.
Raising the Stakes
As Julie B. Hunt pointed out on this site, the Internet of Things is coming– and soon. Machines and sensors will only add to the amount of information available to us. Wearable technology and the quantified self mean that we will generate even more data on what we’re doing.
It’s obvious to me that organizations will need better tools to not only store petabytes of unstructured data (read: Hadoop), but improved mechanisms for analyzing all of this data. Will existing data warehouses, datamarts, and cubes be sufficient?
The Politics of Technology
Throughout my consulting career, I have been both frustrated and amazed at the organizational politics behind internal systems and technology. Early on, I learned that homegrown applications die hard. People who built systems usually didn’t want to see them retired, although I did see pleasant exceptions to that rule. For the most part, people tended to associate “their” applications with their own jobs and even their own identities. Systems and reporting tools that were more than a little long in the tooth survived for years after they had stopped being useful.
As I write these words, I know that someone, somewhere is reading about the benefits of Big Data. Maybe two people are having a conversation at a conference or at a coffee shop. One of the two shares my concurrent sense of frustration and opportunity. There’s so much that can be done with Big Data, but only if the organization realizes its potential–and makes attendant investments to make it happen.
Big Data doesn’t happen overnight and there’s no magic to it. Deploying Big Data tools doesn’t guarantee anything, as Simpson’s Paradox proves. By the same token, though, it’s impossible to benefit from it with insufficient tools.
Does your organization possess the tools to handle Big Data?
Why or why not?