Who needs statisticians when you have big data?

Depending on whether you are a half-full or a half-empty kind of person, the "big data" revolution is either a tremendous windfall for the career of a statistician, or the makings of a real existential crisis. As with most things, it’s probably a bit of both.

From a need to expand…

The field of statistics began as an inferential scientific approach arising largely from the historical aspiration to generalize observed phenomena in a small observed group to some larger group. This basically means statistics arose out of the need to carefully draw conclusions from incomplete, but representative data.

Three hundred years ago, the beginnings of the statistical approach was created to estimate the population of London during the plague, without counting every person (as this was impossible, not to mention a health risk, with the available resources of the time).

The concept of a representative sample was used and with it a very rigorous statistical methodology. Statistics consisted (and still does) of evaluating what the data actual says (eg: counts, averages, correlations, trends, regressions, etc), but at least as important was the strict analysis of whether the sample was representative of the greater population.

… to a need to interpret

At first glance, with the current capabilities to cost effectively store and explore ever growing amounts of data, the need to sample seems to have been buried for good. But does that mean that sampling and generalization considerations are passé in a big data world?

Let’s consider for a moment: is any dataset really complete? A tax authority can in theory complete analyses on all taxpayers, but do they have all relevant data on taxpayer assets, expenses, socio-economics, etc.? And can they compare to other regions or countries? An oil company can, in theory, do analysis on all sensors in an oil rig, but is that rig representative of all rigs? And don’t they need to enrich with weather data? Even music services like Spotify can evaluate all listeners’ behavior, but can they really predict individual preferences based solely on co-occurrences?

The only correct answer to any of these questions will always start with: "It depends."
And that’s exactly why statisticians and critically minded analysts still have a crucial role in big data analysis and the innovative organization. Statistics, or being statistically inclined, means having a critical mindset about what role numbers can play in describing the real world.

Your dataset is bigger than mine? So what?

Just because my data set is bigger than yours, it doesn’t mean I’ll get more information out of it: only the potential for learning is greater with bigger data and bigger analysis. It will still take analytically minded thinkers to draw the appropriate conclusions.

So, hold on to your critical edge, fellow statisticians. The potential of big data is truly awesome, but it’s up to us to make it work. And if we get paid for it, well that’s just a nice side effect of a job well done.