Valid data management: your ticket to paradise!

Ok, so we have found paradise: it is located in West Perth, Australia. Analytically speaking anyway, as discovered through the Paradise Found project and as described in the previous blog post. It took us a lot of machine learning and analytics to pinpoint this location, and even more data: over five million data points collected on 148,233 locales around the world from 1,124 different sources guided us in the right direction. To me, data management is every bit as important as analytics. If we want to get the data to tell us anything, we need data management and analytics to work together optimally.

The real big data challenge: V as in variety

The challenge in analytics projects (such as Paradise Found) often lies less in the volume of data than in the variety of source systems and access pathways and in the diversity of data structures and missing structures. Paradise Found is a perfect illustration of how important it is to have an open analytics platform that can transparently access nearly every data source and acquire the data without problems.

Diverse data sources and heterogeneous data structures require the most sophisticated data quality tools. In this project, we had to standardize and consolidate city names from a dizzying variety of formats around the world – in terms of language and the alphabets used – and that was just the easy part! We had to apply standard data quality methods like profiling, parsing, and cleansing, but also more advanced capabilities such as analytical data enrichment. We did not exclude missing or incorrect data from the analysis, but rather applied processes like machine learning to improve the usefulness of the data.

Success factors: speed and simplicity

This project has not only highlighted the huge importance of having the right data management tools, it has also demonstrated - once again - how important it is to closely integrate data management and analytics. Only an iterative, integrated process makes it possible to make rapid progress and to enrich the analyses with additional data. This also means that the traditional roles - data scientist, data architect, business analyst, IT department, … - and associated task assignments and processes have mostly become obsolete. The processes must be merged into an iterative process to generate innovation. Only an integrated platform like the SAS platform, which covers these iterative steps in a complete overall process, allows you to implement a project like this in just a few weeks’ time.

Key aspects here are:

  • a consistent use of analytics and machine learning algorithms throughout the process, even at the earliest stage of data preparation
  • constant transparency of the existing data, data quality, and any information already generated from the data in the form of models.
  • In combination with an intuitive front-end, this can enable a broad range of users to get the data to speak to them very quickly, in a “self-service” process.

When applied this way, big data management is more than a simple finger exercise but it doesn’t have to be an onerous chore. It is the only way to obtain a clear, undistorted picture of the data and to derive models – the success or failure of every analysis will depend on that.

You won’t find paradise without good data management – at least not an analytically approved paradise. In the case of Paradise Found, valid and interesting results are certainly nice to have. But in business, machine learning will generate entirely new realms of potential.