Machine learning: a different (dashboard) light on the Paradise Papers

No less than 380 journalists, members of the ICIJ (International Consortium of Investigative Journalists),have been investigating 13 million documents since the start of this year. The results of their investigation are now being released as the so-called Paradise Papers. It took almost a year to sift through these documents looking for the connection between, among others, president Putin and the US government’s secretary of commerce Wilbur Ross. An impressive feat, in which data science and state-of-the-art machine learning algorithms could play an important role.

Guest blog by Véronique Van Vlasselaer, analytical consultant at SAS

She co-authored the book Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection.

Who is the better investigator: humans or computers?

As some journalists phrased it: it was like looking for a needle in a giant haystack. The Belgian journalists in this investigation have searched these data for anything connected to our country. Often they have been investigating a specific trail for a long while, only to find out in the end that it was a dead trail. This is precisely where data science should come in: to search for patterns efficiently and effectively in huge amounts of data.

If machine learning and data science had been used during this investigation, this would probably have led to useful results much more quickly, even with a much smaller team than the 380 journalists worldwide who have been struggling for months. Data science and machine learning algorithms can support the investigative process by pointing out potentially ‘suspicious’ patterns to the journalists. Machine learning cannot replace the human factor, but it can speed up the research significantly: instead of looking for the interesting patterns within the mountains of data themselves, they can focus on validating patterns that were discovered by the machine.

Where the computer makes a difference: speed

Such use of data science is far from new. Every day, such algorithms are being used for countless transactions, without us even noticing any of them. Think only of the finance industry. The analysis of financial transactions is no longer executed by human experts but rather by computers using machine learning algorithms to perform these analyses at superspeed. For any transaction on a payment terminal in a retail shop, the computer should decide within six seconds if the transaction is valid or not. Within that short timeframe, all relevant data are gathered, investigated for patterns and flagged if any anomaly has been detected. Throughout these processes, there is a continuous learning curve in understanding how fraudsters operate. Machine learning also allows the computer systems to identify specific patterns and to adapt the algorithms accordingly.

This machine learning technology could contribute a great deal to the Paradise Papers investigation. Machines can spot and analyze transactions between organizations and indiviudals in a fraction of the time needed by human investigators, leading to much faster results. Network analysis, one of the machine learning capacities, automatically investigates all connections between ventures, individuals and organizations. This technique is a valuable contribution to analyze and visualize networks: when performed manually, these tasks can take ages.

Where the journalist makes a difference: interpretation

Using text analytics, you can automatically retrieve persons, ventures, connections and other interesting information from a large number of documents. These unstructured documents are then transformed into structured information. The computer takes care of all preparations and the journalists can focus on the further analysis.

Detecting large-scale, organized fraud networks is not limited to (science) fiction. The powerful combination of man and machine enables a quick and effective dismantling of such networks. As the data processing becomes more mature, we will hopefully be able to discover and address such anomalies even without data leaks.

One single machine can replace the manual work of hundred individuals. But don’t get me wrong: ultimately, you can only be successful with the right interaction between man and machine. A computer can uncover correlations, but not (yet) the causalities. It can discover lots of trails very swiftly, but it still needs humans to guide the search and to handle the discoveries. Because in the end, even powered by machine learning, a computer still lacks that capacity of good old-fashioned human interpretation.