Commercial versus Open-Source software for Analytics

Both commercial as well as open-source software each have their merits which should be thoroughly evaluated before any analytical software investment decision is made. In this contribution, we will elaborate further on this trade-off.

In order to set up an analytics environment, firms need to make decisions about the hardware and software technologies to be adopted. In terms of hardware big data requires specialized infrastructures to store, integrate, clean and manage the data. As for software, many vendors such as SAS, IBM, Microsoft, Oracle, Matlab, etc. are currently providing commercial solutions for big data & analytics. We also see more and more open-source, free software solutions (e.g. R, Pytyon, Weka, Rapidminer) being offered in the market.  In fact, the popularity of open source analytical software has sparked the debate about the added value of commercial tools.  Both commercial as well as open-source software each have their merits which should be thoroughly evaluated before any analytical software investment decision is made. In this contribution, we will elaborate further on this trade-off.  

First of all, the key advantage of open source software is that it is obviously available for free, which significantly lowers the entry barrier to use it. However, this clearly poses a danger as well, since anyone can contribute to it without any quality assurance or extensive prior testing.  In heavily regulated environments such as Credit Risk (Basel Accord), Insurance (Solvency Accord) and Pharmaceutics (FDA regulation), the analytical models are subject to external supervisory review because of their strategic impact to society, which is now bigger than ever before.  Hence, in these settings many firms prefer to rely on mature commercial solutions, that have been thoroughly engineered and extensively tested, validated and completely documented.  Many of these solutions also include automatic reporting facilities to generate compliant reports in each of the settings mentioned.  Open source software solutions come without any kind of quality control or warranty which increases the risk to use them in a regulated environment.  

Another key advantage of commercial solutions is that the software offered is no longer centered around dedicated analytical workbenches for e.g. data preprocessing, data mining, etc. but on well-engineered business focused solutions which automate the end to end activities.  As an example, consider credit risk modeling which starts from framing the business problem (e.g. modeling default risk for a mortgage portfolio) to data preprocessing (e.g. taking care of missing values, outliers, etc.), analytical model development (e.g. estimating logistic regression or decision tree models), backtesting (e.g. using traffic light indicator approaches) and benchmarking (e.g. using FICO scores), stress testing (e.g. based on sensitivity and scenario analysis) and regulatory capital calculation.  To automate this entire chain of activities using open source would require various scripts, likely originating from heterogeneous sources, to be matched and connected together, resulting in a possible melting pot of software, whereby the overall functionality and transparency can become unstable and/or unclear.  

Contrary to open source software, commercial software vendors also offer extensive help facilities such as FAQs, technical support hot lines, newsletters, professional training courses, etc.  Another key advantage of commercial software vendors is business continuity.  More specifically, the availability of centralized R&D teams (as opposed to world-wide loosely connected open source developers) which closely follow up on new analytical and regulatory developments provides a better guarantee that new software upgrades will provide the facilities required.  In an open source environment, you need to rely on the community to voluntarily contribute, which provides less of a guarantee.  

A disadvantage of commercial software is that it usually comes in pre-packaged, black box routines which, although extensively tested and documented, cannot be inspected by the more sophisticated data scientist.  This is in contrast to open source solutions which provide full access to the source code of each of the scripts contributed.  

As a final note, we currently see more and more small and medium sized enterprises (SMEs) being interested in leveraging big data and analytics.  Since these firms typically have only limited budgets, they are particularly interested in open-source or freeware solutions that can be directly used to analyze their data.  Actually, the most popular technologies in use here are web analytics tools (e.g. Google Analytics) to study how their web sites are being used and found, improve their search engine ranking, or decide upon their optimal organic versus paid search on-line marketing mix.   

Given the above discussion, it is clear that both commercial and open source software each have their strengths and weaknesses.  Hence, it is likely that both will continue to co-exist and interfaces should be provided for both to collaborate as is the case for e.g. SAS and R/Python.  The optimal mix also depends upon the size of the firm (e.g. large corporate versus SME) and the maturity of the Big data & analytics projects.