Algorithmic Black Boxes, Data Scientists and Trade-offs

There is a thought provoking article published in the Forbes on the directions in data science and their probable implications. Its title is quite catchy ‘How Data Scientists turned against Statistics”. At least on the title, it seems to capture a conflict that seems ongoing between the data scientists resting on their tools and the statistical methods and kit that one has got used to in the days of the small datasets. The article seems to posit a trade-off. The trade-off is the larger the datasets, the lower the statistical rigour by an individual analyst who will instead prefer to use automated tools. The sophisticated construction of statistical methodology and consequent inferences give way to automated construction of data collection methodological, analytical and inferential tools with little control in the hands of the data scientist. In the words of the author, the giant leap into the world of data has been accompanied by a giant leap of faith the outcome of which is there is the core tenets of statistics become irrelevant in a universe of large datasets.

In the days preceding the arrival of big data, it was not possible to collect data for the entire population. So the via media was to collect a sample data collection which would then be extrapolated to the population. The sample sought to be collected must be representative of the population and hence great efforts were made to approximate the sample to be representative of the population. Any error in the same would dilute the findings and the thus skewing the very objectives that were sought to be pursued. The arrival of the big data changed the scenario. It was no longer essential to collect the samples as enough tools emerged to gain the insights of the whole population. Yet as these tools emerged, the manual had to give way to the automated. This brought with itself several challenges.

As data scientists sought to generate big data and unearth insights from them, they were forced to rely on automated systems which over a period of time have led to the emergence and growth of artificial intelligence, data mining, deep learning etc. The context was ripe for the emergence of data mining and data warehousing firms as did the rise of artificial intelligence or deep learning firms. Demand created its supply. Of course, it could be argued that the rise in these firms and their offerings created a supply in the market, the corollary of which was the rise in demand for these among data scientists as they sought to improvise their findings. Thus it could be argued supply created its demand. But however, as in the proverbial chicken and egg story, it is not important to know the origins but suffice to understand the changes that it has brought in.

The firms sought to leverage the increasing demand. This necessitated development of algorithms whose complexity increased exponentially with each iteration. The algorithms were their key differentiators thus compelling the firms to look at intellectual property protection. This led not just copyrighting the expression of the code but patenting the algorithm itself. A product of this was increasing opacity of the underlying models of the algorithms. The underlying foundations of these algorithms from sentiment analysis to demographic estimation to network construction and analysis all trace to an opaque black box or more precisely a set of black boxes. The analysts and the inferential scientists have no access to these black boxes. Implied is forced dependence on these invisible hands that govern the insights unearthed by the data thus key inputs in decision making. As the author points out, it is not unusual to find different results being manifested at different times for a similar queries thus resulting what should be doubt on the efficacy of the underlying algorithms and the software tools. There is hardly any insight into the models being used, the samples collected, the quantum of missing data, the quantum of data searched, and computational mechanisms of these enrichment estimates among others. In fact, far from it there is hardly any significant worry about the efficacy or the lack of it. This in fact should be an important determinant since the vendor lock in is quite significant in these analytics industries with high switching for firm specific or sector specific algorithms being designed and implemented. The time tested paradigm of trust but verify has all but disappeared.

The author traces this to what he terms as encroachment of non-traditional disciplines into data sciences. An expert in data sciences, one needs an understanding of mathematics and statistics. This has to be combined with strong programming skills. What would be indispensable is the thorough knowledge of the subject being explored or the subject for which the algorithm is being designed. In a universe of modularization and specialization, each task is assigned to different set of groups who may or may not have the cross functional expertize. Algorithm results are a function of implementation details but there is lack of understanding on the programming dimensions on the part of data scientists and vice versa. Those with programming background too find themselves often short of understanding the intricacies of numerical methods and algorithm design, thus creating asymmetries. Those with deep programming skills lack statistical knowledge at depth. There is a failure in understanding the scale transformation of small data as aggregates into big data.

Furthermore, the scientists who have to decode the results come from disciplines who have little understanding of both statistics and programming. In fact, their core competence lies in understanding the questions for which they want the answer. They know the questions they need to ask. They further know the questions they need to design and also the data that perhaps is required both in quality and scale. But they are too dependent on the results the algorithms generate. Most have no access to what the parent dataset is and rely instead of child datasets thus potentially leading to some problems in identifying and validating the results. Yet they have little clue whether the answers generated have relevance or are potentially misleading. They perhaps are too trustful of the results given they are lacking in the tools for verification. The growing commercialisation accompanied by an increasing streamlining and turnkey are perhaps causes of these challenges.

The resolution of these issues would lie in integrating the disciplines of numerical sciences, data sciences and programming. Furthermore, there needs to be emphasis on reducing the black boxes. The users must have a right to access the models and the validation and replication of the models so that they can make an informed decisions. The open source models have not taken off for multiple reasons. Even in the absence of the same, there needs to be transparency in algorithms. The debate on the transparency has just begun.

Search This Blog

Random Musings, Ramblings and such on Theory and Like

Decision Making as Output and Bounded Rationality

Algorithmic Black Boxes, Data Scientists and Trade-offs

Comments

Post a Comment

Popular posts from this blog

Decision Making as Output and Bounded Rationality

Crypto Currencies and Sovereign Challenges

The Economics Origins of BCG Matrix