Algorithmic Black Boxes, Data Scientists and Trade-offs
- Get link
- X
- Other Apps
There is a
thought provoking article
published in the Forbes on the directions in data science and their probable implications.
Its title is quite catchy ‘How Data Scientists turned against Statistics”. At
least on the title, it seems to capture a conflict that seems ongoing between
the data scientists resting on their tools and the statistical methods and kit
that one has got used to in the days of the small datasets. The article seems to
posit a trade-off. The trade-off is the larger the datasets, the lower the
statistical rigour by an individual analyst who will instead prefer to use
automated tools. The sophisticated construction of statistical methodology and consequent
inferences give way to automated construction of data collection methodological,
analytical and inferential tools with little control in the hands of the data scientist.
In the words of the author, the giant leap into the world of data has been accompanied
by a giant leap of faith the outcome of which is there is the core tenets of
statistics become irrelevant in a universe of large datasets.
In the days
preceding the arrival of big data, it was not possible to collect data for the
entire population. So the via media was to collect a sample data collection
which would then be extrapolated to the population. The sample sought to be
collected must be representative of the population and hence great efforts were
made to approximate the sample to be representative of the population. Any
error in the same would dilute the findings and the thus skewing the very
objectives that were sought to be pursued. The arrival of the big data changed
the scenario. It was no longer essential to collect the samples as enough tools
emerged to gain the insights of the whole population. Yet as these tools
emerged, the manual had to give way to the automated. This brought with itself
several challenges.
As data
scientists sought to generate big data and unearth insights from them, they
were forced to rely on automated systems which over a period of time have led
to the emergence and growth of artificial intelligence, data mining, deep
learning etc. The context was ripe for the emergence of data mining and data
warehousing firms as did the rise of artificial intelligence or deep learning
firms. Demand created its supply. Of course, it could be argued that the rise
in these firms and their offerings created a supply in the market, the corollary
of which was the rise in demand for these among data scientists as they sought
to improvise their findings. Thus it could be argued supply created its demand.
But however, as in the proverbial chicken and egg story, it is not important to
know the origins but suffice to understand the changes that it has brought in.
The firms sought
to leverage the increasing demand. This necessitated development of algorithms whose
complexity increased exponentially with each iteration. The algorithms were
their key differentiators thus compelling the firms to look at intellectual
property protection. This led not just copyrighting the expression of the code
but patenting the algorithm itself. A product of this was increasing opacity of
the underlying models of the algorithms. The underlying foundations of these
algorithms from sentiment analysis to demographic estimation to network
construction and analysis all trace to an opaque black box or more precisely a
set of black boxes. The analysts and the inferential scientists have no access
to these black boxes. Implied is forced dependence on these invisible hands
that govern the insights unearthed by the data thus key inputs in decision
making. As the author points out, it is not unusual to find different results
being manifested at different times for a similar queries thus resulting what
should be doubt on the efficacy of the underlying algorithms and the software
tools. There is hardly any insight into the models being used, the samples
collected, the quantum of missing data, the quantum of data searched, and
computational mechanisms of these enrichment estimates among others. In fact, far
from it there is hardly any significant worry about the efficacy or the lack of
it. This in fact should be an important determinant since the vendor lock in is
quite significant in these analytics industries with high switching for firm
specific or sector specific algorithms being designed and implemented. The time
tested paradigm of trust but verify has all but disappeared.
The author
traces this to what he terms as encroachment of non-traditional disciplines
into data sciences. An expert in data sciences, one needs an understanding of
mathematics and statistics. This has to be combined with strong programming
skills. What would be indispensable is the thorough knowledge of the subject
being explored or the subject for which the algorithm is being designed. In a
universe of modularization and specialization, each task is assigned to
different set of groups who may or may not have the cross functional expertize.
Algorithm results are a function of implementation details but there is lack of
understanding on the programming dimensions on the part of data scientists and
vice versa. Those with programming background too find themselves often short
of understanding the intricacies of numerical methods and algorithm design,
thus creating asymmetries. Those with deep programming skills lack statistical knowledge
at depth. There is a failure in understanding the scale transformation of small
data as aggregates into big data.
Furthermore, the
scientists who have to decode the results come from disciplines who have little
understanding of both statistics and programming. In fact, their core
competence lies in understanding the questions for which they want the answer. They
know the questions they need to ask. They further know the questions they need
to design and also the data that perhaps is required both in quality and scale.
But they are too dependent on the results the algorithms generate. Most have no
access to what the parent dataset is and rely instead of child datasets thus
potentially leading to some problems in identifying and validating the results.
Yet they have little clue whether the answers generated have relevance or are
potentially misleading. They perhaps are too trustful of the results given they
are lacking in the tools for verification. The growing commercialisation accompanied
by an increasing streamlining and turnkey are perhaps causes of these
challenges.
The resolution
of these issues would lie in integrating the disciplines of numerical sciences,
data sciences and programming. Furthermore, there needs to be emphasis on
reducing the black boxes. The users must have a right to access the models and
the validation and replication of the models so that they can make an informed
decisions. The open source models have not taken off for multiple reasons. Even
in the absence of the same, there needs to be transparency in algorithms. The debate
on the transparency has just begun.
- Get link
- X
- Other Apps
Comments
Post a Comment