UNDERSTANDING BIG DATA ANALYTICS
(Professor in Indian Statistical Institute, Kolkata)
In the era of Big Data, we are like the blind men of the parable standing in front of an elephant, trying to conceptualize the big animal from glimpses only. However, unless we understand the full nature of the Big Data, we’ll feel only one part of the elephant body, and conclude that the elephant is either like a winnowing basket (ear), a plowshare (tusk), a plow (trunk), a pillar (foot), and so on.
It’s not easy to realize why all the fancy analytics tools that we use for small datasets cannot be replicated when our data grows. For example, if we want to find simple average of ‘n’ numbers, we just add them, and divide the sum by ‘n’. Apparently, t h e approach is same whether ‘n’ is 100 or 100 billion. However, if all the numbers are large positive, then the sum of 100 billion such numbers might be overflowed in the computer memory. So, we need to adjust the algorithm somehow appropriately to find the average. That’s the extra bit of
cosmetic surgery needed for handling Big Data. Consider the example of simple multiplication. We learnt the multiplication tables in our elementary classes, which are stored in our memory. However, we need some additional techniques for multiplying two big numbers, say with hundreds of digits. We use our memory, multiply one number by every digit of the other, one by one, starting from the unit place. For each digit of the second number, the result of the multiplication is presented in a row, where we go on multiplying each digit of the first number starting from the unit digit, and use the ‘carrying method’ to add the residual, if any. For the next digit of the second number, we move to the next row and start by moving one place to the left. Finally we add all the rows. This algorithm for multiplication is a derivative of the knowledge of tables, combined with some special techniques. This can be interpreted as a Big Data problem. And special techniques on top of the standard multiplication tables of single digits are needed for solving them. Consider another simple mathematical problem, that of sorting. Suppose we are to sort 5 numbers in increasing order. In our elementary classes, we could easily sort them by looking at the numbers; certainly some algorithm within our brain runs to sort them manually. A smarter student might at once sort 10 such numbers, may be 20. But, certainly we cannot sort 100 numbers, or say 100,000 numbers, just by looking at them. We need some algorithm on top our unexplainable internal algorithm, which we learn in higher classes. This is another example of Big Data problem that we are tackling for years. Data analytics mostly comprises of statistical methodologies like regression analyses, classification and clustering techniques, some standard estimation and testing procedures, etc. While most of such theories are neatly developed in statistical literature, and easily applied for small to moderate data size, one might need to manipulate intelligently and devise some novel techniques for unusual format of data. But, the real challenge, even for standard ready-to- use techniques, lies in the limitations of using loads of data with huge number of variables. One reason is the presence of ‘spurious’ or ‘nonsense’ correlations among different variables, the more variables we handle we face more and more such correlations. And unless we can identify and carefully separate out the unimportant variables, and learn to handle lots of spurious correlations, we cannot aspire to have meaningful analyses of data. However, that’s a daunting task, anyway! It’s theoretically challenging too. In addition, even in a standard regression analysis, for example, with loads of data and say, 10,000 variables, we need additional computational techniques. For example, we might need to invert an 10000×10000 matrix, which becomes intractable.
Then, what’s about the ocean of data? Now, with the advent of computers and internet, and with virtually everything be confined under the system of ‘Internet of Things’, a gigantic amount of data is generated continuously. However, the ever-expanding horizon of ‘Data’ is now growing faster than ever. An IBM report of mid-2017 described that 2.5 quintillion bytes of data are created per day, and according to a Forbes article (2015), by 2020, new information of about 1.7 megabytes per second is expected to be created for every human being on the planet. Such Big Data is boon and curse at the same time. Are we really capable of leveraging them? With the present expertise, the answer is blatantly ‘NO’. A computer cannot handle data of size more than its RAM at the same token, be it 4 GB or 64 GB. Thus, we need to devise statistical techniques to accommodate this huge data, in a stage by stage way– exactly how the additional technique of ‘carrying’ and writing line by line could be deployed for multiplication of two big numbers, on top of the a-priority knowledge of multiplication of two single-digit numbers. And only the top statisticians collaborating with excellent computational experts might produce such techniques, that too in a case by case way. The quest for understanding the whole elephant would continue.