The answer to data inundation may be to throw most of it out


Many corporations around the world today are pursuing “Big Data analytics” in the hope of a Eureka moment that will allow them to learn something valuable about their existing customers—and help them in acquiring new customers. Anyone who has seen the tiles of “people whom you may know” pop up on Facebook or other social networking sites, or the one that says “people who bought this also bought” on e-commerce portals has been at the receiving end of this phenomenon. I have always been impressed by how Facebook knows that I’m connected to these people, some of whom I haven’t been in touch with for many years. I am sure buying WhatsApp, a phone messaging application in widespread use, which in turn has access to everyone’s full contact list on the phone, had something to do with it.

Firms are also awash with data on their own operations and plough through it incessantly in order to find information that will allow them to “transform” their operations—in a never-ending attempt to get more by spending less. Watching some senior executives at this game has often reminded me of dogs trying to catch their own tails. While it’s hugely entertaining, I feel sorry for them sometimes. They go around and around in circles, but seldom manage a nip at the coccyx.

I am not gainsaying the need for Big Data analytics. But the explosion of digital data in the wake of the Internet boom of the last decade has caused a justifiable fear of “data inundation”. And a series of Noahs have sprung up to build arks to keep companies afloat during the data deluge. They peddle ever larger database storage software and machines to handle this data flood. For years now, firms like Teradata Corp., Oracle Corp. and International Business Machines Corp. have sold large database software management tools and the computing machines to go with them. Data is now referred to in zetabytes (where one zetabyte is equivalent to one trillion gigabytes).

Speaking of Noah’s arks, the serendipitously named investment advisory outfit ARK Investment Management Llc says that in 2006, the Internet produced about 0.16 zetabytes of data while the available storage capacity was only 0.09. This data then rose at a compound annual growth rate (CAGR) of 25% for the next decade, causing a shortage of 500% in storage space. The prediction is that this data will now grow at a CAGR of almost 40% for the next half-decade, going from 8.5 zetabytes at the end of last year to 44 zetabytes by 2020. More arks, anyone?

Companies then add an “analytics” team to try to make sense of all this data. These teams usually have one or two expert statisticians or “data scientists” and a bunch of young, eager, offshore number crunchers who are willing to pore through reams of data like gaunt prospectors panning for gold. This data surplus has spawned an entire genre of Big Data analytics firms. The problem they blithely ignore is that they are often sifting through a load of junk in the first place, and even more junk falls upon this load every day at the Internet’s warp speed.

Data scientists are quick to point out that they first cleanse the data before working on manipulating it so that they can make sense of it. But as ARK Investment says, there’s just too much of this data floating around, and what’s more, some of it has been kept alive on company servers for many years, without ever being looked at. The problem here isn’t that the data is dirty—it’s just old. And data ages alarmingly quickly in the Internet world.

My mother, who was a successful doctor, managed the house with the same precision with which she managed her surgeries and obstetric procedures. When she got down to spring cleaning the house, she simply threw out everything that hadn’t been used for a year. It didn’t matter if it was still unused and in its original packaging; if it hadn’t been used for a year—out it went, despite howls of protest from the rest of us. It was as if rivers had been diverted through our house to wash it clean, reminiscent of Hercules cleaning the Augean stables.

The real need of the hour for many firms is a data purge, not more data science. I am not suggesting that the rules be as stringent as my parent’s or the efforts Herculean. But what I am suggesting does needs corporate courage: instead of spending huge sums on buying or renting more disk space to store data, and then more money on statisticians to attempt to find even more ways to analyse fast decaying data, some of the money may be better spent on having a smart set of young people offshore trained exclusively to look for anything that is too old to be used, or was “dead on arrival”, and having that data purged.

This might limit the growth of useless computing and storage capability to some extent, leading instead to the use of data that is more relevant and real time from which the conclusions reached by data analytics can be readily acted upon.

And no, I don’t want to be Facebook friends with a business acquaintance I last spoke to five years ago.