What is Big Data?

Recently added as a new term in the Oxford Dictionary in 2013. Big data is just that. Big. It is so many rows of observations that an excel sheet would crash. It can have even more rows of observation that statistical software like STATA or SAS would crash (that is typically at about 500,000 rows of observations). Big data is not only defined by volume but by velocity and by variety. It is real-time. If you built a prediction on historical data, you better find a way to include real time data to become even more accurate. Large shopping platforms know what other products you were just browsing in your browser and can recommend similar products for purchase. Netflix includes all the latest movies you just watched to build a taste profile so it can recommend what you should watch next. Lastly, big data comes in a variety of forms. It includes tweets, emails, browsing history, personal characteristics such as demographics or personal shopping history, and hard raw data like housing prices in the last 100 years. Some data is not ‘big’ in size but many CIOs and CMOs consider including them in the definition because they have value such as product transaction information, financial records and recorded data from interaction channels such as the call center and point-of-sale touch points.

Big data has volume, velocity, and variety and it does not seem to be slowing down. In fact, 90% of data has been generated in the past few years alone and with the advent of the internet of things, that trend is expected to continue. Currently more than 1018 bytes of data (That’s a Quintillion!) is generated daily. We have more data than we know what to do with. Others (forbes) have included two more V’s to the evaluation of Big Data--veracity for integrity of data and value to the core business goal. This is where analytics comes in. The field of predictive analytics is a young field that uses algorithms to analyze, model, or visualize the data so that we can make sense of it. It consists of using historical (or real-time) big data to make predictions. We’ve already been using it for more than a decade in the digital era. We’ve used it to predict customer lifetime value (CLV) to a firm as well as recommend an item they are more likely to buy. Digital marketing is a dominant adopter of big data & analytics and we will introduce that landscape in the next blog. The most important foundation for good predictive analytics is having good data. As the saying goes “garbage in, garbage out”, if the data is not cleaned then it is of no value. Some examples of cleaning the data include: removing outliers, dimensions reduced using principal component analysis, missing values filled in with average, and different features with large scale variations standardized to a uniform scale such as 0 to 1. The data is then identified as structured (with predefined labels or categories such as revenue as the dependent factor and price, volume, customer loyalty, frequency of retail visits, etc as the independent variables) or unstructured (without predefined labels). The most common statistical algorithm applied is regression on structured data, but there are many more algorithms in the data scientist’s toolbox that can be used to more accurately fit the data and improve the accuracy of predictions. With big data, a statistician moved his analysis from excel and conventional statistical software packages like MATLAB and STATA to open source programming languages like R and Python that can handle much larger data observations and promoted her or her title to Data Scientist--which in the domain of Big Data and analytics is the fastest growing profession in demand any company that stores big data.


McAfee, A., & Brynjolfsson, E. (2012). Big data: the management revolution. Harvard Business Review, 90 (10): 60 – 68.

Arthur, Lisa. What is Big Data? (2013). Forbes Magazine. <http://www.forbes.com/sites/lisaarthur/2013/08/15/what-is-big-data/#299fb2a53487> Accessed on 11/06/2016

Davenport, Thomas. A Predictive Analytics Primer. (2014). Harvard Business Review. <https://hbr.org/2014/09/a-predictive-analytics-primer>. Accessed 11/06/2016.