If world is ready: print ("Hello World!")
Welcome! My name is Alex Antoniou and I am writing to you from Baltimore, Maryland. I am a physician trained in Nuclear Medicine and Clinical Informatics passionate for all things digital health, and recently moving into the Big Data & Analytics industry. This is a fascinating and wonderful world that sort of sprang up out of nowhere for those of us who spent their 20s inside the hospital walls. So it’s amazing that I get to learn all this content and let my imagination run wild with all the potential future possibilities. I’m excited at the speed of knowledge discovery and adoption and ask you to join me in understanding the what..and a little bit of the how of big data & analytics.
The tentative blog topics are as follows:
The Harvard Business Review published a widely cited paper calling the Data Scientist the sexiest job of the 21st century. How could a statistical discipline entrenched in linear algebra and calculus become so popular? The answer is the popularity and value-add from Big Data. Statisticians moved from modeling data on excel and commercial software like MATLAB and SAS to open source programming languages like R and Python that could handle millions of observations in a single dataset. The insights derived from analyzing such a large dataset were profound. But where would we find these large datasets. Everywhere, but initially when the field was first finding its footing in the business world, the industry to start with was in startups and companies using network effects to connect millions of people across the globe. One such example was in Jonathan Goldman who joined LinkedIn in 2006 and using information from millions of users was able to provide predictions on the most likely members to belong a user’s network. The predictions using the background in a user’s profile were so accurate that the click-through rate was 30% higher than other prompts to visit new member’s pages, it generated millions of new page views which significantly affected LinkedIn’s subsequent growth. Many other examples exist such as creation of Netflix’s movie recommendation system, Zynga’s game modifications to increase engagement, and most target ‘customer-centric’ digital advertising by google, facebook, twitter and others.
Originally, the title of Data Scientist was coined in 2008 and the number of data scientists grew exponentially largely in part to advancements in frameworks (Hadoop for distributed file system processing), cloud computing, and data visualization. The looming shortage on the other hand is secondary to the fact that there are no university programs to train and consequently provide a constant supply of Data Scientists. Today’s Data Scientists are self taught. Inherent to this fact, is the underlying fact that they are a curious bunch with a scientific approach to analyzing data. Thus, it comes as no surprise that many of today’s data scientists come from fields strong in methodology with a computational focus such as physics PhDs. A common thread in those fields is a foundational knowledge of math, statistics, probability, and a technical skill related to computer science. This foundational knowledge requires domain expertise. This is an important point. Structured data is already labeled, and an industry like wall street investment banks and hedge funds have a singular objective--to buy low and sell high. In that regard, quantitative analysts working on wall street are not the type of candidate data scientist recruiters look for. Instead, they look for people who can take unstructured data and make sense of it. If the big data is in healthcare, it pays to have some domain expertise in order to make sense of all the different types of information stored in petabytes worth of data.
Recently added as a new term in the Oxford Dictionary in 2013. Big data is just that. Big. It is so many rows of observations that an excel sheet would crash. It can have even more rows of observation that statistical software like STATA or SAS would crash (that is typically at about 500,000 rows of observations). Big data is not only defined by volume but by velocity and by variety. It is real-time. If you built a prediction on historical data, you better find a way to include real time data to become even more accurate. Large shopping platforms know what other products you were just browsing in your browser and can recommend similar products for purchase. Netflix includes all the latest movies you just watched to build a taste profile so it can recommend what you should watch next. Lastly, big data comes in a variety of forms. It includes tweets, emails, browsing history, personal characteristics such as demographics or personal shopping history, and hard raw data like housing prices in the last 100 years. Some data is not ‘big’ in size but many CIOs and CMOs consider including them in the definition because they have value such as product transaction information, financial records and recorded data from interaction channels such as the call center and point-of-sale touch points.
Big data has volume, velocity, and variety and it does not seem to be slowing down. In fact, 90% of data has been generated in the past few years alone and with the advent of the internet of things, that trend is expected to continue. Currently more than 1018 bytes of data (That’s a Quintillion!) is generated daily. We have more data than we know what to do with. Others (forbes) have included two more V’s to the evaluation of Big Data--veracity for integrity of data and value to the core business goal. This is where analytics comes in. The field of predictive analytics is a young field that uses algorithms to analyze, model, or visualize the data so that we can make sense of it. It consists of using historical (or real-time) big data to make predictions. We’ve already been using it for more than a decade in the digital era. We’ve used it to predict customer lifetime value (CLV) to a firm as well as recommend an item they are more likely to buy. Digital marketing is a dominant adopter of big data & analytics and we will introduce that landscape in the next blog. The most important foundation for good predictive analytics is having good data. As the saying goes “garbage in, garbage out”, if the data is not cleaned then it is of no value. Some examples of cleaning the data include: removing outliers, dimensions reduced using principal component analysis, missing values filled in with average, and different features with large scale variations standardized to a uniform scale such as 0 to 1. The data is then identified as structured (with predefined labels or categories such as revenue as the dependent factor and price, volume, customer loyalty, frequency of retail visits, etc as the independent variables) or unstructured (without predefined labels). The most common statistical algorithm applied is regression on structured data, but there are many more algorithms in the data scientist’s toolbox that can be used to more accurately fit the data and improve the accuracy of predictions. With big data, a statistician moved his analysis from excel and conventional statistical software packages like MATLAB and STATA to open source programming languages like R and Python that can handle much larger data observations and promoted her or her title to Data Scientist--which in the domain of Big Data and analytics is the fastest growing profession in demand any company that stores big data.