The Harvard Business Review published a widely cited paper calling the Data Scientist the sexiest job of the 21st century. How could a statistical discipline entrenched in linear algebra and calculus become so popular? The answer is the popularity and value-add from Big Data. Statisticians moved from modeling data on excel and commercial software like MATLAB and SAS to open source programming languages like R and Python that could handle millions of observations in a single dataset. The insights derived from analyzing such a large dataset were profound. But where would we find these large datasets. Everywhere, but initially when the field was first finding its footing in the business world, the industry to start with was in startups and companies using network effects to connect millions of people across the globe. One such example was in Jonathan Goldman who joined LinkedIn in 2006 and using information from millions of users was able to provide predictions on the most likely members to belong a user’s network. The predictions using the background in a user’s profile were so accurate that the click-through rate was 30% higher than other prompts to visit new member’s pages, it generated millions of new page views which significantly affected LinkedIn’s subsequent growth. Many other examples exist such as creation of Netflix’s movie recommendation system, Zynga’s game modifications to increase engagement, and most target ‘customer-centric’ digital advertising by google, facebook, twitter and others.
Originally, the title of Data Scientist was coined in 2008 and the number of data scientists grew exponentially largely in part to advancements in frameworks (Hadoop for distributed file system processing), cloud computing, and data visualization. The looming shortage on the other hand is secondary to the fact that there are no university programs to train and consequently provide a constant supply of Data Scientists. Today’s Data Scientists are self taught. Inherent to this fact, is the underlying fact that they are a curious bunch with a scientific approach to analyzing data. Thus, it comes as no surprise that many of today’s data scientists come from fields strong in methodology with a computational focus such as physics PhDs. A common thread in those fields is a foundational knowledge of math, statistics, probability, and a technical skill related to computer science. This foundational knowledge requires domain expertise. This is an important point. Structured data is already labeled, and an industry like wall street investment banks and hedge funds have a singular objective--to buy low and sell high. In that regard, quantitative analysts working on wall street are not the type of candidate data scientist recruiters look for. Instead, they look for people who can take unstructured data and make sense of it. If the big data is in healthcare, it pays to have some domain expertise in order to make sense of all the different types of information stored in petabytes worth of data.
Because of the large demand for data scientists as companies continue to accumulate data, the salary for data scientists continues to go up (typically past the 6 figure mark). This created an opportunity for consulting firms such as Accenture, Deloitte, and IBM Global services to create divisions dedicated to analytics consulting for the companies who can’t afford or don’t need a full-time analytics team.
So what does it take to become a data scientist?
- Databases: mySQL, mongoDB, PostgreSQL
- Distributed Computing: Apache Hadoop/Hive/Spark, MapReduce
- R or Python programming language and Java (for writing production codes)
- Statistics, linear algebra, and calculus
- Machine Learning (ex: scikit learn library in python)
- Data Munging (formatting the data so it can be used in R or python)
- Tableau is a data visualization and analytics tool used for business intelligence and can be used by non-technical analysts or marketers
Finally, watch what the life of Josh Wills, a Data Scientist at Airbnb, is like:
As well as that of Hilary Mason, Chief Scientist at Bitly
Big Data Resources to Practice with:
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.html
Government freely available data: www.data.gov
AWS Public Datasets: https://aws.amazon.com/datasets/
Kaggle Datasets: https://www.kaggle.com/datasets
Davenport, Thomas, and D.J. Patil. (2012). Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review. <https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century>. Accessed on 11/08/2016.
Udacity. (2014). 8 Skills You Need to Be a Data Scientist. <http://blog.udacity.com/2014/11/data-science-job-skills.html>. Accessed on 11/08/2016.