Data Analysis: Exploration, Patterns, Prediction and Causality

Textbook for Business, Economics and Policy

List of topicsFeedback survey

Data Analysis: Exploration, Patterns, Prediction and Causality is a textbook aimed primarily at business, applied economics and public policy students. It may be taught at MBA, MA Economics (non-PhD track), MSc in Business Economics/Management, MA in Public Policy, PhD in Management and comparable programs. It also a natural fit in Business Analytics graduate programs.

Data Analysis: Exploration, Patterns, Prediction and Causality by Gábor Békés and Gábor Kézdi is now forthcoming with Cambridge University Press (Spring, 2019). Cambridge University Press is not-for-profit publishing business of the University of Cambridge. It is the oldest publishing house in the world whose aim is to disseminate knowledge in the pursuit of education, learning and research at the highest international levels of excellence.

The textbook material may be fully covered in a year-long course (for example, in the first year of a two-year Master programs or PhD programs) It covers material for a series of courses or modules, and chapters may be used to assemble programs of various lengths.

Our textbook covers integrated knowledge of methods and tools traditionally scattered around various fields such as econometrics, machine learning and practical business statistics. Our sections in the book

  1. DATA EXPLORATION
  2. PATTERNS: REGRESSION ANALYSIS
  3. PREDICTION
  4. CAUSALITY: THE EFFECTS OF INTERVENTIONS

Each four major sections will include six chapters. For details, see the list of contents.

State-of-the art knowledge in data analysis includes traditional regression analysis, causal analysis of the effects of interventions, predictive analytics using regression and machine learning tools, and practical skills for working with real-life data and collecting data.

We cover relatively few methods but help students gain a deep intuitive understanding. The upside is that visualization and interpretation of results may become the focus of analysis.

Applied knowledge can be acquired only by working through many applications. Students will use real-life data; we provide data and code as part of an online ancillary platform. The textbook supports R and Stata: we provide code in both software for all case studies and all exhibits in the textbook.

After completing courses based on this textbook students will learn how to manage analytical projects, starting with unstructured questions and messy data and arriving at readily interpretable results and effective presentation.

The illustration studies and examples presented in the textbook are all connected to real life problems in business and public policy. We are cooperating with major companies, NGOs and policy institutions to jointly develop interesting case studies.

Our textbook is complemented with extensive online material including data, code, additional case studies, practice questions, sample exams and data exercises. This helps instructors to choose case studies that better fit their student audience and distribute content and exercises across of lectures, practice sessions and assignments.

  1. Origins of data (data table, data quality, scraping, surveys, sampling, ethics)
  2. Preparing data for analysis (tidy data, source of variation, variable types, relational data, missing data, data cleaning)
  3. Describing variables (frequency and probability, distribution, extreme values, summary statistics, theoretical distributions)
  4. Comparison and correlation (conditioning, source of variation, conditional probability, conditional distribution, conditional expectation, correlation, visualization)
  5. Generalizing from a dataset (repeated samples, confidence interval, standard error estimation via bootstrap and formula, external validity)
  6. Testing hypotheses (null and alternative hypotheses, t-test, false positives / false negatives, p-value, testing multiple hypotheses)
  1. Simple regression analysis (non-parametric regression, linear regression, OLS, predicted values and residuals, regression and correlation, regression and causality)
  2. Complicated patterns and messy data (taking log and other transformations of variables, piecewise linear splines and polynomials, influential observations, measurement error in variables, using weights)
  3. Generalizing results of regression analysis (standard error of regression coefficients, confidence interval, prediction interval, testing, external validity)
  4. Multiple linear regression (multiple linear regression mechanics, binary and other qualitative right-hand-side variables, interactions, multiple regression and causal analysis)
  5. Probability models (linear probability, logit and probit, marginal differences, goodness of fit)
  6. Time series regressions (special features of time series data, trends, seasonality, lags, serial correlation)
  1. Framework for prediction (prediction error, loss function, RMSE, prediction with regression, overfitting, external validity, machine learning)
  2. Model selection (adjusted measures of in-sample fit, cross-validation, shrinkage/LASSO)
  3. Regression trees (CART, stopping rules, pruning, search algorithms, regression vs CART)
  4. Random forest (boosting, decorrelating trees, regression vs random forest)
  5. Classification (calibration and Brier score, ROC/AUC, classification with logit vs. random forest)
  6. Forecasting from time series data (cross-validation in time series, ARIMA, vector autoregression, CART/random forest in time series)
  1. A framework for causal analysis (potential outcomes, average treatment effect confounders selection, reverse causality, causal analysis with regression, overview of methods)
  2. Experiments (fields experiments, A/B testing, randomization and balance, attrition, power, internal and external validity)
  3. Methods for observational cross-section data (matching vs. regression, common support, instrumental variables, regression-discontinuity)
  4. Methods for observational time series (policy interventions, anticipation effects, event study, impulse response analysis)
  5. Difference in differences (parallel trends, panel vs. repeated cross-section, synthetic controls)
  6. Methods for observational panel data (fixed effects, first differences, cluster standard errors, unbalanced panels)

The textbook is written by

Yes, two Gabors. What are the odds?

We have been teaching related courses for quite a few years in many places.

How is this book different?

Economic Statistics, Applied Econometrics and Data Science together - mixing best of all worlds

The most important classic and new tools - from data visualization to regression analysis to machine learning

Curated and focused content - selecting the most useful methods, emphasizing intuitive understanding and precise interpretation of results

Data and lab - data and code in R and Stata provided to practice coding and enhance learning

No mickey mouse data - real life examples and challenges from business, economics and policy problems, emphasis on data wrangling and modeling choices

Practice questions and data exercises - checking understanding and inviting to hands-on practicing

We cover integrated knowledge of methods and tools traditionally scattered around various fields such as econometrics, data science and practical business statistics. State-of-the art knowledge in data analysis includes traditional regression analysis, causal analysis of the effects of interventions, predictive analytics using regression and machine learning tools, and practical skills for working with real-life data and collecting data. We focus on (i) Data exploration, (ii) Pattern discovery through regression analysis , (iii) Uncovering the effects of interventions, and (iv) Building prediction models.

Data collection is often an integral part of data analysis in many situations. Understanding how data was born is also necessary to assess its quality. Whoever collected the data in the first place, analysts need to restructure and clean data before it can be analyzed. This process, often called data wrangling, takes a lot of time and effort of data analysts, and how it is done affects the results of subsequent analysis. Even after extensive cleaning the data used in the analysis is typically different from the ideal dataset that would serve the analysis best. Analysts need to use appropriate tools to collect, clean and restructure data. Moreover, they need to have a thorough understanding the quality of their data, including how they differ from ideal data, to interpret their results in appropriate ways.

Describing data, or exploratory data analysis, is an important part of analysis, yet it is often overlooked in textbooks. This textbook gives an introduction to the most important statistical and visual tools of data description. It also incorporates the most important statistical knowledge needed to make inference - in other words generalizing from a dataset. Besides providing a intuitive working knowledge of how to produce and interpret confidence intervals and hypothesis tests, it emphasizes the role of external validity and that statistical tools complement but don't substitute for thinking.

Uncovering patterns of association in the data is an important goal in itself. It is also the starting point for carrying out predictions and uncovering the effect of interventions.

The textbook starts with simple regression analysis, the method that compares expected y for different values of x to learn the patterns of association between the two variables. It discusses nonparametric regressions and focuses on the linear regression. It builds on simple linear regression and goes on to enriching it with nonlinear functional forms, generalizing from a particular dataset to other data it represents, adding more explanatory variables, etc.

The textbook also covers regression analysis for binary dependent variables and time series data, as well as nonlinear models such as logit and probit. Understanding the intuition behind the methods, their applicability in various situations, and the correct interpretation of their results are the constant focus of the textbook.

Data analysis in business and policy applications is often aimed at prediction. The textbook introduces tools to evaluate predictions, such as loss functions or the Brier score. It emphasizes the importance of out-of-sample prediction, the role of stationarity, the dangers of overfitting and the use of training and testing samples and cross-validation.

The book presents and compares the most important predictive models that may be useful in various situations such as time series regressions, classification tools and tree-based machine learning methods.

Decisions in business and policy are often centered on specific interventions, such as changing the price or other attributes of products, changing the media mix in marketing, changing interest rates or other instruments of monetary policy, or modifying health care financing. Learning the effects of such interventions is an important purpose of data analysis.

The textbook incorporates the basic concepts and methods used in causal analysis, including the framework of potential outcomes and the benefits of randomized assignment. It covers various methods, including field experiments, as A/B testing as well as methods for observational data, such as matching, instrumental variables, difference-in-differences and fixed-effects panel regressions.

The need to integrate practical data work with statistical analysis and machine learning has increased with the advance of Big Data – large and unstructured datasets are that becoming more and more common. The textbook discusses what Big Data implies for data structuring and cleaning, exploratory analysis and inference, as well as discovering and describing patterns, carrying out predictions and uncovering the effects of interventions. Many chapters include sections that describe extra challenges provided by Big Data and how data analysts can meet those challenges with the skill set covered in the textbook.

Big Data – describing very large, although often unstructured, datasets – presents opportunities to better answer old questions and ask new questions. It offers great advantages when applying many traditional statistical methods and allows for developing new methods. At the same time analyzing Big Data presents new challenges, too.

The need to integrate practical data work with statistical analysis and machine learning has increased with the advance of Big Data. The textbook discusses what Big Data implies for data structuring and cleaning, exploratory analysis and inference, as well as discovering and describing patterns, carrying out predictions and uncovering the effects of interventions. The chapters in the textbook include sections that describe extra challenges provided by Big Data and how data analysts can meet those challenges with the skill set covered in the textbook.

Each chapter is accompanied by the data used in the illustration studies with a full but concise description, and the description of how to implement data management, cleaning, analysis, and visualization in Excel, R and Stata.

We think code support is essential, applied knowledge can be acquired only by working through many applications. The textbook fosters hands-on work through numerous examples, both within the main text and as supplementary material. Illustration studies in the textbook explain how methods work; fully developed case studies included in the ancillary material answer real-life questions using real-life data; additional exercises invite students to replicate analysis and carry out further projects with appropriate guidance.

There is so much good (sometimes, awesome) data around. We’ll borrow from the best work to build case studies for students.

We also provide the R and Stata codes themselves that produce all results shown in the textbook, starting with raw data. Students can learn coding by first understanding and then tinkering with code that works. We plan to store these “Data and lab” sections online, some elements possibly turned into videos or interactive exercises similar to those on datacamp.com.

Super, it seems you are interested in our project. Please send us an email, and we’ll send you draft chapters to review.

  1. Origins of data
  2. Preparing data for analysis
  3. Discovering and describing data
  4. Comparison and correlation
  5. Generalizing from a dataset
  6. Testing hypotheses
  1. Simple regression analysis
  2. Complicated patterns and messy data
  3. Generalizing results of a regression
  4. Multiple linear regression
  5. Probability models
  6. Time series regressions
  1. Framework for prediction
  2. Model selection
  3. Regression trees
  4. Random Forest
  5. Classification
  6. Forecasting from time series data
  1. A framework for causal analysis
  2. Experiments
  3. Methods for observational cross-section data
  4. Methods for observational time series
  5. Difference in differences
  6. Methods for observational panel data

We will provide additional case studies that allow for studying the entire process of data analysis from the substantive business or policy question through collecting or accessing data, managing and cleaning data, carrying out the analysis, presenting and interpreting its results, and addressing the original substantive questions. Case studies aim at answering a question rather than simply illustrating a method.

In addition to showing tools in real life use, case studies will also present decision points (e.g. which model or functional form to pick, what data source to rely on, how to treat missing observations) as well. To get to interesting topics we are collaborating with academic and industry partners.

Here is an example. A real estate company wants to price apartments in a bunch of new projects. It has ample data on apartment ads from the past. The case study discusses exploratory data analysis, feature engineering and prediction of continuous target variables. The emphasis is on how to code various types of variables.

Here is another. In any country, exporting firms tend to be larger and more productive. Why? Well, firms may benefit from export experience as they learn new skills from foreign partners. This notion is the basis for great many export promotion policies. However, equally possible is this: firms that already larger, are those who will be successful abroad. The case study compared experiment and observational evidence. The focus is on internal and external validity.