Data Analysis: Exploration, Patterns, Prediction and Causality

Textbook for Business, Economics and Policy

Detailed list of topicsFeedback survey

Data Analysis: Exploration, Patterns, Prediction and Causalityis a textbook aimed primarily at business, applied economics and public policy students. It may be taught at MBA, MA Economics (non-PhD track), MSc in Business Economics/Management, MA in Public Policy, PhD in Management and comparable programs. It also a natural fit in Business Analytics graduate programs.

Data Analysis: Exploration, Patterns, Prediction and Causality by Gábor Békés and Gábor Kézdi is now forthcoming with Cambridge University Press (Spring, 2019). Cambridge University Press is not-for-profit publishing business of the University of Cambridge. It is the oldest publishing house in the world whose aim is to disseminate knowledge in the pursuit of education, learning and research at the highest international levels of excellence.

The textbook material may be fully covered in a year-long course (for example, in the first year of a two-year Master programs or PhD programs) It covers material for a series of courses or modules, and chapters may be used to assemble programs of various lengths.

Our textbook covers integrated knowledge of methods and tools traditionally scattered around various fields such as econometrics, machine learning and practical business statistics. Our sections in the book


Each four major sections will include six chapters. For details, see the list of contents.

State-of-the art knowledge in data analysis includes traditional regression analysis, causal analysis of the effects of interventions, predictive analytics using regression and machine learning tools, and practical skills for working with real-life data and collecting data. READ MORE: about motivation

We cover relatively few methods but help students gain a deep intuitive understanding. The upside is that visualization and interpretation of results may become the focus of analysis.

Applied knowledge can be acquired only by working through many applications. Students will use real-life data; learn how to manage analytical projects from scratch as we provide data and code as part of an online ancillary platform. The textbook supports Microsoft Excel, R and Stata, emphasizing the latter two software.

The illustration studies and examples presented in the textbook are all connected to real life problems in business and public policy. We are cooperating with major companies, NGOs and policy institutions to jointly develop interesting case studies.

Our textbook is complemented with extensive online material including data, code, additional case studies, practice questions, sample exams and data exercises.

The textbook is written by

Yes, two Gabors. What are the odds?

We have been teaching related courses for quite a few years in many places.

Why is this book different?

Economic Statistics meets Data Science - mixing best of both worlds

Wide range of classic and new tools - from regression analysis to machine learning

Curated and focused content - selecting the most useful methods, emphasizing precise interpretation

Data and lab - to practice coding and enhance learning data and code in R and Stata will be provided

No mickey mouse data - real life examples and challenges from business, economics and policy problems, emphasis on data wrangling

Practice questions on the essentials - enhancing deeper understanding

We cover integrated knowledge of methods and tools traditionally scattered around various fields such as econometrics, data science and practical business statistics. State-of-the art knowledge in data analysis includes traditional regression analysis, causal analysis of the effects of interventions, predictive analytics using regression and machine learning tools, and practical skills for working with real-life data and collecting data. We focus on (i) Pattern discovery , (ii) Addressing causality and (iii) Building prediction model.

Uncovering patterns in the data can be an important goal in itself, and it is the prerequisite to establishing cause and effect and carrying out predictions.

The textbook starts with simple regression analysis, the method that compares expected y for different values of x to learn the patterns of association between the two variables. It discusses nonparametric regressions and focuses on the linear regression. It builds on simple linear regression and goes on to enriching it with nonlinear functional forms, generalizing from a particular dataset to other data it represents, adding more explanatory variables, etc.

The textbook also covers regression analysis for time series data, panel data, binary dependent variables, as well as nonlinear models such as logit and probit. Understanding the intuition behind the methods, their applicability in various situations, and the correct interpretation of their results are the constant focus of the textbook.

Decisions in business and policy are often centered on specific interventions, such as changing monetary policy, modifying health care financing, changing the price or other attributes of products, or changing the media mix in marketing. Learning the effects of such interventions is an important purpose of data analysis.

The textbook incorporates the basic concepts and methods used by program evaluation (the framework of potential outcomes, the benefits of randomized assignment, etc.). It also covers related methods used in business, such as A/B testing.

Data analysis in business and policy applications is often aimed at prediction. The textbook introduces tools to evaluate predictions, such as loss functions or the Brier score. It emphasizes the importance of out-of-sample prediction, the role of stationarity, the dangers of overfitting and the use of training and testing samples and cross-validation.

The book presents and compares the most important predictive models that may be useful in various situations such as time series regressions, classification tools and tree-based machine learning methods.

Data collection is often an integral part of data analysis in many situations. Regardless of its source, data in real life needs cleaning and restructuring before it can be analyzed. Even after extensive cleaning the data used in the analysis is typically different from the ideal dataset that would serve the analysis best. Analysts need to use appropriate tools to collect, clean and restructure data, and they need to have a thorough understanding of the differences between ideal data and available data to interpret their results in appropriate ways.

The need to integrate practical data work with statistical analysis and machine learning has increased with the advance of Big Data – large and unstructured datasets are becoming more and more common. The textbook introduces the most important tools of data collection, data management and cleaning, and it discusses the consequences of measurement issues on the results of analysis. These topics are included as separate chapters, and they are emphasized in the case studies as well.

Big Data presents opportunities to better answer old questions and ask new questions. It offers great advantages when applying many traditional statistical methods and allows for developing new methods. At the same time analyzing Big Data presents new challenges, too. We include explicit discussion of these opportunities and challenges in relation to uncovering and generalizing patterns, learning the effects of interventions and carrying out predictions, within each of the sections of the book.

Each chapter is accompanied by the data used in the illustration studies with a full but concise description, and the description of how to implement data management, cleaning, analysis, and visualization in Excel, R and Stata.

We think code support is essential, applied knowledge can be acquired only by working through many applications. The textbook fosters hands-on work through numerous examples, both within the main text and as supplementary material. Illustration studies in the textbook explain how methods work; fully developed case studies included in the ancillary material answer real-life questions using real-life data; additional exercises invite students to replicate analysis and carry out further projects with appropriate guidance.

There is so much good (sometimes, awesome) data around. We’ll borrow from the best work to build case studies for students.

We also provide the R and Stata codes themselves that produce all results shown in the textbook, starting with raw data. Students can learn coding by first understanding and then tinkering with code that works. We plan to store these “Data and lab” sections online, some elements possibly turned into videos or interactive exercises similar to those on

Super, it seems you are interested in our project. Please send us an email, and we’ll send you draft chapters to review.

  1. Origins of data (data table, data quality, survey, scraping, sampling, ethics)
  2. Preparing data for analysis (tidy data, source of variation, variable types, missing data, data cleaning)
  3. Describing variables (probability, distributions, extreme values, summary stats)
  4. Comparison and correlation (conditional probability, conditional distribution, conditional expectation, visual comparisons, correlation, quick intro to linear regression)
  5. Generalizing from a dataset (repeated samples, confidence interval, standard error estimation via bootstrap and formula, external validity)
  6. Testing hypotheses (null and alternative hypotheses, t-test, false positives / false negatives, p-value, testing multiple hypotheses)
  1. Simple regression analysis (non-parametric regression, linear regression, OLS, predicted values and residuals, regression and causality)
  2. Complicated patterns and messy data (taking log and other transformations of variables, piecewise linear splines and polynomials, measurement error in variables, influential observations, using weights)
  3. Generalizing results of regression analysis (standard error, confidence interval, prediction interval, testing, external validity)
  4. Multiple linear regression (linear regression mechanics, binary and other qualitative right-hand-side variables, interactions, ceteris paribus vs. conditioning in multiple regression)
  5. Probability models (linear probability, logit and probit, marginal differences, goodness of fit)
  6. Time series regressions (trends, seasonality, lags, serial correlation)
  1. Framework for prediction (prediction error, loss function, RMSE, prediction with regression, overfitting, external validity, machine learning)
  2. Model selection (adjusted measures of in-sample fit, cross-validation, shrinkage/LASSO)
  3. Regression trees (CART, stopping rules, pruning, search algorithms)
  4. Random forest (boosting, decorrelating trees, regression vs random forest)
  5. Classification (calibration and Brier score, ROC/AUC, classification with logit vs. CART)
  6. Forecasting from time series data (cross-validation in time series, ARIMA, vector autoregression, CART in time series)
  1. A framework for causal analysis (potential outcomes, average treatment effect selection, other confounders, reverse causality, methods)
  2. Experiments (fields experiments, A/B testing, randomization and balance, attrition, power, internal and external validity)
  3. Methods for observational cross-section data (matching vs. regression, common support, instrumental variables, regression-discontinuity)
  4. Methods for observational time series (policy interventions, anticipation effects, event study)
  5. Difference in differences (parallel trends, panel vs. repeated cross-section, synthetic controls)
  6. Methods for observational panel data (fixed effects, first differences, cluster standard errors, unbalanced panels)
  1. Origins of data
  2. Preparing data for analysis
  3. Discovering and describing data
  4. Comparison and correlation
  5. Generalizing from a dataset
  6. Testing hypotheses
  1. Simple regression analysis
  2. Complicated patterns and messy data
  3. Generalizing results of a regression
  4. Multiple linear regression
  5. Probability models
  6. Time series regressions
  1. Framework for prediction
  2. Model selection
  3. Regression trees
  4. Random Forest
  5. Classification
  6. Forecasting from time series data
  1. A framework for causal analysis
  2. Experiments
  3. Methods for observational cross-section data
  4. Methods for observational time series
  5. Difference in differences
  6. Methods for observational panel data

We will provide additional case studies that allow for studying the entire process of data analysis from the substantive business or policy question through collecting or accessing data, managing and cleaning data, carrying out the analysis, presenting and interpreting its results, and addressing the original substantive questions. Case studies aim at answering a question rather than simply illustrating a method.

In addition to showing tools in real life use, case studies will also present decision points (e.g. which model or functional form to pick, what data source to rely on, how to treat missing observations) as well. To get to interesting topics we are collaborating with academic and industry partners.

Here is an example. A real estate company wants to price apartments in a bunch of new projects. It has ample data on apartment ads from the past. The case study discusses exploratory data analysis, feature engineering and prediction of continuous target variables. The emphasis is on how to code various types of variables.

Here is another. In any country, exporting firms tend to be larger and more productive. Why? Well, firms may benefit from export experience as they learn new skills from foreign partners. This notion is the basis for great many export promotion policies. However, equally possible is this: firms that already larger, are those who will be successful abroad. The case study compared experiment and observational evidence. The focus is on internal and external validity.