Data Analysis: Exploration, Patterns, Prediction and Causalityis a textbook aimed primarily at business, applied economics and public policy students. It may be taught at MBA, MA Economics (non-PhD track), MSc in Business Economics/Management, MA in Public Policy, PhD in Management and comparable programs. It also a natural fit in Business Analytics graduate programs.
Data Analysis: Exploration, Patterns, Prediction and Causality by Gábor Békés and Gábor Kézdi is now forthcoming with Cambridge University Press (Spring, 2019). Cambridge University Press is not-for-profit publishing business of the University of Cambridge. It is the oldest publishing house in the world whose aim is to disseminate knowledge in the pursuit of education, learning and research at the highest international levels of excellence.
The textbook material may be fully covered in a year-long course (for example, in the first year of a two-year Master programs or PhD programs) It covers material for a series of courses or modules, and chapters may be used to assemble programs of various lengths.
Our textbook covers integrated knowledge of methods and tools traditionally scattered around various fields such as econometrics, machine learning and practical business statistics. Our sections in the book
Each four major sections will include six chapters. For details, see the list of contents.
State-of-the art knowledge in data analysis includes traditional regression analysis, causal analysis of the effects of interventions, predictive analytics using regression and machine learning tools, and practical skills for working with real-life data and collecting data. READ MORE: about motivation
We cover relatively few methods but help students gain a deep intuitive understanding. The upside is that visualization and interpretation of results may become the focus of analysis.
Applied knowledge can be acquired only by working through many applications. Students will use real-life data; learn how to manage analytical projects from scratch as we provide data and code as part of an online ancillary platform. The textbook supports Microsoft Excel, R and Stata, emphasizing the latter two software.
The illustration studies and examples presented in the textbook are all connected to real life problems in business and public policy. We are cooperating with major companies, NGOs and policy institutions to jointly develop interesting case studies.
Our textbook is complemented with extensive online material including data, code, additional case studies, practice questions, sample exams and data exercises.
Economic Statistics meets Data Science - mixing best of both worlds
Wide range of classic and new tools - from regression analysis to machine learning
Curated and focused content - selecting the most useful methods, emphasizing precise interpretation
Data and lab - to practice coding and enhance learning data and code in R and Stata will be provided
No mickey mouse data - real life examples and challenges from business, economics and policy problems, emphasis on data wrangling
Practice questions on the essentials - enhancing deeper understanding
We cover integrated knowledge of methods and tools traditionally scattered around various fields such as econometrics, data science and practical business statistics. State-of-the art knowledge in data analysis includes traditional regression analysis, causal analysis of the effects of interventions, predictive analytics using regression and machine learning tools, and practical skills for working with real-life data and collecting data. We focus on (i) Pattern discovery , (ii) Addressing causality and (iii) Building prediction model.
Uncovering patterns in the data can be an important goal in itself, and it is the prerequisite to establishing cause and effect and carrying out predictions.
The textbook starts with simple regression analysis, the method that compares expected y for different values of x to learn the patterns of association between the two variables. It discusses nonparametric regressions and focuses on the linear regression. It builds on simple linear regression and goes on to enriching it with nonlinear functional forms, generalizing from a particular dataset to other data it represents, adding more explanatory variables, etc.
The textbook also covers regression analysis for time series data, panel data, binary dependent variables, as well as nonlinear models such as logit and probit. Understanding the intuition behind the methods, their applicability in various situations, and the correct interpretation of their results are the constant focus of the textbook.
Decisions in business and policy are often centered on specific interventions, such as changing monetary policy, modifying health care financing, changing the price or other attributes of products, or changing the media mix in marketing. Learning the effects of such interventions is an important purpose of data analysis.
The textbook incorporates the basic concepts and methods used by program evaluation (the framework of potential outcomes, the benefits of randomized assignment, etc.). It also covers related methods used in business, such as A/B testing.
Data analysis in business and policy applications is often aimed at prediction. The textbook introduces tools to evaluate predictions, such as loss functions or the Brier score. It emphasizes the importance of out-of-sample prediction, the role of stationarity, the dangers of overfitting and the use of training and testing samples and cross-validation.
The book presents and compares the most important predictive models that may be useful in various situations such as time series regressions, classification tools and tree-based machine learning methods.
Data collection is often an integral part of data analysis in many situations. Regardless of its source, data in real life needs cleaning and restructuring before it can be analyzed. Even after extensive cleaning the data used in the analysis is typically different from the ideal dataset that would serve the analysis best. Analysts need to use appropriate tools to collect, clean and restructure data, and they need to have a thorough understanding of the differences between ideal data and available data to interpret their results in appropriate ways.
The need to integrate practical data work with statistical analysis and machine learning has increased with the advance of Big Data – large and unstructured datasets are becoming more and more common. The textbook introduces the most important tools of data collection, data management and cleaning, and it discusses the consequences of measurement issues on the results of analysis. These topics are included as separate chapters, and they are emphasized in the case studies as well.
Big Data presents opportunities to better answer old questions and ask new questions. It offers great advantages when applying many traditional statistical methods and allows for developing new methods. At the same time analyzing Big Data presents new challenges, too. We include explicit discussion of these opportunities and challenges in relation to uncovering and generalizing patterns, learning the effects of interventions and carrying out predictions, within each of the sections of the book.
Each chapter is accompanied by the data used in the illustration studies with a full but concise description, and the description of how to implement data management, cleaning, analysis, and visualization in Excel, R and Stata.
We think code support is essential, applied knowledge can be acquired only by working through many applications. The textbook fosters hands-on work through numerous examples, both within the main text and as supplementary material. Illustration studies in the textbook explain how methods work; fully developed case studies included in the ancillary material answer real-life questions using real-life data; additional exercises invite students to replicate analysis and carry out further projects with appropriate guidance.
There is so much good (sometimes, awesome) data around. We’ll borrow from the best work to build case studies for students.
We also provide the R and Stata codes themselves that produce all results shown in the textbook, starting with raw data. Students can learn coding by first understanding and then tinkering with code that works. We plan to store these “Data and lab” sections online, some elements possibly turned into videos or interactive exercises similar to those on datacamp.com.
We will provide additional case studies that allow for studying the entire process of data analysis from the substantive business or policy question through collecting or accessing data, managing and cleaning data, carrying out the analysis, presenting and interpreting its results, and addressing the original substantive questions. Case studies aim at answering a question rather than simply illustrating a method.
In addition to showing tools in real life use, case studies will also present decision points (e.g. which model or functional form to pick, what data source to rely on, how to treat missing observations) as well. To get to interesting topics we are collaborating with academic and industry partners.
Here is an example. A real estate company wants to price apartments in a bunch of new projects. It has ample data on apartment ads from the past. The case study discusses exploratory data analysis, feature engineering and prediction of continuous target variables. The emphasis is on how to code various types of variables.
Here is another. In any country, exporting firms tend to be larger and more productive. Why? Well, firms may benefit from export experience as they learn new skills from foreign partners. This notion is the basis for great many export promotion policies. However, equally possible is this: firms that already larger, are those who will be successful abroad. The case study compared experiment and observational evidence. The focus is on internal and external validity.