Should Data Science Eat Higher Education?
The Book of Why by Judea Pearl and Dana Mackenzie
In a Wall Street Journal column from 2011, the venture capitalist Marc Andreessen wrote about how “software is eating the world.” Together with both General Electric CEO Jeffrey Immelt and Microsoft CEO Satya Nadella, Andreessen is also credited with saying that “Every company now has to be a software company.”
For leaders in higher education, the current analog might be, “Data science is eating the world.” The stream of announcements about new schools, programs, and gifts focusing on data science suggests that “Every university now has to be a data science university.”
Every day we see more students, faculty, administrators, and budgets drawn towards Artificial Intelligence (AI), for example, swamping more traditional priorities and processes in higher education. But what should we make of claims that machine learning techniques will enable companies, governments, and even universities to make “better” decisions about everyone and everything, including matters that intimately affect our lives?
As with most fads, there are abundant reasons for universities to proceed with caution. Not least among these is the argument that universities exist not to duplicate what the government or the private sector can do, but rather to address precisely what neither politics nor the market can accomplish for the benefit of society.
It is not enough to point to the obvious shortcomings of Big Data and AI. New technologies always have downsides— though concerns about the Fairness, Accountability, and Transparency (known as “FAT concerns” among the cognoscenti) of many data science applications do seem especially alarming. Rather than more hand-wringing and worry, we need actual ideas. What are the logical limitations of the present course? What are the precise promises of alternative agendas?
For big ideas like these, there may be no better source than The Book of Why, co-authored by UCLA’s Turing Prize winner Judea Pearl and gifted science writer Dana Mackenzie. Pearl has spent a lifetime pondering what data can and cannot tell us about the world. Whereas his early works could be dense and technical, this latest one is clear, approachable, and convincing. Do not let a few diagrams and equations deter you. No mathematics beyond basic arithmetic is needed to appreciate The Book of Why.
Pearl and Mackenzie’s topic is “causal inference”—that is, drawing conclusions of the form, “If you do this, that will happen.” Data scientists, not to mention their commercial employers, often make such determinations about topics like these:
If you make loans to these kinds of people, they are more likely to repay them.
If you prescribe this drug to these patients this way, they are more likely to recover.
If you grant parole to these kinds of convicts, they are more likely to stay straight.
If you eat these kinds of foods instead of those, you are more likely to stay healthy.
If you show these kinds of people certain ads, they are more likely to click through.
Yet the evidence behind such important and potentially life changing claims are typically not very good at all. Today, data science as practiced at most institutions is based only on calculating correlations; the results are just associations. When people want to be careful, they report that X is “linked with” Y. Especially when big data are involved, however, most listeners causally take this to mean that X causes Y—ignoring the stern warnings found in statistics textbook about how correlation does not imply causation. Better textbooks might even supply a cautionary tale. But even then the best statistics books go on to say almost nothing more about causality.
Traditional methods of data analysis simply have no way to ask whether carrying an umbrella will cause rain, or what happens if one member of a firing squad decides not to shoot—to cite two of Pearl and Mackenzie’s examples. The authors even devote a whole chapter to describing how debates over the health effects of smoking lasted so long not merely because of the self-interest of powerful tobacco companies but also because the statistics community lacked the basic tools for posing, let alone answering, questions about whether cigarette smoke actually causes cancer. Why couldn’t some other factor, say heredity, cause both smoking and cancer?
Pearl’s great contribution is to provide systematic ways of specifying the assumptions, data, and calculations needed to justify or refute such causal claims. His main message is that data sets alone never suffice. Capturing the story about how the data were generated is necessary, too. He introduces a formal way of writing down what is essential about those stories by drawing simple diagrams, where all the variables of interest are represented as dots, with arrows drawn from one dot to another if the first variable could directly influence the second, but no arrow if they are independent. By analyzing such a diagram (called a Directed Acyclic Graph or DAG), Pearl shows how to determine whether and, if so, how any data eventually collected about that situation could ever justify causal inferences about the effects that variations of some variables would have on others.
One of Pearl’s examples illustrates why one cannot reliably derive causal conclusions from data without such diagrams. Picture a 2×2 table of medical study results. The columns correspond to the treatment and control groups, say those who take a drug or not, as indicated by a binary variable we can call X. The rows correspond to another way of dividing the study participants into two subpopulations, as indicated by a binary variable Y. In each of the four cells are numbers showing how many people recovered out of the corresponding subpopulation. Call that recovery rate Z, and now consider two different stories about it.
In one story, the rows cor-respond to male and female study participants. Imagine that gender affects whether or not you take the drug and also your recovery rate, so that there are arrows from Y to both X and Z in addition to the one from X to Z. Given this story, it is clear from the numbers in Pearl’s table that a doctor would want every patient to have the treatment corresponding to the first column.
In the other story, suppose the rows correspond to subpopulations with high or low blood pressure after the treatment. Imagine that the drug has some toxicity but lowers Y when it works, so that arrows go from X to Y and from Y to Z in addition to the one from X to Z. According to this story, it is clear that a doctor looking at the same recovery data in Pearl’s table would want every patient to have the treatment corresponding to the second column instead.
Here is the point: even with the very same numbers in the data table, the two different stories lead to opposite conclusions about what to do!
The FDA does not usually ask about stories like this or their causal diagrams, at least not yet. It avoids such potential paradoxes by requiring pharmaceutical companies to discount test data unless there is a tale to tell about randomly assigning who does and does not receive the drug. The diagram for a Randomized Controlled Trial (RCT) includes a variable for the coin toss, say, that determines who gets in the treatment group rather than the control group. Why does a well-designed experiment of this form produce valid causal inferences? If the control group and the treatment group are statistically indistinguishable before receiving the drug, the statistical differences observed after taking the drug can be accurately attributed to the intervention.
Academic researchers like those affiliated with the Jamil Poverty Action Lab (JPAL) based at MIT have therefore championed the use of RCTs to test hypotheses—not just about drugs, but about public policy interventions of all sorts. In many cases, however, it is impractical or even unethical to consider such a strategy. Even for the sake of science, no one would suggest forcing people to smoke or not based on the flip of a coin.
This is where Pearl’s diagrams really shine. To go beyond RCTs, econometricians, philosophers, and other scientists have each developed a few of their own tools for drawing causal inferences. But the rigorous ones are all special cases that can be derived from the general framework sketched in The Book of Why and described with more of the technical details in Pearl’s other books and articles.
Here is the upshot. The archetypical problem consists of trying to draw inferences about the causal effect of one variable on another in the presence of potentially confounding variables. Based only on the arrows that do and do not connect all those variables in the diagram that captures the “story” in the situation at hand, Pearl and Mackenzie describe recipes for distinguishing among three cases: when such an inference is possible or not (ignoring the confounders); when an inference is possible by controlling for certain confounders; and when controlling for confounders can actually make matters worse. They also show how to derive and interpret the proper formulae for estimating the size of such causal effects.
In addition to dealing so comprehensively and decisively with causal inference, Pearl’s approach has also helped crack some of the other notoriously difficult problems in data science—such as what to do about missing data or about the external validity of a given empirical finding when transferred to a different context.
If you are a leader in higher education, sometime soon a faculty member or committee will undoubtedly ask to eat up even more of your budget on data science. When that happens, ask if the work is concerned with true causal inference or it is more likely to uncover suggestive but potentially dangerous correlations. Watch the reactions, and when your supplicants start mumbling, hand them a copy of this clear, compelling, and important book.
Daniel L. Goroff is Vice President and Program Director of the Alfred P. Sloan Foundation. Opinions in this review are not necessarily those of the Alfred P. Sloan Foundation.