Wednesday, April 03, 2019


For present purposes, I’ll treat “data” as a singular noun, as in “The Data.”  Yes, I know it’s a plural. Singular/plural agreement has emerged as a theme of the week…

I’ve been in some frustrating conversations recently about “The Data.”  The term is invoked as a sort of conversation-stopper: if The Data says that something is true, then true it must be.  And any truth claims derived from any other source are to be taken lightly, unless and until they have been confirmed by data.

Um, no.  And I say that as someone who typically gives the Institutional Research department a pretty decent workout.

Data, properly understood, is largely theory-driven.  Put differently, it’s gathered and analyzed in order to answer questions.  Those questions come from other places. They may be economic, moral, political, logistical, or even whimsical.  But the questions define what you look at in the first place. In the absence of questions, reams of data tell you nothing.

For example, we disaggregate our graduation rates by race and ethnicity, sex, and Pell status.  That’s because we believe that those factors shouldn’t matter, but do, and that we have a moral obligation to address disparities along those lines.  We don’t disaggregate by astrological sign. It isn’t any harder to do, mathematically, but we don’t see any reason to try. If Libras do slightly better than Leos, what, exactly, do we do with that?  

That should be obvious, but it has implications.  The idea that new interventions or efforts should be “data-driven” implies that substantially all of the important questions have already been asked.  I don’t know why we’d assume that.

Even if you have numbers, you still need to build an explanatory story around them, and that story will necessarily reflect larger theories of the world.  For instance, if graduation rates increase, is that because a college successfully got obstacles out of students’ way, or because it inflated grades to maintain headcount?  If a professor has an uncommonly low pass rate, is that a sign of superior rigor, inferior teaching, or the luck of the draw?

Then, of course, there’s how the data is collected.  Opinion surveys are notoriously unreliable guides to behavior, and their results can be swayed by the phrasing of questions or the way options are presented.  Some data can’t be gathered directly. How many students chose not to enroll because we didn’t have a program in x, or we didn’t offer it on Saturdays, or we didn’t advertise it in a given place?  There’s literally no way to know that with certainty. We know who signed up for what; we don’t know who wanted to, but couldn’t, because of something we could have fixed.

A fixation on data can also lead to paralysis.  Will something that worked somewhere else also work here?  The data may indicate that it’s likely, but can’t prove it; past performance is no guarantee of future returns, and settings can differ.  But those qualifiers can become open-ended permission to shoot down nearly anything new. With data and innovation, there’s a basic chicken-and-egg problem; you won’t have data on the effectiveness of something until you actually try it.  Hard-headedness is the disguise fear wears in the presence of the new. No amount of data will ever get us past the need for the occasional leap of faith. Nor should it.

In the service of a larger worldview, data can be great for quality control.  It can dispel myths and provide counterintuitive insights. It can provide a reality check.  But it only works when the larger worldview works. And that is as much a moral and political question as it can ever be a statistical one.