Tuesday, June 15, 2010

The Year-Later Evaluation

Several alert readers sent me this piece from the Washington Post. It glosses a study conducted at the Air Force Academy that finds that

Professors rated highly by their students tended to yield better results for students in their own classes, but the same students did worse in subsequent classes. The implication: highly rated professors actually taught students less, on average, than less popular profs.
Meanwhile, professors with higher academic rank, teaching experience and educational experience -- what you might call "input measures" for performance -- showed the reverse trend. Their students tended to do worse in that professor's course, but better in subsequent courses. Presumably, they were learning more.


The piece goes on to suggest that student evaluations are not to be trusted, because they reward entertainment, attractiveness, and/or easy grading. Instead, administrators should shuck their slavish dependence on shallow popularity measures and defer to faculty rank.

The argument collapses upon close examination, of course. If student evaluations had the implied effect, then these tougher and more effective teachers would never have had the opportunity to gain more experience, let alone get promoted to higher rank. The stronger performance of the senior group actually suggests that administrators are doing a pretty good job of separating the wheat from the chaff at promotion time (and of taking student evaluations with the requisite grains of salt). But never mind that.

The worthwhile element of the story is the prospect of basing professors’ evaluations on student performance in subsequent courses.

The appeal is obvious. If Prof. Smith’s English 101 students routinely crash and burn in English 102, while Prof. Jones’ English 101 students do great in 102, then I feel pretty confident saying that Prof. Jones is doing a better job than Prof. Smith. If your method of getting great results isn’t my own, but it works, the fact that it works is the important thing.

But it isn’t as easy as that. Start with ‘routinely.’ ‘Routinely’ is an ambiguous term. It could take years to get raw numbers large enough for any statistically significant measure.

Course sequences aren’t always as clean as that, either. 101 to 102 is easy. But what about electives? What about the history professor who can’t help but notice that some students’ papers are consistently better than others? What about the smallish programs in which the same few professors teach the entire sequence, so they could, if so inclined, skew the sample?

What about the professor who actually takes some risks?

In a relatively high-attrition environment, how do you count the dropout? Do you control for demographics? Given that students in, say, developmental reading will normally do much worse than students in honors philosophy, what are you actually measuring? And if you go with course-based norming, aren’t you essentially pitting faculty in the same department against each other?

I bring these up not to discount the possibility or the appeal of outcome-based evaluation, but to suggest that it just isn’t anywhere as easy as all that. (I haven’t even mentioned the implications for tenure if a senior, tenured professor turns out to be doing a horrible job.) Drive-by analyses like this project make for cute headlines, but actually defeat understanding. Iif you’re actually serious about presenting an alternative, you need to take a much closer look at the implications of your alternative across the board. You’ll also need some serious grains of salt. Go ahead and take some of mine.