Tuesday, June 14, 2016

Aggregating Course Evaluations

Most industries passed this point some time ago, but it’s new to me.

I just saw a demo of a program that allows students to do course evaluations on mobile devices.  The data are automatically aggregated, and put into an easy-to-analyze format.  We ran a pilot this Spring; the demo showed what could be done with the data on a large scale.

It got me thinking.

Paper-based course evaluations were misnamed; they were mostly instructor evaluations.  At that level, their merits and demerits are well-rehearsed.  They’re integrated into the promotion and tenure process, for better or worse.  Most of us have a pretty good sense of how to read them.  They also come with compilation sheets showing collegewide averages in various categories.  I’ve written before on how to read them, but the short version is: ignore the squiggle in the middle.  Look for red flags.  And never make them entirely dispositive one way or the other; at most, they’re warning lights.  

But when answers to the same few questions from thousands of students can be sliced and diced quickly and easily, new uses suggest themselves.

For example, with an active dataset, it’s no great challenge to isolate, say, one course from the rest.  In a high-enrollment class with lots of sections taught by many different people -- the English Comps and Intro to Psychs of the world -- you could look at scores across the questions for the entire course to see if there are consistent trouble spots.  If the same red flag pops up in nearly every section of the same class, regardless of who teaches it, then there’s probably a course issue. Administratively, that suggests a couple of things.  First, don’t penalize instructors for a course issue.  Second, target professional development or curricular design resources to those areas.  

I could imagine a department building a question like “among the following topics covered in this class, which one do you wish got more time?”  Getting answers from dozens of sections, taught by many different people, could be useful.  A consensus may exist, but from the perspective of any one person, it may be hard to distinguish between “I didn’t do that part well” and “the course doesn’t do that part well.”  Rack up a large enough sample, though, and the effects of any one person should come out in the wash.  A department could find real value in a consistent answer.

The social scientist in me would love to run other, less action-oriented queries.  For example, if we broke out the ratings by gender of instructor, what would it show?  I wouldn’t recommend basing hiring or scheduling decisions on that -- discrimination is discrimination, and aggregates don’t map cleanly onto individuals anyway -- but it might reveal something interesting about the local culture.  We could break them out by full-time/adjunct status, with the usual caveats about perverse incentives and limited resources.  At some point, I’d love to (somehow) track the correlation between perceived quality of the intro course and student performance in the next course in a sequence: for example, did students who gave higher ratings to their Comp 1 instructors do better in Comp 2?  Anecdotes abound, but we could get an actual reality check.

As with any data, there would have to be procedural and ethical safeguards, as well as some training for the folks looking at it to understand what they’re seeing.  But that doesn’t strike me as a deal-breaker.  If anything, it suggests making the warning lights more accurate.

Wise and worldly readers, if you could slice and dice the data set of student course evaluations, what questions would you ask of it?  What would you want it to reveal?