Tuesday, June 15, 2010


The Year-Later Evaluation

Several alert readers sent me this piece from the Washington Post. It glosses a study conducted at the Air Force Academy that finds that

Professors rated highly by their students tended to yield better results for students in their own classes, but the same students did worse in subsequent classes. The implication: highly rated professors actually taught students less, on average, than less popular profs.
Meanwhile, professors with higher academic rank, teaching experience and educational experience -- what you might call "input measures" for performance -- showed the reverse trend. Their students tended to do worse in that professor's course, but better in subsequent courses. Presumably, they were learning more.

The piece goes on to suggest that student evaluations are not to be trusted, because they reward entertainment, attractiveness, and/or easy grading. Instead, administrators should shuck their slavish dependence on shallow popularity measures and defer to faculty rank.

The argument collapses upon close examination, of course. If student evaluations had the implied effect, then these tougher and more effective teachers would never have had the opportunity to gain more experience, let alone get promoted to higher rank. The stronger performance of the senior group actually suggests that administrators are doing a pretty good job of separating the wheat from the chaff at promotion time (and of taking student evaluations with the requisite grains of salt). But never mind that.

The worthwhile element of the story is the prospect of basing professors’ evaluations on student performance in subsequent courses.

The appeal is obvious. If Prof. Smith’s English 101 students routinely crash and burn in English 102, while Prof. Jones’ English 101 students do great in 102, then I feel pretty confident saying that Prof. Jones is doing a better job than Prof. Smith. If your method of getting great results isn’t my own, but it works, the fact that it works is the important thing.

But it isn’t as easy as that. Start with ‘routinely.’ ‘Routinely’ is an ambiguous term. It could take years to get raw numbers large enough for any statistically significant measure.

Course sequences aren’t always as clean as that, either. 101 to 102 is easy. But what about electives? What about the history professor who can’t help but notice that some students’ papers are consistently better than others? What about the smallish programs in which the same few professors teach the entire sequence, so they could, if so inclined, skew the sample?

What about the professor who actually takes some risks?

In a relatively high-attrition environment, how do you count the dropout? Do you control for demographics? Given that students in, say, developmental reading will normally do much worse than students in honors philosophy, what are you actually measuring? And if you go with course-based norming, aren’t you essentially pitting faculty in the same department against each other?

I bring these up not to discount the possibility or the appeal of outcome-based evaluation, but to suggest that it just isn’t anywhere as easy as all that. (I haven’t even mentioned the implications for tenure if a senior, tenured professor turns out to be doing a horrible job.) Drive-by analyses like this project make for cute headlines, but actually defeat understanding. Iif you’re actually serious about presenting an alternative, you need to take a much closer look at the implications of your alternative across the board. You’ll also need some serious grains of salt. Go ahead and take some of mine.

i got my degree in engineering.

our professors ratings were directly related to how well they spoke english. i'm not kidding about this.
I looked at the original article, which is extremely interesting. The data set is from the US Air Force Academy, and has some very nice properties: all students take a fairly large set of common core courses. These courses are taught in many small sections (average 20 students) with a common syllabus and, in the cases they looked at, common final exams that were graded collectively by all faculty teaching the course. Moreover, students are assigned to sections at random. This removes a lot of the selection biases that plague these things, though it also restricts how different (and thus better/worse) individual faculty can make a class. It should also be noted that the introductory course is in fact Calculus I and the follow up courses are only in the sciences/engineering. (They also looked at physics and chemistry, which had similar results but there were more potential biases there.)

The results are still very odd. Students in sections with highly rated instructors definitely do better on the common final, but not as well in followup courses. WTF? The effect in the introductory course isn't huge --- one standard deviation improvement in student evaluations of a faculty member translates into 5% of a standard deviation improvement in the course grade --- but maybe that's expected given the extreme homogeneity of the sections.
The results are still very odd. Students in sections with highly rated instructors definitely do better on the common final, but not as well in followup courses. WTF?

Very interesting. Maybe the highly rated instructors are highly rated because they actually teach that class well, but as a result their students never have to flounder and figure things out on their own, so they are less equipped to deal with a poorer instructor in the future?
Very interesting questions, and the issue is definitely more complex than it appears at first glance - as Dean Dad has suggested. The issue of evaluating instructional effectiveness is definitely complex and occurs not only at the higher education level but also at the elementary and high school level.

While I can't speak to the administrative challenges of evaluating instructor effectiveness via formal university mechanisms, I can speak to the challenges of counseling students about the use of more informal feedback in selecting an instructor (e.g., feedback from other students and the many "rate your instructor" websites that exist).

Ultimately a student's perception of how good or bad an instructor is is largely based on their preconceived goals when they enter the class. Are they looking for an "easy A"? Does the type of work required by the instructor match up with the type of work they like to do in a course (e.g., writing papers)? Are they entering the course with an already existing anxiety about the content? Are they looking for an instructor who doesn't keep them for the entire scheduled class time and lets them out early? Are they looking for an instructor who challenges and pushes them?

These and the answers to many, many other questions can significantly color a student's perception of the educational experience that he/she will receive. As an Advisor, the best thing I can do is counsel students to be wise consumers of the information they receive and to understand that anyone's opinion about an instructor and the educational experience they receive is subject to certain biases. The more aware a student can be of the biases behind the information they are consuming as well as their own personal biases, the better chance the student can interpret the information fairly and make a more useful decision based on it.

I truly feel for faculty members who are under the pressure of the many formal and informal evaluation mechanisms by which they are judged as well as the constant challenges that administrators face in attempting to wisely and fairly evaluate faculty performance.
Is it shocking to any of us that students are not very reliable reporters of a professor's teaching ability? Or that fun and engaging professors get higher ratings than those who are more serious and more rigorous? I would certainly hope that anyone who is in the business of evaluating faculty use far more sources of information than surveys of students. It has been my opinion that student surveys commonly support other sources (class visits, personal comments from students, peer reviews, analysis of student performance in future courses, etc.), but are never used as sole indicators of performance.

I just concluded a unique (for me) and painful situation with a sub-standard adjunct instructor. Her first course was typified by unprecedented student complaints, high attrition, and a poor teaching performance review by myself and a peer instructor. Nevertheless, I agreed to re-hire her after informing her that her performance had been unacceptable and detailing how she needed to improve. The second course started out the same way, so I went to her and told her that I was not seeing the desired improvement and that unless the student evals where substantially better than the previous ones, we would not be re-hiring her. At that point, she demanded a numerical threshold that she would need to achieve to be re-hired. I have never done such a thing in the past, but acquiesced and gave her a number. Well, the evals came back last week and she did not even approach the minimum numbers I quoted, so I met with her yesterday to give her the bad news. She then harangued me about being overly dependent on the whims of students for evaluating instructors.
I have to echo topometropolis. This sort of study is very interesting both because the investigators faced a natural randomization of students and faculty and because so much of the curriculum was sequenced--it was easy to do follow-ups with follow-on courses.

Neither of those, of course, is easy to replicate elsewhere. Students choose classes using a number of criteria (knowledge of the instructor; time of day/day of week...). Faculty are also not assigned randomly to courses (for which I am thankful; if I had to teach statistics, it'd be a train wreck for several semesters). Having a large pool of faculty in a discipline is common only in fairly large institutions or programs. Strong sequencing of courses is uncommon, particularly in small institutions. Common exams for entire courses are also unusual.

I worry some about comment I have seen about this study, suggesting that it might be an artifact of less-experienced instructors "teaching to the test," rather than digging into the fundamentals of the subject. It seems to me that we want to test on the basis of what we think are the important things for people to have learned in the course. So we want to test-to-what-we-teach, which may look a lot (to people looking at things after the fact) like taching to the test. If the problem here is some variation of that, then I would contend that the final exam was badly designed; it did not test what students needed to know and to be able to do in follow-on courses...

But that's a much larger issue.
Thanks for pointing this out, because it tells me that several of us in my CC are directing our attention at the right thing (post course outcomes), and that we could probably publish what we learn in the process.

I could say more, but I want to emphasize one thing that seems to be missed in the news story and this discussion: This concerns the Air Force Academy. I'll list just a few differences from most universities and colleges. (1) You never have to worry about attendance or behavior. Student officers deal with that. (2) Everyone who wants to fly those cool planes takes Calculus as their freshman math class, even if they plan to major in history. I've encountered this at my CC as well, with an ROTC student.

These are highly motivated and disciplined students, but their goal is to pass these classes so they can fly the JSF, not so they can be engineers. This could put "learn to the test" ahead of "learn for next year" and produce some of the effects being seen in this study.

DD says "If student evaluations had the implied effect, then these tougher and more effective teachers would never have had the opportunity to gain more experience, let alone get promoted to higher rank." This is logically flawed, as it contains the implicit assumption that the faculty at the Air Force Academy were hired and promoted based on the opinions of plebes. They are not, according to someone I know who taught there. The correct argument would be more along the lines that if student evaluations are relied on rather than peer evaluations, the least effective faculty would never improve their teaching and institutional goals would suffer.
"We find that less-experienced and less-qualif ied professors produce students who perform significantly better in the contemporaneous course being taught, while more-experienced and highly qualified professors produce students who perform better in the follow-on related curriculum."

To me, this is the key finding of the article and should the be *starting* point for the conversation. The authors conclude, though, that it's student evaluations that are the problem.

The student evaluations are a red herring, especially because the student evaluation instruments themselves were never analyzed for quality. Bad instruments = bad data, and well, the few SE items presented in the paper are not so good. Concluding that SE's are flawed because of the above fact (regarding teacher experience) is a leap of logic that's unwarranted.

*I* want a study that really follows up on *why* the *learning result* (sorry about all the **) was what it was. Because that's worth asking, and the authors are right that they have an amazing dataset.

Also, apropos of nothing, can I just say that the authors' attitude is showing? "Potential 'bleeding heart' professors" -- gaah.
I'm with Jason in emphasizing the importance of considering the student population when interpreting student evaluations. I know this from personal experience. I have a colleague who is, objectively, a much more effective instructor than I am, but who receives some harsher student evaluations. This is because he teaches first year Biology students, most of whom aspire to further professional education in the health sciences. When they are evaluated fairly on their academic performance - that is, when their grades don't match what they've assumed they deserve, and which they need for their aspirations - they take it out on the professor. Meanwhile, I teach first year Nursing students, who just need to get through my class to continue in their program. I get great student evaluations.

Fortunately, our department chair and academic dean know the score, but I fear the situation is rather common, including at institutions where the leadership isn't paying adequate attention.
I would tend to agree with Azulao. That particular statement actually implies something to me that doesn't seem to be addressed: could it be that more experienced professors may have had more breadth in their teaching experience? Could a more experienced professor find ways to bring in complementary topics from other classes, and less experienced professors have focused on a more narrow range of topics in greater depth?

Allow me to throw out an unsubstantiated hypotheses. For many courses, getting through the current course with the best grade possible is usually most students' goal. (This is especially true in a gen ed course where the content may or may not be perceived as useful later on.) If a professor is more likely to focus on the class and not bring in information from beyond that, a lot of students seem happier. If, however, they chose to bring in other topics, especially from areas which are more advanced, only a small subset of students will enjoy it.

Of course, the point should then perhaps be on putting inexperienced teachers into areas where the classroom demographics are diverse in terms of majors and interest. A more experienced teacher should perhaps be placed in a class where the material is quite obviously critical for success in later classes. (I could also see reasons for arguing the opposite placement..)

I'm not sure that is what the data implies, but it seems like a more useful way to deal with it than simply trying to invalidate student evals.
I'll not only give a "yes" answer to Cherish's rhetorical questions, I'll validate the hypothesis. I have developed lots of evidence that a key contributor to future success is exactly the teaching techniques that were mentioned.

I also think they are far more important in "majors" classes (like calculus and physics and first-year composition) than in a terminal gen-ed class, simply because fluency in a subset of skills is essential to success in the next class (in a science sequence or where you have to write essay exams). You can see the difference between students who passed on a cram-and-forget approach and those who were forced to actually learn a key subset of the core material.
DD, your straw men suggest you don't want to see can be learned about your faculty during the 7 years that lead up to tenure.

Why would you compare a developmental instructor's students to those coming out of an honors class? Wouldn't you, possibly perhaps maybe, look at their retention in comp 1 or college algebra? At our college, that step guarantees a new set of instructors.

And why not find out what kind of statistics you can obtain in the dozen semesters that lead up to your tenure decision, and see if they can be used to improve teaching along the way? It can't hurt to try, particularly because that is what is done with the research component of tenure decisions. Not to mention the relevance of such studies to ASSESSMENT when reaffirmation time rolls around.

And don't worry about random assignment. It doesn't really matter -- as far as your retention concerns go -- whether they fail the next class because of the expectations of the prof or because they sought out a prof with those expectations.
"The stronger performance of the senior group actually suggests that administrators are doing a pretty good job of separating the wheat from the chaff at promotion time."

Alternate explanation: Believe the faculty when they say "I'm not going to [assign honest grades / teach at the college level / demand that the students write coherently] until I have tenure."
Dr. Sparky:

Oh, snap. Gotta say, this is the best explanation I've read of the data so far.
Even if it's not tenure, long term residence implies job security...
Clearly the Academy is "different" than most if not all other 4 yr degree granting institutions.

1. Highly motivated students. (Forget "forced discipline." These really are highly motivated type-A personalities.)

2. Many of the faculty are active duty, assigned there to teach for 4 years, and then "return to force." They may (or more likely, may not) have terminal degrees.

3. Tenure isn't an issue for the military faculty (until they are eligible for "Permanent Professor"). They are military officers.

And the list goes on...

To echo what others have said-the interesting finding is that the group that performs worse LATER actually perform better on the COMMON EXAM. I would concur with those that suggest the "more favorably assessed" junior faculty actually are better instructors, and do not prepare their students for the poor instruction yet to come from the older, most likely civilian, jaded faculty.

Just an observation.
I would concur with those that suggest the "more favorably assessed" junior faculty actually are better instructors, and do not prepare their students for the poor instruction yet to come from the older, most likely civilian, jaded faculty.

Alternate explanation: the junior faculty teach to the exam, and concentrate mostly on that. The senior faculty teach that, plus what will be useful later (ie. good foundation) but isn't tested directly right now.

Or, if you want a more cynical explanation, the common exam is set to reflect what everyone has done, so those who cover less material are 'rewarded' by having an exam that covers only what they have repeatedly drilled the students on. I've taught at schools where the common exam covered less than half the material in the curriculum, because someone had gone at the speed of the slowest student in the class…
Heh, it really is a brutal indictment of NCLB, if you look at it, isn't it?

Too bad the authors of the study have their own biases. Fortunately, if you publish your dataset, people who just want good tools, rather than specific axes ground, can use it.
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?