Tuesday, June 15, 2010
The Year-Later Evaluation
Professors rated highly by their students tended to yield better results for students in their own classes, but the same students did worse in subsequent classes. The implication: highly rated professors actually taught students less, on average, than less popular profs.
Meanwhile, professors with higher academic rank, teaching experience and educational experience -- what you might call "input measures" for performance -- showed the reverse trend. Their students tended to do worse in that professor's course, but better in subsequent courses. Presumably, they were learning more.
The piece goes on to suggest that student evaluations are not to be trusted, because they reward entertainment, attractiveness, and/or easy grading. Instead, administrators should shuck their slavish dependence on shallow popularity measures and defer to faculty rank.
The argument collapses upon close examination, of course. If student evaluations had the implied effect, then these tougher and more effective teachers would never have had the opportunity to gain more experience, let alone get promoted to higher rank. The stronger performance of the senior group actually suggests that administrators are doing a pretty good job of separating the wheat from the chaff at promotion time (and of taking student evaluations with the requisite grains of salt). But never mind that.
The worthwhile element of the story is the prospect of basing professors’ evaluations on student performance in subsequent courses.
The appeal is obvious. If Prof. Smith’s English 101 students routinely crash and burn in English 102, while Prof. Jones’ English 101 students do great in 102, then I feel pretty confident saying that Prof. Jones is doing a better job than Prof. Smith. If your method of getting great results isn’t my own, but it works, the fact that it works is the important thing.
But it isn’t as easy as that. Start with ‘routinely.’ ‘Routinely’ is an ambiguous term. It could take years to get raw numbers large enough for any statistically significant measure.
Course sequences aren’t always as clean as that, either. 101 to 102 is easy. But what about electives? What about the history professor who can’t help but notice that some students’ papers are consistently better than others? What about the smallish programs in which the same few professors teach the entire sequence, so they could, if so inclined, skew the sample?
What about the professor who actually takes some risks?
In a relatively high-attrition environment, how do you count the dropout? Do you control for demographics? Given that students in, say, developmental reading will normally do much worse than students in honors philosophy, what are you actually measuring? And if you go with course-based norming, aren’t you essentially pitting faculty in the same department against each other?
I bring these up not to discount the possibility or the appeal of outcome-based evaluation, but to suggest that it just isn’t anywhere as easy as all that. (I haven’t even mentioned the implications for tenure if a senior, tenured professor turns out to be doing a horrible job.) Drive-by analyses like this project make for cute headlines, but actually defeat understanding. Iif you’re actually serious about presenting an alternative, you need to take a much closer look at the implications of your alternative across the board. You’ll also need some serious grains of salt. Go ahead and take some of mine.
our professors ratings were directly related to how well they spoke english. i'm not kidding about this.
The results are still very odd. Students in sections with highly rated instructors definitely do better on the common final, but not as well in followup courses. WTF? The effect in the introductory course isn't huge --- one standard deviation improvement in student evaluations of a faculty member translates into 5% of a standard deviation improvement in the course grade --- but maybe that's expected given the extreme homogeneity of the sections.
Very interesting. Maybe the highly rated instructors are highly rated because they actually teach that class well, but as a result their students never have to flounder and figure things out on their own, so they are less equipped to deal with a poorer instructor in the future?
While I can't speak to the administrative challenges of evaluating instructor effectiveness via formal university mechanisms, I can speak to the challenges of counseling students about the use of more informal feedback in selecting an instructor (e.g., feedback from other students and the many "rate your instructor" websites that exist).
Ultimately a student's perception of how good or bad an instructor is is largely based on their preconceived goals when they enter the class. Are they looking for an "easy A"? Does the type of work required by the instructor match up with the type of work they like to do in a course (e.g., writing papers)? Are they entering the course with an already existing anxiety about the content? Are they looking for an instructor who doesn't keep them for the entire scheduled class time and lets them out early? Are they looking for an instructor who challenges and pushes them?
These and the answers to many, many other questions can significantly color a student's perception of the educational experience that he/she will receive. As an Advisor, the best thing I can do is counsel students to be wise consumers of the information they receive and to understand that anyone's opinion about an instructor and the educational experience they receive is subject to certain biases. The more aware a student can be of the biases behind the information they are consuming as well as their own personal biases, the better chance the student can interpret the information fairly and make a more useful decision based on it.
I truly feel for faculty members who are under the pressure of the many formal and informal evaluation mechanisms by which they are judged as well as the constant challenges that administrators face in attempting to wisely and fairly evaluate faculty performance.
I just concluded a unique (for me) and painful situation with a sub-standard adjunct instructor. Her first course was typified by unprecedented student complaints, high attrition, and a poor teaching performance review by myself and a peer instructor. Nevertheless, I agreed to re-hire her after informing her that her performance had been unacceptable and detailing how she needed to improve. The second course started out the same way, so I went to her and told her that I was not seeing the desired improvement and that unless the student evals where substantially better than the previous ones, we would not be re-hiring her. At that point, she demanded a numerical threshold that she would need to achieve to be re-hired. I have never done such a thing in the past, but acquiesced and gave her a number. Well, the evals came back last week and she did not even approach the minimum numbers I quoted, so I met with her yesterday to give her the bad news. She then harangued me about being overly dependent on the whims of students for evaluating instructors.
Neither of those, of course, is easy to replicate elsewhere. Students choose classes using a number of criteria (knowledge of the instructor; time of day/day of week...). Faculty are also not assigned randomly to courses (for which I am thankful; if I had to teach statistics, it'd be a train wreck for several semesters). Having a large pool of faculty in a discipline is common only in fairly large institutions or programs. Strong sequencing of courses is uncommon, particularly in small institutions. Common exams for entire courses are also unusual.
I worry some about comment I have seen about this study, suggesting that it might be an artifact of less-experienced instructors "teaching to the test," rather than digging into the fundamentals of the subject. It seems to me that we want to test on the basis of what we think are the important things for people to have learned in the course. So we want to test-to-what-we-teach, which may look a lot (to people looking at things after the fact) like taching to the test. If the problem here is some variation of that, then I would contend that the final exam was badly designed; it did not test what students needed to know and to be able to do in follow-on courses...
But that's a much larger issue.
I could say more, but I want to emphasize one thing that seems to be missed in the news story and this discussion: This concerns the Air Force Academy. I'll list just a few differences from most universities and colleges. (1) You never have to worry about attendance or behavior. Student officers deal with that. (2) Everyone who wants to fly those cool planes takes Calculus as their freshman math class, even if they plan to major in history. I've encountered this at my CC as well, with an ROTC student.
These are highly motivated and disciplined students, but their goal is to pass these classes so they can fly the JSF, not so they can be engineers. This could put "learn to the test" ahead of "learn for next year" and produce some of the effects being seen in this study.
DD says "If student evaluations had the implied effect, then these tougher and more effective teachers would never have had the opportunity to gain more experience, let alone get promoted to higher rank." This is logically flawed, as it contains the implicit assumption that the faculty at the Air Force Academy were hired and promoted based on the opinions of plebes. They are not, according to someone I know who taught there. The correct argument would be more along the lines that if student evaluations are relied on rather than peer evaluations, the least effective faculty would never improve their teaching and institutional goals would suffer.
To me, this is the key finding of the article and should the be *starting* point for the conversation. The authors conclude, though, that it's student evaluations that are the problem.
The student evaluations are a red herring, especially because the student evaluation instruments themselves were never analyzed for quality. Bad instruments = bad data, and well, the few SE items presented in the paper are not so good. Concluding that SE's are flawed because of the above fact (regarding teacher experience) is a leap of logic that's unwarranted.
*I* want a study that really follows up on *why* the *learning result* (sorry about all the **) was what it was. Because that's worth asking, and the authors are right that they have an amazing dataset.
Also, apropos of nothing, can I just say that the authors' attitude is showing? "Potential 'bleeding heart' professors" -- gaah.
Fortunately, our department chair and academic dean know the score, but I fear the situation is rather common, including at institutions where the leadership isn't paying adequate attention.
Allow me to throw out an unsubstantiated hypotheses. For many courses, getting through the current course with the best grade possible is usually most students' goal. (This is especially true in a gen ed course where the content may or may not be perceived as useful later on.) If a professor is more likely to focus on the class and not bring in information from beyond that, a lot of students seem happier. If, however, they chose to bring in other topics, especially from areas which are more advanced, only a small subset of students will enjoy it.
Of course, the point should then perhaps be on putting inexperienced teachers into areas where the classroom demographics are diverse in terms of majors and interest. A more experienced teacher should perhaps be placed in a class where the material is quite obviously critical for success in later classes. (I could also see reasons for arguing the opposite placement..)
I'm not sure that is what the data implies, but it seems like a more useful way to deal with it than simply trying to invalidate student evals.
I also think they are far more important in "majors" classes (like calculus and physics and first-year composition) than in a terminal gen-ed class, simply because fluency in a subset of skills is essential to success in the next class (in a science sequence or where you have to write essay exams). You can see the difference between students who passed on a cram-and-forget approach and those who were forced to actually learn a key subset of the core material.
Why would you compare a developmental instructor's students to those coming out of an honors class? Wouldn't you, possibly perhaps maybe, look at their retention in comp 1 or college algebra? At our college, that step guarantees a new set of instructors.
And why not find out what kind of statistics you can obtain in the dozen semesters that lead up to your tenure decision, and see if they can be used to improve teaching along the way? It can't hurt to try, particularly because that is what is done with the research component of tenure decisions. Not to mention the relevance of such studies to ASSESSMENT when reaffirmation time rolls around.
And don't worry about random assignment. It doesn't really matter -- as far as your retention concerns go -- whether they fail the next class because of the expectations of the prof or because they sought out a prof with those expectations.
Alternate explanation: Believe the faculty when they say "I'm not going to [assign honest grades / teach at the college level / demand that the students write coherently] until I have tenure."
1. Highly motivated students. (Forget "forced discipline." These really are highly motivated type-A personalities.)
2. Many of the faculty are active duty, assigned there to teach for 4 years, and then "return to force." They may (or more likely, may not) have terminal degrees.
3. Tenure isn't an issue for the military faculty (until they are eligible for "Permanent Professor"). They are military officers.
And the list goes on...
To echo what others have said-the interesting finding is that the group that performs worse LATER actually perform better on the COMMON EXAM. I would concur with those that suggest the "more favorably assessed" junior faculty actually are better instructors, and do not prepare their students for the poor instruction yet to come from the older, most likely civilian, jaded faculty.
Just an observation.
Alternate explanation: the junior faculty teach to the exam, and concentrate mostly on that. The senior faculty teach that, plus what will be useful later (ie. good foundation) but isn't tested directly right now.
Or, if you want a more cynical explanation, the common exam is set to reflect what everyone has done, so those who cover less material are 'rewarded' by having an exam that covers only what they have repeatedly drilled the students on. I've taught at schools where the common exam covered less than half the material in the curriculum, because someone had gone at the speed of the slowest student in the class…
Too bad the authors of the study have their own biases. Fortunately, if you publish your dataset, people who just want good tools, rather than specific axes ground, can use it.