(1) Students reward receiving high grades with high evaluations.
(2) Professors are induced by this reward to inflate grades by "teaching to the test."
(3) The grade inflation resulting from "teaching to the test" harms long-term student achievement.
The authors base their conclusions on an interesting data set from the U.S. Air Force Academy. In the data set, students are randomly assigned to professors (eliminating bias from good students picking good professors), exams are standardized (allowing for direct comparisons across courses), and students are randomly assigned to follow-up courses (allowing for an assessment of long-term learning).
For establishing whether students scores improved because of the students assignment to professors, this data set is a dream. There is no selection into courses, there is direct comparability across courses, and the academy randomly assigned students to professors, both in the intro courses and follow up courses. In other words, the data set is an experiment. As any introductory statistics course should have taught you (if you had a good professor), experiments allow researchers the ability to assign causation.
In his post on this topic, Jeff Ely points out that assigning causation is tricky, even if we get to assign it:
I am not jumping on the bandwagon. I have read through the paper and while I certainly may have overlooked something (and please correct me if I have) I don’t see any way the authors have ruled out the following equally plausible explanation for the statistical findings. First, students are targeting a GPA. If I am an outstanding teacher and they do unusually well in my class they don’t need to spend as much effort in their next class as those who had lousy teachers, did poorly this time around, and have some catching up to do next time. Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.I think Ely is right. Even though we can determine that the assignment caused scores to increase when evaluations are low, Ely's explanation points out that we cannot sort out what features of the environment implied that the assignment causes scores to increase. Ely's conclusion seems perfectly plausible to me.
Moreover, there is potentially a more troublesome issue with the conclusions of the study. Let's suppose that the authors were right to conclude that "teaching to the test" harms long-term student performance in their sample. Their data apply only to courses taught at the U.S. Air Force Academy, yet they conclude generally. There are a couple of reasons why this is troublesome:
First, there is significant variation across institutions of higher learning with respect to attributes of students (quality, attitude, etc.), the role that student evaluations play, and the form of the evaluations. I have experience with two universities (Montana State University and University of Chicago), and the students I have encountered at the two universities are considerably different. They would likely give high marks on evaluations for very different reasons.
For example, in my first encounter with a UChicago undergraduate, I told him that the course "is going to be a challenge." He responded, "Great! I would expect nothing less." That was just my first encounter, but based on my experience as his teaching assistant, he and most of his classmates probably would have given bad evaluations if we hadn't challenged them. This obviously isn't true at all universities, but some universities have developed a special culture where students long to be challenged.
Given this personal observation, I expect student evaluations to have a different relationship with long-term performance depending on the institution and student population that generates the data. The U.S. Air Force Academy isn't a representative institution of higher learning, so making general conclusions on the basis of a sample generated by the Air Force isn't valid.
Second, the experiment was conducted on a very peculiar type of teaching-learning environment. The introductory course was a common course with multiple sections taught by different professors. The professors taught out of a common syllabus and gave identical exams during a common testing period. This sounds great for the prospect of learning something deep, but for two reasons, this environment is problematic for making inference to the general nature of student evaluations and long-term learning:
(1) The standardization of the course encourages "teaching to the test," even among instructors who would not otherwise teach to the test. In this setting, instructors know that their students' performance is directly comparable across sections. This may induce additional instructors to give hints to students merely so they do well on the common exam. Depending on the type of the instructor who would switch to giving hints, you would expect to find different results if the data were generated from standardized classes than from classes that were not standardized.
(2) More importantly, the standardization of the course strips away important aspects of teaching. Good professors write comprehensive exams that require intense and comprehensive studying. They also set an ambitious agenda (represented in the syllabus) that holds students to a high standard and encourages deep learning. These are are unobservable aspects of teaching in the study's data set because the course has a common syllabus and a common set of exams. Therefore, the only aspect of teaching that the study can purport to describe is lecturing ability.
Because setting the course agenda is the most important instrument instructors have to encourage deep learning, this is a bad data set to use to uncover the truth about how and why student evaluations correlate with deep learning.
In spite of these flaws, it is an interesting data set and an fascinating study. In fact, I completely agree (on intuitive grounds) with what the authors conclude about how universities should view student evaluations:
Since many U.S. colleges and universities use student evaluations as a measurement of teaching quality for academic promotion and tenure decisions, this latter finding draws into question the value and accuracy of this practice.I agree that it sometimes seems that students reward low standards, but I am not sure the consequences are so dire for long term performance. As I always tell students before they fill out the evaluations, it is the comments that matter most.