A new study argues that student evaluations are systematically biased against women — so much so, in fact, that they're better mirrors of gender bias than of what they are supposed to be measuring: teaching quality.
Anne Boring, an economist and the lead author of the paper, was hired by her university in Paris, Sciences Po, to conduct quantitative analysis of gender bias. Through her conversations with instructors and students, she became suspicious of what she calls "double standards" applying to male and female instructors.
Philip Stark, associate dean of the Division of Mathematical and Physical Sciences at the University of California, Berkeley, is a co-author of the paper along with Kellie Ottoboni. Stark has a longstanding research interest in — some might say a vendetta against — course evaluations. We've reported on his work previously.
In this paper, the team ran a series of statistical tests on two different data sets, of French and U.S. university students.
The French students were, in effect, randomly assigned to either male or female section leaders in a wide range of required courses. In this case, the study authors found, male French students rated male instructors more highly across the board.
Is it bias? Or were the male instructors, maybe, actually, on average, better teachers? (It's science; we have to ask the uncomfortable questions.)
Well, turns out that, at this university, all students across all sections of a course take the same, anonymously graded final exam, regardless of which instructor they have.
This offers the chance to look at one dimension of actual instructor quality: Presumably better section leaders would help students get better grades on the same exam. In fact, they found, the students of male instructors on average did slightly worse on the final.
Overall, there was no correlation between students rating their instructors more highly and those students actually learning more.
The American case was a little bit different. Here, the authors performed a new analysis of a clever experiment published in 2014. Students were taking a single online class with either a male or female instructor. In half the cases, the instructors agreed to dress in virtual drag: The men used the women's names and vice versa.
Here, it was the female students, not the males, who rated the instructors they believed to be male more highly across the board. That's right: The same instructor, with all the same comments, all the same interactions with the class, received higher ratings if he was called Paul than if she was called Paula.
And that higher rating even applied to a seemingly objective question: Did this teacher return assignments on time? (The online system made it possible to ensure that promptness was identical in every case.)
What to make of the fact that the bias was wielded primarily by men in France and by women in the U.S.?
"That the situation is Really Complicated," Philip Stark writes in an email to NPR Ed, and, he adds, it won't be easy to correct for it. In fact, the authors titled their paper "Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness."
These results seem pretty damning, but not everyone is convinced.
Michael Grant is the vice provost and associate vice chancellor for undergraduate education at the University of Colorado, Boulder. He says there's a lot of research supporting the effectiveness and usefulness of student evaluations.
"There are multiple, well-designed, thoughtfully conducted studies that clearly contradict this very weakly designed study," he says, citing this study from 2000 and this study conducted at his own university. His personal review of student ratings from one department at CU Boulder over nine years did find a bias in favor of men, he says, but it was very small — averaging 0.13 on a 6-point scale.
At Sciences Po, educators are taking steps to try to reduce gender bias in end-of-course evaluations, beginning with informing students about that bias. However, both Boring and Stark seem ready to write off Student Evaluations of Teaching, or SET, as pretty much useless.
"Trying to adjust for the bias to make SET 'fair' is hopeless," says Stark, "(even if they measured effectiveness, and there's lots of evidence that they don't)."
Boring acknowledges that "SETs can contain some information that can be valuable." But, she adds, they are too biased to be used in a high-stakes way as a measure of teacher effectiveness.