It's as true now as it was five years ago. The ability to measure teachers' relative effectiveness is the backbone of any talent management strategy. However, a new study adds to some evidence that, although a relatively reliable measure, teacher observations may be capturing more than we would like.
Using data collected originally for the 2009-2013 MET Project, Shanyce Campbell (University of California, Irvine) and Matthew Ronfeldt (University of Michigan) found that a teacher's observation score is significantly influenced by the type of students taught or the teacher's gender. All other factors being equal, male teachers, as well as teachers with more low-achieving and minority students, tend to receive lower observation scores. These results appear robust, as previous research has also found that teachers assigned to lower-performing students are more likely to earn lower observation scores.
What does this mean? Two possible explanations emerge. The first is that these groups of teachers are in fact weaker on average—not implausible given that numerous past studies have found that disadvantaged schools and students have lower-quality teachers. The second explanation is that evaluators are biased against male teachers or teachers who work with lower-income or minority students. Based on further analysis, the study's authors leaned toward the latter explanation.
To mitigate against any potential bias and to build confidence in the accuracy of ratings, districts should use multiple observers and regularly check the validity of data. Putting in place checks on the validity (e.g., by comparing ratings between two observers) and reliability (e.g., by looking for drift in ratings over time) is good practice for any evaluation. Furthermore, evaluation systems that combine multiple measures consistently produce more stable results.
Identifying flaws in evaluation systems is not a reason to abandon them completely, but it is a reason for states to mitigate such flaws.