College of Education, NCCU-The Impact and Control of the Rater Effects in the Context of School Evaluation

::: :::

Date 2026-03-31

校務評鑑情境中的評分者效果之影響及控制

Author(s):

Yi-Hung, Lin (Department of Education, National Kaohsiung Normal University)

Abstract:

Background

School evaluation provides schools with improvement suggestions through school internal assessment and external experts’ visits. Schools pay much attention to school evaluation because the results of school evaluation are sometimes used to judge school performance. Experts’ rating behaviors involve human subjective judgment and are prone to the impact of rater effects. In order to investigate the magnitude of rater effects, the current study applies the many-facet Rasch model (MFRM) to analyze an empirical dataset which consists of 12 experts’ ratings for 46 schools’ creativity curriculum plan with a 5-point rating scale on 25 items. This situation is similar to school evaluation and is used as an example to demonstrate the impact of rater effects on school evaluation.

Literature

School evaluation means schools’ teaching and administrative performances are rated by a group of internal and external experts, and the rating results are used as the basis of some political decisions, such as budget allocations; therefore, schools pay much attention to school evaluation and struggle for better evaluation results. However, human judgments are not always stable and prone to be biased by some specific factors. That is, rater effects may exist in the school evaluation results. Many types of rater effects were discovered and studied, such as (Myford & Wolfe, 2003): 1. severity/leniency effect; 2. halo effect; 3. central tendency; 4. restriction-of-range effect; 5. inaccuracy; 6. logical error; 7. contrast error; 8. influences of rater biases, beliefs, attitudes, and personality characteristics; 9. influences of rater/ratee background characteristics; 10. proximity error; 11. recency/primacy error; 12. order effects, etc. In order to control rater effects, different statistical methods were proposed; among them, the item response theory (IRT) approach- MFRM- is a promising method because it provides (Gordon et al., 2021; Wang & Long, 2022): 1. reliability indicators for raters, rating scales, items, and ratees; 2. quantified rater severity and used as the basis to adjust rating results; 3. systematic variances among raters to find inconsistencies of rating behaviors; and 4. capabilities to fit planned-missing-data design to reduce raters’ time cost. The present study applies the MFRM to fit an empirical dataset to demonstrate the impact of rater effects.

Methods

1. The empirical dataset analyzed in this study was collected by the National Sun Yat-sen University for the Taiwan Ministry of Education’s project “Enhancing senior high schools’ creativity teaching” in 2006, 46 schools and 12 experts were included in this project, and a Likert-type 25-item 5-point rating scale was used for experts to rate each school’s curriculum plan.

2. Five competing MFRM models were fitted to the empirical dataset introduced above, and these models include:

(1) The MFRM model equips school performance parameter, item difficulty parameter, and threshold difficulty parameter; that is, a model assumes no significant rater effects;

(2) The MFRM model equips school performance parameter, item difficulty parameter, threshold difficulty parameter, and rater severity parameter; that is, a model assumes significant rater effects;

(3) The MFRM model equips school performance parameter, item difficulty parameter, threshold difficulty parameter, rater severity parameter, and interaction effect parameter for rater x school; that is, a model assumes significant rater effects and significant interaction effect between raters and schools;

(4) The MFRM model equips school performance parameter, item difficulty parameter, threshold difficulty parameter, rater severity parameter, and interaction effect parameter for rater x item; that is, a model assumes significant rater effects and significant interaction effect between raters and items; and

(5) The MFRM model equips school performance parameter, item difficulty parameter, threshold difficulty parameter, rater severity parameter, interaction effect parameter for rater x school, and interaction effect parameter for rater x item; that is, a model assumes significant rater effects, significant interaction effect between raters and schools, and significant interaction effect between raters and items;

3. The model-data-fit indicators include likelihood ratio test for nested models, Akaike information criterion, Bayesian information criterion, and Consistent Akaike information criterion for all models; and

4. The IRT modeling software, ConQuest (Wu et al., 2007), developed by the Australian Council for Educational Research, was used in this study to fit the exemplar dataset.

Results

1. The model considers rater effects and related interaction terms (model 5) is the best-fitting model, and the Wright Map also shows individual differences among raters’ severity, among rater by school interactions, and among rater by item interactions; 2. The infit and outfit indexes indicate that most raters’ severity conforms to model expectation, but most items’ difficulty is not the case and is thought in relation to raters’ disagreement on item difficulty; and 3. The correlations between schools’ “expected sum score vs. observed sum score” and “expected ranking vs. observed ranking” are rather high, but the score differences of 26% schools are larger than 1 and the ranking differences of 50% schools are not zero, both phenomena may impact the fairness of evaluation.

Conclusions

1. The model considers rater effects and related interaction terms is the best fitting model, which means rater effects are unignorable in the exemplar dataset; 2. Significant individual differences exist in the raters’ severity, the interaction effects between raters and schools, and the interaction effects between raters and rating criteria; 3. Most raters’ severity conform to the MFRM model’s expectation, but most criteria’s difficulties deviate from the MFRM model’s expectation, which implies raters have different understandings of rating criteria; and 4. The candidate schools’ expected sum scores were highly correlated to observed sum scores, but 12 (26%) candidate schools’ sum scores’ differences were larger than 1 point, and 23 (50%) candidate schools’ expected rankings were different from the observed rankings, both score and ranking differences can be found in high-performance and low-performance schools and may have impact on schools’ evaluation results in the practical situation.

Suggestions

1. Choose an appropriate data-analyzing model, such as the MFRM, to control the rater effects and enhance evaluation objectivity; 2. Reduce the halo effect between raters and schools to maintain evaluation validity; 3. Enhance the consensus of experts on evaluation target to lower rater subjectivity; and 4. Apply the adjusted scores and rankings in the final evaluation results to raise the fairness of school evaluation.

Keywords:

many-facet Rasch model、school evaluation、rater effects、item respons theory

Back Home