College of Education, NCCU-Does Slipping Matter? Applicability of the Item Response Theory Four-Parameter Logistic Model in Taiwan Student Achievement Tests

Does Slipping Matter? Applicability of the Item Response Theory Four-Parameter Logistic Model in Taiwan Student Achievement Tests

Home / Journal of Education & Psychology / Issues / Volume 49 Issue 1 / Does Slipping Matter? Applicability of the Item Response Theory Four-Parameter Logistic Model in Taiwan Student Achievement Tests

::: :::

Date 2026-06-12

粗心有差嗎？試題反應理論四參數模式於我國學生成就測驗之適用性

Author(s):

Yi-Hung Lin (Department of Education, National Kaohsiung Normal University)

Abstract:

Multiple-choice item format is widely used nowadays in student achievement tests at different educational levels. There are three factors that may determine whether an examinee answers a multiple-choice item correctly: True ability, guessing, and unexpected errors. Research on guessing behavior in multiple-choice items has long been an important issue in educational measurement, and “unexpected errors,” namely “carelessness,” has also received increasing attention. When examinees show carelessness, testing agencies may obtain biased estimates of examinee ability and item parameters, which may further influence test fairness and related educational decision making. In item response theory (IRT), the prevailing approach to carelessness is to add a slipping parameter to the three-parameter logistic model (3PLM), resulting in the four-parameter logistic model (4PLM), so as to estimate the probability that high-ability examinees unexpectedly answer easy items incorrectly. Because the 4PLM involves a complex parameter estimation procedure and appropriate estimation software has been limited, its application has been restricted. In recent years, due to advances in parameter estimation techniques and computer hardware and software, the 4PLM has gradually attracted scholarly attention. However, published studies applying the 4PLM in Taiwan remain uncommon. Whether student achievement tests in Taiwan can be fitted with the 4PLM? This study argues that carelessness is not rare in Taiwan student achievement tests and that applying the 4PLM is reasonable; thus, related research should not be absent. In order to provide suggestions for practitioners, this study fits empirical datasets from Taiwan with the 4PLM and compares the data-analyzing results of the 4PLM and 3PLM to demonstrate the properties of the 4PLM.

This study includes empirical analysis and simulation evaluation. For empirical analysis, one high-stakes and one low-stakes dataset were analyzed to compare model fit between the 4PLM and 3PLM. The high-stakes dataset was a random sample of 5,000 examinees from the 2021 Comprehensive Assessment Program for junior high school students (CAP), including 41 multiple-choice items from the English reading test. The low-stakes dataset was from the 2014 Taiwan Assessment of Student Achievement (TASA), including 7,405 eleventh-grade examinees and 28 multiple-choice items from the Chinese test. Model fit indices included the likelihood ratio test (LRT), Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and bias-corrected AIC (AICc). All parameters of the 4PLM and 3PLM were estimated using the IRTBEMM package in R. For simulation evaluation, this study generated multiple datasets using the 4PLM parameter estimates from empirical analysis as true values, and then analyzed the simulated data with both the 4PLM and 3PLM. Bias and root mean square error (RMSE) were used to evaluate the differences between average parameter estimates and the true values.

The results include, compared to the 3PLM:

(1) Reliability: The reliability coefficients of both empirical datasets were above .70; the CAP English test tended to have higher reliability than the TASA Chinese test, possibly due to the number of items;

(2) Model fit: The LRT results indicated that the 4PLM, with more parameters, significantly reduced the chi-square values (p < .001 for both datasets); AIC, BIC, and AICc were also lower for the 4PLM, and the differences from the 3PLM were all larger than 10. Overall, the 4PLM fits the empirical datasets better than the 3PLM, indicating that the 4PLM more appropriately described the empirical data in this study;

(3) Empirical parameter comparison: The item fit indices (infit and outfit mean-square) were similar between the 3PLM and 4PLM and all fell within the reasonable range of 0.5 to 1.5; however, the average item fit indices of the 4PLM tended to be closer to the expected value 1.0. Regarding discrimination, the average estimates and standard errors (SE) under the 4PLM were significantly larger than those under the 3PLM (p < .05). Regarding difficulty, the estimates and SE of difficulty parameters from the two models were significantly correlated (p < .05), but the 4PLM tended to yield smaller estimates and SE than the 3PLM. Regarding guessing, the guessing estimates and SE were often significantly correlated across models, but the 4PLM tended to yield larger estimates and SE than the 3PLM. Regarding slipping, the average slipping estimate in the CAP English test (0.96) was closer to the expectation that the slipping parameter “should be close to 1.0” and the suggestion that it “should not be smaller than 0.9,” whereas the average slipping estimate in the TASA Chinese test (0.86) was closer to previously reported averages in related studies. Regarding ability, the 4PLM ability distribution tended to be closer to 0 than the 3PLM, and the 3PLM tended to underestimate high-ability examinees compared to the 4PLM;

(4) Simulation evaluation: For discrimination, the average bias and RMSE of the 4PLM were significantly smaller than those of the 3PLM (p < .05). For difficulty, when the test data contained non-negligible slipping effects, the average bias and RMSE of difficulty estimates under the 4PLM tended to be smaller than those under the 3PLM. For guessing, the absolute mean bias and mean RMSE under the 4PLM tended to be smaller than those under the 3PLM. For slipping, the average RMSE of the slipping parameter was close to that of the guessing parameter and was comparable to the magnitude reported in previous research. For ability, the mean RMSE under the 4PLM was significantly smaller than that under the 3PLM (p < .05).

These results indicate that the slipping parameter is functional and the 4PLM is an applicable data-analyzing model for Taiwan student achievement tests. The conclusions of this study include:

(1) The 4PLM is applicable to Taiwan test data: The 4PLM fit the empirical datasets significantly better than the 3PLM because the slipping parameter can explain examinees’ carelessness that the 3PLM cannot explain, thereby improving model fit;

(2) The empirical results of the 4PLM differ from those of the 3PLM: Compared to the 3PLM, the 4PLM produced item fit indices closer to the expected value 1.0; larger discrimination and guessing estimates and SE; smaller difficulty estimates and SE; larger SE of ability estimates but less tendency to underestimate high-ability examinees; and for the slipping parameter, larger estimates were associated with smaller SE;

(3) The simulation results of the 4PLM showed smaller errors: Compared to the 3PLM, the 4PLM yielded smaller mean bias and RMSE for discrimination, difficulty, and guessing; ability estimates showed significantly smaller mean RMSE; and slipping estimates were within a reasonable range and consistent with prior studies.

The suggestions of this study include:

(1) Choose an appropriate data-analyzing model to obtain more precise results: The empirical datasets analyzed in this study implied non-negligible carelessness effects beyond the commonly used 3PLM; thus, the 4PLM was a better-fitting model, and in practice, the selection of a data-analyzing model should be based on examinees’ response processes to obtain more precise results;

(2) Investigate the influence and effect of the slipping parameter: The 4PLM tended to be less likely to underestimate high-ability examinees, and larger slipping estimates were associated with smaller SE; further investigating the instructional meaning of the slipping parameter and its relations with other parameters is a feasible direction for future research;

(3) Compare the efficiency and properties of different 4PLM parameter estimation algorithms: Future studies may hold other conditions constant and compare different estimation methods for the 4PLM to provide evidence for researchers when selecting estimation tools.

Keywords:

4PLM, achievement test, item response theory, slipping

Back Home