College of Education, NCCU-Applying the R Packages to Integrate Estimation of Plausible Values and Implementation of Secondary Data Analysis with a National Large-Scale Assessment Data from the Perspective of Builders

Applying the R Packages to Integrate Estimation of Plausible Values and Implementation of Secondary Data Analysis with a National Large-Scale Assessment Data from the Perspective of Builders

Home / Journal of Education & Psychology / Issues / Volume 48 Issue 3 / Applying the R Packages to Integrate Estimation of Plausible Values and Implementation of Secondary Data Analysis with a National Large-Scale Assessment Data from the Perspective of Builders

::: :::

Date 2026-03-31

從建置者角度應用開放軟體R建立大型教育評量學生表現可能值與次級資料分析

Author(s):

Jin-Chang Hsieh (Research Center for Testing and Assessment, National Academy for Educational Research)

Abstract:

Research Motivation and Objective

International large-scale assessments (ILSA) have become core evidence infrastructures for policy and scholarship. Inspired by NAEP (USA), the National Education Association of Korea (NEAK), and the National Educational Policy Survey (Germany), Taiwan has launched TASA and TASAL (including its i-Generation extension) to measure proficiency and context at scale. Two domestic bottlenecks persist: (1) National databases rarely release plausible values (PVs), the lingua franca of ILSA-based inference; and (2) Guidance and tooling for secondary analyses under complex sampling and PV uncertainty are fragmented. Although international programmes disseminate PVs and utilities (e.g., IDB Analyzer), their latent-regression choices and missing-data strategies may not align with local analytic needs.

Positioning ourselves as database architects, we propose a coherent, end-to- end framework that integrates PV estimation with downstream secondary analysis. Using a Taiwanese large-scale English assessment, we demonstrate how open-source R packages implement the complete workflow—itemresponse modelling and latent regression (with principled handling of missing background data), PV extraction, and design-based routines for descriptive statistics, regression, and structural equation modelling (SEM). Our aims are: (1) To articulate a transparent, modular PV-and-analysis pipeline adaptable to local requirements; and (2) To supply concrete micro-procedures and code patterns that raise the quality and consistency of national secondary analyses.

Literature Review

PVs are generated by latent-regression item response models that couple an item response model with a population model for proficiency conditional on background covariates. Their rationale is strongest under matrix-sampled designs, where students see only a subset of items; multiple posterior draws propagate measurement and imputation uncertainty, avoiding biases of singlepoint scores. Design-based work on weights, replication (e.g., jackknife; modified BRR), and design effects establishes that valid inference in complex samples requires both weighting and proper variance estimation. A current tension concerns missing background data at the PV stage: International practice often uses missing-indicator codings, yet downstream analysts frequently impute. Divergent strategies can induce incoherence between PV construction and subsequent modelling, motivating a unified pipeline that treats missingness consistently across stages.

Research Methodology

Data and design. The demonstration dataset comprises 6,815 students selected via stratified two-stage cluster sampling: Systematic PPS selection of schools followed by one class per school. Released variables include student/school IDs, stratification and jackknife replication variables, total and normalized weights, background items, and English item scores. Each student took only two of five domains; English employed a partial balanced incomplete block design across listening and reading items.

Stage 1: PV construction. We compare five IRT specifications in TAM and mirt, ranging from unidimensional Rasch/PCM to multidimensional 2PL/GPCM and a bifactor model (general English plus listening/reading specifics). Parameters are estimated by marginal maximum likelihood with EM; higher-dimensional models use quasi–Monte Carlo integration. Item quality is reviewed with Infit/ Outfit (productive ≈ 0.5–1.5) and gender DIF via likelihood-ratio and Wald tests (with multiplicity control). Missing background data are multiply imputed using mice with predictive mean matching to align PV construction with likely downstream strategies. Background variables are contrast-coded; principal components explaining ~80%–90% of variance form the conditioning set for the latent regression. From the posterior, M = 10 PVs are drawn for English (EAP reliability ≈ .66) and transformed to a 500/100 scale.

Stage 2: Secondary analysis. Two routes are provided. A convenience interface (itasa, adapted from intsvy) supports weighted descriptives, group comparisons, PV pooling via Rubin’s rules, and graphics. A lower-level route uses survey and mitools to define the replicate-weighted design (JKn with strata and jackknife zones) and to pool PV-specific estimates. For SEM, lavaan.survey fits a model in which SES predicts motivational constructs (interest, self-concept, social motivation) and English proficiency; the motivational constructs in turn predict proficiency.

Research Results

Model selection and diagnostics. Information criteria favoured 2PL/GPCM under uni- and two-dimensional structures; the bifactor improved fit marginally at substantial computational cost. A small set of items showed elevated Outfit, but Infit remained acceptable. Gender DIF was limited (significant for a few items by LRT, not by Wald), supporting approximate invariance.

PV construction. Ten PVs were extracted using the imputed, dimension-reduced background set (~129 principal-component covariates) and standardized to 500/100, facilitating interpretability and comparability with international conventions.

Design-based descriptives. Weighted means by gender and community development level exhibited a robust gradient: students in advantaged areas outperformed peers in disadvantaged areas, and females outperformed males. Highest means occurred for females in advantaged areas; lowest for males in disadvantaged areas. Confidence intervals did not overlap. Estimates from itasa matched survey/mitools, validating the convenience interface.

Regression analyses. Using normalized “household” weights, multiple regression of English PVs on SES and motivational constructs showed SES and academic self-concept as positive, statistically significant predictors. Interest and social motivation became non-significant once SES and self-concept were included, aligning with collinearity and mediational accounts in the achievement literature. Coefficients and standard errors were essentially identical across the two routes; Rubin-pooled standard errors reflected both replication and PV uncertainty.

SEM under complex sampling. Comparing traditional ML (Satorra–Bentler corrections) with design-based pseudo-ML (jackknife replication) revealed higher standard errors and adjusted fit when accounting for the survey design (average generalized design effect rising from ≈ 1.09 to ≈ 1.45). Parameter directions were stable across estimators; conditional relative efficiency averaged ≈ 1.13. The SEM indicated strong direct and indirect associations between selfconcept and English proficiency, with self-concept showing the most robust proximal link.

Discussion and Recommendations

An integrated pipeline. Treat PV estimation and secondary analysis as one workflow to reduce misalignment between how proficiency is generated and later modelled. In practice: (1) Handle missing background data consistently (avoid mixing missing-indicator PVs with imputed downstream models); (2) Document conditioning sets and transformations (e.g., PCA thresholds); and (3) Expose modular code to permit alternative imputation models and latent regressions.

Modelling choices and scalability. Rich multidimensional or bifactor IRT can capture domain structure but may yield limited practical gains at high computational cost. For routine reporting and most secondary analyses, a well-diagnosed 2PL/GPCM with rigorous item screening balances fidelity and feasibility. When many domains/subdomains must be estimated, staged estimation or domain-specific PVs with explicit linking may be preferable to monolithic high-dimensional fits.

Design-based inference is essential. Two-stage stratified cluster designs and replicate weights materially affect standard errors and fit. Analysts should default to replicate-weighted designs (jackknife, BRR variants) and Rubin’s pooling for PVs; single-PV analyses or PV averaging should be avoided because they understate uncertainty.

Substantive insights and limits. Demonstrated patterns—area and female advantages, primacy of SES and self-concept—are consonant with the literature, lending face validity to the pipeline. Because the dataset was intentionally modified and partial, results are methodological demonstrations rather than policy findings.

Prospects. Future work should: (1) Co-design conditioning models with anticipated downstream analyses (e.g., propensity-score weighting, subgroup definitions) to minimize incoherence between PV generation and causal/ comparative modelling; and (2) Integrate multi-study synthesis (e.g., IPD meta-analysis) with PV-aware, design-based estimation to scale evidence across cohorts and years while preserving complex-sample guarantees. Practically, agencies should release PVs with documented conditioning and imputation, provide total/normalized weights plus replicate schemes, and supply lightweight, auditable R functions that mirror survey/mitools calls—enabling Taiwan’s national assessments to inform policy and support high-quality secondary research.

Keywords:

large-scale assessment、plausible value、international large-scale assessment in education、design-based secondary data analysis、R

Back Home