Evaluation of Measurement Invariance in IRT Using Limited Information Fit Statistics/Indices: A Monte Carlo Study
Cui, Mengyao (author)
Yang, Yanyun (professor co-directing dissertation)
Paek, Insu (professor co-directing dissertation)
Huffer, Fred W. (Fred William) (university representative)
Becker, Betsy Jane, 1956- (committee member)
Binici, Salih (committee member)
Florida State University (degree granting institution)
College of Education (degree granting college)
Department of Educational Psychology and Learning Systems (degree granting department)
2016
text
Measurement invariance analysis is important when test scores are used to make a group-wise comparison. Multiple-group IRT modeling is one of the commonly used methods for measurement invariance examination. One essential step in the multiple-group modeling method is the evaluation of overall model-data fit. A family of limited information fit statistics has been recently developed for assessing the overall model-data fit in IRT. Previous studies evaluated the performance of limited information fit statistics using single-group data, and found that these fit statistics performed better than the traditional full information fit statistics when data were sparse. However, no study has investigated the performance of the limited information fit statistics within the multiple-group modeling framework. This study aims to examine the performance of the limited information fit statistic (M₂) and M₂-based corresponding descriptive fit indices in conducting measurement invariance analysis within the multiple-group IRT framework. A Monte Carlo study was conducted to examine sampling distributions of M₂ and M₂-based descriptive fit indices, and their sensitivities to lack of measurement invariance under various conditions. The manipulated factors included sample sizes, model types, dimensionality, types and numbers of DIF items, and latent trait distributions. Results showed that the M₂ followed an approximately chi-square distribution when the model was correctly specified, as expected. The type I error rates of M₂ were reasonable under large sample sizes (1000/2000). When the model was misspecified, the power of M₂ was a function of sample size and the number of DIF items. For example, the power of M₂ for rejecting the U2PL Scalar Model increased from 29.2% to 99.9% when the number of uniform DIF items increased from one to six, given the sample sizes of 1000/2000. With six uniform DIF items (30% of the studied items), the power of increased from 42.4% to 99.9% when sample sizes changed from 250/500 to 1000/2000. When the difference in M₂(ΔM₂) was used to compare two correctly specified nested models, the sampling distribution of ΔM₂ appeared to be apart from the reference chi-square distribution at both tails, especially under small sample sizes. The type I error rates of the ΔM₂ test became closer to the expectation when sample sizes increased. For example, both Metric and Configural Models were correctly specified when the test included no DIF item. Given the alpha level of .05, the type I error rates of for the comparsion between the Metric and Configural Model were slightly inflated with n=250/500 (8.72%), and became closer to the alpha level with n=1000/2000 (5.3%). When at least one of the models was misspecified, the power of increased when the number of DIF items or sample sizes became larger. For example, the Metric Model was misspecified when nonuniform DIF item existed. Given sample sizes of 1000/2000 and alpha level of .05, the power of ΔM₂ for the comparison between the Metric and Configural Model increased from 52.55 % to 99.39% when the number of nonuniform DIF items changes from one to six. With one nonuniform DIF item in the test, the power of ΔM₂ was only 17.05% given the alpha level of .05 and sample sizes of 250/500, but increased to 52.55% given the sample sizes of 1000/2000. The descriptive fit indices and their differences between nested models were also affected by the number of DIF items. When there was no DIF item, all fit indices indicated good model-data fit. The differences of the five fit indices between nested models were all very small (<.008) across different sample sizes. When DIF items existed, the means of descriptive fit indices, and their differences between nested models increased when number of DIF items increased. The finding from this study provided some suggestions about the implementation of the limited information fit statistics/indices in measurement invariance analysis within the multiple-group IRT framework.
October 31, 2016.
A Dissertation submitted to the Department of Educational Psychology and Learning Systems in partial fulfillment of the requirements for the degree of Doctor of Philosophy.
Includes bibliographical references.
Yanyun Yang, Professor Co-Directing Dissertation; Insu Paek, Professor Co-Directing Dissertation; Fred W. Huffer, University Representative; Betsy J. Becker, Committee Member; Salih Binici, Committee Member.
Florida State University
FSU_FA2016_Cui_fsu_0071E_13537
This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). The copyright in theses and dissertations completed at Florida State University is held by the students who author them.