What makes a good SEM fit? Thought I’d take a peek at Bentler (2007).
He gives a range of fairly general suggestions of things to report about model fits, including (a) which bits of the model were and weren’t decided a priori; (b) tests of assumptions, e.g., multivariate normality; (c) descriptives and a correlation matrix; and (d) a summary of parameters before and after paths have been added or removed.
Regarding fit, he suggests that
“At least one statistical test of model fit … should be reported for each major SEM model, unless the author verifies that no appropriate test exists for their design and model.”
In addition, he suggests reporting SRMR or the average absolute standardized residual, as well as the largest several residuals in a correlation metric. He also suggests reporting CFI or RMSEA.
The all-important small sample problem: he says “small” is \(N < 100\) and that in such cases at least one alternative model should be provided which is successfully rejected even with the small sample size.
Now the interesting thing is what he makes of \(\chi^2\). He argues that any test is unlikely to have an exact \(\chi^2\) distribution, and thus it would be ill-advised to rely too much on an exact test p-value, though importantly he adds that
“The toolkit of possible \(\chi^2\) tests has recently vastly expanded … and it does not make sense to talk about “the” \(\chi^2\) test. Including F-tests, EQS 6 provides more than a dozen model tests… I certainly favor the use of a carefully chosen model test, but even the best of these can fail in applications…”
One of the articles he cites to justify suspicion of test distributions is a simulation study by a colleague and he (Yuan & Bentler, 2004). They summarise the problem faced by applied researchers:
“In practice, many reported chi-square statistics are significant even when sample sizes are not large, and in the context of nested models, the chi-square difference test is often not significant.”
Elaborating on this, they mean that when you don’t want to have a significant test result, you often do get it; when you do want significance, you don’t. (NHST—different issue for another day.) They also summarise another important problem:
“There are many model fit indices in the literature of SEM. For example, SAS CALIS provides about 20 fit indices in its default output. Consequently, there is no unique criterion for judging whether a model fits the data.”
So the gist from their simulations: often you can’t trust “the” \(\chi^2\) tests; nor can you trust the Wald zs testing the individual parameters. Hurrah.
Wald tests have been attacked elsewhere, e.g., by Hauck and Donner (1977) for logistic regression; they demonstrated the problem in an analysis of what predicts the presence of the T. vaginalis organism in college students. The gist: the further your estimate is away from the null value (usually zero; recall your null hypothesis is often that the slope is zero), the lower the power of the Wald test.
Ah, the joys of stats! Give me a nice graph or table any day.
Bentler, P. M. (2007). On tests and indices for evaluating structural models. Personality and Individual Differences, 42, 825-829.
Hauck, Walter W. Jnr & Donner, A. (1977). Wald’s Test as Applied to Hypotheses in Logit Analysis. Journal of the American Statistical Association, 72, 851-853.
K.-H. Yuan and P.M. Bentler (2004). On chi-square difference and z-tests in mean and covariance structure analysis when the base model is misspecified. Educational and Psychological Measurement, 64, 737–757.