Simple cluster bargraphs with error bars in R

Here you go. (Edited to add: These days you should use ggplot2.)

bargraph.CI(dose, len, group = supp, data = ToothGrowth,
xlab = "Dose", ylab = "Growth", x.leg = 1,
col = c("white","grey"), legend = TRUE, = function(x) t.test(x)$

lineplot.CI(dose, len, group = supp, data = ToothGrowth,
xlab = "Dose", ylab = "Growth", x.leg = 1,
legend = TRUE, = function(x) t.test(x)$

Computing standard errors (SEs) and confidence intervals for BLUPs in lmer/lme4

Useful email exchange over here. Key bits—something like this will work:

rfg <- ranef(my.lmer, postVar=TRUE) = sqrt(as.numeric(attributes(rfg$group)$postVar))

But note Dimitris Rizopoulos’s remark:

“… maybe one thing that I think needs to be kept in mind is that the posterior variances that ranef(…, postVar = TRUE) returns condition on the MLEs and do not take their variability into account.”

Douglas Bates notes over here:

“… I prefer to call the quantities that are returned by the ranef extractor ‘the conditional modes of the random effects’. If you want to be precise, these are the conditional modes (for a linear mixed model they are also the conditional means) of the random effects B given Y = y, evaluated at the parameter estimates. One can also evaluate the condtional variance-covariance of B given Y = y and hence obtain a prediction interval. These are returned by ranef when the optional argument ‘postVar’ is TRUE…

“BTW, the reason that I say ‘conditional modes’, rather than ‘conditional means’, is so the term can apply to generalized linear mixed models, nonlinear mixed models and generalized nonlinear mixed models. The conditional distribution of B given Y is a continuous multivariate distribution that can be evaluated explicitly for a linear mixed model but is known only up to a constant for the generalized forms of the model. We can still optimize the conditional density of B given Y = y for particular values of the parameters but doing so provides the conditional modes, not the conditional means.

Goldstein (see e.g., his book Multilevel Statistical Models) calculates that multiplying by 1.39 rather than 1.96 allows pairwise comparison of these estimates.

Also the arm package has a function se.ranef. Code for that below:

function (object)
    se.bygroup <- ranef(object, postVar = TRUE)
    n.groupings <- length(se.bygroup)
    for (m in 1:n.groupings) {
        vars.m <- attr(se.bygroup[[m]], "postVar")
        K <- dim(vars.m)[1]
        J <- dim(vars.m)[3]
        se.bygroup[[m]] <- array(NA, c(J, K))
        for (j in 1:J) {
            se.bygroup[[m]][j, ] <- sqrt(diag(as.matrix(vars.m[,
                , j])))
        names.full <- dimnames(se.bygroup)
        dimnames(se.bygroup[[m]]) <- list(names.full[[1]], names.full[[2]])

Statistical models as cognitive models…

“Models of data have a deep influence on the kinds of theorising that researchers do. A structural equation model with latent variables named Shifting, Updating, and Inhibition (Miyake et al. 2000) might suggest a view of the mind as inter-connected Gaussian distributed variables. These statistical constructs are driven by correlations between variables, rather than by the underlying cognitive processes […]. Davelaar and Cooper (2010) argued, using a more cognitive-process-based mathematical model of the Stop Signal task and the Stroop task, that the inhibition part of the statistical model does not actually model inhibition, but rather models the strength of the pre-potent response channel. Returning to the older example introduced earlier of g (Spearman 1904), although the scores from a variety of tasks are positively correlated, this need not imply that the correlations are generated by a single cognitive (or social, or genetic, or whatever) process. The dynamical model proposed by van der Mass et al. (2006) shows that correlations can emerge due to mutually beneficial interactions between quite distinct processes.”

Fugard, A. J. B & Stenning, K. (2013). Statistical models as cognitive models of individual differences in reasoningArgument & Computation4, 89–102.

Linking statistics and qualitative methods

You’ll be aware of the gist. Quantitative statistical models are great for generalizing, also data suitable for the stats tends to be quicker to analyze than qualitative data. More qualitative methods, such as interviewing, tend to provide much richer information, but generalization is very tricky and often involves coding up so the data can be fitted using the stats. How else can the two (crudely defined here!) approaches to analysis talk to each other?

I like this a lot:

“In the social sciences we are often criticized by the ethnographers and the anthropologists who say that we do not link in with them sufficiently and that we simply produce a set of statistics which do not represent reality.”

“… by using league tables, we can find examples of places which are perhaps not outliers but where we want to look for the pathways of influence on why they are not outliers. For example, one particular Bangladeshi village would have been expected to have high levels of immunization, whereas it was down in the middle of the table with quite a large confidence interval. This seemed rather strange, but our colleagues were able to attribute this to a fundamentalist imam. […] Another example is a village at the top of the league table, which our colleagues could attribute to a very enthusiastic school-teacher.”

“… by connecting with the qualitative workers, by encouraging the fieldworkers to look further at particular villages and by saying to them that we were surprised that this place was good and that one was bad, we could get people to understand the potential for linking the sophisticated statistical methods with qualitative research.” (Ian Diamond and Fiona Steele, from a comment on a paper by Goldstein and Spiegelhalter, 1996, p. 429)

Also reminds me of a study by Turner and Sobolewska (2009) which split participants on their Systemizing and Empathizing Quotient scores. Participants were asked, “What is inside a mobile phone?” Here’s what someone with high EQ said:

“It flashes the lights, screen flashes, and the buttons lights up, and it vibrates. It comes to life on the inside and it comes to life on the outside, and you talk to the one side and someone is answering on the other side”

And someone with high SQ:

“Many things, circuit boards, chips, transceiver [laughs], battery [pause], a camera in some of them, a media player, buttons, lots of different things. [pause] Well there are lots and lots of different bits and pieces to the phone, there are mainly in … Eh, like inside the chip there are lots of little transistors, which is used, they build up to lots of different types of gates…”

(One possible criticism is that the SQ/EQ just found students of technical versus non-technical subjects… But the general idea is still lovely.)

Would be great to see more quantitative papers with little excerpts of stories. We tried in our paper on spontaneous shifts of interpretation on a probabilistic reasoning task (Fugard, Pfeifer, Mayerhofer & Kleiter, 2011, p. 642), but we only squeezed in a few sentences:

‘Participant 34 (who settled into a conjunction interpretation) said: “I only looked at the shape and the color, and then always out of 6; this was the quickest way.” Participant 37, who shifted from the conjunction to the conditional event, said: “In the beginning [I] always [responded] ‘out of 6,’ but then somewhere in the middle . . . Ah! It clicked and I got it. I was angry with myself that I was so stupid before.” Five participants spontaneously reported when they shifted during the task, for example, saying, “Ah, this is how it works.”’


Fugard, A. J. B., Pfeifer, N., Mayerhofer, B., & Kleiter, G. D. (2011).  How people interpret conditionals: Shifts towards the conditional event.  Journal of Experimental Psychology: Learning, Memory, and Cognition, 37, 635–648.

Goldstein, H. & Spiegelhalter, D. J. (1996). League tables and their limitations: statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society. Series A (Statistics in Society) 159, 385–443.

Turner, P. & Sobolewska, E. (2009). Mental models, magical thinking, and individual differences. Human Technology 5, 90–113.

Qual and quant – subjective and objective?

“… tensions between quantitative and qualitative methods can reflect more on academic politics than on epistemology. Qualitative approaches are generally associated with an interpretivist position, and quantitative approaches with a positivist one, but the methods are not uniquely tied to the epistemologies. An interpretivist need not eschew all numbers, and positivists can and do carry out qualitative studies (Lin, 1998). ‘Quantitative’ need not mean ‘objective’. Subjective approaches to statistics, for instance Bayesian approaches, assume that probabilities are mental constructions and do not exist independently of minds (De Finetti, 1989). Statistical models are seen as inhabiting a theoretical world which is separate to the ‘real’ world though related to it in some way (Kass, 2011). Physics, often seen as the shining beacon of quantitative science, has important examples of qualitative demonstrations in its history that were crucial to the development of theory (Kuhn, 1961).”

Fugard and Potts (2015, pp. 671-672)

Some advice on factor analysis from the 60s

“What are the alternatives to factor (or component) analysis if one has a correlation whose analysis one cannot escape? There is only one alternative method of analysing a correlation matrix which needs to be mentioned, and that is to LOOK AT IT.”

“Quite the best alternative to factor analysis is to avoid being saddled with the analysis of a correlation matrix in the first place. (Just to collect a lot of people, to measure them all on a lot of variables, and then to compute a correlation matrix is, after all, not a very advanced way of investigating anything.)”

From Andrew S. C. Ehrenberg (1962). Some Questions About Factor Analysis. The Statistician, 12(3), 191-208

Chi-square in SEM

Playing around again with SEM. Just where does that \(\chi^2\) come from?

You start with the sample covariance matrix (\(S\)) and a model description (quantitative boxology; CFA tied together with regression). The fit machinery gives you estimates for the various parameters over several iterations until the difference between \(S\) and the “implied” covariance matrix (i.e., the one predicted by the model, \(C\)) is minimised and out pops the final set of estimates. Then you multiply that difference between \(S\) and \(C\) by (\(N – 1\)) to get something out with a \(\chi^2\) distribution.


First how do we get \(C\)? Loehlin (2004, p. 41) to the rescue:

\(C = F \cdot (I-A)^{-1} \cdot S \cdot (1 – A)^{-1′} \cdot F’\)

Here \(A\) and \(S\) have the same dimensions as the sample covariance matrix. (This is a different \(S\) to the one I mentioned above—don’t be confused yet.)

\(A\) contains the (assymetric) path estimates, \(S\) contains the (symmetric) covariances and residual variances, and \(F\) is the so called filter matrix which marks which variables are measured variables. (\(I\) is the identity matrix and \(M’\) is the transpose of \(M\).)

I don’t quite get WHY the implied matrix is plugged together this way, but onwards…

So now we have a \(C\). Take \(S\) again—the sample covariance matrix. Loehlin gives a number of different criterion measures which tell you how far off \(C\) is. I’m playing with SEM in R using John Fox’s package which uses this one:

\(\mbox{tr}(SC^{-1}) + \mbox{log}(|C|) – \mbox{log}(|S|) – n\)

where \(\mbox{tr}\) is the trace of a matrix and is the sum of the diagonal, and \(|M|\) is the determinant of \(M\). Oh and \(n\) is the number of observed variables.

The R code for this (pulled and edited from the null \(\chi^2\) calculation in the sem fit function) is

sum(diag(S %*% solve(C))) + log(det(C)) – log(det(S)) – n

Here you can see trace is implemented as a sum after a diag. The solve function applied to only one matrix (as here) gives you the inverse of the matrix.

Let’s have a quick poke around with the sem package using a simple linear regression:



x1 = rnorm(N, 20, 20)
x2 = rnorm(N, 50, 10)
x3 = rnorm(N, 100, 15)
e = rnorm(N,0,100)

y = 2*x1 – 1.2*x2 + 1.5*x3 + 40 + e

thedata = data.frame(x1,x2,x3,y)

mod1 = specify.model()
y <->y, e.y, NA
x1 <->x1, e.x1, NA
x2 <->x2, e.x2, NA
x3 <->x3, e.x3, NA
y <- x1, bx1, NA
y <- x2, bx2, NA
y <- x3, bx3, NA

sem1 = sem(mod1, cov(thedata), N=dim(thedata)[1], debug=T)

When I ran this, the model \(\chi^2 = 4.6454\).

The \(S\) and \(C\) matrices can be extracted using


Then plugging these into the formula …

N = 100
n = 4

S = sem1$S
C = sem1$C

(N – 1) *
(sum(diag(S %*% solve(C))) + log(det(C))-log(det(S)) – n)

… gives… 4.645429.

One other thing: to get the null \(\chi^2\) you just set \(C\) as the diagonal of \(S\).



Loehlin, J. C. (2004). Latent Variable Models (4th ed). LEA, NJ, USA.

SEM fits

What makes a good SEM fit? Thought I’d take a peek at Bentler (2007).

He gives a range of fairly general suggestions of things to report about model fits, including (a) which bits of the model were and weren’t decided a priori; (b) tests of assumptions, e.g., multivariate normality; (c) descriptives and a correlation matrix; and (d) a summary of parameters before and after paths have been added or removed.

Regarding fit, he suggests that

“At least one statistical test of model fit … should be reported for each major SEM model, unless the author verifies that no appropriate test exists for their design and model.”

In addition, he suggests reporting SRMR or the average absolute standardized residual, as well as the largest several residuals in a correlation metric. He also suggests reporting CFI or RMSEA.

The all-important small sample problem: he says “small” is \(N < 100\) and that in such cases at least one alternative model should be provided which is successfully rejected even with the small sample size.

Now the interesting thing is what he makes of \(\chi^2\). He argues that any test is unlikely to have an exact \(\chi^2\) distribution, and thus it would be ill-advised to rely too much on an exact test p-value, though importantly he adds that

“The toolkit of possible \(\chi^2\) tests has recently vastly expanded … and it does not make sense to talk about “the” \(\chi^2\) test. Including F-tests, EQS 6 provides more than a dozen model tests… I certainly favor the use of a carefully chosen model test, but even the best of these can fail in applications…”

One of the articles he cites to justify suspicion of test distributions is a simulation study by a colleague and he (Yuan & Bentler, 2004). They summarise the problem faced by applied researchers:

“In practice, many reported chi-square statistics are significant even when sample sizes are not large, and in the context of nested models, the chi-square difference test is often not significant.”

Elaborating on this, they mean that when you don’t want to have a significant test result, you often do get it; when you do want significance, you don’t. (NHST—different issue for another day.) They also summarise another important problem:

“There are many model fit indices in the literature of SEM. For example, SAS CALIS provides about 20 fit indices in its default output. Consequently, there is no unique criterion for judging whether a model fits the data.”


So the gist from their simulations: often you can’t trust “the” \(\chi^2\) tests; nor can you trust the Wald zs testing the individual parameters. Hurrah.

Wald tests have been attacked elsewhere, e.g., by Hauck and Donner (1977) for logistic regression; they demonstrated the problem in an analysis of what predicts the presence of the T. vaginalis organism in college students. The gist: the further your estimate is away from the null value (usually zero; recall your null hypothesis is often that the slope is zero), the lower the power of the Wald test.

Ah, the joys of stats! Give me a nice graph or table any day.


Bentler, P. M. (2007). On tests and indices for evaluating structural models. Personality and Individual Differences, 42, 825-829.

Hauck, Walter W. Jnr & Donner, A. (1977). Wald’s Test as Applied to Hypotheses in Logit Analysis. Journal of the American Statistical Association, 72, 851-853.

K.-H. Yuan and P.M. Bentler (2004). On chi-square difference and z-tests in mean and covariance structure analysis when the base model is misspecified. Educational and Psychological Measurement, 64, 737–757.

SEM again

As yet I haven’t convinced myself that SEM is a good idea—or at least not in the examples I’ve seen in psychology. Two reasons for starters: (i) fit statistic chaos and (ii) weird analyses driven almost entirely by correlations or at best with vague theorizing based on  trivial analyses of the tasks (it has a bit of this and bit of that…).

Anyway, recently Andrew Gelman noted:

“… there’s a research paradigm in which you fit a model—maybe a regression, maybe a structural equations model, maybe a multilevel model, whatever—and then you read off the coefficients, with each coefficient telling you something. You gather these together and those are your conclusions.

“My paradigm is a bit different. I sometimes say that each causal inference requires its own analysis and maybe its own experiment. I find it difficult to causally interpret several different coefficients from the same model.”

He goes on in the discussion to add:

“I have no problem with multiple-equation models, including measurement-error models, multilevel models, and instrumental variables. But I’m skeptical of trying to answer several casual questions by fitting a single model to a dataset.”

Interesting discussion starting over here.

Random effects

Another little simulation.

Set up the number of people and how many data points:

people = 10
obs = 4

Noise, the random intercept per person:

sub_noise = data.frame(id = 1:people, rand_int = rnorm(people,0,60))

Set up a dataframe and a relation such that y_ij = 23x_i + 20 + b_j + e_ij

id = rep(1:people,obs)
thedata = data.frame(id)
thedata = merge(sub_noise, thedata)
thedata$x = rep(1:obs,people)
thedata$y = 23*thedata$x + 20 + thedata$rand_int + rnorm(people*obs, 0, 20)

Here’s a bit of the table:

> thedata
id rand_int x      y
1   1    -84.9 1 -58.15
2   1    -84.9 2  -3.09
3   1    -84.9 3  14.81
4   1    -84.9 4  16.95
5   2    -69.8 1  -9.75
6   2    -69.8 2   3.73
7   2    -69.8 3  18.86
8   2    -69.8 4  59.62
9   3     72.6 1 101.65
10  3     72.6 2 120.65
11  3     72.6 3 142.71
12  3     72.6 4 208.24

Fit a mixed effect model with random intercept for participants.

lmer1 = lmer(y ~ x + (1|id), data=thedata)

And a Gaussian regression where responses are assumed to be independent.

glm1 = glm(y ~ x, data=thedata)

> summary(lmer1)
Linear mixed-effects model fit by REML
Formula: y ~ x + (1 | id)
Data: thedata
AIC BIC logLik MLdeviance REMLdeviance
379 385 -187 385 373
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 2404 49.0
Residual 419 20.5
number of obs: 40, groups: id, 10

Fixed effects:
Estimate Std. Error t value
(Intercept) 10.57 17.41 0.61
x 23.51 2.89 8.12

Correlation of Fixed Effects:
x -0.416
> summary(glm1)

glm(formula = y ~ x, data = thedata)

Deviance Residuals:
Min 1Q Median 3Q Max
-92.23 -44.13 -1.10 47.89 103.61

Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.57 20.11 0.53 0.6022
x 23.51 7.34 3.20 0.0028

(Dispersion parameter for gaussian family taken to be 2697)

Null deviance: 130117 on 39 degrees of freedom
Residual deviance: 102472 on 38 degrees of freedom
AIC: 433.5

Number of Fisher Scoring iterations: 2

> sd(glm1$residuals)
[1] 51

Note how the sd of the residuals in the mixed effects model is 20.5, whereas in the model with only fixed effects is 51. The estimates are identical, but the standard errors are reduced.