Standard errors of marginal means in an RCT

Randomised controlled trials (RCTs) typically use a convenience sample to estimate the mean effect of a treatment for study participants. Participants are randomly assigned to one of (say) two conditions, and an unbiased estimate of the sample mean treatment effect is obtained by taking the difference of the two conditions’ mean outcomes. The estimand in such an RCT is sometimes called the sample average treatment effect (SATE).

Some papers report a standard error for the marginal mean outcomes in treatment and control groups using the textbook formula

\(\displaystyle \frac{\mathit{SD_g}}{\sqrt{n_g}}\),

where \(\mathit{SD_g}\) is the standard deviation of group \(g\) and \(n_g\) the number of participants assigned to that group.

This formula assumes a simple random sample with replacement from an infinite population, so does not work for a convenience sample (see Stephen Senn, A Standard Error). I am convinced, but curious what standard error for each group’s mean would be appropriate, if any. (You could stop here and argue that the marginal group means mean nothing anyway. The whole point of running a trial is to subtract off non-treatment explanations of change such as regression to the mean.)

Let’s consider a two-arm RCT with no covariates and a coin toss determining who receives treatment or control. What standard error would be appropriate for the mean treatment outcome? Let the total sample size be \(N\) and quantities for treatment and control use subscripts \(t\) and \(c\), respectively.

Treatment outcome mean of those who received treatment

If we focus on the mean for the \(n_t\) participants who were assigned to treatment, we have all observations for that group, so the standard error of the mean is 0. This feels like cheating.

Treatment outcome mean of everyone in the sample

Suppose we want to say something about the treatment outcome mean for all \(N\) participants in the trial, not only the \(n_t\) who were assigned to treatment.

To see how to think about this, consider a service evaluation of \(N\) patients mimicking everything about an RCT except that it assigns everyone to treatment and uses a coin toss to determine whether someone is included in the evaluation. This is now a survey of \(n\) participants, rather than a trial. We want to generalise results to the finite \(N\) from which we sampled.

Since the population is finite and the sampling is done without replacement, the standard error of the mean should be multiplied by a finite population correction,

\(\displaystyle \mathit{FPC} = \sqrt{\frac{N – n}{N – 1}}\).

This setup for a survey is equivalent to what we observe in the treatment group of an RCT. Randomly assigning participants to treatment gives us a random sample from a finite population, the sample frame of which we get by the end of the trial: all treatment and control participants. So we can estimate the SEM around the mean treatment outcome as:

\(\displaystyle \mathit{SEM_t} = \frac{\mathit{SD_t}}{\sqrt{n_t}} \sqrt{\frac{N – n_t}{N – 1}}\).

If, by chance (probability \(1/2^N\)), the coin delivers everyone to treatment, then \(N = n_t\) and the FPC reduces to zero, as does the standard error.


If the marginal outcome means mean anything, then there are a couple of standard errors you could use, even with a convenience sample. But the marginal means seem irrelevant when the main reason for a running an RCT is to subtract off non-treatment explanations of change following treatment.

If you enjoyed this, you may now be wondering what standard error to use when estimating a sample average treatment effect. Try Efficacy RCTs as survey twins.