Estimating causal effects with optimization-based methods

Cousineau et al. (2023) conducted a comparative analysis of seven optimisation-based methods for estimating causal effects, using 7700 datasets from the 2016 Atlantic Causal Inference competition. These datasets use real covariates with simulated treatment assignment and response functions, so it’s real-world data (kinda), but with the advantage that the true effect (here, sample average treatment effect; SATT) is known. See the supplementary material of Dorie et al.’s (2019) paper for more info on how the sims were setup.

The methods they compared were:

Method R package Function used
Approximate residual balancing (ARB) balanceHD 1.0 residualBalance.ate
Covariate balancing propensity score (CBPS) CBPS 0.21 CBPS
Entropy balancing (EBal) ebal 0.1–6 ebalance
Genetic matching (GenMatch) Matching 4.9–9 GenMatch
Kernel balancing (KBal) kbal 0.1 kbal
Stable balancing weights (SBW) sbw 1.1.1 sbw

I’m hearing entropy balancing discussed a lot, so had my eye on this.

Findings (N gives the number of datasets out of 7700 where SATT could be estimated):

    Bias   Time
Method N Mean SD RMSE Mean (sec)
kbal 7700 0.036 0.083 0.091 2521.3
balancehd 7700 0.041 0.099 0.107 2.0
sbw 4513 0.041 0.102 0.110 254.9
cbps_exact 7700 0.041 0.105 0.112 6.4
ebal 4513 0.041 0.110 0.117 0.2
cbps_over 7700 0.044 0.117 0.125 17.3
genmatch 7700 0.052 0.141 0.151 8282.4

Bias was the estimated SATT minus true SATT (i.e., the sign was kept; I’m not sure what to make of that when averaging across findings from multiple datasets, though the SD is safe). The root-mean-square error (RMSE) squares the bias from each estimate first, removing the sign, before averaging and square rooting, which seems easier to interpret.

Entropy balancing failed to find a solution for about 40% of them! Note, however:

“All these optimization-based methods are executed using their default parameters on R 4.0.2 to demonstrate their usefulness when directly used by an applied researcher” (emphasis added).

Maybe tweaking the settings would have improved the success rate. And #NotAllAppliedResearchers 🙂

Below is a comparison with a bunch of other methods from the competition where findings were already available (see Dorie et al., 2019, Table 2 and 3, for more info on each method).

    Bias   95% CI
Method N Mean SD RMSE coverage (%)
bart_on_pscore 7700 0.001 0.014 0.014 88.4
bart_tmle 7700 0.000 0.016 0.016 93.5
mbart_symint 7700 0.002 0.017 0.017 90.3
bart_mchains 7700 0.002 0.017 0.017 85.7
bart_xval 7700 0.002 0.017 0.017 81.2
bart 7700 0.002 0.018 0.018 81.1
sl_bart_tmle 7689 0.003 0.029 0.029 91.5
h2o_ensemble 6683 0.007 0.029 0.030 100.0
bart_iptw 7700 0.002 0.032 0.032 83.1
sl_tmle 7689 0.007 0.032 0.032 87.6
superlearner 7689 0.006 0.038 0.039 81.6
calcause 7694 0.003 0.043 0.043 81.7
tree_strat 7700 0.022 0.047 0.052 87.4
balanceboost 7700 0.020 0.050 0.054 80.5
adj_tree_strat 7700 0.027 0.068 0.074 60.0
lasso_cbps 7108 0.027 0.077 0.082 30.5
sl_tmle_joint 7698 0.010 0.101 0.102 58.9
cbps 7344 0.041 0.099 0.107 99.7
teffects_psmatch 7506 0.043 0.099 0.108 47.0
linear_model 7700 0.045 0.127 0.135 22.3
mhe_algorithm 7700 0.045 0.127 0.135 22.8
teffects_ra 7685 0.043 0.133 0.140 37.5
teffects_ipwra 7634 0.044 0.161 0.166 35.3
teffects_ipw 7665 0.042 0.298 0.301 39.0

I’ll leave you to read the original for commentary on this, but check out the RMSE and CI coverage. Linear model is summarised as “Linear model/ordinary least squares”. I assume covariates were just entered as main effects, which is a little unfair – the simulations included non-linearity and diagnostic checks on models, such as partial residual plots, would spot this. Still doesn’t do too badly – better than genetic matching! Interestingly the RMSE was a tiny bit worse for entropy balancing than for teffects_psmatch – a particular application of psmatch using propensity scores estimated using logistic regression on first order terms and matching by nearest neighbour.

No info is provided on CI coverage for the seven optimised-based methods they tested. This is why (Cousineau et al., 2023, p. 377):

“While some of these methods did provide some functions to estimate the confidence intervals (i.e., balancehd, sbw), these did not work due to the collinearity of the covariates. While it could be possible to obtain confidence intervals with bootstrapping for all methods, we did not pursue this avenue due to the computational resources that would be needed for some methods (e.g., kbal) and to the inferior results in Table 5 that did not warrant such resources.”


Cousineau, M., Verter, V., Murphy, S. A., & Pineau, J. (2023). Estimating causal effects with optimization-based methods: A review and empirical comparison. European Journal of Operational Research, 304(2), 367–380.

Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019). Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statistical Science, 34(1). 

Evaluating the arts

I love the arts, particularly bouncy, moderately cheesy dance music experienced in a club. Programme evaluation has been poking at the arts for a while now and attempting to quantify their impact on wellbeing. And I really wish evaluators would stop and think what they’re doing before rushing in with WEMWBS. The experience I have dancing in a sweaty club is very different to the experience wandering around Tate or crying in a cinema. Programme effects are contrasts: the actual outcome of the programme versus an estimate of what the outcome would have been in the programme’s absence, e.g., following some genre of “business as usual”. Before we can measure the average causal benefits of the arts we need a sensible theory of change, and some ideas about what the counterfactual is. Maybe you can find out how I feel dancing to Free Yourself, but what’s the contrast? Dancing to something else? Staying in with a cup of tea and chocolate, reading a book about standard errors for matching with replacement? What exactly is the programme being evaluated…? (End of random thoughts.)

Dealing with confounding in observational studies

Excellent review of simulation-based evaluations of quasi-experimental methods, by Varga et al. (2022). Also lovely annexes summarising the methods’ assumptions.

Methods for measured confounding the authors cover (Varga et al., 2022, Table A1):

Method Description of the method
PS matching (N = 47) Treated and untreated individuals are matched based on their propensity score-similarity. After creating comparable groups of treated and untreated individuals the effect of the treatment can be estimated.
IPTW (N = 30) With the help of re-weighting by the inverse probability of receiving the treatment, a synthetic sample is created which is representative of the population and in which treatment assignment is independent of the observed baseline covariates. Over-represented groups are downweighted and underrepresented groups are upweighted.
Overlap weights (N = 4) Overlap weights were developed to overcome the limitations of truncation and trimming for IPTW, when some individual PSs approach 0 or 1.
Matching weights (N = 2) Matching weights is an analogue weighting method for IPTW, when some individual PSs approach 0 or 1.
Covariate adjustment using PS (N = 13) The estimated PS is included as covariate in a regression model of the treatment.
PS stratification (N = 26) First the subjects are grouped into strata based upon their PS. Then, the treatment effect is estimated within each PS stratum, and the ATE is computed as a weighted mean of the stratum specific estimates.
GAM (N = 1) GAMs provide an alternative for traditional PS estimation by replacing the linear component of a logistic regression with a flexible additive function.
GBM (N = 3) GBM trees provide an alternative for traditional PS estimation by estimating the function of covariates in a more flexible manner than logistic regression by averaging the PSs of small regression trees.
Genetic matching (N = 7) This matching method algorithmically optimizes covariate balance and avoids the process of iteratively modifying the PS model.
Covariate-balancing PS (N = 5) Models treatment assignment while optimizing the covariate balance. The method exploits the dual characteristics of the PS as a covariate balancing score and the conditional probability of treatment assignment.
DR estimation (N = 13) Combines outcome regression with with a model for the treatment (eg, weighting by the PS) such that the effect estimator is robust to misspecification of one (but not both) of these models.
AIPTW (N = 8) This estimator achieves the doubly-robust property by combining outcome regression with weighting by the PS.
Stratified DR estimator (N = 1) Hybrid DR method of outcome regression with PS weighting and stratification.
TMLE (N = 2) Semi-parametric double-robust method that allows for flexible estimation using (nonparametric) machine-learning methods.
Collaborative TMLE (N = 1) Data-adaptive estimation method for TMLE.
One step joint Bayesian PS (N = 3) Jointly estimates quantities in the PS and outcome stages.
Two-step Bayesian approach (N = 2) A two-step modeling method is using the Bayesian PS model in the first step, followed by a Bayesian outcome model in the second step.
Bayesian model averaging (N = 1) Fully Bayesian model averaging approach.
An’s intermediate approach (N = 2) Not fully Bayesian insofar as the outcome equation in An’s approach is frequentist.
G-computation (N = 4) The method interprets counterfactual outcomes as missing data and uses a prediction model to obtain potential outcomes under different treatment scenarios. The entire set of predicted outcomes is then regressed on the treatment to obtain the coefficient of the effect estimate.
Prognostic scores (N = 7) Prognostic scores are considered to be the prognostic analog of the PS methods. the prognostic score includes covariates based on their predictive power of the response, the PS includes covariates that predict treatment assignment.

Methods for unmeasured confounding (Varga et al., 2022, Table A2):

Method Description of the method
IV approach (N = 17) Post-randomization can be achieved using a sufficiently strong instrument. IV is correlated with the treatment and only affects the outcome through the treatment.
2SLS (N = 11) Linear estimator of the IV method. Uses linear probability for binary outcome and linear regression for continuous outcome.
2SPS (N = 5) Non-parametric estimator of the IV method. Logistic regression is used for both the first and second stages of 2SPS procedure. The predicted or residual values from the first stage logistic regression of treatment on the IV are used as covariates in the second stage logistic regression: the predicted value of treatment replaces the observed treatment for 2SPS.
2SRI (N = 8) Semi-parametric estimator of the IV method. Logistic regression is used for both the first and second stages of the 2SRI procedure. The predicted or residual values from the first stage logistic regression of treatment on the IV are used as covariates in the second stage logistic regression.
IV based on generalized structural mean model (GSMM) (N = 1) Semi-parametric models that use instrumental variables to identify causal parameters. IV approach
Instrumental PS (Matching enhanced IV) (N = 2) Reduces the dimensionality of the measured confounders, but it also deals with unmeasured confounders by the use of an IV.
DiD (N = 7) DiD method uses the assumption that without the treatment the average outcomes for the treated and control groups would have followed parallel trends over time. The design measures the effect of a treatment as the relative change in the outcomes between individuals in the treatment and control groups over time.
Matching combined with DiD (N = 6) Alternative approach to DiD. (2) Uses matching to balance the treatment and control groups according to pre-treatment outcomes and covariates
SCM (N = 7) This method constructs a comparator, the synthetic control, as a weighted average of the available control individuals. The weights are chosen to ensure that, prior to the treatment, levels of covariates and outcomes are similar over time to those of the treated unit.
Imperfect SCM (N = 1) Extension of SCM method with relaxed assumptions that allow outcomes to be functions of transitory shocks.
Generalized SCM (N = 2) Combines SC with fixed effects.
Synthetic DiD (N = 1) Both unit and time fixed effects, which can be interpreted as the time-weighted version of DiD.
LDV regression approach (N = 1) Adjusts for pre-treatment outcomes and covariates with a parametric regression model. Alternative approach to DiD.
Trend-in-trend (N = 1) The trend-in-trend design examines time trends in outcome as a function of time trends in treatment across strata with different time trends in treatment.
PERR (N = 3) PERR adjustment is a type of self-controlled design in which the treatment effect is estimated by the ratio of two rate ratios (RRs): RR after initiation of treatment and the RR prior to initiation of treatment.
PS calibration (N = 1) Combines PS and regression calibration to address confounding by variables unobserved in the main study by using variables observed in a validation study.
RD (N = 4) Method used for policy analysis. People slightly below and above the threshold for being exposed to a treatment are compared.


Varga, A. N., Guevara Morel, A. E., Lokkerbol, J., van Dongen, J. M., van Tulder, M. W., & Bosmans, J. E. (2022). Dealing with confounding in observational studies: A scoping review of methods evaluated in simulation studies with single‐point exposure. Statistics in Medicine.

Different ways to attain the same average treatment effect

Fun draft paper by Andrew Gelman, looking at different patterns of causal effects holding the average treatment effect (ATE) at 0.1 – part-inspired by Anscombe’s (1973) correlation quartet. Each graph shows a correlation between a hypothetical covariate, such as baseline symptom severity, and treatment effect. All four patterns are compatible with the ATE of 0.1.

Understanding causal estimands like ATE and ATT

Photo by Susanne Jutzeler

Social policy and programme evaluations often report findings in terms of casual estimands such as the average treatment effect (ATE) or the average treatment effect on the treated (ATT or ATET). An estimand is a quantity we are trying to estimate – but what exactly does that mean? This post explains through simple examples.

Suppose a study has two conditions, treat (=1) and control (=0). Causal estimands are defined in terms of potential outcomes: the outcome if someone had been assigned to treatment, \(Y(1)\), and outcome if someone had been assigned to control, \(Y(0)\).

We only get to see one of those two realised, depending on which condition someone was actually assigned to. The other is a counterfactual outcome. Assume, for a moment, that you are omniscient and can observe both potential outcomes. The treatment effect (TE) for an individual is \(Y(1)-Y(0)\) and, since you are omniscient, you can see it for everyone.

Here is a table of potential outcomes and treatment effects for 10 fictional study participants. A higher score represents a better outcome.

Person Condition Y(0) Y(1) TE
1 1 0 7 7
2 0 3 0 -3
3 1 2 9 7
4 1 1 8 7
5 0 4 1 -3
6 1 3 10 7
7 0 4 1 -3
8 0 8 5 -3
9 0 7 4 -3
10 1 3 10 7

Note the pattern in the table. People who were assigned to treatment have a treatment effect of \(7\) and people who were assigned to control have a treatment effect of \(-3\), i.e., if they had been assigned to treatment, their outcome would have been worse. So everyone in this fictional study was lucky: they were assigned to the condition that led to the best outcome they could have had.

The average treatment effect (ATE) is simply the average of treatment effects: 

\(\displaystyle \frac{7 + -3 + 7 + 7 + -3 + 7 + -3 + -3 + -3 + 7}{10}=2\)

The average treatment effect on the treated (ATT or ATET) is the average of treatment effects for people who were assigned to the treatment:

\(\displaystyle \frac{7 + 7 + 7 + 7 + 7}{5}=7\)

The average treatment effect on control (ATC) is the average of treatment effects for people who were assigned to control:

\(\displaystyle \frac{-3 + -3 + -3 + -3 + -3}{5}=-3\)

Alas we aren’t really omniscient, so in reality see a table like this:

Person Condition Y(0) Y(1) TE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 9 ?
4 1 ? 8 ?
5 0 4 ? ?
6 1 ? 10 ?
7 0 4 ? ?
8 0 8 ? ?
9 0 7 ? ?
10 1 ? 10 ?

This table highlights the fundamental problem of causal inference and why it is sometimes seen as a missing data problem.

Don’t confuse estimands and methods for estimation

One of the barriers to understanding these estimands is that we are used to taking a between-participant difference in group means to estimate the average effect of a treatment. But the estmands are defined in terms of a within-participant difference between two potential outcomes, only one of which is observed.

The causal effect is a theoretical quantity defined for individual people and it cannot be directly measured.

Here is another example where the causal effect is zero for everyone, so ATT, ATE, and ATC are all zero too:

Person Condition Y(0) Y(1) TE
1 1 7 7 0
2 0 3 3 0
3 1 7 7 0
4 1 7 7 0
5 0 3 3 0
6 1 7 7 0
7 0 3 3 0
8 0 3 3 0
9 0 3 3 0
10 1 7 7 0

However, people have been assigned to treatment and control in such a way that, given the outcomes realised, it appears that treatment is better than control. Here is the table again, this time with observations we couldn’t observe removed:

Person Condition Y(0) Y(1) CE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 7 ?
4 1 ? 7 ?
5 0 3 ? ?
6 1 ? 7 ?
7 0 3 ? ?
8 0 3 ? ?
9 0 3 ? ?
10 1 ? 7 ?

So, if we take the average of realised treatment outcomes we get 7 and the average of realised control outcomes we get 3. The mean difference is then 4. This estimate is biased. The correct answer is zero, but we couldn’t tell from the available data.

The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. For other estimators that don’t require random treatment assignment and for other estimands, try Scott Cunningham’s Causal Inference: The Mixtape.

How do you choose between ATE, ATT, and ATC?

Firstly, if you are running a randomised controlled trial, you don’t choose: ATE, ATT, and ATC will be the same. This is because, on average across trials, the characteristics of those who were assigned to treatment or control will be the same.

So the distinction between these three estimands only matters for quasi-experimental studies, for example where treatment assignment is not under the control of the researcher.

Noah Greifer and Elizabeth Stuart offer a neat set of example research questions to help decide (here lightly edited to make them less medical):

  • ATT: should an intervention currently being offered continue to be offered or should it be withheld?
  • ATC: should an intervention be extended to people who don’t currently receive it?
  • ATE: should an intervention be offered to everyone who is eligible?

How does intention to treat fit in?

The distinction between ATE and ATT is unrelated to the distinction between intention to treat and per-protocol analyses. Intention to treat analysis means we analyse people according to the group they were assigned to, even if they didn’t comply, e.g., by not engaging with the treatment. Per-protocol analysis is a biased analysis that only analyses data from participants who did comply and is generally not recommended.

For instance, it is possible to conduct a quasi-experimental study that uses intention to treat and estimates the average treatment effect on the treated. In this case, ATT might be better called something like average treatment effect for those we intended to treat (ATETWITT). Sadly this term hasn’t yet been used in the literature.


Causal effects are defined in terms of potential outcomes following treatment and following control. Only one potential outcome is observed, depending on whether someone was assigned to treatment or control, so causal effects cannot be directly observed. The fields of statistics and causal inference find ways to estimate these estimands using observable data. The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. Quasi-experimental designs allow the estimation of additional estimands: ATT and ATC.

Kharkiv, statistics, and causal inference

As news comes in (14 May 2022) that Ukraine has won the battle of Kharkiv* and Russian troops are withdrawing, it may be of interest to know that a major figure in statistics and causal inference, Jerzy Neyman (1894-1981), trained as a mathematician there 1912-16. If you have ever used a confidence interval or conceptualised causal inference in terms of potential outcomes, then you owe him a debt of gratitude.

“[Neyman] was educated as a mathematician at the University of Kharkov*, 1912-16. After this he became a Lecturer at the Kharkov Institute of Technology with the title of Candidate. When speaking of these years he always stressed his debt to Sergei Bernstein, and his friendship with Otto Struve (later to meet him again in Berkeley). His thesis was entitled ‘Integral of Lebesgue’.” (Kendall et al., 1982)

* Харків (transliterated to Kharkiv) in Ukrainian, Харькoв (transliterated to Kharkov) in Russian.

Efficacy RCTs as survey twins

Surveys attempt to estimate a quantity of a finite population using a probability sample from that population. How people ended up in the population is somebody else’s problem – demographers, perhaps.

Survey participants are sampled at random from this finite population without replacement. Part a of the figure below illustrates. Green blocks denote people who are surveyed and from whom we collect data. Grey blocks denote people we have not surveyed; we would like to infer what their responses would have been, if they had they been surveyed too.

RCTs randomly assign participants to treatment or control conditions. This is illustrated in part b of the figure above: green cells denote treatment and purple cells denote control. There are no grey cells since we have gathered information from everyone in the finite population. But in a way, we haven’t really.

An alternative way to view efficacy RCTs that aim to estimate a sample average treatment effect (SATE) is as a kind of survey. This illustrated in part c. Now the grey cells return.

There is a finite population of people who present for a trial, often with little known about how they ended up in that population – not dissimilarly to the situation for a survey. (But who studies how they end up in a trial – trial demographers?)

Randomly assigning people to conditions generates two finite populations of theoretical twins, identical except for treatment assignment and the consequences thereafter. One theoretical twin receives treatment and the other receives control. But we only obtain the response from one of the twins, i.e., either the treatment or the control twin. (You could also think of these theoretical twins’ outcomes as potential outcomes.)

Looking individually at one of the two theoretical populations, the random assignment to conditions has generated a random sample from that population. We really want to know what the outcome would have been for everyone in the treatment condition, if everyone had been assigned treatment. Similarly for control. Alas, we have to make do with a pair of surveys that sample from these two populations.

Viewing the Table 1 fallacy through the survey twin lens

There is a common practice of testing for differences in covariates between treatment and control. This is the Table 1 fallacy (see also Dean Eckles’s take on whether it really is a fallacy). Let’s see how it can be explained using survey twins.

Firstly, we have a census of covariates for the whole finite population at baseline, so we know with perfect precision what the means are. Treatment and control groups are surveys of the same population, so clearly no statistical test is needed. The sample means in both groups are likely to be different from each other and from the finite population mean of both groups combined. No surprises there: we wouldn’t expect a survey mean to be identical to the population mean. That’s why we use confidence intervals or large samples so that the confidence intervals are very narrow.

What’s the correct analysis of an RCT?

It’s common to analyse RCT data using a linear regression model. The outcome variable is the endpoint, predictors are treatment group and covariates. This is also known as an ANCOVA. This analysis is easy to understand if the trial participants are a simple random sample from some infinite population. But this is not what we have in efficacy trials as modelled by survey twins above. If the total number of participants in the trial is 1000, then we have a finite population of 1000 in the treatment group and a finite population of 1000 in the control group – together, 2000. In total we have 1000 observations, though, split in some proportion between treatment and control.

Following through on this reasoning, it sounds like the correct analysis uses a stratified independent sampling design with two strata, coinciding with treatment and control groups. The strata populations are both 1000, and a finite population correction should be applied accordingly.

It’s a little more complicated, as I discovered in a paper by Reichardt and Gollob (1999), who independently derived results found by Neyman (1923/1990). Their results highlight a wrinkle in the argument when conducting a t-test on two groups for finite populations as described above. This has general implications for analyses with covariates too. The wrinkle is, the two theoretical populations are not independent of each other.

The authors derive the standard error of the mean difference between X and Y as

\(\displaystyle \sqrt{\frac{\sigma_X^2}{n_X} + \frac{\sigma_Y^2}{n_Y}-\left[ \frac{(\sigma_X-\sigma_Y)^2}{N} + \frac{2(1-\rho) \sigma_X \sigma_{Y}}{N} \right]}\),

where \(\sigma_X^2\) and \(\sigma_Y^2\) are the variances of the two groups, \(n_X\) and \(n_Y\) are the observed group sample sizes, and \(N\) is the total sample (the finite population) size. Finally, \(\rho\) is the unobservable correlation between treat and control outcomes for each participant – unobservable because we only get either the treatment outcome or control outcome for each participant and not both. The terms in square brackets correct for the finite population.

If the variances are equal (\(\sigma_X = \sigma_Y\)) and the correlation \(\rho = 1\), then the correction vanishes (glance back at numerators in the square brackets to see). This is great news if you are willing to assume that treatments have constant effects on all participants (an assumption known as unit-treatment additivity): the same regression analysis that you would use assuming a simple random sample from an infinite population applies.

If the variances are equal and the correlation is 0, then this is the same standard error as in the stratified independent sampling design with two strata described above. Or at least it was for the few examples I tried.

If the variances can be different and the correlation is one, then this is the same standard error as per Welch’s two-sample t-test.

So, which correlation should we use? Reichardt and Gollob (1999) suggest using the reliability of the outcome measure to calculate an upper bound on the correlation. More recently, Aronow, Green, and Lee (2014) proved a result that puts bounds on the correlation based on the observed marginal distribution of outcomes, and provide R code to copy and paste to calculate it. It’s interesting that a problem highlighted a century ago on something so basic – what standard error we should use for an RCT – is still being investigated now.


Aronow, P. M., Green, D. P., & Lee, D. K. K. (2014). Sharp bounds on the variance in randomized experiments. Annals of Statistics, 42, 850–871.

Neyman, J. (1923/1990). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science, 5, 465-472.

Reichardt, C. S., & Gollob, H. F. (1999). Justifying the Use and Increasing the Power of a t Test for a Randomized Experiment With a Convenience Sample. Psychological Methods, 4, 117–128.


Standard errors of marginal means in an RCT

Randomised controlled trials (RCTs) typically use a convenience sample to estimate the mean effect of a treatment for study participants. Participants are randomly assigned to one of (say) two conditions, and an unbiased estimate of the sample mean treatment effect is obtained by taking the difference of the two conditions’ mean outcomes. The estimand in such an RCT is sometimes called the sample average treatment effect (SATE).

Some papers report a standard error for the marginal mean outcomes in treatment and control groups using the textbook formula

\(\displaystyle \frac{\mathit{SD_g}}{\sqrt{n_g}}\),

where \(\mathit{SD_g}\) is the standard deviation of outcomes in group \(g\) and \(n_g\) the number of observations in that group.

This formula assumes a simple random sample with replacement from an infinite population, so does not work for a convenience sample (see Stephen Senn, A Standard Error). I am convinced, but curious what standard error for each group’s mean would be appropriate, if any. (You could stop here and argue that the marginal group means mean nothing anyway. The whole point of running a trial is to subtract off non-treatment explanations of change such as regression to the mean.)

Let’s consider a two-arm RCT with no covariates and a coin toss determining who receives treatment or control. What standard error would be appropriate for the mean treatment outcome? Let the total sample size be \(N\) and quantities for treatment and control use subscripts \(t\) and \(c\), respectively.

Treatment outcome mean of those who received treatment

If we focus on the mean for the \(n_t\) participants who were assigned to treatment, we have all observations for that group, so the standard error of the mean is 0. This feels like cheating.

Treatment outcome mean of everyone in the sample

Suppose we want to say something about the treatment outcome mean for all \(N\) participants in the trial, not only the \(n_t\) who were assigned to treatment.

To see how to think about this, consider a service evaluation of \(N\) patients mimicking everything about an RCT except that it assigns everyone to treatment and uses a coin toss to determine whether someone is included in the evaluation. This is now a survey of \(n\) participants, rather than a trial. We want to generalise results to the finite \(N\) from which we sampled.

Since the population is finite and the sampling is done without replacement, the standard error of the mean should be multiplied by a finite population correction,

\(\displaystyle \mathit{FPC} = \sqrt{\frac{N-n}{N-1}}\).

This setup for a survey is equivalent to what we observe in the treatment group of an RCT. Randomly assigning participants to treatment gives us a random sample from a finite population, the sample frame of which we get by the end of the trial: all treatment and control participants. So we can estimate the SEM around the mean treatment outcome as:

\(\displaystyle \mathit{SEM_t} = \frac{\mathit{SD_t}}{\sqrt{n_t}} \sqrt{\frac{N-n_t}{N-1}}\).

If, by chance (probability \(1/2^N\)), the coin delivers everyone to treatment, then \(N = n_t\) and the FPC reduces to zero, as does the standard error.


If the marginal outcome means mean anything, then there are a couple of standard errors you could use, even with a convenience sample. But the marginal means seem irrelevant when the main reason for running an RCT is to subtract off non-treatment explanations of change following treatment.

If you enjoyed this, you may now be wondering what standard error to use when estimating a sample average treatment effect. Try Efficacy RCTs as survey twins.

Sample size determination for propensity score weighting

If you’re using propensity score weighting (e.g., inverse probability weighting), one question that will arise is how big a sample you need.

Solutions have been proposed that rely on a variance inflation factor (VIF). You calculate the sample size for a simple design and then multiply that by the VIF to take account of weighting.

But the problem is that it is difficult to choose a VIF in advance.

Austin (2021) has developed a simple method (R code in the paper) to estimate VIFs from c-statistics (area under the curve; AOC) of the propensity score models. These c-statistics are often published.

A larger c-statistic means a greater separation between treatment and control, which in turn leads to a larger VIF and requirement for a larger sample.

Picture illustrating different c-statistics.

The magnitude of the VIF also depends on the estimand of interest, e.g., whether average treatment effect (ATE), average treatment effect on the treated (ATET/ATT), or average treatment effect where treat and control overlap (ATO).


Austin, P. C. (2021). Informing power and sample size calculations when using inverse probability of treatment weighting using the propensity score. Statistics in Medicine.

ACME: average causal mediation effect

Suppose there are two groups in a study: treatment and control. There are two potential outcomes for an individual, \(i\): outcome under treatment, \(Y_i(1)\), and outcome under control, \(Y_i(0)\). Only one of the two potential outcomes can be realised and observed as \(Y_i\).

The treatment effect for an individual is defined as the difference in potential outcomes for that individual:

\(\mathit{TE}_i = Y_i(1) – Y_i(0)\).

Since we cannot observe both potential outcomes for any individual, we usually we make do with a sample or population average treatment effect (SATE and PATE). Although these are unobservable (they are the averages of unobservable differences in potential outcomes), they can be estimated. For example, with random treatment assignment, the difference in observed sample mean outcomes for the treatment and control is an unbiased estimator of SATE. If we also have a random sample from the population of interest, then this difference in sample means gives us an unbiased estimate of PATE.

Okay, so what happens if we add a mediator? The potential outcome is expanded to depend on both treatment group and mediator value.

Let \(Y_i(t, m)\) denote the potential outcome for \(i\) under treatment \(t\) and with mediator value \(m\).

Let \(M_i(t)\) denote the potential value of the mediator under treatment \(t\).

The (total) treatment effect is now:

\(\mathit{TE}_i = Y_i(1, M_i(1)) – Y_i(0, M_i(0))\).

Informally, the idea here is that we calculate the potential outcome under treatment, with the mediator value as it is under treatment, and subtract from that the potential outcome under control with the mediator value as it is under control.

The causal mediation effect (CME) is what we get when we hold the treatment assignment constant, but work out the difference in potential outcomes when the mediators are set to values they have under treatment and control:

\(\mathit{CME}_i(t) = Y_i(t, M_i(1)) – Y_i(t, M_i(0))\)

The direct effect (DE) holds the mediator constant and varies treatment:

\(\mathit{DE}_i(t) = Y_i(1, M_i(t)) – Y_i(0, M_i(t))\)

Note how both CME and DE depend on the treatment group. If there is no interaction between treatment and mediator, then

\(\mathit{CME}_i(0) = \mathit{CME}_i(1) = \mathit{CME}\)


\(\mathit{DE}_i(0) = \mathit{DE}_i(1) = \mathit{DE}\).

ACME and ADE are the averages of these effects. Again, since they are defined in terms of potential values (of outcome and mediator), they cannot be directly observed, but – given some assumptions – there are estimators.

Baron and Kenny (1986) provide an estimator in terms of regression equations. I’ll focus on two of their steps and assume there is no need to adjust for any covariates. I’ll also assume that there is no interaction between treatment and moderator.

First, regress the mediator (\(m\)) on the binary treatment indicator (\(t\)):

\(m = \alpha_1 + \beta_1 t\).

The slope \(\beta_1\) tells us how much the mediator changes between the two treatment conditions on average.

Second, regress the outcome (\(y\)) on both mediator and treatment indicator:

\(y = \alpha_2 + \beta_2 t + \beta_3 m\).

The slope \(\beta_2\) provides the average direct effect (ADE), since this model holds the mediator constant (note how this mirrors the definition of DE in terms of potential outcomes).

Now to work out the average causal mediation effect (ACME), we need to wiggle the outcome by however much the mediator moves between treat and control, whilst holding the treatment group constant. Slope \(\beta_1\) tells us how much the treatment shifts the mediator. Slope \(\beta_3\) tells us how much the outcome increases for every unit increase in the mediator, holding treatment constant. So \(\beta_1 \beta_3\) is ACME.

For more, especially on complicating the Baron and Kenny approach, see Imai et al. (2010).


Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173–1182.

Imai, K., Keele, L., & Yamamoto, T. (2010). Identification, Inference and Sensitivity Analysis for Causal Mediation Effects. Statistical Science, 25, 51–71.