Estimating causal effects with optimization-based methods

Cousineau et al. (2023) conducted a comparative analysis of seven optimisation-based methods for estimating causal effects, using 7700 datasets from the 2016 Atlantic Causal Inference competition. These datasets use real covariates with simulated treatment assignment and response functions, so it’s real-world data (kinda), but with the advantage that the true effect (here, sample average treatment effect; SATT) is known. See the supplementary material of Dorie et al.’s (2019) paper for more info on how the sims were setup.

The methods they compared were:

Method R package Function used
Approximate residual balancing (ARB) balanceHD 1.0 residualBalance.ate
Covariate balancing propensity score (CBPS) CBPS 0.21 CBPS
Entropy balancing (EBal) ebal 0.1–6 ebalance
Genetic matching (GenMatch) Matching 4.9–9 GenMatch
Kernel balancing (KBal) kbal 0.1 kbal
Stable balancing weights (SBW) sbw 1.1.1 sbw

I’m hearing entropy balancing discussed a lot, so had my eye on this.

Findings (N gives the number of datasets out of 7700 where SATT could be estimated):

    Bias   Time
Method N Mean SD RMSE Mean (sec)
kbal 7700 0.036 0.083 0.091 2521.3
balancehd 7700 0.041 0.099 0.107 2.0
sbw 4513 0.041 0.102 0.110 254.9
cbps_exact 7700 0.041 0.105 0.112 6.4
ebal 4513 0.041 0.110 0.117 0.2
cbps_over 7700 0.044 0.117 0.125 17.3
genmatch 7700 0.052 0.141 0.151 8282.4

Bias was the estimated SATT minus true SATT (i.e., the sign was kept; I’m not sure what to make of that when averaging across findings from multiple datasets, though the SD is safe). The root-mean-square error (RMSE) squares the bias from each estimate first, removing the sign, before averaging and square rooting, which seems easier to interpret.

Entropy balancing failed to find a solution for about 40% of them! Note, however:

“All these optimization-based methods are executed using their default parameters on R 4.0.2 to demonstrate their usefulness when directly used by an applied researcher” (emphasis added).

Maybe tweaking the settings would have improved the success rate. And #NotAllAppliedResearchers 🙂

Below is a comparison with a bunch of other methods from the competition where findings were already available (see Dorie et al., 2019, Table 2 and 3, for more info on each method).

    Bias   95% CI
Method N Mean SD RMSE coverage (%)
bart_on_pscore 7700 0.001 0.014 0.014 88.4
bart_tmle 7700 0.000 0.016 0.016 93.5
mbart_symint 7700 0.002 0.017 0.017 90.3
bart_mchains 7700 0.002 0.017 0.017 85.7
bart_xval 7700 0.002 0.017 0.017 81.2
bart 7700 0.002 0.018 0.018 81.1
sl_bart_tmle 7689 0.003 0.029 0.029 91.5
h2o_ensemble 6683 0.007 0.029 0.030 100.0
bart_iptw 7700 0.002 0.032 0.032 83.1
sl_tmle 7689 0.007 0.032 0.032 87.6
superlearner 7689 0.006 0.038 0.039 81.6
calcause 7694 0.003 0.043 0.043 81.7
tree_strat 7700 0.022 0.047 0.052 87.4
balanceboost 7700 0.020 0.050 0.054 80.5
adj_tree_strat 7700 0.027 0.068 0.074 60.0
lasso_cbps 7108 0.027 0.077 0.082 30.5
sl_tmle_joint 7698 0.010 0.101 0.102 58.9
cbps 7344 0.041 0.099 0.107 99.7
teffects_psmatch 7506 0.043 0.099 0.108 47.0
linear_model 7700 0.045 0.127 0.135 22.3
mhe_algorithm 7700 0.045 0.127 0.135 22.8
teffects_ra 7685 0.043 0.133 0.140 37.5
teffects_ipwra 7634 0.044 0.161 0.166 35.3
teffects_ipw 7665 0.042 0.298 0.301 39.0

I’ll leave you to read the original for commentary on this, but check out the RMSE and CI coverage. Linear model is summarised as “Linear model/ordinary least squares”. I assume covariates were just entered as main effects, which is a little unfair – the simulations included non-linearity and diagnostic checks on models, such as partial residual plots, would spot this. Still doesn’t do too badly – better than genetic matching! Interestingly the RMSE was a tiny bit worse for entropy balancing than for teffects_psmatch – a particular application of psmatch using propensity scores estimated using logistic regression on first order terms and matching by nearest neighbour.

No info is provided on CI coverage for the seven optimised-based methods they tested. This is why (Cousineau et al., 2023, p. 377):

“While some of these methods did provide some functions to estimate the confidence intervals (i.e., balancehd, sbw), these did not work due to the collinearity of the covariates. While it could be possible to obtain confidence intervals with bootstrapping for all methods, we did not pursue this avenue due to the computational resources that would be needed for some methods (e.g., kbal) and to the inferior results in Table 5 that did not warrant such resources.”


Cousineau, M., Verter, V., Murphy, S. A., & Pineau, J. (2023). Estimating causal effects with optimization-based methods: A review and empirical comparison. European Journal of Operational Research, 304(2), 367–380.

Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019). Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statistical Science, 34(1). 

A cynical view of SEMs

It is all too common for a box and arrow diagram to be cobbled together in an afternoon and christened a “theory of change”. One formalised version of such a diagram is a structural equation model (SEM), the arrows of which are annotated with coefficients estimated using data. Here is John Fox (2002) on SEM and informal boxology:

“A cynical view of SEMs is that their popularity in the social sciences reflects the legitimacy that the models appear to lend to causal interpretation of observational data, when in fact such interpretation is no less problematic than for other kinds of regression models applied to observational data. A more charitable interpretation is that SEMs are close to the kind of informal thinking about causal relationships that is common in social-science theorizing, and that, therefore, these models facilitate translating such theories into data analysis.”


Fox, J. (2002). Structural Equation Models: Appendix to An R and S-PLUS Companion to Applied Regression. Last corrected 2006.

Beautiful friendships have been jeopardised

This is an amusing opening to a paper on face validity, by Mosier (1947):

“Face validity is a term that is bandied about in the field of test construction until it seems about to become a part of accepted terminology. The frequency of its use and the emotional reaction which it arouses-ranging almost from contempt to highest approbation-make it desirable to examine its meaning more closely. When a single term variously conveys high praise or strong condemnation, one suspects either ambiguity of meaning or contradictory postulates among those using the term. The tendency has been, I believe, to assume unaccepted premises rather than ambiguity, and beautiful friendships have been jeopardized when a chance remark about face validity has classed the speaker among the infidels.”

I think dozens of beautiful friendships have been jeopardized by loose talk about randomised controlled trials, theory-based evaluation, realism, and positivism, among many others. I’ve just seen yet another piece arguing that you wouldn’t evaluate a parachute with an RCT and I can’t even.


Mosier, C. I. (1947). A Critical Examination of the Concepts of Face Validity. Educational and Psychological Measurement, 7(2), 191–205.

Do “Growth Mindset” interventions improve students’ academic attainment?

“We conducted a systematic review and multiple meta-analyses of the growth mindset intervention literature. Our goal was to answer two questions: (a) Do growth mindset interventions generally improve students’ academic achievement? and (b) Are growth mindset intervention effects due to instilling growth mindsets in students or are apparent effects due to shortcomings in study designs, analyses, and reporting? To answer these questions, we systematically reviewed the literature and conducted multiple meta-analyses imposing varying degrees of quality control. Our results indicated that apparent effects of growth mindset interventions are possibly due to inadequate study designs, reporting flaws, and bias. In particular, the systematic review yielded several concerning patterns of threats to internal validity.”

Here’s a pic:

Counterfactual evaluation

Consider the following two sentences:

(1) Alex’s train left 2 minutes before they arrived at the platform.

(2) If Alex had arrived at the platform 10 minutes earlier, then they probably would have caught their train.

Is the counterfactual in sentence 2 true or false, or can’t you tell because you didn’t run an RCT?

I reckoned that the counterfactual is true. I reasoned that Alex probably missed the train because they were late, so turning up earlier would have fixed that.

I could think of other possible outcomes, but they became increasingly contrived and far from the (albeit minimal) evidence provided. For instance, it is conceivable that if Alex arrived earlier, they would have believed they had time to pop to Pret for a coffee – and missed the train again.

The nature and significance of social ontology

This looks fun in Synthese, by Francesco Guala and Frank Hindriks:

“We have proposed an ecumenical methodology by distinguishing three ways in which [social ontology] can and should interact with other disciplines and perspectives. First, social ontology can forge connections between social scientific theories by unifying them. Second, it can preserve part of the manifest image by scrutinizing it from the perspective of the scientific image, integrating the two and thereby achieving consilience. Third, it can fruitfully interact with a range of philosophical disciplines that engage in normative theorizing. We hope that our bridge-builder conception of social ontology will contribute to an even more fecund way of practicing social ontology.”

Sheffield Elicitation Framework (SHELF) tools

There’s a long history of trying to extract experts’ beliefs about probability distributions when there is no data to estimate the distributions directly. The MATCH Uncertainty Elicitation Tool (Morris et al., 2014) offers five methods.

The roulette method is the most intuitive to me. You are provided with a blank histogram, with range and number of cells of your choosing. As you click, MATCH guesses the distribution using a least-squares procedure, choosing between normal, Student’s t, scaled beta, gamma, log normal, and log Student’s t. You can also override its guess. This then means you can look up the quantiles and use those to influence your clicks, e.g., if the median or extreme quantiles are off what you believe, you can add or remove cells to drag the quantiles where you think they should be.

Here’s an example for the range 0 to 10 and with grid height 10 and 20 bins:

There are a couple of methods that ask for quantiles (either quartile or tertile). Another that asks for three probabilities, where you can choose the parameters. The default probabilities requested are \(P(0 < X < 0.25)\), \(P(0.75 < X < 1)\) and \(P(0 < X < 0.5)\), when \(X \in [0,1]\). Finally, there’s a hybrid option which requests the median and two probabilities. This latter option also feels intuitive to work with, particularly again when looking at the fitted distribution and peeking at quantiles.

There’s also the SHELF R Package, which includes a bunch of Shiny apps that are available directly on the web too.


Morris, D. E., Oakley, J. E., & Crowe, J. A. (2014). A web-based tool for eliciting probability distributions from experts. Environmental Modelling & Software, 52, 1–4.

Applying process tracing to RCTs

Process tracing is an application of Bayes’ theorem to test hypotheses using qualitative evidence.¹ Application areas tend to be complex, e.g., evaluating the outcomes of international aid or determining the causes of a war by interpreting testimony and documents. This post explores what happens if we apply process tracing to a simple hypothetical quantitative study: an RCT that includes a mediation analysis.

Process tracing is often conducted without probabilities, using heuristics such as the “hoop test” or “smoking gun test” that make its Bayesian foundations digestible. Alternatively, probabilities may be made somewhat easier to digest by viewing them through verbal descriptors such as those provided by the PHIA Probability Yardstick. Given the simple example we will tackle, I will apply Bayes’ rule directly to point probabilities.

I will assume that there are three mutually exclusive hypotheses:

Null: the intervention has no effect.

Out: the intervention improves outcomes; however, not through the hypothesised mediator (it works but we have no idea how).

Med: the intervention improves the outcome and it does so through the hypothesised mediator.

Other hypotheses I might have included are that the intervention causes harm or that the mediator operates in the opposite direction to that hypothesised. We might also be interested in whether the intervention pushes the mediator in the desired direction without shifting the outcome. But let’s not overcomplicate things.

There are two sources of evidence, estimates of:

Average treatment effect (ATE): I will treat this evidence source as binary: whether there is a statistically significant difference between treat and control or not (alternative versus null hypothesis). Let’s suppose that the Type I error rate is 5% and power is 80%. This  means that if either Out or Med holds, then there is an 80% chance of obtaining a statistically significant effect. If neither holds, then there is a 5% chance of obtaining a statistically significant effect (in error).

Average causal mediation effect (ACME): I will again treat this as binary: is ACME statistically significantly different to zero or not (alternative versus null hypothesis). I will assume that if ATE is significant and Med holds, then there is a 70% chance that ACME will be significant. Otherwise, I will assume a 5% chance (by Type I error).

Note where I obtained the probabilities above. I got the 5% and 80% for free, following conventions for Type I error and power in the social sciences. I arrived at the 70% using finger-in-the-wind: it should be possible to choose a decent mediator based on the prior literature, I reasoned; however, I have seen examples where a reasonable choice of mediator still fails to operate as expected in a highly powered study.

Finally, I need to choose prior probabilities for Null, Out, and Med. Under clinical equipoise, I feel that there should be a 50-50 chance of the intervention having an effect or not (findings from prior studies of the same intervention notwithstanding). Now suppose it does have an effect. I am going to assume there is a 50% chance of that effect operating through the mediator.

This means that

P(Null) = 50%
P(Out) = 25%
P(Med) = 25%

So, P(Out or Med) = 50%, i.e., the prior probabilities are setup to reflect my belief that there is a 50% chance the intervention works somehow.

I’m going to use a Bayesian network to do the sums for me (I used GeNIe Modeler). Here’s the setup:

The lefthand node shows the prior probabilities, as chosen. The righthand nodes show the inferred probabilities of observing the different patterns of evidence.

Let’s now pretend we have concluded the study and observed evidence. Firstly, we are delighted to discover that there is a statistically significant effect of the intervention on outcomes. Let’s update our Bayesian network (note how the Alternative outcome on ATE has been underlined and emboldened):

P(Null) has now dropped to 6% and P(ACME > 0) has risen to 36%. We do not yet have sufficient evidence to distinguish between Out or Med: their probabilities are both 47%.²

Next, let’s run the mediation analysis. Amazingly, it is also statistically significant:

So, given our initial probability assignments and the pretend evidence observed, we can be 93% sure that the intervention works and does so through the mediator.

If the mediation test had not been significant, then P(Out) would have risen to 69% and P(Med) would have dropped to 22%. If the ATE had been indistinguishable from zero, then P(Null) would have been 83%.

Is this process tracing or simply putting Bayes’ rule to work as usual? Does this example show that RCTs can be theory-based evaluations, since process tracing is a theory-based method, or does the inclusion of a control group rule out that possibility, as Figure 3.1 of the Magenta Book would suggest? I will leave the reader to assign probabilities to each possible conclusion. Let me know what you think.

¹ Okay, I accept that it is controversial to say that process tracing is necessarily an application of Bayes, particularly when no sums are involved. However, to me Bayes’ rule explains in the simplest possible terms why the four tests attributed to Van Evera (1997) [Guide to Methods for Students of Political Science. New York, NY: Cornell University Press.] work. It’s clear why there are so many references to Bayes in the process tracing literature.

² These are all actually conditional probabilities. I have made this implicit in the notation for ease of reading. Hopefully it’s clear given the prose.

For example, P(Med | ATE = Alternative) =  47%; in other words, the probability of Med given a statistically significant ATE estimate is 47%.

Evaluating the arts

I love the arts, particularly bouncy, moderately cheesy dance music experienced in a club. Programme evaluation has been poking at the arts for a while now and attempting to quantify their impact on wellbeing. And I really wish evaluators would stop and think what they’re doing before rushing in with WEMWBS. The experience I have dancing in a sweaty club is very different to the experience wandering around Tate or crying in a cinema. Programme effects are contrasts: the actual outcome of the programme versus an estimate of what the outcome would have been in the programme’s absence, e.g., following some genre of “business as usual”. Before we can measure the average causal benefits of the arts we need a sensible theory of change, and some ideas about what the counterfactual is. Maybe you can find out how I feel dancing to Free Yourself, but what’s the contrast? Dancing to something else? Staying in with a cup of tea and chocolate, reading a book about standard errors for matching with replacement? What exactly is the programme being evaluated…? (End of random thoughts.)