Estimating causal effects with optimization-based methods

Cousineau et al. (2023) compared seven optimisation-based methods for estimating causal effects, using 7700 datasets from the 2016 Atlantic Causal Inference competition. These datasets use real covariates with simulated treatment assignment and response functions, so it’s real-world-inspired data, with the advantage that the true effect (here, sample average treatment effect; SATT) is known. See the supplementary material of Dorie et al.’s (2019) paper for more info on how the sims were setup.

The methods they compared were:

Method R package Function used
Approximate residual balancing (ARB) balanceHD 1.0 residualBalance.ate
Covariate balancing propensity score (CBPS) CBPS 0.21 CBPS
Entropy balancing (EBal) ebal 0.1–6 ebalance
Genetic matching (GenMatch) Matching 4.9–9 GenMatch
Kernel balancing (KBal) kbal 0.1 kbal
Stable balancing weights (SBW) sbw 1.1.1 sbw

I’m hearing entropy balancing discussed a lot, so had my eye on this.

Bias was the estimated SATT minus true SATT (i.e., the +/- sign was kept; I’m not sure what to make of that when averaging biases from analyses of multiple datasets). The root-mean-square error (RMSE) squares the bias from each estimate first, removing the sign, before averaging and square rooting, which seems easier to interpret.

Findings below. N gives the number of datasets out of 7700 where SATT could be estimated; red where my eyebrows were raised and pink for entropy balancing and its RMSE:

    Bias   Time
Method N Mean SD RMSE Mean (sec)
kbal 7700 0.036 0.083 0.091 2521.3
balancehd 7700 0.041 0.099 0.107 2.0
sbw 4513 0.041 0.102 0.110 254.9
cbps_exact 7700 0.041 0.105 0.112 6.4
ebal 4513 0.041 0.110 0.117 0.2
cbps_over 7700 0.044 0.117 0.125 17.3
genmatch 7700 0.052 0.141 0.151 8282.4

This particular implementation of entropy balancing failed to find a solution for about 40% of the datasets! Note, however:

“All these optimization-based methods are executed using their default parameters on R 4.0.2 to demonstrate their usefulness when directly used by an applied researcher” (emphasis added).

Maybe tweaking the settings would have improved the success rate. And #NotAllAppliedResearchers 🙂

Below is a comparison with a bunch of other methods from the competition, for which findings were already available on a GitHub repo (see Dorie et al., 2019, Table 2 and 3, for more info on each method).

    Bias   95% CI
Method N Mean SD RMSE coverage (%)
bart on pscore 7700 0.001 0.014 0.014 88.4
bart tmle 7700 0.000 0.016 0.016 93.5
mbart symint 7700 0.002 0.017 0.017 90.3
bart mchains 7700 0.002 0.017 0.017 85.7
bart xval 7700 0.002 0.017 0.017 81.2
bart 7700 0.002 0.018 0.018 81.1
sl bart tmle 7689 0.003 0.029 0.029 91.5
h2o ensemble 6683 0.007 0.029 0.030 100.0
bart iptw 7700 0.002 0.032 0.032 83.1
sl tmle 7689 0.007 0.032 0.032 87.6
superlearner 7689 0.006 0.038 0.039 81.6
calcause 7694 0.003 0.043 0.043 81.7
tree strat 7700 0.022 0.047 0.052 87.4
balanceboost 7700 0.020 0.050 0.054 80.5
adj tree strat 7700 0.027 0.068 0.074 60.0
lasso cbps 7108 0.027 0.077 0.082 30.5
sl tmle joint 7698 0.010 0.101 0.102 58.9
cbps 7344 0.041 0.099 0.107 99.7
teffects psmatch 7506 0.043 0.099 0.108 47.0
linear model 7700 0.045 0.127 0.135 22.3
mhe algorithm 7700 0.045 0.127 0.135 22.8
teffects ra 7685 0.043 0.133 0.140 37.5
teffects ipwra 7634 0.044 0.161 0.166 35.3
teffects ipw 7665 0.042 0.298 0.301 39.0

I’ll leave you to read the original for commentary on this, but check out the RMSE and CI coverage. Linear model is summarised as “Linear model/ordinary least squares”. I assume covariates were just entered as main effects, which is a little unfair. The simulations included non-linearity and diagnostic checks on models, such as partial residual plots, would spot this. Still doesn’t do too badly – better than genetic matching!

Interestingly the RMSE was a tiny bit worse for entropy balancing than for Stata’s teffects psmatch, which in simulations was setup to use nearest-neighbour matching on propensity scores estimated using logistic regression (I presume the defaults – I’m an R user).

The winners were all regression-based or what the authors called “mixed methods” – in this context meaning some genre of doubly-robust method that combined matching/weighting with regression adjustment. Bayesian additive regression trees (BART) feature towards the best end of the table. These sorts of regression-based methods don’t allow the design phase to be clearly separated from the estimation phase. For matching approaches where this separation is possible, the outcomes data can be held back from analysts until matches are found or weights estimated based only on covariates. Where the analysis also demands access to outcomes, a robust approach is needed, including a highly-specified and published statistical analysis plan and e.g., holding back some data in a training and validation phase before fitting the final model.

No info is provided on CI coverage for the seven optimisation-based methods they tested. This is why (Cousineau et al., 2023, p. 377):

“While some of these methods did provide some functions to estimate the confidence intervals (i.e., balancehd, sbw), these did not work due to the collinearity of the covariates. While it could be possible to obtain confidence intervals with bootstrapping for all methods, we did not pursue this avenue due to the computational resources that would be needed for some methods (e.g., kbal) and to the inferior results in Table 5 that did not warrant such resources.”

It would be interesting to zoom in on a smaller set of options and datasets and perhaps allow some more researcher input on how analyses are carried out.


Cousineau, M., Verter, V., Murphy, S. A., & Pineau, J. (2023). Estimating causal effects with optimization-based methods: A review and empirical comparison. European Journal of Operational Research, 304(2), 367–380.

Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019). Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statistical Science, 34(1). 

Mermin’s (1981) variant of Bell’s theorem – in R

Entanglement is the weirdest feature of quantum mechanics. David Mermin (1981) provides an accessible introduction to experiments showing that local determinism doesn’t hold in the quantum world, simplifying Bell’s theorem and tests thereof. This knitted Markdown file shows the sums in R. It’s probably only going to make sense if you have been here before, but hadn’t got around to doing the sums yourself (that was me, before writing this today!).


Bell (1981, C2-57):

“… it may be that it is not permissible to regard the experimental settings a and b in the analyzers as independent variables, as we did. We supposed them in particular to be independent of the supplementary [a.k.a. hidden] variables λ, in that a and b could be changed without changing the probability distribution ρ(λ). Now even if we have arranged that a and b are generated by apparently random radioactive devices, housed in separate boxes and thickly shielded, or by Swiss national lottery machines, or by elaborate computer programmes, or by apparently free willed experimental physicists, or by some combination of all of these, we cannot be sure that a and b are not significantly influenced by the same factors λ that influence A and B [measurement outcomes]. But this way of arranging quantum mechanical correlations would be even more mind boggling that one in which causal chains go faster than light. Apparently separate parts of the world would be deeply and conspiratorially entangled, and our apparent free will would be entangled with them.”

Hance and Hossenfelder (2022, p. 1382) on the assumption of statistical independence of supplementary/hidden variables and experimental settings:

“Types of hidden variables theories which violate statistical independence include those which are superdeterministic, retrocausal, and supermeasured. Some have dismissed them on metaphysical grounds, by associating a violation of statistical independence with the existence of ‘free will’ or ‘free choice’ and then arguing that these are not assumptions we should give up.

“It is, in hindsight, difficult to understand how this association came about. We believe it originated in the idea that a correlation between the hidden variables and the measurement setting would somehow prevent the experimentalist from choosing the setting to their liking. However, this is mistaking a correlation with a causation. And any serious philosophical discussion of free will acknowledges that human agency is of course constrained by the laws of nature anyway.”


Bell, J. S. (1981). Bertlmann’s socks and the nature of realityLe Journal de Physique Colloques42(C2), C2-41-C2-62. Reprinted in Bell (2004).

Bell, J. S. (2004). Speakable and unspeakable in quantum mechanics: Collected papers on quantum philosophy (2nd ed.). Cambridge University Press.

Hance, J. R., & Hossenfelder, S. (2022). Bell’s theorem allows local theories of quantum mechanics. Nature Physics, 18(12), 1382.  [Preprint]


A cynical view of SEMs

It is all too common for a box and arrow diagram to be cobbled together in an afternoon and christened a “theory of change”. One formalised version of such a diagram is a structural equation model (SEM), the arrows of which are annotated with coefficients estimated using data. Here is John Fox (2002) on SEM and informal boxology:

“A cynical view of SEMs is that their popularity in the social sciences reflects the legitimacy that the models appear to lend to causal interpretation of observational data, when in fact such interpretation is no less problematic than for other kinds of regression models applied to observational data. A more charitable interpretation is that SEMs are close to the kind of informal thinking about causal relationships that is common in social-science theorizing, and that, therefore, these models facilitate translating such theories into data analysis.”


Fox, J. (2002). Structural Equation Models: Appendix to An R and S-PLUS Companion to Applied Regression. Last corrected 2006.

A simple circuit

Here is a simple quantum computing circuit:

There are two qubits (quantum bits), q[0] and q[1], and two classical bits, c[0] and c[1]. The latter will be used to store results of measuring the former.

Read the circuit left to right.

∣0⟩ is a qubit that will always have a measurement outcome of 0 (in the computational basis).

H is a Hadamard gate that puts that ∣0⟩ into a “superposition” (a sum) of both the “basis states” ∣0⟩ and ∣1⟩. The resulting superposition will collapse to either ∣0⟩ or ∣1⟩ with equal probability when measured (again, assuming the computational basis is used).

The next items on the circuit that look like little dials with cables attached denote measurement. Qubit q[0] is measured first and the result saved into c[0], then q[1] is measured and the result is saved into c[1]. The two qubits are unentangled, which means that measuring one has no effect on the other. (See this post for an example with entanglement.)

So basically this circuit is a fancy way to flip two coins, using quantum objects in superposition rather than metal discs. You can run it on a real quantum computer for free at IBM Quantum. I used such a circuit to decide what to do at the weekend, choosing randomly from four options. With \(n\) qubits you can do this for \(2^n\) options. It took about an hour to get the answer. There may be better things to do with quantum computers…

Appeal to consequences fallacy in understanding Bell’s theorem

Joan Vaccaro (2018, p. 11) on arguments against superdeterminism:

“An argument that has been advocated by leading physicists is that humans are necessarily independent of the universe that surrounds them because the practice of science requires the independence of the experimenter from the subject of study. For example, Bell et al. state that unless the experimenter and subject are independent, we would need to abandon ‘…the whole enterprise of discovering the laws of nature by experimentation’, and Zeilinger claims that if the experimenter and subject were not independent ‘…such a position would completely pull the rug out from underneath science.’ However, this argument contains a logical fallacy called an appeal to consequences. Specifically, arguing for experimenter–subject independence on the basis that the alternative has undesirable consequences does not prove that experimenters are independent of their subjects. Rather, the alternative may well be true, in which case we would need to deal with the consequences.”


Vaccaro, J. A. (2018). The quantum theory of time, the block universe, and human experience. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 376(2123).

Beautiful friendships have been jeopardised

This is an amusing opening to a paper on face validity, by Mosier (1947):

“Face validity is a term that is bandied about in the field of test construction until it seems about to become a part of accepted terminology. The frequency of its use and the emotional reaction which it arouses-ranging almost from contempt to highest approbation-make it desirable to examine its meaning more closely. When a single term variously conveys high praise or strong condemnation, one suspects either ambiguity of meaning or contradictory postulates among those using the term. The tendency has been, I believe, to assume unaccepted premises rather than ambiguity, and beautiful friendships have been jeopardized when a chance remark about face validity has classed the speaker among the infidels.”

I think dozens of beautiful friendships have been jeopardized by loose talk about randomised controlled trials, theory-based evaluation, realism, and positivism, among many others. I’ve just seen yet another piece arguing that you wouldn’t evaluate a parachute with an RCT and I can’t even.


Mosier, C. I. (1947). A Critical Examination of the Concepts of Face Validity. Educational and Psychological Measurement, 7(2), 191–205.