Estimating causal effects with optimization-based methods

Cousineau et al. (2023) compared seven optimisation-based methods for estimating causal effects, using 7700 datasets from the 2016 Atlantic Causal Inference competition. These datasets use real covariates with simulated treatment assignment and response functions, so it’s real-world-inspired data, with the advantage that the true effect (here, sample average treatment effect; SATT) is known. See the supplementary material of Dorie et al.’s (2019) paper for more info on how the sims were setup.

The methods they compared were:

Method R package Function used
Approximate residual balancing (ARB) balanceHD 1.0 residualBalance.ate
Covariate balancing propensity score (CBPS) CBPS 0.21 CBPS
Entropy balancing (EBal) ebal 0.1–6 ebalance
Genetic matching (GenMatch) Matching 4.9–9 GenMatch
Kernel balancing (KBal) kbal 0.1 kbal
Stable balancing weights (SBW) sbw 1.1.1 sbw

I’m hearing entropy balancing discussed a lot, so had my eye on this.

Bias was the estimated SATT minus true SATT (i.e., the +/- sign was kept; I’m not sure what to make of that when averaging biases from analyses of multiple datasets). The root-mean-square error (RMSE) squares the bias from each estimate first, removing the sign, before averaging and square rooting, which seems easier to interpret.

Findings below. N gives the number of datasets out of 7700 where SATT could be estimated; red where my eyebrows were raised and pink for entropy balancing and its RMSE:

Β  Β  Bias Β  Time
Method N Mean SD RMSE Mean (sec)
kbal 7700 0.036 0.083 0.091 2521.3
balancehd 7700 0.041 0.099 0.107 2.0
sbw 4513 0.041 0.102 0.110 254.9
cbps_exact 7700 0.041 0.105 0.112 6.4
ebal 4513 0.041 0.110 0.117 0.2
cbps_over 7700 0.044 0.117 0.125 17.3
genmatch 7700 0.052 0.141 0.151 8282.4

This particular implementation of entropy balancing failed to find a solution for about 40% of the datasets! Note, however:

“All these optimization-based methods are executed using their default parameters on R 4.0.2 to demonstrate their usefulness when directly used by an applied researcher” (emphasis added).

Maybe tweaking the settings would have improved the success rate. And #NotAllAppliedResearchers πŸ™‚

Below is a comparison with a bunch of other methods from the competition, for which findings were already available on a GitHub repo (see Dorie et al., 2019, Table 2 and 3, for more info on each method).

Β  Β  Bias Β  95% CI
Method N Mean SD RMSE coverage (%)
bart on pscore 7700 0.001 0.014 0.014 88.4
bart tmle 7700 0.000 0.016 0.016 93.5
mbart symint 7700 0.002 0.017 0.017 90.3
bart mchains 7700 0.002 0.017 0.017 85.7
bart xval 7700 0.002 0.017 0.017 81.2
bart 7700 0.002 0.018 0.018 81.1
sl bart tmle 7689 0.003 0.029 0.029 91.5
h2o ensemble 6683 0.007 0.029 0.030 100.0
bart iptw 7700 0.002 0.032 0.032 83.1
sl tmle 7689 0.007 0.032 0.032 87.6
superlearner 7689 0.006 0.038 0.039 81.6
calcause 7694 0.003 0.043 0.043 81.7
tree strat 7700 0.022 0.047 0.052 87.4
balanceboost 7700 0.020 0.050 0.054 80.5
adj tree strat 7700 0.027 0.068 0.074 60.0
lasso cbps 7108 0.027 0.077 0.082 30.5
sl tmle joint 7698 0.010 0.101 0.102 58.9
cbps 7344 0.041 0.099 0.107 99.7
teffects psmatch 7506 0.043 0.099 0.108 47.0
linear model 7700 0.045 0.127 0.135 22.3
mhe algorithm 7700 0.045 0.127 0.135 22.8
teffects ra 7685 0.043 0.133 0.140 37.5
teffects ipwra 7634 0.044 0.161 0.166 35.3
teffects ipw 7665 0.042 0.298 0.301 39.0

I’ll leave you to read the original for commentary on this, but check out the RMSE and CI coverage. Linear model is summarised as “Linear model/ordinary least squares”. I assume covariates were just entered as main effects, which is a little unfair. The simulations included non-linearity and diagnostic checks on models, such as partial residual plots, would spot this. Still doesn’t do too badly – better than genetic matching!

Interestingly the RMSE was a tiny bit worse for entropy balancing than for Stata’s teffects psmatch, which in simulations was setup to use nearest-neighbour matching on propensity scores estimated using logistic regression (I presume the defaults – I’m an R user).

The winners were all either regression-based or what the authors called “mixed methods” – in this context meaning some genre of doubly-robust method that combined matching/weighting with regression adjustment. Bayesian additive regression trees (BART) feature towards the best end of the table. These sorts of regression-based methods don’t allow the design phase to be clearly separated from the estimation phase. For matching approaches where this separation is possible, the outcomes data can be held back from analysts until matches are found or weights estimated based only on covariates. Where the analysis also demands access to outcomes, a robust approach is needed, including a highly-specified and published statistical analysis plan and e.g., holding back some data in a training and validation phase before fitting the final model.

No info is provided on CI coverage for the seven optimisation-based methods they tested. This is why (Cousineau et al., 2023, p. 377):

“While some of these methods did provide some functions to estimate the confidence intervals (i.e., balancehd, sbw), these did not work due to the collinearity of the covariates. While it could be possible to obtain confidence intervals with bootstrapping for all methods, we did not pursue this avenue due to the computational resources that would be needed for some methods (e.g., kbal) and to the inferior results in Table 5 that did not warrant such resources.”

It would be interesting to zoom in on a smaller set of options and datasets and perhaps allow some more researcher input on how analyses are carried out.


Cousineau, M., Verter, V., Murphy, S. A., & Pineau, J. (2023). Estimating causal effects with optimization-based methods: A review and empirical comparison. European Journal of Operational Research, 304(2), 367–380.

Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019). Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statistical Science, 34(1).Β