Cousineau et al. (2023) conducted a comparative analysis of seven optimisation-based methods for estimating causal effects, using 7700 datasets from the 2016 Atlantic Causal Inference competition. These datasets use real covariates with simulated treatment assignment and response functions, so it’s real-world data (kinda), but with the advantage that the true effect (here, sample average treatment effect; SATT) is known. See the supplementary material of Dorie et al.’s (2019) paper for more info on how the sims were setup.
The methods they compared were:
Method | R package | Function used |
---|---|---|
Approximate residual balancing (ARB) | balanceHD 1.0 | residualBalance.ate |
Covariate balancing propensity score (CBPS) | CBPS 0.21 | CBPS |
Entropy balancing (EBal) | ebal 0.1–6 | ebalance |
Genetic matching (GenMatch) | Matching 4.9–9 | GenMatch |
Kernel balancing (KBal) | kbal 0.1 | kbal |
Stable balancing weights (SBW) | sbw 1.1.1 | sbw |
I’m hearing entropy balancing discussed a lot, so had my eye on this.
Findings (N gives the number of datasets out of 7700 where SATT could be estimated):
Bias | Time | ||||
---|---|---|---|---|---|
Method | N | Mean | SD | RMSE | Mean (sec) |
kbal | 7700 | 0.036 | 0.083 | 0.091 | 2521.3 |
balancehd | 7700 | 0.041 | 0.099 | 0.107 | 2.0 |
sbw | 4513 | 0.041 | 0.102 | 0.110 | 254.9 |
cbps_exact | 7700 | 0.041 | 0.105 | 0.112 | 6.4 |
ebal | 4513 | 0.041 | 0.110 | 0.117 | 0.2 |
cbps_over | 7700 | 0.044 | 0.117 | 0.125 | 17.3 |
genmatch | 7700 | 0.052 | 0.141 | 0.151 | 8282.4 |
Bias was the estimated SATT minus true SATT (i.e., the sign was kept; I’m not sure what to make of that when averaging across findings from multiple datasets, though the SD is safe). The root-mean-square error (RMSE) squares the bias from each estimate first, removing the sign, before averaging and square rooting, which seems easier to interpret.
Entropy balancing failed to find a solution for about 40% of them! Note, however:
“All these optimization-based methods are executed using their default parameters on R 4.0.2 to demonstrate their usefulness when directly used by an applied researcher” (emphasis added).
Maybe tweaking the settings would have improved the success rate. And #NotAllAppliedResearchers 🙂
Below is a comparison with a bunch of other methods from the competition where findings were already available (see Dorie et al., 2019, Table 2 and 3, for more info on each method).
Bias | 95% CI | ||||
---|---|---|---|---|---|
Method | N | Mean | SD | RMSE | coverage (%) |
bart_on_pscore | 7700 | 0.001 | 0.014 | 0.014 | 88.4 |
bart_tmle | 7700 | 0.000 | 0.016 | 0.016 | 93.5 |
mbart_symint | 7700 | 0.002 | 0.017 | 0.017 | 90.3 |
bart_mchains | 7700 | 0.002 | 0.017 | 0.017 | 85.7 |
bart_xval | 7700 | 0.002 | 0.017 | 0.017 | 81.2 |
bart | 7700 | 0.002 | 0.018 | 0.018 | 81.1 |
sl_bart_tmle | 7689 | 0.003 | 0.029 | 0.029 | 91.5 |
h2o_ensemble | 6683 | 0.007 | 0.029 | 0.030 | 100.0 |
bart_iptw | 7700 | 0.002 | 0.032 | 0.032 | 83.1 |
sl_tmle | 7689 | 0.007 | 0.032 | 0.032 | 87.6 |
superlearner | 7689 | 0.006 | 0.038 | 0.039 | 81.6 |
calcause | 7694 | 0.003 | 0.043 | 0.043 | 81.7 |
tree_strat | 7700 | 0.022 | 0.047 | 0.052 | 87.4 |
balanceboost | 7700 | 0.020 | 0.050 | 0.054 | 80.5 |
adj_tree_strat | 7700 | 0.027 | 0.068 | 0.074 | 60.0 |
lasso_cbps | 7108 | 0.027 | 0.077 | 0.082 | 30.5 |
sl_tmle_joint | 7698 | 0.010 | 0.101 | 0.102 | 58.9 |
cbps | 7344 | 0.041 | 0.099 | 0.107 | 99.7 |
teffects_psmatch | 7506 | 0.043 | 0.099 | 0.108 | 47.0 |
linear_model | 7700 | 0.045 | 0.127 | 0.135 | 22.3 |
mhe_algorithm | 7700 | 0.045 | 0.127 | 0.135 | 22.8 |
teffects_ra | 7685 | 0.043 | 0.133 | 0.140 | 37.5 |
teffects_ipwra | 7634 | 0.044 | 0.161 | 0.166 | 35.3 |
teffects_ipw | 7665 | 0.042 | 0.298 | 0.301 | 39.0 |
I’ll leave you to read the original for commentary on this, but check out the RMSE and CI coverage. Linear model is summarised as “Linear model/ordinary least squares”. I assume covariates were just entered as main effects, which is a little unfair – the simulations included non-linearity and diagnostic checks on models, such as partial residual plots, would spot this. Still doesn’t do too badly – better than genetic matching! Interestingly the RMSE was a tiny bit worse for entropy balancing than for teffects_psmatch – a particular application of psmatch using propensity scores estimated using logistic regression on first order terms and matching by nearest neighbour.
No info is provided on CI coverage for the seven optimised-based methods they tested. This is why (Cousineau et al., 2023, p. 377):
“While some of these methods did provide some functions to estimate the confidence intervals (i.e., balancehd, sbw), these did not work due to the collinearity of the covariates. While it could be possible to obtain confidence intervals with bootstrapping for all methods, we did not pursue this avenue due to the computational resources that would be needed for some methods (e.g., kbal) and to the inferior results in Table 5 that did not warrant such resources.”
References
Cousineau, M., Verter, V., Murphy, S. A., & Pineau, J. (2023). Estimating causal effects with optimization-based methods: A review and empirical comparison. European Journal of Operational Research, 304(2), 367–380.
Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019). Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statistical Science, 34(1).