Cousineau et al. (2023) compared seven optimisation-based methods for estimating causal effects, using 7700 datasets from the 2016 Atlantic Causal Inference competition. These datasets use real covariates with simulated treatment assignment and response functions, so it’s real-world-inspired data, with the advantage that the true effect (here, sample average treatment effect; SATT) is known. See the supplementary material of Dorie et al.’s (2019) paper for more info on how the sims were setup.
The methods they compared were:
Method | R package | Function used |
---|---|---|
Approximate residual balancing (ARB) | balanceHD 1.0 | residualBalance.ate |
Covariate balancing propensity score (CBPS) | CBPS 0.21 | CBPS |
Entropy balancing (EBal) | ebal 0.1β6 | ebalance |
Genetic matching (GenMatch) | Matching 4.9β9 | GenMatch |
Kernel balancing (KBal) | kbal 0.1 | kbal |
Stable balancing weights (SBW) | sbw 1.1.1 | sbw |
I’m hearing entropy balancing discussed a lot, so had my eye on this.
Bias was the estimated SATT minus true SATT (i.e., the +/- sign was kept; I’m not sure what to make of that when averaging biases from analyses of multiple datasets). The root-mean-square error (RMSE) squares the bias from each estimate first, removing the sign, before averaging and square rooting, which seems easier to interpret.
Findings below. N gives the number of datasets out of 7700 where SATT could be estimated; red where my eyebrows were raised and pink for entropy balancing and its RMSE:
Β | Β | Bias | Β | Time | |
---|---|---|---|---|---|
Method | N | Mean | SD | RMSE | Mean (sec) |
kbal | 7700 | 0.036 | 0.083 | 0.091 | 2521.3 |
balancehd | 7700 | 0.041 | 0.099 | 0.107 | 2.0 |
sbw | 4513 | 0.041 | 0.102 | 0.110 | 254.9 |
cbps_exact | 7700 | 0.041 | 0.105 | 0.112 | 6.4 |
ebal | 4513 | 0.041 | 0.110 | 0.117 | 0.2 |
cbps_over | 7700 | 0.044 | 0.117 | 0.125 | 17.3 |
genmatch | 7700 | 0.052 | 0.141 | 0.151 | 8282.4 |
This particular implementation of entropy balancing failed to find a solution for about 40% of the datasets! Note, however:
“All these optimization-based methods are executed using their default parameters on R 4.0.2 to demonstrate their usefulness when directly used by an applied researcher” (emphasis added).
Maybe tweaking the settings would have improved the success rate. And #NotAllAppliedResearchers π
Below is a comparison with a bunch of other methods from the competition, for which findings were already available on a GitHub repo (see Dorie et al., 2019, Table 2 and 3, for more info on each method).
Β | Β | Bias | Β | 95% CI | |
---|---|---|---|---|---|
Method | N | Mean | SD | RMSE | coverage (%) |
bart on pscore | 7700 | 0.001 | 0.014 | 0.014 | 88.4 |
bart tmle | 7700 | 0.000 | 0.016 | 0.016 | 93.5 |
mbart symint | 7700 | 0.002 | 0.017 | 0.017 | 90.3 |
bart mchains | 7700 | 0.002 | 0.017 | 0.017 | 85.7 |
bart xval | 7700 | 0.002 | 0.017 | 0.017 | 81.2 |
bart | 7700 | 0.002 | 0.018 | 0.018 | 81.1 |
sl bart tmle | 7689 | 0.003 | 0.029 | 0.029 | 91.5 |
h2o ensemble | 6683 | 0.007 | 0.029 | 0.030 | 100.0 |
bart iptw | 7700 | 0.002 | 0.032 | 0.032 | 83.1 |
sl tmle | 7689 | 0.007 | 0.032 | 0.032 | 87.6 |
superlearner | 7689 | 0.006 | 0.038 | 0.039 | 81.6 |
calcause | 7694 | 0.003 | 0.043 | 0.043 | 81.7 |
tree strat | 7700 | 0.022 | 0.047 | 0.052 | 87.4 |
balanceboost | 7700 | 0.020 | 0.050 | 0.054 | 80.5 |
adj tree strat | 7700 | 0.027 | 0.068 | 0.074 | 60.0 |
lasso cbps | 7108 | 0.027 | 0.077 | 0.082 | 30.5 |
sl tmle joint | 7698 | 0.010 | 0.101 | 0.102 | 58.9 |
cbps | 7344 | 0.041 | 0.099 | 0.107 | 99.7 |
teffects psmatch | 7506 | 0.043 | 0.099 | 0.108 | 47.0 |
linear model | 7700 | 0.045 | 0.127 | 0.135 | 22.3 |
mhe algorithm | 7700 | 0.045 | 0.127 | 0.135 | 22.8 |
teffects ra | 7685 | 0.043 | 0.133 | 0.140 | 37.5 |
teffects ipwra | 7634 | 0.044 | 0.161 | 0.166 | 35.3 |
teffects ipw | 7665 | 0.042 | 0.298 | 0.301 | 39.0 |
I’ll leave you to read the original for commentary on this, but check out the RMSE and CI coverage. Linear model is summarised as “Linear model/ordinary least squares”. I assume covariates were just entered as main effects, which is a little unfair. The simulations included non-linearity and diagnostic checks on models, such as partial residual plots, would spot this. Still doesn’t do too badly – better than genetic matching!
Interestingly the RMSE was a tiny bit worse for entropy balancing than for Stata’s teffects psmatch, which in simulations was setup to use nearest-neighbour matching on propensity scores estimated using logistic regression (I presume the defaults – I’m an R user).
The winners were all either regression-based or what the authors called “mixed methods” – in this context meaning some genre of doubly-robust method that combined matching/weighting with regression adjustment. Bayesian additive regression trees (BART) feature towards the best end of the table. These sorts of regression-based methods don’t allow the design phase to be clearly separated from the estimation phase. For matching approaches where this separation is possible, the outcomes data can be held back from analysts until matches are found or weights estimated based only on covariates. Where the analysis also demands access to outcomes, a robust approach is needed, including a highly-specified and published statistical analysis plan and e.g., holding back some data in a training and validation phase before fitting the final model.
No info is provided on CI coverage for the seven optimisation-based methods they tested. This is why (Cousineau et al., 2023, p. 377):
“While some of these methods did provide some functions to estimate the confidence intervals (i.e., balancehd, sbw), these did not work due to the collinearity of the covariates. While it could be possible to obtain confidence intervals with bootstrapping for all methods, we did not pursue this avenue due to the computational resources that would be needed for some methods (e.g., kbal) and to the inferior results in Table 5 that did not warrant such resources.”
It would be interesting to zoom in on a smaller set of options and datasets and perhaps allow some more researcher input on how analyses are carried out.
References
Cousineau, M., Verter, V., Murphy, S. A., & Pineau, J. (2023). Estimating causal effects with optimization-based methods: A review and empirical comparison. European Journal of Operational Research, 304(2), 367β380.
Dorie, V., Hill, J., Shalit, U., Scott, M., & Cervone, D. (2019). Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statistical Science, 34(1).Β