Dealing with confounding in observational studies

Excellent review of simulation-based evaluations of quasi-experimental methods, by Varga et al. (2022). Also lovely annexes summarising the methods’ assumptions.

Methods for measured confounding the authors cover (Varga et al., 2022, Table A1):

Method Description of the method
PS matching (N = 47) Treated and untreated individuals are matched based on their propensity score-similarity. After creating comparable groups of treated and untreated individuals the effect of the treatment can be estimated.
IPTW (N = 30) With the help of re-weighting by the inverse probability of receiving the treatment, a synthetic sample is created which is representative of the population and in which treatment assignment is independent of the observed baseline covariates. Over-represented groups are downweighted and underrepresented groups are upweighted.
Overlap weights (N = 4) Overlap weights were developed to overcome the limitations of truncation and trimming for IPTW, when some individual PSs approach 0 or 1.
Matching weights (N = 2) Matching weights is an analogue weighting method for IPTW, when some individual PSs approach 0 or 1.
Covariate adjustment using PS (N = 13) The estimated PS is included as covariate in a regression model of the treatment.
PS stratification (N = 26) First the subjects are grouped into strata based upon their PS. Then, the treatment effect is estimated within each PS stratum, and the ATE is computed as a weighted mean of the stratum specific estimates.
GAM (N = 1) GAMs provide an alternative for traditional PS estimation by replacing the linear component of a logistic regression with a flexible additive function.
GBM (N = 3) GBM trees provide an alternative for traditional PS estimation by estimating the function of covariates in a more flexible manner than logistic regression by averaging the PSs of small regression trees.
Genetic matching (N = 7) This matching method algorithmically optimizes covariate balance and avoids the process of iteratively modifying the PS model.
Covariate-balancing PS (N = 5) Models treatment assignment while optimizing the covariate balance. The method exploits the dual characteristics of the PS as a covariate balancing score and the conditional probability of treatment assignment.
DR estimation (N = 13) Combines outcome regression with with a model for the treatment (eg, weighting by the PS) such that the effect estimator is robust to misspecification of one (but not both) of these models.
AIPTW (N = 8) This estimator achieves the doubly-robust property by combining outcome regression with weighting by the PS.
Stratified DR estimator (N = 1) Hybrid DR method of outcome regression with PS weighting and stratification.
TMLE (N = 2) Semi-parametric double-robust method that allows for flexible estimation using (nonparametric) machine-learning methods.
Collaborative TMLE (N = 1) Data-adaptive estimation method for TMLE.
One step joint Bayesian PS (N = 3) Jointly estimates quantities in the PS and outcome stages.
Two-step Bayesian approach (N = 2) A two-step modeling method is using the Bayesian PS model in the first step, followed by a Bayesian outcome model in the second step.
Bayesian model averaging (N = 1) Fully Bayesian model averaging approach.
An’s intermediate approach (N = 2) Not fully Bayesian insofar as the outcome equation in An’s approach is frequentist.
G-computation (N = 4) The method interprets counterfactual outcomes as missing data and uses a prediction model to obtain potential outcomes under different treatment scenarios. The entire set of predicted outcomes is then regressed on the treatment to obtain the coefficient of the effect estimate.
Prognostic scores (N = 7) Prognostic scores are considered to be the prognostic analog of the PS methods. the prognostic score includes covariates based on their predictive power of the response, the PS includes covariates that predict treatment assignment.

Methods for unmeasured confounding (Varga et al., 2022, Table A2):

Method Description of the method
IV approach (N = 17) Post-randomization can be achieved using a sufficiently strong instrument. IV is correlated with the treatment and only affects the outcome through the treatment.
2SLS (N = 11) Linear estimator of the IV method. Uses linear probability for binary outcome and linear regression for continuous outcome.
2SPS (N = 5) Non-parametric estimator of the IV method. Logistic regression is used for both the first and second stages of 2SPS procedure. The predicted or residual values from the first stage logistic regression of treatment on the IV are used as covariates in the second stage logistic regression: the predicted value of treatment replaces the observed treatment for 2SPS.
2SRI (N = 8) Semi-parametric estimator of the IV method. Logistic regression is used for both the first and second stages of the 2SRI procedure. The predicted or residual values from the first stage logistic regression of treatment on the IV are used as covariates in the second stage logistic regression.
IV based on generalized structural mean model (GSMM) (N = 1) Semi-parametric models that use instrumental variables to identify causal parameters. IV approach
Instrumental PS (Matching enhanced IV) (N = 2) Reduces the dimensionality of the measured confounders, but it also deals with unmeasured confounders by the use of an IV.
DiD (N = 7) DiD method uses the assumption that without the treatment the average outcomes for the treated and control groups would have followed parallel trends over time. The design measures the effect of a treatment as the relative change in the outcomes between individuals in the treatment and control groups over time.
Matching combined with DiD (N = 6) Alternative approach to DiD. (2) Uses matching to balance the treatment and control groups according to pre-treatment outcomes and covariates
SCM (N = 7) This method constructs a comparator, the synthetic control, as a weighted average of the available control individuals. The weights are chosen to ensure that, prior to the treatment, levels of covariates and outcomes are similar over time to those of the treated unit.
Imperfect SCM (N = 1) Extension of SCM method with relaxed assumptions that allow outcomes to be functions of transitory shocks.
Generalized SCM (N = 2) Combines SC with fixed effects.
Synthetic DiD (N = 1) Both unit and time fixed effects, which can be interpreted as the time-weighted version of DiD.
LDV regression approach (N = 1) Adjusts for pre-treatment outcomes and covariates with a parametric regression model. Alternative approach to DiD.
Trend-in-trend (N = 1) The trend-in-trend design examines time trends in outcome as a function of time trends in treatment across strata with different time trends in treatment.
PERR (N = 3) PERR adjustment is a type of self-controlled design in which the treatment effect is estimated by the ratio of two rate ratios (RRs): RR after initiation of treatment and the RR prior to initiation of treatment.
PS calibration (N = 1) Combines PS and regression calibration to address confounding by variables unobserved in the main study by using variables observed in a validation study.
RD (N = 4) Method used for policy analysis. People slightly below and above the threshold for being exposed to a treatment are compared.


Varga, A. N., Guevara Morel, A. E., Lokkerbol, J., van Dongen, J. M., van Tulder, M. W., & Bosmans, J. E. (2022). Dealing with confounding in observational studies: A scoping review of methods evaluated in simulation studies with single‐point exposure. Statistics in Medicine.

Sample size determination for propensity score weighting

If you’re using propensity score weighting (e.g., inverse probability weighting), one question that will arise is how big a sample you need.

Solutions have been proposed that rely on a variance inflation factor (VIF). You calculate the sample size for a simple design and then multiply that by the VIF to take account of weighting.

But the problem is that it is difficult to choose a VIF in advance.

Austin (2021) has developed a simple method (R code in the paper) to estimate VIFs from c-statistics (area under the curve; AOC) of the propensity score models. These c-statistics are often published.

A larger c-statistic means a greater separation between treatment and control, which in turn leads to a larger VIF and requirement for a larger sample.

Picture illustrating different c-statistics.

The magnitude of the VIF also depends on the estimand of interest, e.g., whether average treatment effect (ATE), average treatment effect on the treated (ATET/ATT), or average treatment effect where treat and control overlap (ATO).


Austin, P. C. (2021). Informing power and sample size calculations when using inverse probability of treatment weighting using the propensity score. Statistics in Medicine.