Dealing with confounding in observational studies

Excellent review of simulation-based evaluations of quasi-experimental methods, by Varga et al. (2022). Also lovely annexes summarising the methods’ assumptions.

Methods for measured confounding the authors cover (Varga et al., 2022, Table A1):

Method Description of the method
PS matching (N = 47) Treated and untreated individuals are matched based on their propensity score-similarity. After creating comparable groups of treated and untreated individuals the effect of the treatment can be estimated.
IPTW (N = 30) With the help of re-weighting by the inverse probability of receiving the treatment, a synthetic sample is created which is representative of the population and in which treatment assignment is independent of the observed baseline covariates. Over-represented groups are downweighted and underrepresented groups are upweighted.
Overlap weights (N = 4) Overlap weights were developed to overcome the limitations of truncation and trimming for IPTW, when some individual PSs approach 0 or 1.
Matching weights (N = 2) Matching weights is an analogue weighting method for IPTW, when some individual PSs approach 0 or 1.
Covariate adjustment using PS (N = 13) The estimated PS is included as covariate in a regression model of the treatment.
PS stratification (N = 26) First the subjects are grouped into strata based upon their PS. Then, the treatment effect is estimated within each PS stratum, and the ATE is computed as a weighted mean of the stratum specific estimates.
GAM (N = 1) GAMs provide an alternative for traditional PS estimation by replacing the linear component of a logistic regression with a flexible additive function.
GBM (N = 3) GBM trees provide an alternative for traditional PS estimation by estimating the function of covariates in a more flexible manner than logistic regression by averaging the PSs of small regression trees.
Genetic matching (N = 7) This matching method algorithmically optimizes covariate balance and avoids the process of iteratively modifying the PS model.
Covariate-balancing PS (N = 5) Models treatment assignment while optimizing the covariate balance. The method exploits the dual characteristics of the PS as a covariate balancing score and the conditional probability of treatment assignment.
DR estimation (N = 13) Combines outcome regression with with a model for the treatment (eg, weighting by the PS) such that the effect estimator is robust to misspecification of one (but not both) of these models.
AIPTW (N = 8) This estimator achieves the doubly-robust property by combining outcome regression with weighting by the PS.
Stratified DR estimator (N = 1) Hybrid DR method of outcome regression with PS weighting and stratification.
TMLE (N = 2) Semi-parametric double-robust method that allows for flexible estimation using (nonparametric) machine-learning methods.
Collaborative TMLE (N = 1) Data-adaptive estimation method for TMLE.
One step joint Bayesian PS (N = 3) Jointly estimates quantities in the PS and outcome stages.
Two-step Bayesian approach (N = 2) A two-step modeling method is using the Bayesian PS model in the first step, followed by a Bayesian outcome model in the second step.
Bayesian model averaging (N = 1) Fully Bayesian model averaging approach.
An’s intermediate approach (N = 2) Not fully Bayesian insofar as the outcome equation in An’s approach is frequentist.
G-computation (N = 4) The method interprets counterfactual outcomes as missing data and uses a prediction model to obtain potential outcomes under different treatment scenarios. The entire set of predicted outcomes is then regressed on the treatment to obtain the coefficient of the effect estimate.
Prognostic scores (N = 7) Prognostic scores are considered to be the prognostic analog of the PS methods. the prognostic score includes covariates based on their predictive power of the response, the PS includes covariates that predict treatment assignment.

Methods for unmeasured confounding (Varga et al., 2022, Table A2):

Method Description of the method
IV approach (N = 17) Post-randomization can be achieved using a sufficiently strong instrument. IV is correlated with the treatment and only affects the outcome through the treatment.
2SLS (N = 11) Linear estimator of the IV method. Uses linear probability for binary outcome and linear regression for continuous outcome.
2SPS (N = 5) Non-parametric estimator of the IV method. Logistic regression is used for both the first and second stages of 2SPS procedure. The predicted or residual values from the first stage logistic regression of treatment on the IV are used as covariates in the second stage logistic regression: the predicted value of treatment replaces the observed treatment for 2SPS.
2SRI (N = 8) Semi-parametric estimator of the IV method. Logistic regression is used for both the first and second stages of the 2SRI procedure. The predicted or residual values from the first stage logistic regression of treatment on the IV are used as covariates in the second stage logistic regression.
IV based on generalized structural mean model (GSMM) (N = 1) Semi-parametric models that use instrumental variables to identify causal parameters. IV approach
Instrumental PS (Matching enhanced IV) (N = 2) Reduces the dimensionality of the measured confounders, but it also deals with unmeasured confounders by the use of an IV.
DiD (N = 7) DiD method uses the assumption that without the treatment the average outcomes for the treated and control groups would have followed parallel trends over time. The design measures the effect of a treatment as the relative change in the outcomes between individuals in the treatment and control groups over time.
Matching combined with DiD (N = 6) Alternative approach to DiD. (2) Uses matching to balance the treatment and control groups according to pre-treatment outcomes and covariates
SCM (N = 7) This method constructs a comparator, the synthetic control, as a weighted average of the available control individuals. The weights are chosen to ensure that, prior to the treatment, levels of covariates and outcomes are similar over time to those of the treated unit.
Imperfect SCM (N = 1) Extension of SCM method with relaxed assumptions that allow outcomes to be functions of transitory shocks.
Generalized SCM (N = 2) Combines SC with fixed effects.
Synthetic DiD (N = 1) Both unit and time fixed effects, which can be interpreted as the time-weighted version of DiD.
LDV regression approach (N = 1) Adjusts for pre-treatment outcomes and covariates with a parametric regression model. Alternative approach to DiD.
Trend-in-trend (N = 1) The trend-in-trend design examines time trends in outcome as a function of time trends in treatment across strata with different time trends in treatment.
PERR (N = 3) PERR adjustment is a type of self-controlled design in which the treatment effect is estimated by the ratio of two rate ratios (RRs): RR after initiation of treatment and the RR prior to initiation of treatment.
PS calibration (N = 1) Combines PS and regression calibration to address confounding by variables unobserved in the main study by using variables observed in a validation study.
RD (N = 4) Method used for policy analysis. People slightly below and above the threshold for being exposed to a treatment are compared.


Varga, A. N., Guevara Morel, A. E., Lokkerbol, J., van Dongen, J. M., van Tulder, M. W., & Bosmans, J. E. (2022). Dealing with confounding in observational studies: A scoping review of methods evaluated in simulation studies with single‐point exposure. Statistics in Medicine.

Sexual orientation and gender identity: Census 2021 in England and Wales

Hot off the press: Data and supporting information about sexual orientation and gender identity from Census 2021 in England and Wales.

Gender, where different to AGAB:

  • 48,000 (0.10%) identified as a trans man
  • 48,000 (0.10%) identified as a trans woman
  • 30,000 (0.06%) identified as non-binary
  • 18,000 (0.04%) wrote in a different gender identity

Sexuality, where non-het:

  • 748,000 (1.5%), described themselves as gay or lesbian
  • 624,000 (1.3%) described themselves as bisexual
  • 165,000 (0.3%) selected “Other sexual orientation”, which were mostly:
    • pansexual (112,000, 0.23%)
    • asexual (28,000, 0.06%)
    • queer (15,000, 0.03%)

Loads of tables by geographical region, e.g., LA.

Migration and the Value of Social Networks

I haven’t read this working paper yet – just struck by this dataset:

“We leverage a rich new source of ‘digital trace’ data to provide a detailed empirical perspective on how social networks influence the decision to migrate. These data capture the entire universe of mobile phone activity in Rwanda over a five-year period. Each of roughly one million individuals is uniquely identified throughout the dataset, and every time they make or receive a phone call, we observe their approximate location, as well as the identity of the person they are talking to. From these data, we can reconstruct each subscriber’s 5-year migration trajectory, as well as a detailed picture of their social network before and after migration

A Dilemma for the Russo–Williamson Thesis

The Russo–Williamson thesis states that

“in order to establish a causal claim in medicine, one normally needs to establish both that the putative cause and putative effect are appropriately correlated and that there is some underlying mechanism that can account for this correlation.”

Wilde (2022) explores counterexamples to this where a causal claim was accepted before a mechanism was confirmed, e.g.,

  • Deep brain stimulation as a treatment for Parkinson’s disease.
  • Soot as a cause of scrotal cancer before the mechanisms involving benzo[a]pyrene had been established.

Lots to ponder therein, e.g., whether it works to weaken the causal condition to require a plausible mechanism that need not necessarily be established. There are worries that this manoeuvre leads to a thesis that is too weak since, particularly in social science, it is often easy to come up with some kind of plausible mechanism for just about any phenomena. Read the paper for a proposed solution!


Wilde, M. (2022). A Dilemma for the Russo–Williamson ThesisErkenntnis, in press

Reminiscing about BSSM

I used to run a social science methodology discussion group. Dumping the event list here, since the direction of travel of discussions tends to repeat, e.g., mixing methods, role of theory, sample size, limits of introspection, …

When Info / Readings
Thurs 6 Feb BISR BSSM research network social

This is a joint event with BSSM and the Birkbeck Institute for Social Research. Lunch will be included.

Tues 21 Jan Hilda Weiss, sociologist: A critical theorist of a lesser kind?

Detlef Garz, Hanse-Wissenschaftskolleg (HWK) Institute for Advanced Study

Hilda Weiss (29 August 1900 – 29 May 1981) was a sociologist and one of the first doctoral students at the Institute for Social Research in Frankfurt (joining 1924) which is famous for its role in developing critical theory. She played a central role in designing and running a large study of political views and employment conditions in Germany, 1930, working with Erich Fromm. Given her life, contributions to sociology, and methodological innovations, it seems odd that she has been mostly relegated to the occasional footnote in papers on people like Fromm. This talk will explore her life and contributions to sociology and critical theory.

Please sign up on Eventbrite

Thurs 7 Nov Analytic philosophy as critical theory: what can it do for empirical studies of gender?

Katharine Jenkins, University of Nottingham

Although the distinction between ‘analytic’ and ‘continental’ philosophy is difficult to pinpoint and easy to critique, there is nevertheless a fairly distinct literature that can be thought of as ‘analytic philosophy of social science’. Moreover, critical theory – theory understood as part of an emancipatory social movement – is often seen as part of continental philosophy and not as part of analytic philosophy. Crucially, critical theory involves being in contact with, and responsive to, one or more social justice movements, and developing theoretical tools that are useful for advancing the aims of these movements.In this talk, I explore the possibility for undertaking analytic philosophy of social science as a form of critical theory, with the intention of supplying tools to empirical social science that can aid emancipatory work. Using gender as a case study, I argue that it is possible to use the methods of analytic philosophy to fulfil the aims of critical theory, and that the clarity and precision that analytic philosophy brings can be useful for empirical research. I offer an analytic framework for thinking about social categories or kinds that is suited to projects in critical theory, and I apply this framework to gender in a way that is responsive to transfeminist movements.

Please sign up on Eventbrite

Wed 16 Oct Work in progress: Prediction versus history in political science

Robert Northcott, Philosophy

Robert will introduce a draft of a chapter he is writing on the philosophy of political science. The draft chapter argues that, usually, retrospective testing of wide-scope theories or models will not be appropriate for political science and that forward-looking prediction is required instead. But given the difficulty of the latter, in turn the main actual focus should be on contextual historical work. It then illustrates via a case study what role such a contextual approach leaves for wider-scope theory. It concludes by assessing the scope for political science to offer policy advice.

Weds 24 July Free association

Claudia Lapping, UCL

This session will be a brief introduction to the use of free association as a social research method. You will be invited to try out a couple of exercises: individual free writing and (in pairs) how to encourage free associations in interviews

Weds 29 May Generalising from case studies

Ylikoski, P. (2018). Mechanism-based theorizing and generalization from case studies. Studies in History and Philosophy of Science Part A. In press, corrected proof.

Fri 12 April The constant comparative method

Quinn, K. G., Murphy, M. K., Nigogosyan, Z., & Petroll, A. E. (2019). Stigma, isolation and depression among older adults living with HIV in rural areas. Ageing and Society, 1–19.Boeije, H. (2002). A Purposeful Approach to the Constant Comparative Method in the Analysis of Qualitative Interviews. Quality and Quantity, 36, 391–409.

Thurs 7 March Mixing qualitative methods

Cassell & Bishop (2018). Qualitative data analysis: Exploring themes, metaphors and stories. European Management Review.

Clarke, Willis, Barnes, Caddick, Cromby, McDermott & Wiltshire (2015). Analytical pluralism in qualitative research: A meta-study. Qualitative Research in Psychology, 12(2), 182-201.

Wed 23 Jan Telling more than we can know?

Petitmengin, C., Remillieux, A., Cahour, B., & Carter-Thomas, S. (2013). A gap in Nisbett and Wilson’s findings? A first-person access to our cognitive processes. Consciousness and Cognition, 22, 654–669.

Petitmengin, C. (2006). Describing one’s subjective experience in the second person: An interview method for the science of consciousness. Phenomenology and the Cognitive Sciences, 5, 229–269.

Nisbett, R.E. & Wilson, T.D. (1977). Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84, 231–259.

Tues 4 Dec What happens when mixed method findings conflict?

Johnson, R.B., Russo, F. & Schoonenboom, J., 2017. Causation in Mixed Methods Research: The Meeting of Philosophy, Science, and Practice. Journal of Mixed Methods Research.

Moffatt, S. et al., 2006. Using quantitative and qualitative data in health services research – what happens when mixed method findings conflict? BMC Health Services Research, 6, p.28.

Mon 12 Nov Launch!

Understanding causal estimands like ATE and ATT

Photo by Susanne Jutzeler

Social policy and programme evaluations often report findings in terms of casual estimands such as the average treatment effect (ATE) or the average treatment effect on the treated (ATT or ATET). An estimand is a quantity we are trying to estimate – but what exactly does that mean? This post explains through simple examples.

Suppose a study has two conditions, treat (=1) and control (=0). Causal estimands are defined in terms of potential outcomes: the outcome if someone had been assigned to treatment, \(Y(1)\), and outcome if someone had been assigned to control, \(Y(0)\).

We only get to see one of those two realised, depending on which condition someone was actually assigned to. The other is a counterfactual outcome. Assume, for a moment, that you are omniscient and can observe both potential outcomes. The treatment effect (TE) for an individual is \(Y(1)-Y(0)\) and, since you are omniscient, you can see it for everyone.

Here is a table of potential outcomes and treatment effects for 10 fictional study participants. A higher score represents a better outcome.

Person Condition Y(0) Y(1) TE
1 1 0 7 7
2 0 3 0 -3
3 1 2 9 7
4 1 1 8 7
5 0 4 1 -3
6 1 3 10 7
7 0 4 1 -3
8 0 8 5 -3
9 0 7 4 -3
10 1 3 10 7

Note the pattern in the table. People who were assigned to treatment have a treatment effect of \(7\) and people who were assigned to control have a treatment effect of \(-3\), i.e., if they had been assigned to treatment, their outcome would have been worse. So everyone in this fictional study was lucky: they were assigned to the condition that led to the best outcome they could have had.

The average treatment effect (ATE) is simply the average of treatment effects: 

\(\displaystyle \frac{7 + -3 + 7 + 7 + -3 + 7 + -3 + -3 + -3 + 7}{10}=2\)

The average treatment effect on the treated (ATT or ATET) is the average of treatment effects for people who were assigned to the treatment:

\(\displaystyle \frac{7 + 7 + 7 + 7 + 7}{5}=7\)

The average treatment effect on control (ATC) is the average of treatment effects for people who were assigned to control:

\(\displaystyle \frac{-3 + -3 + -3 + -3 + -3}{5}=-3\)

Alas we aren’t really omniscient, so in reality see a table like this:

Person Condition Y(0) Y(1) TE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 9 ?
4 1 ? 8 ?
5 0 4 ? ?
6 1 ? 10 ?
7 0 4 ? ?
8 0 8 ? ?
9 0 7 ? ?
10 1 ? 10 ?

This table highlights the fundamental problem of causal inference and why it is sometimes seen as a missing data problem.

Don’t confuse estimands and methods for estimation

One of the barriers to understanding these estimands is that we are used to taking a between-participant difference in group means to estimate the average effect of a treatment. But the estmands are defined in terms of a within-participant difference between two potential outcomes, only one of which is observed.

The causal effect is a theoretical quantity defined for individual people and it cannot be directly measured.

Here is another example where the causal effect is zero for everyone, so ATT, ATE, and ATC are all zero too:

Person Condition Y(0) Y(1) TE
1 1 7 7 0
2 0 3 3 0
3 1 7 7 0
4 1 7 7 0
5 0 3 3 0
6 1 7 7 0
7 0 3 3 0
8 0 3 3 0
9 0 3 3 0
10 1 7 7 0

However, people have been assigned to treatment and control in such a way that, given the outcomes realised, it appears that treatment is better than control. Here is the table again, this time with observations we couldn’t observe removed:

Person Condition Y(0) Y(1) CE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 7 ?
4 1 ? 7 ?
5 0 3 ? ?
6 1 ? 7 ?
7 0 3 ? ?
8 0 3 ? ?
9 0 3 ? ?
10 1 ? 7 ?

So, if we take the average of realised treatment outcomes we get 7 and the average of realised control outcomes we get 3. The mean difference is then 4. This estimate is biased. The correct answer is zero, but we couldn’t tell from the available data.

The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. For other estimators that don’t require random treatment assignment and for other estimands, try Scott Cunningham’s Causal Inference: The Mixtape.

How do you choose between ATE, ATT, and ATC?

Firstly, if you are running a randomised controlled trial, you don’t choose: ATE, ATT, and ATC will be the same. This is because, on average across trials, the characteristics of those who were assigned to treatment or control will be the same.

So the distinction between these three estimands only matters for quasi-experimental studies, for example where treatment assignment is not under the control of the researcher.

Noah Greifer and Elizabeth Stuart offer a neat set of example research questions to help decide (here lightly edited to make them less medical):

  • ATT: should an intervention currently being offered continue to be offered or should it be withheld?
  • ATC: should an intervention be extended to people who don’t currently receive it?
  • ATE: should an intervention be offered to everyone who is eligible?

How does intention to treat fit in?

The distinction between ATE and ATT is unrelated to the distinction between intention to treat and per-protocol analyses. Intention to treat analysis means we analyse people according to the group they were assigned to, even if they didn’t comply, e.g., by not engaging with the treatment. Per-protocol analysis is a biased analysis that only analyses data from participants who did comply and is generally not recommended.

For instance, it is possible to conduct a quasi-experimental study that uses intention to treat and estimates the average treatment effect on the treated. In this case, ATT might be better called something like average treatment effect for those we intended to treat (ATETWITT). Sadly this term hasn’t yet been used in the literature.


Causal effects are defined in terms of potential outcomes following treatment and following control. Only one potential outcome is observed, depending on whether someone was assigned to treatment or control, so causal effects cannot be directly observed. The fields of statistics and causal inference find ways to estimate these estimands using observable data. The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. Quasi-experimental designs allow the estimation of additional estimands: ATT and ATC.

>1 million z-values

The distribution of more than one million z-values from Medline (1976–2019).

You need \(|z| > 1.96\) for “statistical significance” at the usual 5% level. This picture suggests a significant problem of papers not being published if that threshold isn’t crossed.

Source: van Zwet, E. W., & Cator, E. A. (2021). The significance filter, the winner’s curse and the need to shrink. Statistica Neerlandica, 75(4), 437–452.

The value of high quality qualitative research

Here’s an interesting paper (Greenland & Moore, 2021) that used our (Fugard & Potts, 2015) quantitative model for choosing a sample size for a thematic analysis. The authors also had a probability sample – very rare to see in published qualitative research.

Key ingredients: they had a sample frame (students who dropped out of open online university courses and their phone numbers); they wanted a comprehensive typology of reasons for drop out and suggestions for retaining students; and they could complete each interview within an average of 15 minutes (emphasis on average: some must have been longer).

Here are the authors’ conclusions:

“This study’s research design demonstrates the value of using a larger qualitative probability-based sample, in conjunction with in-depth interviewer probing and thematic analysis to investigate non-traditional student dropouts. While prior qualitative research has often used smaller samples (Creswell, 2007), recent studies have highlighted the need for more rigorous sample design to enable subthemes within themes, which is the key purpose of thematic analysis (eg, Nowell et al., 2017). This study’s sample moved beyond simple thematic saturation rationale, with consideration of the level of granularity required (Vasileiou et al., 2018). That is, 226 participants had a 99% probability of capturing all relevant dropout reason subthemes, down to a 5% incidence level or frequency of occurrence (Fugard & Potts, 2015). This study therefore presents a definitive typology of non-traditional student dropout in open online education.”

It’s exciting to see a rigorous and yet pragmatic qualitative study.


Fugard, A. J. B. & Potts, H. W. W. (2015). Supporting thinking on sample sizes for thematic analyses: A quantitative toolInternational Journal of Social Research Methodology, 18, 669-684. (There’s an app for that.)

Greenland, S. J., & Moore, C. (2021). Large qualitative sample and thematic analysis to redefine student dropout and retention strategy in open online education. British Journal of Educational Technology.

Intersectionality, in under 200 words

If we try to eliminate pay gaps by monitoring only single characteristics like gender or ethnicity, we can still end up with pay gaps between combinations of characteristics. One way to do this would be to appoint white women and Black men to senior management positions, but not appoint any Black women.

The idea of an intersection comes from set theory and describes where two sets overlap. For instance, the intersection of the set of Black people and the set of women is the set of Black women.

Venn diagram illustrating the intersection between women and Black people

Intersectionality is a broad framework that promotes the study and elimination of oppression and exploitation of people in terms of combinations of characteristics.

Is intersectionality a theory, explaining why this form of discrimination occurs? Here’s Patricia Hill Collins (2019, p.51), a leading scholar in this area:

“Every time I encounter an article that identifies intersectionality as a social theory, I wonder what conception of social theory the author has in mind. I don’t assume that intersectionality is already a social theory. Instead, I think a case can be made that intersectionality is a social theory in the making.”


Collins, P. H. (2019).  Intersectionality As Critical Social Theory. Duke University Press.