Quantitative social research – the worst kind, except for all the others

Breznau, et al. (2022) asked a group of 161 researchers in 73 teams to analyse the same dataset and test the same hypothesis: greater immigration reduces public support for the welfare state. As we now expect in this genre of the literature, results varied. See the study’s figure below:

So roughly 60% of analyses found a non-statistically significant result. Of the 40% that were statistically significant, 60% found a negative association and 40% found a positive association.

Social scientists are well-versed in the replication crisis and, e.g., the importance of preregistering analyses and not relying too heavily on the findings from any one study.

Mathur et al. (2022) offer a glimmer of hope, though. The variation looks fairly wild when focussing on whether a hypothesis test was statistically significant or not. However, 90% of analyses found that a one-unit increase in immigration was associated with an increase or decrease in public support of less than 4% of a standard deviation – tiny effects!

I also find hope in all the meta-analyses transparently showing biases. It seems that quantitative social science is the most unreliable and difficult to replicate form of social science, except for all the others.


Breznau, N., et al. (2022). Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. PNAS 119(44), e2203150119 (2022).

Mathur, M. B., Covington, C., & VanderWeele, T. (2022, November 22). Variation across analysts in statistical significance, yet consistently small effect sizes. Preprint.

Understanding causal estimands like ATE and ATT

Photo by Susanne Jutzeler

Social policy and programme evaluations often report findings in terms of casual estimands such as the average treatment effect (ATE) or the average treatment effect on the treated (ATT or ATET). An estimand is a quantity we are trying to estimate – but what exactly does that mean? This post explains through simple examples.

Suppose a study has two conditions, treat (=1) and control (=0). Causal estimands are defined in terms of potential outcomes: the outcome if someone had been assigned to treatment, \(Y(1)\), and outcome if someone had been assigned to control, \(Y(0)\).

We only get to see one of those two realised, depending on which condition someone was actually assigned to. The other is a counterfactual outcome. Assume, for a moment, that you are omniscient and can observe both potential outcomes. The treatment effect (TE) for an individual is \(Y(1)-Y(0)\) and, since you are omniscient, you can see it for everyone.

Here is a table of potential outcomes and treatment effects for 10 fictional study participants. A higher score represents a better outcome.

Person Condition Y(0) Y(1) TE
1 1 0 7 7
2 0 3 0 -3
3 1 2 9 7
4 1 1 8 7
5 0 4 1 -3
6 1 3 10 7
7 0 4 1 -3
8 0 8 5 -3
9 0 7 4 -3
10 1 3 10 7

Note the pattern in the table. People who were assigned to treatment have a treatment effect of \(7\) and people who were assigned to control have a treatment effect of \(-3\), i.e., if they had been assigned to treatment, their outcome would have been worse. So everyone in this fictional study was lucky: they were assigned to the condition that led to the best outcome they could have had.

The average treatment effect (ATE) is simply the average of treatment effects: 

\(\displaystyle \frac{7 + -3 + 7 + 7 + -3 + 7 + -3 + -3 + -3 + 7}{10}=2\)

The average treatment effect on the treated (ATT or ATET) is the average of treatment effects for people who were assigned to the treatment:

\(\displaystyle \frac{7 + 7 + 7 + 7 + 7}{5}=7\)

The average treatment effect on control (ATC) is the average of treatment effects for people who were assigned to control:

\(\displaystyle \frac{-3 + -3 + -3 + -3 + -3}{5}=-3\)

Alas we aren’t really omniscient, so in reality see a table like this:

Person Condition Y(0) Y(1) TE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 9 ?
4 1 ? 8 ?
5 0 4 ? ?
6 1 ? 10 ?
7 0 4 ? ?
8 0 8 ? ?
9 0 7 ? ?
10 1 ? 10 ?

This table highlights the fundamental problem of causal inference and why it is sometimes seen as a missing data problem.

Don’t confuse estimands and methods for estimation

One of the barriers to understanding these estimands is that we are used to taking a between-participant difference in group means to estimate the average effect of a treatment. But the estmands are defined in terms of a within-participant difference between two potential outcomes, only one of which is observed.

The causal effect is a theoretical quantity defined for individual people and it cannot be directly measured.

Here is another example where the causal effect is zero for everyone, so ATT, ATE, and ATC are all zero too:

Person Condition Y(0) Y(1) TE
1 1 7 7 0
2 0 3 3 0
3 1 7 7 0
4 1 7 7 0
5 0 3 3 0
6 1 7 7 0
7 0 3 3 0
8 0 3 3 0
9 0 3 3 0
10 1 7 7 0

However, people have been assigned to treatment and control in such a way that, given the outcomes realised, it appears that treatment is better than control. Here is the table again, this time with observations we couldn’t observe removed:

Person Condition Y(0) Y(1) CE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 7 ?
4 1 ? 7 ?
5 0 3 ? ?
6 1 ? 7 ?
7 0 3 ? ?
8 0 3 ? ?
9 0 3 ? ?
10 1 ? 7 ?

So, if we take the average of realised treatment outcomes we get 7 and the average of realised control outcomes we get 3. The mean difference is then 4. This estimate is biased. The correct answer is zero, but we couldn’t tell from the available data.

The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. For other estimators that don’t require random treatment assignment and for other estimands, try Scott Cunningham’s Causal Inference: The Mixtape.

How do you choose between ATE, ATT, and ATC?

Firstly, if you are running a randomised controlled trial, you don’t choose: ATE, ATT, and ATC will be the same. This is because, on average across trials, the characteristics of those who were assigned to treatment or control will be the same.

So the distinction between these three estimands only matters for quasi-experimental studies, for example where treatment assignment is not under the control of the researcher.

Noah Greifer and Elizabeth Stuart offer a neat set of example research questions to help decide (here lightly edited to make them less medical):

  • ATT: should an intervention currently being offered continue to be offered or should it be withheld?
  • ATC: should an intervention be extended to people who don’t currently receive it?
  • ATE: should an intervention be offered to everyone who is eligible?

How does intention to treat fit in?

The distinction between ATE and ATT is unrelated to the distinction between intention to treat and per-protocol analyses. Intention to treat analysis means we analyse people according to the group they were assigned to, even if they didn’t comply, e.g., by not engaging with the treatment. Per-protocol analysis is a biased analysis that only analyses data from participants who did comply and is generally not recommended.

For instance, it is possible to conduct a quasi-experimental study that uses intention to treat and estimates the average treatment effect on the treated. In this case, ATT might be better called something like average treatment effect for those we intended to treat (ATETWITT). Sadly this term hasn’t yet been used in the literature.


Causal effects are defined in terms of potential outcomes following treatment and following control. Only one potential outcome is observed, depending on whether someone was assigned to treatment or control, so causal effects cannot be directly observed. The fields of statistics and causal inference find ways to estimate these estimands using observable data. The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. Quasi-experimental designs allow the estimation of additional estimands: ATT and ATC.

Privilege hazard

“The problems of gender and racial bias in our information systems are complex, but some of their key causes are plain as day […]. When data teams are primarily composed of people from dominant groups, those perspectives come to exert outsized influence on the decisions being made—to the exclusion of other identities and perspectives. This is not usually intentional; it comes from the ignorance of being on top. We describe this deficiency as a privilege hazard.”

– Catherine D’Ignazio and Lauren F. Klein (2020). Data feminism. MIT Press.


Kharkiv, statistics, and causal inference

As news comes in (14 May 2022) that Ukraine has won the battle of Kharkiv* and Russian troops are withdrawing, it may be of interest to know that a major figure in statistics and causal inference, Jerzy Neyman (1894-1981), trained as a mathematician there 1912-16. If you have ever used a confidence interval or conceptualised causal inference in terms of potential outcomes, then you owe him a debt of gratitude.

“[Neyman] was educated as a mathematician at the University of Kharkov*, 1912-16. After this he became a Lecturer at the Kharkov Institute of Technology with the title of Candidate. When speaking of these years he always stressed his debt to Sergei Bernstein, and his friendship with Otto Struve (later to meet him again in Berkeley). His thesis was entitled ‘Integral of Lebesgue’.” (Kendall et al., 1982)

* Харків (transliterated to Kharkiv) in Ukrainian, Харькoв (transliterated to Kharkov) in Russian.

Census data on trans and non-binary people in Canada

Canada published census data on trans and non-binary people on 27 April 2022. Here’s a table of the values they presented in a pie chart (why a pie chart, Canada?). Individuals in the census were aged 15 or above and living in a private household in May 2021.

Gender N %
Cis man 14,814,230 48.83
Cis woman 15,421,085 50.83
Trans man 27,905 0.09
Trans woman 31,555 0.10
Non binary 41,355 0.14
Total 30,336,130 100.00


Efficacy RCTs as survey twins

Surveys attempt to estimate a quantity of a finite population using a probability sample from that population. How people ended up in the population is somebody else’s problem – demographers, perhaps.

Survey participants are sampled at random from this finite population without replacement. Part a of the figure below illustrates. Green blocks denote people who are surveyed and from whom we collect data. Grey blocks denote people we have not surveyed; we would like to infer what their responses would have been, if they had they been surveyed too.

RCTs randomly assign participants to treatment or control conditions. This is illustrated in part b of the figure above: green cells denote treatment and purple cells denote control. There are no grey cells since we have gathered information from everyone in the finite population. But in a way, we haven’t really.

An alternative way to view efficacy RCTs that aim to estimate a sample average treatment effect (SATE) is as a kind of survey. This illustrated in part c. Now the grey cells return.

There is a finite population of people who present for a trial, often with little known about how they ended up in that population – not dissimilarly to the situation for a survey. (But who studies how they end up in a trial – trial demographers?)

Randomly assigning people to conditions generates two finite populations of theoretical twins, identical except for treatment assignment and the consequences thereafter. One theoretical twin receives treatment and the other receives control. But we only obtain the response from one of the twins, i.e., either the treatment or the control twin. (You could also think of these theoretical twins’ outcomes as potential outcomes.)

Looking individually at one of the two theoretical populations, the random assignment to conditions has generated a random sample from that population. We really want to know what the outcome would have been for everyone in the treatment condition, if everyone had been assigned treatment. Similarly for control. Alas, we have to make do with a pair of surveys that sample from these two populations.

Viewing the Table 1 fallacy through the survey twin lens

There is a common practice of testing for differences in covariates between treatment and control. This is the Table 1 fallacy (see also Dean Eckles’s take on whether it really is a fallacy). Let’s see how it can be explained using survey twins.

Firstly, we have a census of covariates for the whole finite population at baseline, so we know with perfect precision what the means are. Treatment and control groups are surveys of the same population, so clearly no statistical test is needed. The sample means in both groups are likely to be different from each other and from the finite population mean of both groups combined. No surprises there: we wouldn’t expect a survey mean to be identical to the population mean. That’s why we use confidence intervals or large samples so that the confidence intervals are very narrow.

What’s the correct analysis of an RCT?

It’s common to analyse RCT data using a linear regression model. The outcome variable is the endpoint, predictors are treatment group and covariates. This is also known as an ANCOVA. This analysis is easy to understand if the trial participants are a simple random sample from some infinite population. But this is not what we have in efficacy trials as modelled by survey twins above. If the total number of participants in the trial is 1000, then we have a finite population of 1000 in the treatment group and a finite population of 1000 in the control group – together, 2000. In total we have 1000 observations, though, split in some proportion between treatment and control.

Following through on this reasoning, it sounds like the correct analysis uses a stratified independent sampling design with two strata, coinciding with treatment and control groups. The strata populations are both 1000, and a finite population correction should be applied accordingly.

It’s a little more complicated, as I discovered in a paper by Reichardt and Gollob (1999), who independently derived results found by Neyman (1923/1990). Their results highlight a wrinkle in the argument when conducting a t-test on two groups for finite populations as described above. This has general implications for analyses with covariates too. The wrinkle is, the two theoretical populations are not independent of each other.

The authors derive the standard error of the mean difference between X and Y as

\(\displaystyle \sqrt{\frac{\sigma_X^2}{n_X} + \frac{\sigma_Y^2}{n_Y}-\left[ \frac{(\sigma_X-\sigma_Y)^2}{N} + \frac{2(1-\rho) \sigma_X \sigma_{Y}}{N} \right]}\),

where \(\sigma_X^2\) and \(\sigma_Y^2\) are the variances of the two groups, \(n_X\) and \(n_Y\) are the observed group sample sizes, and \(N\) is the total sample (the finite population) size. Finally, \(\rho\) is the unobservable correlation between treat and control outcomes for each participant – unobservable because we only get either the treatment outcome or control outcome for each participant and not both. The terms in square brackets correct for the finite population.

If the variances are equal (\(\sigma_X = \sigma_Y\)) and the correlation \(\rho = 1\), then the correction vanishes (glance back at numerators in the square brackets to see). This is great news if you are willing to assume that treatments have constant effects on all participants (an assumption known as unit-treatment additivity): the same regression analysis that you would use assuming a simple random sample from an infinite population applies.

If the variances are equal and the correlation is 0, then this is the same standard error as in the stratified independent sampling design with two strata described above. Or at least it was for the few examples I tried.

If the variances can be different and the correlation is one, then this is the same standard error as per Welch’s two-sample t-test.

So, which correlation should we use? Reichardt and Gollob (1999) suggest using the reliability of the outcome measure to calculate an upper bound on the correlation. More recently, Aronow, Green, and Lee (2014) proved a result that puts bounds on the correlation based on the observed marginal distribution of outcomes, and provide R code to copy and paste to calculate it. It’s interesting that a problem highlighted a century ago on something so basic – what standard error we should use for an RCT – is still being investigated now.


Aronow, P. M., Green, D. P., & Lee, D. K. K. (2014). Sharp bounds on the variance in randomized experiments. Annals of Statistics, 42, 850–871.

Neyman, J. (1923/1990). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science, 5, 465-472.

Reichardt, C. S., & Gollob, H. F. (1999). Justifying the Use and Increasing the Power of a t Test for a Randomized Experiment With a Convenience Sample. Psychological Methods, 4, 117–128.


Standard errors of marginal means in an RCT

Randomised controlled trials (RCTs) typically use a convenience sample to estimate the mean effect of a treatment for study participants. Participants are randomly assigned to one of (say) two conditions, and an unbiased estimate of the sample mean treatment effect is obtained by taking the difference of the two conditions’ mean outcomes. The estimand in such an RCT is sometimes called the sample average treatment effect (SATE).

Some papers report a standard error for the marginal mean outcomes in treatment and control groups using the textbook formula

\(\displaystyle \frac{\mathit{SD_g}}{\sqrt{n_g}}\),

where \(\mathit{SD_g}\) is the standard deviation of outcomes in group \(g\) and \(n_g\) the number of observations in that group.

This formula assumes a simple random sample with replacement from an infinite population, so does not work for a convenience sample (see Stephen Senn, A Standard Error). I am convinced, but curious what standard error for each group’s mean would be appropriate, if any. (You could stop here and argue that the marginal group means mean nothing anyway. The whole point of running a trial is to subtract off non-treatment explanations of change such as regression to the mean.)

Let’s consider a two-arm RCT with no covariates and a coin toss determining who receives treatment or control. What standard error would be appropriate for the mean treatment outcome? Let the total sample size be \(N\) and quantities for treatment and control use subscripts \(t\) and \(c\), respectively.

Treatment outcome mean of those who received treatment

If we focus on the mean for the \(n_t\) participants who were assigned to treatment, we have all observations for that group, so the standard error of the mean is 0. This feels like cheating.

Treatment outcome mean of everyone in the sample

Suppose we want to say something about the treatment outcome mean for all \(N\) participants in the trial, not only the \(n_t\) who were assigned to treatment.

To see how to think about this, consider a service evaluation of \(N\) patients mimicking everything about an RCT except that it assigns everyone to treatment and uses a coin toss to determine whether someone is included in the evaluation. This is now a survey of \(n\) participants, rather than a trial. We want to generalise results to the finite \(N\) from which we sampled.

Since the population is finite and the sampling is done without replacement, the standard error of the mean should be multiplied by a finite population correction,

\(\displaystyle \mathit{FPC} = \sqrt{\frac{N-n}{N-1}}\).

This setup for a survey is equivalent to what we observe in the treatment group of an RCT. Randomly assigning participants to treatment gives us a random sample from a finite population, the sample frame of which we get by the end of the trial: all treatment and control participants. So we can estimate the SEM around the mean treatment outcome as:

\(\displaystyle \mathit{SEM_t} = \frac{\mathit{SD_t}}{\sqrt{n_t}} \sqrt{\frac{N-n_t}{N-1}}\).

If, by chance (probability \(1/2^N\)), the coin delivers everyone to treatment, then \(N = n_t\) and the FPC reduces to zero, as does the standard error.


If the marginal outcome means mean anything, then there are a couple of standard errors you could use, even with a convenience sample. But the marginal means seem irrelevant when the main reason for running an RCT is to subtract off non-treatment explanations of change following treatment.

If you enjoyed this, you may now be wondering what standard error to use when estimating a sample average treatment effect. Try Efficacy RCTs as survey twins.

Sample size determination for propensity score weighting

If you’re using propensity score weighting (e.g., inverse probability weighting), one question that will arise is how big a sample you need.

Solutions have been proposed that rely on a variance inflation factor (VIF). You calculate the sample size for a simple design and then multiply that by the VIF to take account of weighting.

But the problem is that it is difficult to choose a VIF in advance.

Austin (2021) has developed a simple method (R code in the paper) to estimate VIFs from c-statistics (area under the curve; AOC) of the propensity score models. These c-statistics are often published.

A larger c-statistic means a greater separation between treatment and control, which in turn leads to a larger VIF and requirement for a larger sample.

Picture illustrating different c-statistics.

The magnitude of the VIF also depends on the estimand of interest, e.g., whether average treatment effect (ATE), average treatment effect on the treated (ATET/ATT), or average treatment effect where treat and control overlap (ATO).


Austin, P. C. (2021). Informing power and sample size calculations when using inverse probability of treatment weighting using the propensity score. Statistics in Medicine.

Two incontrovertible facts about RCTs

“… the following are two incontrovertible facts about a randomized clinical trial:

1. over all randomizations the groups are balanced;

2. for a particular randomization they are unbalanced.

Now, no [statistically] ‘significant imbalance’ can cause 1 to be untrue and no lack of a significant balance can make 2 untrue. Therefore the only reason to employ such a test must be to examine the process of randomization itself. Thus a significant result should lead to the decision that the treatment groups have not been randomized…”

– Senn (1994,  p. 1716)

Senn, S. (1994). Testing for baseline balance in clinical trials. Statistics in Medicine, 13, 1715–1726.