Inside every matching study

A potentially useful one-sentence(!) intervention for making a case to run a statistical matching evaluation rather than a randomised controlled trial:

“Matching can be thought of as a technique for finding approximately ideal experimental data hidden within an observational data set.”

– King, G., & Nielsen, R. (2019, p. 442) [Why Propensity Scores Should Not Be Used for Matching. Political Analysis, 27(4), 435–454]


Carol Fitz-Gibbon (1938 – 2017), author of first description of theory-based evaluation, on importance of RCTs

“[…] I produced the first description of theory based evaluation […]. The point of theory based evaluation is to see, firstly, to what extent the theory is being implemented and, secondly, if the predicted outcomes then follow. It is particularly useful as an interim measure of implementation when the outcomes cannot be measured until much later. But most (if not all) theories in social science are only sets of persuasively stated hypotheses that provide a temporary source of guidance. In order to see if the hypotheses can become theories one must measure the extent to which the predicted outcomes are achieved. This requires randomised controlled trials. Even then the important point is to establish the direction and magnitude of the causal relation, not the theory. Many theories can often fit the same data.”

Fitz-Gibbon, C. T. (2002). Researching outcomes of educational interventions. BMJ, 324(7346), 1155.

Beautiful friendships have been jeopardised

This is an amusing opening to a paper on face validity, by Mosier (1947):

“Face validity is a term that is bandied about in the field of test construction until it seems about to become a part of accepted terminology. The frequency of its use and the emotional reaction which it arouses-ranging almost from contempt to highest approbation-make it desirable to examine its meaning more closely. When a single term variously conveys high praise or strong condemnation, one suspects either ambiguity of meaning or contradictory postulates among those using the term. The tendency has been, I believe, to assume unaccepted premises rather than ambiguity, and beautiful friendships have been jeopardized when a chance remark about face validity has classed the speaker among the infidels.”

I think dozens of beautiful friendships have been jeopardized by loose talk about randomised controlled trials, theory-based evaluation, realism, and positivism, among many others. I’ve just seen yet another piece arguing that you wouldn’t evaluate a parachute with an RCT and I can’t even.


Mosier, C. I. (1947). A Critical Examination of the Concepts of Face Validity. Educational and Psychological Measurement, 7(2), 191–205.

Applying process tracing to RCTs

Process tracing is an application of Bayes’ theorem to test hypotheses using qualitative evidence.¹ Application areas tend to be complex, e.g., evaluating the outcomes of international aid or determining the causes of a war by interpreting testimony and documents. This post explores what happens if we apply process tracing to a simple hypothetical quantitative study: an RCT that includes a mediation analysis.

Process tracing is often conducted without probabilities, using heuristics such as the “hoop test” or “smoking gun test” that make its Bayesian foundations digestible. Alternatively, probabilities may be made easier to digest by viewing them through verbal descriptors such as those provided by the PHIA Probability Yardstick. Given the simple example we will tackle, I will apply Bayes’ rule directly to point probabilities.

I will assume that there are three mutually exclusive hypotheses:

Null: the intervention has no effect.

Out: the intervention improves outcomes; however, not through the hypothesised mediator (it works but we have no idea how).

Med: the intervention improves the outcome and it does so through the hypothesised mediator.

Other hypotheses I might have included are that the intervention causes harm or that the mediator operates in the opposite direction to that hypothesised. We might also be interested in whether the intervention pushes the mediator in the desired direction without shifting the outcome. But let’s not overcomplicate things.

There are two sources of evidence, estimates of:

Average treatment effect (ATE): I will treat this evidence source as binary: whether there is a statistically significant difference between treat and control or not (alternative versus null hypothesis). Let’s suppose that the Type I error rate is 5% and power is 80%. This  means that if either Out or Med holds, then there is an 80% chance of obtaining a statistically significant effect. If neither holds, then there is a 5% chance of obtaining a statistically significant effect (in error).

Average causal mediation effect (ACME): I will again treat this as binary: is ACME statistically significantly different to zero or not (alternative versus null hypothesis). I will assume that if ATE is significant and Med holds, then there is a 70% chance that ACME will be significant. Otherwise, I will assume a 5% chance (by Type I error).

Note where I obtained the probabilities above. I got the 5% and 80% for free, following conventions for Type I error and power in the social sciences. I arrived at the 70% using finger-in-the-wind: it should be possible to choose a decent mediator based on the prior literature, I reasoned; however, I have seen examples where a reasonable choice of mediator still fails to operate as expected in a highly powered study.

Finally, I need to choose prior probabilities for Null, Out, and Med. Under clinical equipoise, I feel that there should be a 50-50 chance of the intervention having an effect or not (findings from prior studies of the same intervention notwithstanding). Now suppose it does have an effect. I am going to assume there is a 50% chance of that effect operating through the mediator.

This means that

P(Null) = 50%
P(Out) = 25%
P(Med) = 25%

So, P(Out or Med) = 50%, i.e., the prior probabilities are setup to reflect my belief that there is a 50% chance the intervention works somehow.

I’m going to use a Bayesian network to do the sums for me (I used GeNIe Modeler). Here’s the setup:

The lefthand node shows the prior probabilities, as chosen. The righthand nodes show the inferred probabilities of observing the different patterns of evidence.

Let’s now pretend we have concluded the study and observed evidence. Firstly, we are delighted to discover that there is a statistically significant effect of the intervention on outcomes. Let’s update our Bayesian network (note how the Alternative outcome on ATE has been underlined and emboldened):

P(Null) has now dropped to 6% and P(ACME > 0) has risen to 36%. We do not yet have sufficient evidence to distinguish between Out or Med: their probabilities are both 47%.²

Next, let’s run the mediation analysis. It is also statistically significant:

So, given our initial probability assignments and the pretend evidence observed, we can be 93% sure that the intervention works and does so through the mediator.

If the mediation test had not been statistically significant, then P(Out) would have risen to 69% and P(Med) would have dropped to 22%. If the ATE had been indistinguishable from zero, then P(Null) would have been 83%.

Is this process tracing or simply putting Bayes’ rule to work as usual? Does this example show that RCTs can be theory-based evaluations, since process tracing is a theory-based method, or does the inclusion of a control group rule out that possibility, as Figure 3.1 of the Magenta Book would suggest? I will leave the reader to assign probabilities to each possible conclusion. Let me know what you think.

¹ Okay, I accept that it is controversial to say that process tracing is necessarily an application of Bayes, particularly when no sums are involved. However, to me Bayes’ rule explains in the simplest possible terms why the four tests attributed to Van Evera (1997) [Guide to Methods for Students of Political Science. New York, NY: Cornell University Press.] work. It’s clear why there are so many references to Bayes in the process tracing literature.

² These are all actually conditional probabilities. I have made this implicit in the notation for ease of reading. Hopefully all is clear given the prose.

For example, P(Hyp = Med | ATE = Alternative) =  47%; in other words, the probability of Med given a statistically significant ATE estimate is 47%.

Baseline balance in experiments and quasi-experiments

Baseline balance is important for both experiments and quasi-experiments, just not in the way researchers sometimes believe. Here are excerpts from three of my favourite discussions of the topic.

Don’t test for baseline imbalance in RCTs. Senn (1994,  p. 1716):

“… the following are two incontrovertible facts about a randomized clinical trial:

1. over all randomizations the groups are balanced;

2. for a particular randomization they are unbalanced.

Now, no ‘[statistically] significant imbalance’ can cause 1 to be untrue and no lack of a significant balance can make 2 untrue. Therefore the only reason to employ such a test must be to examine the process of randomization itself. Thus a significant result should lead to the decision that the treatment groups have not been randomized…”

Do examine baseline imbalance in quasi-experiments; however, not by using statistical tests. Sample descriptives, such as a difference in means, suffice. Imai et al. (2008, p. 497):

“… from a theoretical perspective, balance is a characteristic of the sample, not some hypothetical population, and so, strictly speaking, hypothesis tests are irrelevant…”

Using p-values from t-tests and similar can lead to erroneous decisions of balance. As you prune a dataset to improve balance, power to detect effects decreases. Imai et al. (2008, p. 497 again):

“Since the values of […] hypothesis tests are affected by factors other than balance, they cannot even be counted on to be monotone functions of balance. The t-test can indicate that balance is becoming better whereas the actual balance is growing worse, staying the same or improving. Although we choose the most commonly used t-test for illustration, the same problem applies to many other test statistics…”

If your matching has led to baseline balance, then you’re good, even if the matching model is misspecified. (Though not if you’re missing key covariates, of course.) Rosenbaum (2023, p. 29):

“So far as matching and stratification are concerned, the propensity score and other methods are a means to an end, not an end in themselves. If matching for a misspecified and misestimated propensity score balances x, then that is fine. If by bad luck, the true propensity score failed to balance x, then the match is inadequate and should be improved.”


Imai, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society: Series A (Statistics in Society), 171(2), 481–502.

Rosenbaum, P. R. (2023). Propensity score. In J. R. Zubizarreta, E. A. Stuart, D. S. Small, & P. R. Rosenbaum, Handbook of Matching and Weighting Adjustments for Causal Inference (pp. 21–38). Chapman and Hall/CRC.

Senn, S. (1994). Testing for baseline balance in clinical trials. Statistics in Medicine13, 1715–1726.

Evaluating What Works, by Dorothy Bishop and Paul Thompson

“Those who work in allied health professions aim to make people’s lives better. Often, however, it is hard to know how effective we have been: would change have occurred if we hadn’t intervened? Is it possible we are doing more harm than good? To answer these questions and develop a body of knowledge about what works, we need to evaluate interventions.

“As we shall see, demonstrating that an intervention has an impact is much harder than it appears at first sight. There are all kinds of issues that can arise to mislead us into thinking that we have an effective treatment when this is not the case. On the other hand, if a study is poorly designed, we may end up thinking an intervention is ineffective when in fact it is beneficial. Much of the attention of methodologists has focused on how to recognize and control for unwanted factors that can affect outcomes of interest. But psychology is also important: it tells us that own human biases can be just as important in leading us astray. Good, objective intervention research is vital if we are to improve the outcomes of those we work with, but it is really difficult to do it well, and to do so we have to overcome our natural impulses to interpret evidence in biased ways.”

(Over here.)


“Randomista mania”, by Thomas Aston

Thomas Aston provides a helpful summary of RCT critiques, particularly in international evaluations.

Waddington, Villar, and Valentine (2022), cited therein, provide a handy review of comparisons between RCT and quasi-experimental estimates of programme effect.

Aston also cites examples of unethical RCTs. One vivid example is an RCT in Nairobi with an arm that involved threatening to disconnect water and sanitation services if landlords didn’t settle debts.

Kharkiv, statistics, and causal inference

As news comes in (14 May 2022) that Ukraine has won the battle of Kharkiv* and Russian troops are withdrawing, it may be of interest to know that a major figure in statistics and causal inference, Jerzy Neyman (1894-1981), trained as a mathematician there 1912-16. If you have ever used a confidence interval or conceptualised causal inference in terms of potential outcomes, then you owe him a debt of gratitude.

“[Neyman] was educated as a mathematician at the University of Kharkov*, 1912-16. After this he became a Lecturer at the Kharkov Institute of Technology with the title of Candidate. When speaking of these years he always stressed his debt to Sergei Bernstein, and his friendship with Otto Struve (later to meet him again in Berkeley). His thesis was entitled ‘Integral of Lebesgue’.” (Kendall et al., 1982)

* Харків (transliterated to Kharkiv) in Ukrainian, Харькoв (transliterated to Kharkov) in Russian.

Efficacy RCTs as survey twins

Surveys attempt to estimate a quantity of a finite population using a probability sample from that population. How people ended up in the population is somebody else’s problem – demographers, perhaps.

Survey participants are sampled at random from this finite population without replacement. Part a of the figure below illustrates. Green blocks denote people who are surveyed and from whom we collect data. Grey blocks denote people we have not surveyed; we would like to infer what their responses would have been, if they had they been surveyed too.

RCTs randomly assign participants to treatment or control conditions. This is illustrated in part b of the figure above: green cells denote treatment and purple cells denote control. There are no grey cells since we have gathered information from everyone in the finite population. But in a way, we haven’t really.

An alternative way to view efficacy RCTs that aim to estimate a sample average treatment effect (SATE) is as a kind of survey. This illustrated in part c. Now the grey cells return.

There is a finite population of people who present for a trial, often with little known about how they ended up in that population – not dissimilarly to the situation for a survey. (But who studies how they end up in a trial – trial demographers?)

Randomly assigning people to conditions generates two finite populations of theoretical twins, identical except for treatment assignment and the consequences thereafter. One theoretical twin receives treatment and the other receives control. But we only obtain the response from one of the twins, i.e., either the treatment or the control twin. (You could also think of these theoretical twins’ outcomes as potential outcomes.)

Looking individually at one of the two theoretical populations, the random assignment to conditions has generated a random sample from that population. We really want to know what the outcome would have been for everyone in the treatment condition, if everyone had been assigned treatment. Similarly for control. Alas, we have to make do with a pair of surveys that sample from these two populations.

Viewing the Table 1 fallacy through the survey twin lens

There is a common practice of testing for differences in covariates between treatment and control. This is the Table 1 fallacy (see also Dean Eckles’s take on whether it really is a fallacy). Let’s see how it can be explained using survey twins.

Firstly, we have a census of covariates for the whole finite population at baseline, so we know with perfect precision what the means are. Treatment and control groups are surveys of the same population, so clearly no statistical test is needed. The sample means in both groups are likely to be different from each other and from the finite population mean of both groups combined. No surprises there: we wouldn’t expect a survey mean to be identical to the population mean. That’s why we use confidence intervals or large samples so that the confidence intervals are very narrow.

What’s the correct analysis of an RCT?

It’s common to analyse RCT data using a linear regression model. The outcome variable is the endpoint, predictors are treatment group and covariates. This is also known as an ANCOVA. This analysis is easy to understand if the trial participants are a simple random sample from some infinite population. But this is not what we have in efficacy trials as modelled by survey twins above. If the total number of participants in the trial is 1000, then we have a finite population of 1000 in the treatment group and a finite population of 1000 in the control group – together, 2000. In total we have 1000 observations, though, split in some proportion between treatment and control.

Following through on this reasoning, it sounds like the correct analysis uses a stratified independent sampling design with two strata, coinciding with treatment and control groups. The strata populations are both 1000, and a finite population correction should be applied accordingly.

It’s a little more complicated, as I discovered in a paper by Reichardt and Gollob (1999), who independently derived results found by Neyman (1923/1990). Their results highlight a wrinkle in the argument when conducting a t-test on two groups for finite populations as described above. This has general implications for analyses with covariates too. The wrinkle is, the two theoretical populations are not independent of each other.

The authors derive the standard error of the mean difference between X and Y as

\(\displaystyle \sqrt{\frac{\sigma_X^2}{n_X} + \frac{\sigma_Y^2}{n_Y}-\left[ \frac{(\sigma_X-\sigma_Y)^2}{N} + \frac{2(1-\rho) \sigma_X \sigma_{Y}}{N} \right]}\),

where \(\sigma_X^2\) and \(\sigma_Y^2\) are the variances of the two groups, \(n_X\) and \(n_Y\) are the observed group sample sizes, and \(N\) is the total sample (the finite population) size. Finally, \(\rho\) is the unobservable correlation between treat and control outcomes for each participant – unobservable because we only get either the treatment outcome or control outcome for each participant and not both. The terms in square brackets correct for the finite population.

If the variances are equal (\(\sigma_X = \sigma_Y\)) and the correlation \(\rho = 1\), then the correction vanishes (glance back at numerators in the square brackets to see). This is great news if you are willing to assume that treatments have constant effects on all participants (an assumption known as unit-treatment additivity): the same regression analysis that you would use assuming a simple random sample from an infinite population applies.

If the variances are equal and the correlation is 0, then this is the same standard error as in the stratified independent sampling design with two strata described above. Or at least it was for the few examples I tried.

If the variances can be different and the correlation is one, then this is the same standard error as per Welch’s two-sample t-test.

So, which correlation should we use? Reichardt and Gollob (1999) suggest using the reliability of the outcome measure to calculate an upper bound on the correlation. More recently, Aronow, Green, and Lee (2014) proved a result that puts bounds on the correlation based on the observed marginal distribution of outcomes, and provide R code to copy and paste to calculate it. It’s interesting that a problem highlighted a century ago on something so basic – what standard error we should use for an RCT – is still being investigated now.


Aronow, P. M., Green, D. P., & Lee, D. K. K. (2014). Sharp bounds on the variance in randomized experiments. Annals of Statistics, 42, 850–871.

Neyman, J. (1923/1990). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science, 5, 465-472.

Reichardt, C. S., & Gollob, H. F. (1999). Justifying the Use and Increasing the Power of a t Test for a Randomized Experiment With a Convenience Sample. Psychological Methods, 4, 117–128.


Standard errors of marginal means in an RCT

Randomised controlled trials (RCTs) typically use a convenience sample to estimate the mean effect of a treatment for study participants. Participants are randomly assigned to one of (say) two conditions, and an unbiased estimate of the sample mean treatment effect is obtained by taking the difference of the two conditions’ mean outcomes. The estimand in such an RCT is sometimes called the sample average treatment effect (SATE).

Some papers report a standard error for the marginal mean outcomes in treatment and control groups using the textbook formula

\(\displaystyle \frac{\mathit{SD_g}}{\sqrt{n_g}}\),

where \(\mathit{SD_g}\) is the standard deviation of outcomes in group \(g\) and \(n_g\) the number of observations in that group.

This formula assumes a simple random sample with replacement from an infinite population, so does not work for a convenience sample (see Stephen Senn, A Standard Error). I am convinced, but curious what standard error for each group’s mean would be appropriate, if any. (You could stop here and argue that the marginal group means mean nothing anyway. The whole point of running a trial is to subtract off non-treatment explanations of change such as regression to the mean.)

Let’s consider a two-arm RCT with no covariates and a coin toss determining who receives treatment or control. What standard error would be appropriate for the mean treatment outcome? Let the total sample size be \(N\) and quantities for treatment and control use subscripts \(t\) and \(c\), respectively.

Treatment outcome mean of those who received treatment

If we focus on the mean for the \(n_t\) participants who were assigned to treatment, we have all observations for that group, so the standard error of the mean is 0. This feels like cheating.

Treatment outcome mean of everyone in the sample

Suppose we want to say something about the treatment outcome mean for all \(N\) participants in the trial, not only the \(n_t\) who were assigned to treatment.

To see how to think about this, consider a service evaluation of \(N\) patients mimicking everything about an RCT except that it assigns everyone to treatment and uses a coin toss to determine whether someone is included in the evaluation. This is now a survey of \(n\) participants, rather than a trial. We want to generalise results to the finite \(N\) from which we sampled.

Since the population is finite and the sampling is done without replacement, the standard error of the mean should be multiplied by a finite population correction,

\(\displaystyle \mathit{FPC} = \sqrt{\frac{N-n}{N-1}}\).

This setup for a survey is equivalent to what we observe in the treatment group of an RCT. Randomly assigning participants to treatment gives us a random sample from a finite population, the sample frame of which we get by the end of the trial: all treatment and control participants. So we can estimate the SEM around the mean treatment outcome as:

\(\displaystyle \mathit{SEM_t} = \frac{\mathit{SD_t}}{\sqrt{n_t}} \sqrt{\frac{N-n_t}{N-1}}\).

If, by chance (probability \(1/2^N\)), the coin delivers everyone to treatment, then \(N = n_t\) and the FPC reduces to zero, as does the standard error.


If the marginal outcome means mean anything, then there are a couple of standard errors you could use, even with a convenience sample. But the marginal means seem irrelevant when the main reason for running an RCT is to subtract off non-treatment explanations of change following treatment.

If you enjoyed this, you may now be wondering what standard error to use when estimating a sample average treatment effect. Try Efficacy RCTs as survey twins.