As news comes in (14 May 2022) that Ukraine has won the battle of Kharkiv* and Russian troops are withdrawing, it may be of interest to know that a major figure in statistics and causal inference, Jerzy Neyman (1894-1981), trained as a mathematician there 1912-16. If you have ever used a confidence interval or conceptualised causal inference in terms of potential outcomes, then you owe him a debt of gratitude.

“[Neyman] was educated as a mathematician at the University of Kharkov*, 1912-16. After this he became a Lecturer at the Kharkov Institute of Technology with the title of Candidate. When speaking of these years he always stressed his debt to Sergei Bernstein, and his friendship with Otto Struve (later to meet him again in Berkeley). His thesis was entitled ‘Integral of Lebesgue’.” (Kendall et al., 1982)

* Харків (transliterated to Kharkiv) in Ukrainian, Харькoв (transliterated to Kharkov) in Russian.

If you’re using propensity score weighting (e.g., inverse probability weighting), one question that will arise is how big a sample you need.

Solutions have been proposed that rely on a variance inflation factor (VIF). You calculate the sample size for a simple design and then multiply that by the VIF to take account of weighting.

But the problem is that it is difficult to choose a VIF in advance.

Austin (2021) has developed a simple method (R code in the paper) to estimate VIFs from c-statistics (area under the curve; AOC) of the propensity score models. These c-statistics are often published.

A larger c-statistic means a greater separation between treatment and control, which in turn leads to a larger VIF and requirement for a larger sample.

Picture illustrating different c-statistics.

The magnitude of the VIF also depends on the estimand of interest, e.g., whether average treatment effect (ATE), average treatment effect on the treated (ATET/ATT), or average treatment effect where treat and control overlap (ATO).

Suppose there are two groups in a study: treatment and control. There are two potential outcomes for an individual, \(i\): outcome under treatment, \(Y_i(1)\), and outcome under control, \(Y_i(0)\). Only one of the two potential outcomes can be realised and observed as \(Y_i\).

The treatment effect for an individual is defined as the difference in potential outcomes for that individual:

\(\mathit{TE}_i = Y_i(1) – Y_i(0)\).

Since we cannot observe both potential outcomes for any individual, we usually we make do with a sample or population average treatment effect (SATE and PATE). Although these are unobservable (they are the averages of unobservable differences in potential outcomes), they can be estimated. For example, with random treatment assignment, the difference in observed sample mean outcomes for the treatment and control is an unbiased estimator of SATE. If we also have a random sample from the population of interest, then this difference in sample means gives us an unbiased estimate of PATE.

Okay, so what happens if we add a mediator? The potential outcome is expanded to depend on both treatment group and mediator value.

Let \(Y_i(t, m)\) denote the potential outcome for \(i\) under treatment \(t\) and with mediator value \(m\).

Let \(M_i(t)\) denote the potential value of the mediator under treatment \(t\).

Informally, the idea here is that we calculate the potential outcome under treatment, with the mediator value as it is under treatment, and subtract from that the potential outcome under control with the mediator value as it is under control.

The causal mediation effect (CME) is what we get when we hold the treatment assignment constant, but work out the difference in potential outcomes when the mediators are set to values they have under treatment and control:

ACME and ADE are the averages of these effects. Again, since they are defined in terms of potential values (of outcome and mediator), they cannot be directly observed, but – given some assumptions – there are estimators.

Baron and Kenny (1986) provide an estimator in terms of regression equations. I’ll focus on two of their steps and assume there is no need to adjust for any covariates. I’ll also assume that there is no interaction between treatment and moderator.

First, regress the mediator (\(m\)) on the binary treatment indicator (\(t\)):

\(m = \alpha_1 + \beta_1 t\).

The slope \(\beta_1\) tells us how much the mediator changes between the two treatment conditions on average.

Second, regress the outcome (\(y\)) on both mediator and treatment indicator:

\(y = \alpha_2 + \beta_2 t + \beta_3 m\).

The slope \(\beta_2\) provides the average direct effect (ADE), since this model holds the mediator constant (note how this mirrors the definition of DE in terms of potential outcomes).

Now to work out the average causal mediation effect (ACME), we need to wiggle the outcome by however much the mediator moves between treat and control, whilst holding the treatment group constant. Slope \(\beta_1\) tells us how much the treatment shifts the mediator. Slope \(\beta_3\) tells us how much the outcome increases for every unit increase in the mediator, holding treatment constant. So \(\beta_1 \beta_3\) is ACME.

For more, especially on complicating the Baron and Kenny approach, see Imai et al. (2010).

References

Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173–1182.

Imai, K., Keele, L., & Yamamoto, T. (2010). Identification, Inference and Sensitivity Analysis for Causal Mediation Effects. Statistical Science, 25, 51–71.

Elizabeth Stuart et al. (2021) reviewed 206 articles using mediation analysis “in top academic psychiatry or psychology journals” from 2013-2018, to determine how many satisfied assumptions of mediation analysis.

Here are the headline results (% of papers):

(The assumption of no interaction of exposure and mediator is as a percentage of the 97% of studies that used the Baron and Kenny approach.)

Although 42% of studies discussed mediation assumptions, “in most cases this discussion was simply an acknowledgement that the data were cross sectional and thus results should be interpreted with caution.”

“The tendency of empiricism, unchecked, is always anti-realist; it has a strong tendency to degenerate into some form of verificationism: to treat the question of what there is (and even the question of what we can – intelligibly – talk about) as the same question as the question of what we can find out, or know for certain; to reduce questions of metaphysics and ontology to questions of epistemology.”
—Strawson, G. (1987, p. 267)

Strawson, G. (1987). Realism and causation. The Philosophical Quarterly, 37, 253–277.

“The positivist picture of the structure of scientific theories is now widely rejected. But the underlying idea that scientific theories are primarily designed to predict and explain claims about what we observe remains enormously influential, even among the sharpest critics of positivism.” (p. 304)

“Phenomena are detected through the use of data, but in most cases are not observable in any interesting sense of that term. Examples of data include bubble chamber photographs, patterns of discharge in electronic particle detectors and records of reaction times and error rates in various psychological experiments. Examples of phenomena, for which the above data might provide evidence, include weak neutral currents, the decay of the proton, and chunking and recency effects in human memory.” (p. 306)

“Our general thesis, then, is that we need to distinguish what theories explain (phenomena or facts about phenomena) from what is uncontroversially observable (data).” (p. 314)

Bogen, J., & Woodward, J. (1988). Saving the phenomena. The Philosophical Review, XCVII(3), 303–352.

‘A mechanism is one of the processes in a concrete system that makes it what it is—for example, metabolism in cells, interneuronal connections in brains, work in factories and offices, research in laboratories, and litigation in courts of law. Because mechanisms are largely or totally imperceptible, they must be conjectured. Once hypothesized they help explain, because a deep scientific explanation is an answer to a question of the form, “How does it work, that is, what makes it tick—what are its mechanisms?”’ (p. 182; abstract)

‘Consider the well-known law-statement, “Taking ‘Ecstasy’ causes euphoria,” which makes no reference to any mechanisms. This statement can be analyzed as the conjunction of the following two well-corroborated mechanistic hypotheses: “Taking ‘Ecstasy’ causes serotonin excess,” and “Serotonin excess causes euphoria.” These two together explain the initial statement. (Why serotonin causes euphoria is of course a separate question that cries for a different mechanism.)’ (p. 198)

‘How do we go about conjecturing mechanisms? The same way as in framing any other hypotheses: with imagination both stimulated and constrained by data, well-weathered hypotheses, and mathematical concepts such as those of number, function, and equation. […] There is no method, let alone a logic, for conjecturing mechanisms. […] One reason is that, typically, mechanisms are unobservable, and therefore their description is bound to contain concepts that do not occur in empirical data.’ (p. 200)

‘Even the operations of a corner store are only partly overt. For instance, the grocer does not know, and does not ordinarily care to find out, why a customer buys breakfast cereal of one kind rather than another. However, if he cares he can make inquiries or guesses—for instance, that children are likely to be sold on packaging. That is, the grocer may make up what is called a “theory of mind,” a hypothesis concerning the mental processes that end up at the cash register.’ (p. 201)

Bunge, M. (2004). How Does It Work?: The Search for Explanatory Mechanisms. Philosophy of the Social Sciences, 34(2), 182–210.

It is a cliché that randomised controlled trials (RCTs) are the gold standard if you want to evaluate a social policy or intervention and quasi-experimental designs (QEDs) are presumably the silver standard. But often it is not possible to use either, especially for complex policies. Theory-Based Evaluation is an alternative that has been around for a few decades, but what exactly is it?

In this post I will sketch out what some key texts say about Theory-Based Evaluation; explore one approach, contribution analysis; and conclude with discussion of an approach to assessing evidence in contribution analyses (and a range of other approaches) using Bayes’ rule.

theory (lowercase)

Let’s get the obvious out of the way. All research, evaluation included, is “theory-based” by necessity, even if an RCT is involved. Outcome measures and interviews alone cannot tell us what is going on; some sort of theory (or story, account, narrative, …) – however flimsy or implicit – is needed to design an evaluation and interpret what the data means.

If you are evaluating a psychological therapy, then you probably assume that attending sessions exposes therapy clients to something that is likely to be helpful. You might make assumptions about the importance of the therapeutic relationship to clients’ openness, of any homework activities carried out between sessions, etc. RCTs can include statistical mediation tests to determine whether the various things that happen in therapy actually explain any difference in outcome between a therapy and comparison group (e.g., Freeman et al., 2015).

It is great if a theory makes accurate predictions, but theories are underdetermined by evidence, so this cannot be the only criterion for preferring one theory’s explanation over another (Stanford, 2017) – again, even if you have an effect size from an RCT. Lots of theories will be compatible with any RCT’s results. To see this, try a particular social science RCT and think hard about what might be going on in the intervention group beyond what the intervention developers have explicitly intended.

In addition to accuracy, Kuhn (1977) suggests that a good theory should be consistent with itself and other relevant theories; have broad scope; bring “order to phenomena that in its absence would be individually isolated”; and it should produce novel predictions beyond current observations. There are no obvious formal tests for these properties, especially where theories are expressed in ordinary language and box-and-arrow diagrams.

Theory-Based Evaluation (title case)

Theory-Based Evaluation is a particular genre of evaluation that includes realist evaluation and contribution analysis. According the UK’s government’s Magenta Book (HM Treasury, 2020, p. 43), Theory-Based methods of evaluation

“can be used to investigate net impacts by exploring the causal chains thought to bring about change by an intervention. However, they do not provide precise estimates of effect sizes.”

The Magenta Book acknowledges (p. 43) that “All evaluation methods can be considered and used as part of a [Theory-Based] approach”; however, Figure 3.1 (p. 47) is clear. If you can “compare groups affected and not affected by the intervention”, you should go for experiments or quasi-experiments; otherwise, Theory-Based methods are required.

Theory-Based Evaluation attempts to draw causal conclusions about a programme’s effectiveness in the absence of any comparison group. If a quasi-experimental design (QED) or randomised controlled trial (RCT) were added to an evaluation, it would cease to be Theory-Based Evaluation, as the title case term is used.

Example: Contribution analysis

Contribution analysis is an approach to Theory-Based Evaluation developed by JohnMayne (28 November 1943 – 18 December 2020). Mayne was originally concerned with how to use monitoring data to decide whether social programmes actually worked when quasi-experimental approaches were not feasible (Mayne, 2001), but the approach evolved to have broader scope.

According to a recent summary (Mayne, 2019), contribution analysis consists of six steps (and an optional loop):

Step 1: Set out the specific cause-effect questions to be addressed.

Step 2: Develop robust theories of change for the intervention and its pathways.

Step 3: Gather the existing evidence on the components of the theory of change model of causality: (i) the results achieved and (ii) the causal link assumptions realized.

Step 4: Assemble and assess the resulting contribution claim, and the challenges to it.

Step 5: Seek out additional evidence to strengthen the contribution claim.

Step 6: Revise and strengthen the contribution claim.

Step 7: Return to Step 4 if necessary.

Here is a diagrammatic depiction of the kind of theory of change that could be plugged in at Step 2 (Mayne, 2015, p. 132), which illustrates the cause-effect links an evaluation would aim to evaluate.

In this example, mothers are thought to learn from training sessions and materials, which then persuades them to adopt new feeding practices. This leads to children having more nutritious diets. The theory is surrounded by various contextual factors such as food prices. (See also Mayne, 2017, for a version of this that includes ideas from the COM-B model of behaviour.)

Step 4 is key. It requires evaluators to “Assemble and assess the resulting contribution claim”. How are we to carry out that assessment? Mayne (2001, p. 14) suggests some questions to ask:

“How credible is the story? Do reasonable people agree with the story? Does the pattern of results observed validate the results chain? Where are the main weaknesses in the story?”

For me, the most credible stories would include experimental or quasi-experimental tests, with mediation analysis of key hypothesised mechanisms, and qualitative detective work to get a sense of what’s going on beyond the statistical associations. But the quant part of that would lift us out of the Theory-Based Evaluation wing of the Magenta Book flowchart. In general, plausibility will be determined outside contribution analysis in, e.g., quality criteria for whatever methods for data collection and analysis were used. Contribution analysis says remarkably little on this key step.

Although contribution analysis is intended to fill a gap where no comparison group is available, Mayne (2001, p. 18) suggests that further data might be collected to help rule out alternative explanations of outcomes, e.g., from surveys, field visits, or focus groups. He also suggests reviewing relevant meta-analyses, which could (I presume) include QED and RCT evidence.

It is not clear to me what the underlying theory of causation is in contribution analysis. It is clear what it is not (Mayne, 2019, pp. 173–4):

“In many situations a counterfactual perspective on causality—which is the traditional evaluation perspective—is unlikely to be useful; experimental designs are often neither feasible nor practical…”

“[Contribution analysis] uses a stepwise (generative) not a counterfactual approach to causality.”

(We will explore counterfactuals below.) I can guess what this generative approach could be, but Mayne does not provide precise definitions. It clearly isn’t the idea from generative social science in which causation is defined in terms of computer simulations (Epstein, 1999).

One way to think about it might be in terms of mechanisms: “entities and activities organized in such a way that they are responsible for the phenomenon” (Illari & Williamson, 2011, p. 120). We could make this precise by modelling the mechanisms using causal Bayesian networks such that variables (nodes in a network) represent the probability of activities occurring, conditional on temporally earlier activities having occurred – basically, a chain of probabilistic if-thens.

Why do people get vaccinated for Covid-19? Here is the beginning of a (generative?) if-then theory:

If you learned about vaccines in school and believed what you learned and are exposed to an advert for Covid-19 jab and are invited by text message to book an appointment for one, then (with a certain probability) you use your phone to book an appointment.

If you have booked an appointment, then (with a certain probability) you travel to the vaccine centre in time to attend the appointment.

If you attend the appointment, then (with a certain probability) you are asked to join a queue.

… and so on …

In a picture:

This does not explain how or why the various entities (people, phones, etc.) and activities (doing stuff like getting the bus as a result of beliefs and desires) are organised as they are, just the temporal order in which they are organised and dependencies between them. Maybe this suffices.

What are counterfactual approaches?

Counterfactual impact evaluation usually refers to quantitative approaches to estimate average differences as understood in a potential outcomes framework (or generalisations thereof). The key counterfactual is something like:

“If the beneficiaries had not taken part in programme activities, then they would not have had the outcomes they realised.”

Logicians have long worried how to determine the truth of counterfactuals, “if A had been true, B.” One approach, due to Stalnaker (1968), proposes that you:

Start with a model representing your beliefs about the factual situation where A is false. This model must have enough structure so that tweaking it could lead to different conclusions (causal Bayesian networks have been proposed; Pearl, 2013).

Add A to your belief model.

Modify the belief model in a minimal way to remove contradictions introduced by adding A.

Determine the truth of B in that revised belief model.

This broader conception of counterfactual seems compatible with any kind of evaluation, contribution analysis included. White (2010, p. 157) offered a helpful intervention, using the example of a pre-post design where the same outcome measure is used before and after an intervention:

“… having no comparison group is not the same as having no counterfactual. There is a very simple counterfactual: what would [the outcomes] have been in the absence of the intervention? The counterfactualis that it would have remained […] the same as before the intervention.”

The counterfactual is untested and could be false – regression to the mean would scupper it in many cases. But it can be stated and used in an evaluation. I think Stalnaker’s approach is a handy mental trick for thinking through the implications of evidence and producing alternative explanations.

Cook (2000) offers seven reasons why Theory-Based Evaluation cannot “provide the valid conclusions about a program’s causal effects that have been promised.” I think from those seven, two are key: (i) it is usually too difficult to produce a theory of change that is comprehensive enough for the task and (ii) the counterfactual remains theoretical – in the arm-chair, untested sense of theoretical – so it is too difficult to judge what would have happened in the absence of the programme being evaluated. Instead, Cook proposes including more theory in comparison group evaluations.

Bayesian contribution tracing

Contribution analysis has been supplemented with a Bayesian variant of process tracing (Befani & Mayne, 2014; Befani & Stedman-Bryce, 2017; see also Fairfield & Charman, 2017, for a clear introduction to Bayesian process tracing more generally).

The idea is that you produce (often subjective) probabilities of observing particular (usually qualitative) evidence under your hypothesised causal mechanism and under one or more alternative hypotheses. These probabilities and prior probabilities for your competing hypotheses can then be plugged into Bayes’ rule when evidence is observed.

Suppose you have two competing hypotheses: a particular programme led to change versus pre-existing systems. You may begin by assigning them equal probability, 0.5 and 0.5. If relevant evidence is observed, then Bayes’ rule will shift the probabilities so that one becomes more probable than the other.

Process tracers often cite Van Evera’s (1997) tests such as the hoop test and smoking gun. I find definitions of these challenging to remember so one thing I like about the Bayesian approach is that you can think instead of specificity and sensitivity of evidence, by analogy with (e.g., medical) diagnostic tests. A good test of a causal mechanism is sensitive, in the sense that there is a high probability of observing the relevant evidence if your causal theory is accurate. A good test is also specific, meaning that the evidence is unlikely to be observed if any alternative theory is true. See below for a table (lighted edited from Befani & Mayne, 2014, p. 24) showing the conditional probabilities of evidence for each of Van Evera’s tests given a hypothesis and alternative explanation.

Van Evera test
if Eᵢ is observed

P(Eᵢ | Hyp)

P(Eᵢ | Alt)

Fails hoop test

Low

—

Passes smoking gun

—

Low

Doubly-decisive test

High

Low

Straw-in-the-wind test

High

High

Let’s take the hoop test. This applies to evidence which is unlikely if your preferred hypothesis were true. So if you observe that evidence, the hoop test fails. The test is agnostic about the probability under the alternative hypothesis. Straw-in-the-wind is hopeless for distinguishing between your two hypotheses, but could suggest that neither holds if the test fails. The double-decisive test has high sensitivity and high specificity, so provides strong evidence for your hypothesis if it passes.

The arithmetic is straightforward if you stick to discrete multinomial variables and use software for conditional independence networks. Eliciting the subjective probabilities for each source of evidence, conditional on each hypothesis, may be less straightforward.

Conclusions

I am with Cook (2000) and others who favour a broader conception of “theory-based” and suggest that better theories should be tested in quantitative comparison studies. However, it is clear that it is not always possible to find a comparison group – colleagues and I have had to make do without (e.g., Fugard et al., 2015). Using Theory-Based Evaluation in practice reminds me of jury service: a team are guided through thick folders of evidence, revisiting several key sections that are particularly relevant, and work hard to reach the best conclusion they can with what they know. There is no convenient effect size to consult, just a shared (to some extent) and informal idea of what intuitively feels more or less plausible (and lengthy discussion where there is disagreement). To my mind, when quantitative comparison approaches are not possible, Bayesian approaches to assessing qualitative evidence are the most compelling way to synthesise qualitative evidence of causal impact and make transparent how this synthesis was done.

Finally, it seems to me that the Theory-Based Evaluation category is poorly named. Better might be, Assumption-Based Counterfactual approaches. Then RCTs and QEDs are Comparison-Group Counterfactual approaches. Both are types of theory-based evaluation and both use counterfactuals; it’s just that approaches using comparison groups gather quantitative evidence to test the counterfactual. However, the term doesn’t quite work since RCTs and QEDs rely on assumptions too… Further theorising needed.

Edited to add: Reichardt’s (2022), The Counterfactual Definition of a Program Effect, is a very promising addition to the literature and, I think, offers a clear way out of the theory-based versus non-theory-based and counterfactual versus not-counterfactual false dichotomies. I’ve blogged about it here.

Kuhn, T. S. (1977). Objectivity, Value Judgment, and Theory Choice. In The Essential Tension: Selected Studies in Scientific Tradition and Change (pp. 320–339). The University of Chicago Press.

“It may seem strange that we are trying to understand causality using causal models, which clearly already encode causal relationships. Our reasoning is not circular. Our aim is not to reduce causation to noncausal concepts but to interpret questions about causes of specific events in fully specified scenarios in terms of generic causal knowledge…” (Halpern & Pearl, 2005).

“It may seem circular to use causal models, which clearly already encode causal information, to define actual causation. Nevertheless, there is no circularity. The models do not directly represent relations of actual causation. Rather, they encode information about what would happen under various possible interventions” (Halpern & Hitchcock, 2015).

References

Halpern, J. Y., & Pearl, J. (2005). Causes and Explanations: A Structural-Model Approach. Part I: Causes. The British Journal for the Philosophy of Science, 56(4), 843–887.

Halpern, J. Y., & Hitchcock, C. (2015). Graded Causation and Defaults. The British Journal for the Philosophy of Science, 66(2), 413–457.

The Neyman–Rubin causal model (see, e.g., Rubin, 2008) has the following elements:

Units, physical entities somewhere/somewhen in spacetime such as someone in Camden Town, London, on a Thursday eve.

Two or more interventions, where one is often considered a “control”, e.g., cognitive behavioural therapy (CBT) as usual for anxiety, and another is considered a “treatment”, e.g., a new chat bot app to alleviate anxiety. The “control” does not have to be (and almost certainly cannot be) “nothing”.

Potential outcomes, which represent outcomes following each intervention (e.g., following treatment and following control) for every unit. Alas, only one potential outcome is realised and observed for a unit, depending on which intervention they actually received. This is what makes causal inference such a challenge.

Zero or more pre-intervention covariates, which are measured for all units.

The causal effect is the difference in potential outcomes between two interventions for a unit, e.g., in levels of anxiety for someone following CBT and following the app intervention. It is impossible to obtain the causal effect for an individual unit since only one potential outcome can be realised.

The assignment mechanism is the conditional probability distribution of being in an intervention group, given covariates and potential outcomes. For randomised experiments, the potential outcomes have no influence on the assignment probability. This assignment mechanism also explains which potential outcomes are realised and which are missing data.

Although the causal effect cannot be obtained for individual units, various causal estimates can be inferred if particular assumptions hold, e.g.,

Sample average treatment effect on the treated (SATT or SATET), which is an estimate of the mean difference in a pair of potential outcomes (e.g., anxiety following the app minus anxiety following CBT) for those who were exposed to the “treatment” (e.g., the app) in a sample.

Sample average treatment effect (SATE), which is an estimate of the mean difference between a pair of potential outcomes for everyone in a sample.

How does this work?

Suppose we run a randomised trial where people are assigned to either CBT or app based on the outcome of a coin toss. From each participant’s two potential outcomes, we only observe one depending on which group they were assigned to. But since we randomised, we know the missing data mechanism. It turns out that under a coin toss randomised trial, a good estimate of the average treatment effect is simply the difference between the means in observed outcomes for those assigned to the app and those assigned to CBT.

We can also calculate p-values in a variety of ways. One is to assume a null hypothesis of no difference in potential outcomes in the treatment and control conditions, i.e., the potential outcomes are identical for each participant but may vary between participants. Under this particular “sharp” null, we do not have a missing data problem since we can just use whatever outcome was observed for each participant to fill in the blank for the unobserved potential outcome. Since we know the assignment mechanism, it is possible to work out the distribution of possible mean differences under the null by enumerating all possible random assignments to groups and calculating the mean difference between treatment and control for each (in practice there may be too many, but we can approximate by taking a random subset). Now calculate a p-value by working out the probability of obtaining the actually observed mean difference or larger against this distribution of differences under the null.

What’s lovely about this potential outcomes approach is that it’s a simple starting point for thinking about a variety of ways for evaluating the impact of interventions. Though working out the consequences, e.g., standard errors for estimators, may be non-trivial.