Special issue dedicated to John Mayne

‘I am honoured to introduce this special issue dedicated to John Mayne, a “thought leader,” “practical thinker,” “bridge builder,” and “scholar practitioner” in the field of evaluation. Guest editors Steffen Bohni Nielsen, Sebastian Lemire, and Steve Montague bring together 14 colleagues whose articles document, analyze, and expand on John’s contributions to evaluation in the Canadian public service as well as his contributions to evaluation theory.’ –Jill A. Chouinard

Canadian Journal of Program Evaluation, Volume 37 Issue 3, March 2023

Regression to the mean

Suppose we were to run an uncontrolled pre-post evaluation of an intervention to alleviate psychological distress. We screen participants for distress and invite those with scores 1.5 SDs or more above the mean to take part. Then, following the intervention, we collect data on distress again to see if it has reduced. The measure we have chosen has a test-retest reliability of 0.8.

Here is a picture of simulated findings (scores have been scaled so that they have a mean of 0 and SD of 1). Red points denote data from people who have been included the study.

I have setup the simulation so that the intervention had no effect, in the sense that outcomes would have been identical in the absence of the intervention. However, looking at the right hand side, it appears that there has been a reduction in distress of 1.1 SDs – a huge effect. This is highly “statistically significant”, p < .001. What happened?!

Tweaking the simulation

Let’s try a different simulation. This time, without any screening, so everyone is included in the intervention regardless of their levels of distress (so all the data points are red):

Looking at the right hand side, the pre-post change is 0 and p is close to 1. There is no change.

Next, select participants whose scores are at the mean or above:

The pre-post change is now statistically significant again, with improvement of 0.27 SDs.

Select participants with more extreme scores, 1.5 SDs or above at baseline, and we see the magnitude of change has increased again:

What happens if we increase the test-retest reliability of the measure to 0.9?

Firstly, the scatterplot on the left is a little less fuzzy. The magnitude of change has reduced to 0.48 SDs.

Finally, let’s make the measure perfectly reliable so that the scatterplot on the left is a fuzz-free straight line:

Now there is no change.

What’s going on?

I have simulated the data so that the intervention had zero impact on outcomes, and yet for many of the analyses above it does appear to have alleviated distress.

The extent to which the effect illustrated above, called regression to the mean, occurs partly depends on how selective we are in inviting participants to join the study. At one extreme, if there is no selection, then the mean change is still zero. At the other extreme, when we are highly selective, then change is over 1 SD.

This is because by selecting people with particularly high scores at baseline, there’s an increased chance that we include people who had, for them, a statistically rare score. Perhaps they had a particularly bad day, which wasn’t indicative of their general levels of distress. Since we selected them when they happened to have a bad day, on measuring again after the intervention, there was a good chance they had a much less extreme score. But this reduction was entirely unrelated to the intervention. We know this because the simulation was setup so that the intervention had zero effect.

Making test-retest reliability perfect also eliminates regression to the mean. However, this is unlikely to be possible for most of the characteristics of people that are of interest for interventions.

You can play around with the app I developed to simulate the data over here.

Regression to the mean is just one reason why interventions can spuriously appear to have an effect. Carefully chosen control groups, where possible with random assignment to intervention or control, can take account of alternative explanations of change.

Evaluating What Works, by Dorothy Bishop and Paul Thompson

“Those who work in allied health professions aim to make people’s lives better. Often, however, it is hard to know how effective we have been: would change have occurred if we hadn’t intervened? Is it possible we are doing more harm than good? To answer these questions and develop a body of knowledge about what works, we need to evaluate interventions.

“As we shall see, demonstrating that an intervention has an impact is much harder than it appears at first sight. There are all kinds of issues that can arise to mislead us into thinking that we have an effective treatment when this is not the case. On the other hand, if a study is poorly designed, we may end up thinking an intervention is ineffective when in fact it is beneficial. Much of the attention of methodologists has focused on how to recognize and control for unwanted factors that can affect outcomes of interest. But psychology is also important: it tells us that own human biases can be just as important in leading us astray. Good, objective intervention research is vital if we are to improve the outcomes of those we work with, but it is really difficult to do it well, and to do so we have to overcome our natural impulses to interpret evidence in biased ways.”

(Over here.)


History repeating in psychedelics research

Interesting draft paper by Michiel van Elk and Eiko Fried on flawed evaluations of psychedelics to treat mental health conditions and how to do better. Neat 1966 quotation at the end:

‘… we urge caution repeating the history of so many hyped treatments in clinical psychology and psychiatry in the last century. For psychedelic research in particular, we are not the first ones to raise concerns and can only echo the warning expressed more than half a century ago:

“To be hopeful and optimistic about psychedelic drugs and their potential is one thing; to be messianic is another. Both the present and the future of psychedelic research already have been grievously injured by a messianism that is as unwarranted as it has proved undesirable”. (Masters & Houston, 1966)’

Tin openers versus dials

Neil Carter (1989, p. 134) on the limits of data dashboards and mindless use of KPIs:

“… the majority of indicators are tin-openers rather than dials: by opening up a ‘can of worms’ they do not give answers but prompt interrogation and inquiry, and by themselves provide an incomplete and inaccurate picture.”

Carter, N. (1989). Performance indicators:‘Backseat driving’ or ‘hands off’control? Policy & Politics, 17(2), 131-138.

“Randomista mania”, by Thomas Aston

Thomas Aston provides a helpful summary of RCT critiques, particularly in international evaluations.

Waddington, Villar, and Valentine (2022), cited therein, provide a handy review of comparisons between RCT and quasi-experimental estimates of programme effect.

Aston also cites examples of unethical RCTs. One vivid example is an RCT in Nairobi with an arm that involved threatening to disconnect water and sanitation services if landlords didn’t settle debts.

Understanding causal estimands like ATE and ATT

Photo by Susanne Jutzeler

Social policy and programme evaluations often report findings in terms of casual estimands such as the average treatment effect (ATE) or the average treatment effect on the treated (ATT or ATET). An estimand is a quantity we are trying to estimate – but what exactly does that mean? This post explains through simple examples.

Suppose a study has two conditions, treat (=1) and control (=0). Causal estimands are defined in terms of potential outcomes: the outcome if someone had been assigned to treatment, \(Y(1)\), and outcome if someone had been assigned to control, \(Y(0)\).

We only get to see one of those two realised, depending on which condition someone was actually assigned to. The other is a counterfactual outcome. Assume, for a moment, that you are omniscient and can observe both potential outcomes. The treatment effect (TE) for an individual is \(Y(1)-Y(0)\) and, since you are omniscient, you can see it for everyone.

Here is a table of potential outcomes and treatment effects for 10 fictional study participants. A higher score represents a better outcome.

Person Condition Y(0) Y(1) TE
1 1 0 7 7
2 0 3 0 -3
3 1 2 9 7
4 1 1 8 7
5 0 4 1 -3
6 1 3 10 7
7 0 4 1 -3
8 0 8 5 -3
9 0 7 4 -3
10 1 3 10 7

Note the pattern in the table. People who were assigned to treatment have a treatment effect of \(7\) and people who were assigned to control have a treatment effect of \(-3\), i.e., if they had been assigned to treatment, their outcome would have been worse. So everyone in this fictional study was lucky: they were assigned to the condition that led to the best outcome they could have had.

The average treatment effect (ATE) is simply the average of treatment effects: 

\(\displaystyle \frac{7 + -3 + 7 + 7 + -3 + 7 + -3 + -3 + -3 + 7}{10}=2\)

The average treatment effect on the treated (ATT or ATET) is the average of treatment effects for people who were assigned to the treatment:

\(\displaystyle \frac{7 + 7 + 7 + 7 + 7}{5}=7\)

The average treatment effect on control (ATC) is the average of treatment effects for people who were assigned to control:

\(\displaystyle \frac{-3 + -3 + -3 + -3 + -3}{5}=-3\)

Alas we aren’t really omniscient, so in reality see a table like this:

Person Condition Y(0) Y(1) TE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 9 ?
4 1 ? 8 ?
5 0 4 ? ?
6 1 ? 10 ?
7 0 4 ? ?
8 0 8 ? ?
9 0 7 ? ?
10 1 ? 10 ?

This table highlights the fundamental problem of causal inference and why it is sometimes seen as a missing data problem.

Don’t confuse estimands and methods for estimation

One of the barriers to understanding these estimands is that we are used to taking a between-participant difference in group means to estimate the average effect of a treatment. But the estmands are defined in terms of a within-participant difference between two potential outcomes, only one of which is observed.

The causal effect is a theoretical quantity defined for individual people and it cannot be directly measured.

Here is another example where the causal effect is zero for everyone, so ATT, ATE, and ATC are all zero too:

Person Condition Y(0) Y(1) TE
1 1 7 7 0
2 0 3 3 0
3 1 7 7 0
4 1 7 7 0
5 0 3 3 0
6 1 7 7 0
7 0 3 3 0
8 0 3 3 0
9 0 3 3 0
10 1 7 7 0

However, people have been assigned to treatment and control in such a way that, given the outcomes realised, it appears that treatment is better than control. Here is the table again, this time with observations we couldn’t observe removed:

Person Condition Y(0) Y(1) CE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 7 ?
4 1 ? 7 ?
5 0 3 ? ?
6 1 ? 7 ?
7 0 3 ? ?
8 0 3 ? ?
9 0 3 ? ?
10 1 ? 7 ?

So, if we take the average of realised treatment outcomes we get 7 and the average of realised control outcomes we get 3. The mean difference is then 4. This estimate is biased. The correct answer is zero, but we couldn’t tell from the available data.

The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. For other estimators that don’t require random treatment assignment and for other estimands, try Scott Cunningham’s Causal Inference: The Mixtape.

How do you choose between ATE, ATT, and ATC?

Firstly, if you are running a randomised controlled trial, you don’t choose: ATE, ATT, and ATC will be the same. This is because, on average across trials, the characteristics of those who were assigned to treatment or control will be the same.

So the distinction between these three estimands only matters for quasi-experimental studies, for example where treatment assignment is not under the control of the researcher.

Noah Greifer and Elizabeth Stuart offer a neat set of example research questions to help decide (here lightly edited to make them less medical):

  • ATT: should an intervention currently being offered continue to be offered or should it be withheld?
  • ATC: should an intervention be extended to people who don’t currently receive it?
  • ATE: should an intervention be offered to everyone who is eligible?

How does intention to treat fit in?

The distinction between ATE and ATT is unrelated to the distinction between intention to treat and per-protocol analyses. Intention to treat analysis means we analyse people according to the group they were assigned to, even if they didn’t comply, e.g., by not engaging with the treatment. Per-protocol analysis is a biased analysis that only analyses data from participants who did comply and is generally not recommended.

For instance, it is possible to conduct a quasi-experimental study that uses intention to treat and estimates the average treatment effect on the treated. In this case, ATT might be better called something like average treatment effect for those we intended to treat (ATETWITT). Sadly this term hasn’t yet been used in the literature.


Causal effects are defined in terms of potential outcomes following treatment and following control. Only one potential outcome is observed, depending on whether someone was assigned to treatment or control, so causal effects cannot be directly observed. The fields of statistics and causal inference find ways to estimate these estimands using observable data. The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. Quasi-experimental designs allow the estimation of additional estimands: ATT and ATC.

Theory-based vs. theory-driven evaluation

“Donaldson and Lipsey (2006), Leeuw and Donaldson (2015), and Weiss (1997) noted that there is a great deal of confusion today about what is meant by theory-based or theory-driven evaluation, and the differences between using program theory and social science theory to guide evaluation efforts. For example, the newcomer to evaluation typically has a very difficult time sorting through a number of closely related or sometimes interchangeable terms such as theory-oriented evaluation, theory-based evaluation, theory-driven evaluation, program theory evaluation, intervening mechanism evaluation, theoretically relevant evaluation research, program theory, program logic, logic modeling, logframes, systems maps, and the like. Rather than trying to sort out this confusion, or attempt to define all of these terms and develop a new nomenclature, a rather broad definition is offered in this book in an attempt to be inclusive.

“Program Theory–Driven Evaluation Science is the systematic use of substantive knowledge about the phenomena under investigation and scientific methods to improve, to produce knowledge and feedback about, and to determine the merit, worth, and significance of evaluands such as social, educational, health, community, and organizational programs.”

– Donaldson, S. I. (2022, p. 9). Introduction to Theory-Driven Program Evaluation (2nd ed.). Routledge.

Costs and benefits in policy making

“Costs and benefits should be calculated over the lifetime of the proposal. Proposals involving infrastructure such as roads, railways and new buildings are appraised over a 60 year period. Refurbishment of existing buildings is considered over 30 years. For proposals involving administrative changes a ten year period is used as a standard measure. For interventions likely to have significant costs or benefits beyond 60 years, such as vaccination programmes, or nuclear waste storage, a suitable appraisal period should be discussed with and formally agreed by the Treasury at the start of work on the proposal.”

The Green Book (2022, p. 9)

Emergence and complexity in social programme evaluation

There’s lots of talk of complexity in the world of social programme evaluation with little clarity about what the term means. I thought I’d step back from that and explore ideas of complexity where the definitions are clearer.

One is Kolmogorov complexity:

“the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produces the object as output. It is a measure of the computational resources needed to specify the object.”

For example (mildly edited from the Wikipedia article) compare the following two strings:


The first string has a short description: “ab 16 times” (11 characters). The second has no description shorter than the text itself (32 characters). So the first string is less complex than the second. (The description of the text or other object would usually be written in a programming language.)

One of the fun things we can do with Kolmogorov complexity is use it to help make sense of emergence – how complex phenomena can emerge at a macro-level from some micro level phenomena in a way that seems difficult to predict from the micro-level.

A prototypical example is how complex patterns emergence from simple rules in Conway’s Game of Life. Game of Life consists of an infinite 2D array of cells. Each cell is either alive or dead. The rules are:

    1. Any ‘on’ cell (at time t-1) with fewer than two ‘on’ neighbours (at t -1) transitions to an ‘off’ state at time t.
    2. Any ‘on’ cell (t -1) with two or three ‘on’ neighbours (t -1) remains ‘on’ at time t.
    3. Any ‘on’ cell (t -1) with more than three ‘on’ neighbours (t -1) transitions to an ‘off’ state at time t
    4. And ‘off’ cell (t -1) with exactly three ‘on’ neighbours (t -1) transitions to an ‘on’ state at time t.

Here’s an example of the complexity that can emerge (from the Wikipedia article on Game of Life):

Looking at the animation above, there’s still an array of cells switching on and off, but simultaneously it looks like there’s some sort of factory of (what are known in the genre as) gliders. The challenge is, how do we define the way this macro-level pattern emerges from the micro-level cells?

Start with Mark Bedau’s (1997, p. 378) definition of a particular kind of emergence known as weak emergence:

Macrostate P of S with microdynamic D is weakly emergent iff P can be derived from D and S‘s external conditions but only by simulation.

This captures the idea that it’s difficult to tell just by inspecting the rules (the microdynamic) that the complex pattern will emerge – you have to setup the rules and run them (whether by computer or using pen and paper) to see. However, Nora Berenstain (2020) points out that this kind of emergence is satisfied by random patternlessness at the macro-level which is generated from but can’t be predicted from the micro-level without simulation. Patternlessness doesn’t seem to be the kind of thing we think of as emerging, argues Berenstain.

Berenstain (2020) adds a condition of algorithmic compressibility – in other words, the Kolmogorov complexity of the macro-level pattern must be smaller than the pattern itself for it to count as emergence. Here’s Berenstain’s combined definition:

“Where system S is composed of micro-level entities having associated micro-states, and where microdynamic D governs the time evolution of S’s microstates, macrostate P of S with microdynamic D is weakly emergent iff P is algorithmically compressible and can be derived from D and S’s external conditions only by simulation.”

Now I wonder what happens if a macrostate is very simple – so simple it cannot be compressed. This is different to incompressibility due to randomness. Also how should we define simulation outside the world of models in reality: does that literally mean observing a complex social system to see what happens? This would lead to interesting consequences for evaluating complex social programmes, e.g., how can data dredging be prevented? What should be in a study plan?


Bedau, M. (1997). Weak emergence. Philosophical Perspectives, 11, 375–399.

Berenstain, N. (2020). Strengthening weak emergence. Erkenntnis. Online first.

A lovely video about Game of Life, featuring John Conway

Once you’ve watched that, have a play over here.