## Understanding causal estimands like ATE and ATT

Social policy and programme evaluations often report findings in terms of casual estimands such as the average treatment effect (ATE) or the average treatment effect on the treated (ATT or ATET). An estimand is a quantity we are trying to estimate – but what exactly does that mean? This post explains through simple examples.

Suppose a study has two conditions, treat (=1) and control (=0). Causal estimands are defined in terms of potential outcomes: the outcome if someone had been assigned to treatment, $$Y(1)$$, and outcome if someone had been assigned to control, $$Y(0)$$.

We only get to see one of those two realised, depending on which condition someone was actually assigned to. The other is a counterfactual outcome. Assume, for a moment, that you are omniscient and can observe both potential outcomes. The treatment effect (TE) for an individual is $$Y(1)-Y(0)$$ and, since you are omniscient, you can see it for everyone.

Here is a table of potential outcomes and treatment effects for 10 fictional study participants. A higher score represents a better outcome.

Person Condition Y(0) Y(1) TE
1 1 0 7 7
2 0 3 0 -3
3 1 2 9 7
4 1 1 8 7
5 0 4 1 -3
6 1 3 10 7
7 0 4 1 -3
8 0 8 5 -3
9 0 7 4 -3
10 1 3 10 7

Note the pattern in the table. People who were assigned to treatment have a treatment effect of $$7$$ and people who were assigned to control have a treatment effect of $$-3$$, i.e., if they had been assigned to treatment, their outcome would have been worse. So everyone in this fictional study was lucky: they were assigned to the condition that led to the best outcome they could have had.

The average treatment effect (ATE) is simply the average of treatment effects:

$$\displaystyle \frac{7 + -3 + 7 + 7 + -3 + 7 + -3 + -3 + -3 + 7}{10}=2$$

The average treatment effect on the treated (ATT or ATET) is the average of treatment effects for people who were assigned to the treatment:

$$\displaystyle \frac{7 + 7 + 7 + 7 + 7}{5}=7$$

The average treatment effect on control (ATC) is the average of treatment effects for people who were assigned to control:

$$\displaystyle \frac{-3 + -3 + -3 + -3 + -3}{5}=-3$$

Alas we aren’t really omniscient, so in reality see a table like this:

Person Condition Y(0) Y(1) TE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 9 ?
4 1 ? 8 ?
5 0 4 ? ?
6 1 ? 10 ?
7 0 4 ? ?
8 0 8 ? ?
9 0 7 ? ?
10 1 ? 10 ?

This table highlights the fundamental problem of causal inference and why it is sometimes seen as a missing data problem.

### Don’t confuse estimands and methods for estimation

One of the barriers to understanding these estimands is that we are used to taking a between-participant difference in group means to estimate the average effect of a treatment. But the estmands are defined in terms of a within-participant difference between two potential outcomes, only one of which is observed.

The causal effect is a theoretical quantity defined for individual people and it cannot be directly measured.

Here is another example where the causal effect is zero for everyone, so ATT, ATE, and ATC are all zero too:

Person Condition Y(0) Y(1) TE
1 1 7 7 0
2 0 3 3 0
3 1 7 7 0
4 1 7 7 0
5 0 3 3 0
6 1 7 7 0
7 0 3 3 0
8 0 3 3 0
9 0 3 3 0
10 1 7 7 0

However, people have been assigned to treatment and control in such a way that, given the outcomes realised, it appears that treatment is better than control. Here is the table again, this time with observations we couldn’t observe removed:

Person Condition Y(0) Y(1) CE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 7 ?
4 1 ? 7 ?
5 0 3 ? ?
6 1 ? 7 ?
7 0 3 ? ?
8 0 3 ? ?
9 0 3 ? ?
10 1 ? 7 ?

So, if we take the average of realised treatment outcomes we get 7 and the average of realised control outcomes we get 3. The mean difference is then 4. This estimate is biased. The correct answer is zero, but we couldn’t tell from the available data.

The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. For other estimators that don’t require random treatment assignment and for other estimands, try Scott Cunningham’s Causal Inference: The Mixtape.

### How do you choose between ATE, ATT, and ATC?

Firstly, if you are running a randomised controlled trial, you don’t choose: ATE, ATT, and ATC will be the same. This is because, on average across trials, the characteristics of those who were assigned to treatment or control will be the same.

So the distinction between these three estimands only matters for quasi-experimental studies, for example where treatment assignment is not under the control of the researcher.

Noah Greifer and Elizabeth Stuart offer a neat set of example research questions to help decide (here lightly edited to make them less medical):

• ATT: should an intervention currently being offered continue to be offered or should it be withheld?
• ATC: should an intervention be extended to people who don’t currently receive it?
• ATE: should an intervention be offered to everyone who is eligible?

### How does intention to treat fit in?

The distinction between ATE and ATT is unrelated to the distinction between intention to treat and per-protocol analyses. Intention to treat analysis means we analyse people according to the group they were assigned to, even if they didn’t comply, e.g., by not engaging with the treatment. Per-protocol analysis is a biased analysis that only analyses data from participants who did comply and is generally not recommended.

For instance, it is possible to conduct a quasi-experimental study that uses intention to treat and estimates the average treatment effect on the treated. In this case, ATT might be better called something like average treatment effect for those we intended to treat (ATETWITT). Sadly this term hasn’t yet been used in the literature.

### Summary

Causal effects are defined in terms of potential outcomes following treatment and following control. Only one potential outcome is observed, depending on whether someone was assigned to treatment or control, so causal effects cannot be directly observed. The fields of statistics and causal inference find ways to estimate these estimands using observable data. The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. Quasi-experimental designs allow the estimation of additional estimands: ATT and ATC.

## Theory-based vs. theory-driven evaluation

“Donaldson and Lipsey (2006), Leeuw and Donaldson (2015), and Weiss (1997) noted that there is a great deal of confusion today about what is meant by theory-based or theory-driven evaluation, and the differences between using program theory and social science theory to guide evaluation efforts. For example, the newcomer to evaluation typically has a very difficult time sorting through a number of closely related or sometimes interchangeable terms such as theory-oriented evaluation, theory-based evaluation, theory-driven evaluation, program theory evaluation, intervening mechanism evaluation, theoretically relevant evaluation research, program theory, program logic, logic modeling, logframes, systems maps, and the like. Rather than trying to sort out this confusion, or attempt to define all of these terms and develop a new nomenclature, a rather broad definition is offered in this book in an attempt to be inclusive.

“Program Theory–Driven Evaluation Science is the systematic use of substantive knowledge about the phenomena under investigation and scientific methods to improve, to produce knowledge and feedback about, and to determine the merit, worth, and significance of evaluands such as social, educational, health, community, and organizational programs.”

– Donaldson, S. I. (2022, p. 9). Introduction to Theory-Driven Program Evaluation (2nd ed.). Routledge.

## Costs and benefits in policy making

“Costs and benefits should be calculated over the lifetime of the proposal. Proposals involving infrastructure such as roads, railways and new buildings are appraised over a 60 year period. Refurbishment of existing buildings is considered over 30 years. For proposals involving administrative changes a ten year period is used as a standard measure. For interventions likely to have significant costs or benefits beyond 60 years, such as vaccination programmes, or nuclear waste storage, a suitable appraisal period should be discussed with and formally agreed by the Treasury at the start of work on the proposal.”

The Green Book (2022, p. 9)

## Emergence and complexity in social programme evaluation

There’s lots of talk of complexity in the world of social programme evaluation with little clarity about what the term means. I thought I’d step back from that and explore ideas of complexity where the definitions are clearer.

One is Kolmogorov complexity:

“the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produces the object as output. It is a measure of the computational resources needed to specify the object.”

For example (mildly edited from the Wikipedia article) compare the following two strings:

abababababababababababababababab
4c1j5b2p0cv4w1x8rx2y39umgw5q85s7

The first string has a short description: “ab 16 times” (11 characters). The second has no description shorter than the text itself (32 characters). So the first string is less complex than the second. (The description of the text or other object would usually be written in a programming language.)

One of the fun things we can do with Kolmogorov complexity is use it to help make sense of emergence – how complex phenomena can emerge at a macro-level from some micro level phenomena in a way that seems difficult to predict from the micro-level.

A prototypical example is how complex patterns emergence from simple rules in Conway’s Game of Life. Game of Life consists of an infinite 2D array of cells. Each cell is either alive or dead. The rules are:

1. Any ‘on’ cell (at time t-1) with fewer than two ‘on’ neighbours (at t -1) transitions to an ‘off’ state at time t.
2. Any ‘on’ cell (t -1) with two or three ‘on’ neighbours (t -1) remains ‘on’ at time t.
3. Any ‘on’ cell (t -1) with more than three ‘on’ neighbours (t -1) transitions to an ‘off’ state at time t
4. And ‘off’ cell (t -1) with exactly three ‘on’ neighbours (t -1) transitions to an ‘on’ state at time t.

Here’s an example of the complexity that can emerge (from the Wikipedia article on Game of Life):

Looking at the animation above, there’s still an array of cells switching on and off, but simultaneously it looks like there’s some sort of factory of (what are known in the genre as) gliders. The challenge is, how do we define the way this macro-level pattern emerges from the micro-level cells?

Start with Mark Bedau’s (1997, p. 378) definition of a particular kind of emergence known as weak emergence:

Macrostate P of S with microdynamic D is weakly emergent iff P can be derived from D and S‘s external conditions but only by simulation.

This captures the idea that it’s difficult to tell just by inspecting the rules (the microdynamic) that the complex pattern will emerge – you have to setup the rules and run them (whether by computer or using pen and paper) to see. However, Nora Berenstain (2020) points out that this kind of emergence is satisfied by random patternlessness at the macro-level which is generated from but can’t be predicted from the micro-level without simulation. Patternlessness doesn’t seem to be the kind of thing we think of as emerging, argues Berenstain.

Berenstain (2020) adds a condition of algorithmic compressibility – in other words, the Kolmogorov complexity of the macro-level pattern must be smaller than the pattern itself for it to count as emergence. Here’s Berenstain’s combined definition:

“Where system S is composed of micro-level entities having associated micro-states, and where microdynamic D governs the time evolution of S’s microstates, macrostate P of S with microdynamic D is weakly emergent iff P is algorithmically compressible and can be derived from D and S’s external conditions only by simulation.”

Now I wonder what happens if a macrostate is very simple – so simple it cannot be compressed. This is different to incompressibility due to randomness. Also how should we define simulation outside the world of models in reality: does that literally mean observing a complex social system to see what happens? This would lead to interesting consequences for evaluating complex social programmes, e.g., how can data dredging be prevented? What should be in a study plan?

### References

Bedau, M. (1997). Weak emergence. Philosophical Perspectives, 11, 375–399.

Berenstain, N. (2020). Strengthening weak emergence. Erkenntnis. Online first.

### A lovely video about Game of Life, featuring John Conway

Once you’ve watched that, have a play over here.

## Seven ways to estimate a counterfactual

Experimental and quasi-experimental evaluations usually define a programme effect as the difference between (a) the actual outcome following a social programme and (b) an estimate of what the outcome would have been without the programme – the counterfactual outcome. (The latter might be a competing programme or some genre of “business as usual”.)

It is also usually argued that qualitative or so-called “theory-based” approaches to evaluation are not counterfactual evaluations. Reichardt (2022) adds to a slowly accumulating body of work that challenges this and argues that any approach to evaluation can be understood in counterfactual terms.

Reichardt provides seven examples of evaluation approaches, quantitative and qualitative, and explains how a counterfactual analysis is relevant:

1. Comparisons Across Participants. RCTs and friends. The comparison group is used to estimate the counterfactual. (Note: the comparison group is not the counterfactual. A comparison group is factual.)
2. Before-After Comparisons. The baseline score is often treated as counterfactual outcome (though it’s probably not, thanks, e.g., due to regression to the mean).
3. What-If Assessments. Asking participants to reflect on a counterfactual like, “How would you have felt without the programme?” Participants provide the estimate of the counterfactual, the evaluators use it to estimate the effect.
4. Just-Tell-Me Assessments. Cites Copestake (2014): “If we are interested in finding out whether particular men, women or children are less hungry as a result of some action it seems common-sense just to ask them.” In this case participants may be construed as carrying out the “What-If” assessment of the previous point and using this to work out the programme effect themselves.
5. Direct Observation. Simply seeing the causal effect rather than inferring. An example given is of tapping a car brake and seeing the effect. Not sure I buy this one and neither does Reichardt. Whatever it is, I agree a counterfactual of some sort is needed (and inferred): you need to have a theory to explain what would have happened had you not tapped the brake.
6. Theories-of-Change Assessments. Contribution analysis and realist evaluation are offered as examples. The gist is, despite what proponents of these approaches claim, to use a theory of change to work out whether the programme is responsible for or “contributes to” outcomes, you need to use the theory of change to think about the counterfactual. I’ve blogged about realist evaluation and contribution analysis elsewhere and their definitions of a causal effect.
7. The Modus Operandi (MO) Method. The evaluator looks for evidence of traces or tell-tales that the programme worked. Not sure I quite get how this differs from theory-of-change assessments. Maybe it doesn’t. It sounds like potentially another way to evidence the causal chains in a theory of change.

The conclusion:

“I suspect there is no viable alternative to the counterfactual definition of an effect and that when the counterfactual definition is not given explicitly, it is being used implicitly. […] Of course, evaluators are free to use an alternative to the counterfactual definition of a program effect, if an adequate alternative can be found. But if an alternative definition is used, evaluators should explicitly describe that alternative definition and forthrightly demonstrate how their definition undergirds their methodology […].”

I like four of the seven, as kinds of evidence used to infer the counterfactual outcome. I also propose a fifth: evaluator opinion.

1. Comparisons Across Participants.
2. Before-After Comparisons.
3. What-If Assessments.
4. Just-Tell-Me Assessments.
5. Evaluator opinion.

The What-If and Just-Tell-Me assessments could involve subject experts rather than only beneficiaries of a programme, which would have an impact on how those assessments are interpreted, particularly if the experts have a vested interest. To me, the Theory of Change Assessment in Reichardt’s original could be carried out with the help of one or more of these five. They are all ways to justify causal links (mediating variables or intermediate variables), not just evaluate outcomes, and help assess the validity of a theory of change. Though readers may not find them all equally compelling, particularly the last.

### References

Copestake, J. (2014). Credible impact evaluation in complex contexts: Confirmatory and exploratory approaches. Evaluation, 20(4), 412–427.

Reichardt, C. S. (2022). The Counterfactual Definition of a Program Effect. American Journal of Evaluation43(2), 158–174.

## Why is evaluation so white?

Useful resources to explore (work in progress):

“I have been in too many meetings where a racialised person has felt they’ve had to speak about their lived experience, at great personal cost […]. Sometimes, the individual’s point is directly challenged or downplayed. In a head-spinning moment of gaslighting, they are left isolated and disbelieved, despite (or, perhaps, because) they are the racialised person specifically invited to the meeting to explain why the racist thing is racist.”

‘I’m sometimes asked, “Why are there so few people of color in evaluation?” I flip the question: “Why is evaluation so white?” And answer: “Because our labor is actively erased.”

‘Of the 35 recipients of the Paul F. Lazarsfeld Evaluation Theory Award since 1977, 28 evaluators listed in the sacred Evaluation Theory Tree published in 2004, 22 evaluators featured in the related Evaluation Roots book published in 2004, and 16 evaluators featured by AEA’s Oral History Project since 2003, not one has been a woman of color or indigenous woman.

‘This omission can lead us to conclude that for the last 40 years, no women of color or indigenous women have been academically trained as evaluators, conducting formal evaluations, and engaged in scholarship on evaluation—let alone engaged in evaluative thinking and critical inquiry that are considered outside the boundaries of evaluation. Or that their work is fringe.

‘The evaluation work of several women of color and indigenous women allows us to do the work that we do every day. This post aims to repair the miseducation of evaluators of all ages and experience levels…’

Further resources, e.g.,

It’s striking how issues discussed in the 70s are still relevant now, e.g., concerning the impossibility of using IQ tests (and covert proxies thereof) to improve outcomes rather than simply to blame a child and excuse education systems for poor outcomes.

‘At its core, evaluation is value laden and embued with and responsive to a larger social political order, and evaluators are situated within contexts of study and within interactions of the setting that shape the evaluation study’s logic, structure, and practices (Hopson, Greene, Bledsoe, Villegas, & Brown, 2007). The question of “Who evaluates and why?” highlights the contexts, agendas, and intentions of the evaluation and the evaluator and so raises questions about practices—sometimes commonly accepted ones—and the structures of power and the uses of those power structures for or against hegemony.’ [p. 418]

This cites The oral history of evaluation, part 3: The professional evolution of Michael Scriven, which provides a clue – hiding in plain sight – to why the official history of evaluation is as it is (bold emphasis added):

“Now, there was the May 12th group, which was ahead of the game. The May 12th group was so called after the first date on which they met [in “about 1968”, says Gene Glass–AF]—but the general feeling was if we call it the May 12th group, that will have absolutely zero cachet, and so no one will be able to argue that they were entitled to join the May 12th group because it’s called something generic. And so the idea was you got invited to the May 12th group, and if you weren’t invited, then you weren’t in, and so there was no official stuff. So, they would meet in somebody’s house once a year. […] But some of us felt that we needed to do something that was slightly more official, and we’d got to start making this more than the intellectual elite group.”

The May 13 Group formed on that date in 2020 to challenge this.

‘Evaluation is political. At its simplest, evaluation is the systematic “process of determining the merit, worth and value of things” (Scriven, 1991, p. 1). Who gets to decide, the questions, the process, and the criteria for determining merit, worth, value, or significance—all of these matter.’ [p. 534]

‘As professionals and practitioners, we can no longer sit on the sidelines wearing the cape of objectivity and neutrality, a cape that shields beliefs and assumptions about knowledge, rigor, and evidence and which elevate a Western White worldview. [..] Everyday narratives that continue to marginalize, minimize, and disrespect people of color and those with less privilege could be replaced with ones that do not demonize and place blame on the individual. They could instead lift up the historical, contextual, and powerful dynamics that create and sustain oppression and shed light on the strategies and solutions which can shift the “rules of the game,” so that equity is achievable.’ [p. 538]

“Advisors of evaluation graduate students of colour should create spaces for students to express their feelings and, if they choose, be vulnerable and open about the stressors of simply being a person of colour in a world with white supremacy woven into its very fabric. “

“Whenever a prospective student emails me, I put them in touch with current students in my department. I find this is especially important for international students; I am unable to speak to how the culture in North Carolina and in our department differs from their home culture. I also aim to introduce students to faculty across campus who have similar cultures and backgrounds”

“Advisors of evaluation graduate students of colour can research or have conversations about the norms and dates associated with the holidays and events that their students observe. […] While I can’t know all the traditions observed by my students, I encourage them to inform me about their cultural and religious traditions as appropriate.”

“… advisors and mentors should also practice giving microvalidations […], small acts and words that validate who graduate students believe they can be. My post-doctoral advisor always praised me in public and raised concerns in private. I regularly let my advisees know that I am proud of them, see their potential, and believe in them. I learn every student’s name and work to pronounce their names correctly. And I make a concerted effort to refer to my advisees as my colleagues.”

“… evaluators of color noted that the burden of addressing DEI and calling out racism is often placed on them as they are assumed to be experts…”

“… evaluators of color cited examples of being tapped to join an evaluation project when philanthropic clients asked for demographics of staff in their RFPs, yet not feeling meaningfully included in the subsequent work…”

“When organizations have difficulty retaining staff of color, they often perceive the person of color as the problem, not the ecosystem that reinforces inequities. Persistent challenges with retention should signal a need for the organization to self-reflect on its culture and make changes…”

## Being realistic about “realist” evaluation

Realist evaluation (formerly known as realistic evaluation; Pawson & Tilley, 2004, p. 3) is an approach to theory-based evaluation that treats, e.g., burglars and prisons as real as opposed to narrative constructs (that seems uncontroversial); follows “a realist methodology” that aims for scientific “detachment” and “objectivity”; and also strives to be realistic about the scope of evaluation (Pawson & Tilley, 1997, pp. xii-xiv).

“Realist(ic)” evaluation proposes something apparently new and distinctive. But how does it look in reality? What’s new about it? Let’s have a read of Pawson and Tilley’s (1997) classic to try to find out.

### Déjà vu

Open any text on social science methodology, and it will say something like the following about the process of carrying out research:

1. Review what is known about your topic area, including theories which attempt to explain and bring order to the various disparate findings.
2. Use prior theory, supplemented with your own thinking, to formulate research questions or hypotheses.
3. Choose methods that will enable you to answer those questions or test the hypotheses.
4. Gather and analyse data.
5. Interpret the analysis in relation to the theories introduced at the outset. What have you learned? Do the theories need to be tweaked? For qualitative research, this interpretation and analysis are often interwoven.
6. Acknowledge limitations of your study. This will likely include reflection about whether your method or the theory are to blame for any mismatch between theory and findings.
7. Add your findings to the pool of knowledge (after a gauntlet of peer review).
8. Loop back to 1.

Realist evaluation has similar:

It is scientific method as usual with constraints on what the various stages should include for a study to be certified genuinely “realist”. For instance, the theories should be framed in terms of contexts, mechanisms, and outcomes (more on which in a moment); hypotheses emphasise the “for whom” and circumstances of an evaluation; and instead of “empirical generalisation” there is a “program specification”.

The method of data collection and analysis can be anything that satisfies this broad research loop (p. 85):

“… we cast ourselves as solid members of the modern, vociferous majority […], for we are whole-heartedly pluralists when it comes to the choice of method. Thus, as we shall attempt to illustrate in the examples to follow, it is quite possible to carry out realistic evaluation using: strategies, quantitative and qualitative; timescales, contemporaneous or historical; viewpoints, cross-sectional or longitudinal; samples, large or small; goals, action-oriented or audit-centred; and so on and so forth. [… T]he choice of method has to be carefully tailored to the exact form of hypotheses developed earlier in the cycle.”

This is reassuringly similar to the standard textbook story. However, like the standard story, in practice there are ethical and financial constraints on method meaning that the ideal approach to answer a question may not be feasible, and yet an evaluation of some description is deemed necessary nonetheless. Indeed the UK government’s evaluation bible, the Magenta Book (HM Treasury, 2020), recommends using what it calls “theory-based” approaches like “realist” evaluation when experimental and quasi-experimental approaches are not feasible. (See also, What is Theory-Based Evaluation, really?)

### More than a moment’s thought about theory

Pawson and Tilley (1997) emphasise the importance of thinking about why social interventions may lead to change and not only looking at outcomes, which they illustrate with the example of CCTV:

“CCTV certainly does not create a physical barrier making cars impenetrable. A moment’s thought has us realize, therefore, that the cameras must work by instigating a chain of reasoning and reaction. Realist evaluation is all about turning this moment’s thought into a comprehensive theory of the mechanisms through which CCTV may enter the potential criminal’s mind, and the contexts needed if these powers are to be realized.” (p. 78)

They then list a range of potential mechanisms. CCTV might make it more likely that thieves are caught in the act. Or maybe the presence of CCTV make car parks feel safer, which means they are used by more people whose presence and watchful eyes prevent theft. So other people provide the surveillance rather than the camera bolted to the wall.

Nothing new here – social science is awash with theory (Pawson and Tilley cite Durkheim’s 1950s work on suicide as an example). Psychological therapies are some of the most evaluated of social interventions and the field is particularly productive when it comes to theory; see, e.g., Whittle (1999, p. 240) on psychoanalysis, a predecessor of modern therapies:

“Psychoanalysis is full of theory. It has to be, because it is so distrustful of the surface. It could still choose to use the minimum necessary, but it does the opposite. It effervesces with theory…”

To take a more contemporary example, Power (2010) argues that effects in modern therapies involve at least one of the following three activities: exploring and using how the relationship between therapist and client mirrors relationships outside therapy (transference); graded exposure to situations which provoke anxiety; and challenging dysfunctional assumptions about how the social world works. For each of these activities there are detailed theories of change.

However, perhaps evaluations of social programmes – therapies included – have concentrated too much on tracking outcomes and neglected getting to grips with testing potential mechanisms of change, so “realist” evaluation is potentially a helpful intervention. The specific example of CCTV is a joy to read and is a great way to bring the sometimes abstract notion of  social mechanism alive.

### The structure of explanations in “realist” evaluation

The context-mechanism-outcome triad is a salient feature of the approach. Rather than define each of these (see the original text), here are four examples from Pawson and Tilley (1997) to illustrate what they are. The middle column (New mechanism) describes the putative mechanism that may be “triggered” by a social programme that has been introduced.

Context New mechanism Outcome
Poor-quality, hard-to-let housing; traditional housing department; lack of tenant involvement in estate management Improved housing and increased involvement in management create increased commitment to the estate, more stability, and opportunities and motivation for social control and collective responsibility Reduced burglary
prevalence
Three tower blocks, occupied mainly by the elderly; traditional housing department; lack of tenant involvement in estate management Concentration of elderly tenants into smaller blocks and natural wastage creates vacancies taken up by young, formerly homeless single people inexperienced in independent living. They become the dominant group. They have little capacity or inclination for informal social control, and are attracted to a hospitable estate subterranean subculture Increased burglary prevalence concentrated amongst the more
vulnerable; high levels of vandalism and incivility
Prisoners with little or no previous education with a growing string of convictions – representing a ‘disadvantaged’ background Modest levels of engagement and success with the program trigger ‘habilitation’ process in which the inmate experiences self-realization and social acceptability (for the first time) Lowest levels of reconviction as compared with statistical norm for such inmates
High numbers of prepayment meters, with a high proportion of burglaries involving cash from meters Removal of cash meters reduces incentive to burgle by decreasing actual or perceived rewards Reduction in percentage of burglaries involving meter breakage; reduced risk of burglary at dwellings where meters are removed; reduced burglary rate overall

This seems a helpful way to organise thinking about the context-mechanism-outcome triad, irrespective of whether the approach is labelled “realist”. Those who are into logframe matricies (logframes) might want to add a column for the “outputs” of a programme.

The authors emphasise that the underlying causal model is “generative” in the sense that causation is seen as

“acting internally as well as externally. Cause describes the transformative potential of phenomena. One happening may well trigger another but only if it is in the right condition in the right circumstances. Unless explanation penetrates to these real underlying levels, it is deemed to be incomplete.” (p. 34)

The “internal” here appears to refer to looking inside the “black box” of a social programme to see how it operates, rather than merely treating it as something that is present in some places and absent in others. Later, there is further elaboration of what “generative” might mean:

“To ‘generate’ is to ‘make up’, to ‘manufacture’, to ‘produce’, to ‘form’, to ‘constitute’. Thus when we explain a regularity generatively, we are not coming up with variables or correlates which associate one with the other; rather we are trying to explain how the association itself comes about. The generative mechanisms thus actually constitute the regularity; they are the regularity. The generative mechanisms thus actually constitute the regularity; they are the regularity.” (p. 67)

We also learn that an action is causal only if its outcome is triggered by a mechanism in a context (p. 58). Okay, but how do we find out if an action’s outcome is triggered in this manner? “Realist” evaluation does not, in my view, provide an adequate analysis of what a causal effect is. Understandable, perhaps, given its pluralist approach to method. So, understandings of causation must come from elsewhere.

Mechanisms can be seen as “entities and activities organized in such a way that they are responsible for the phenomenon” (Illari & Williamson, 2011, p. 120). In “realist” evaluation, entities and their activities in the context would be included in this organisation too – the context supplies the mechanism on which a programme intervenes. So, let’s take one of the example mechanisms from the table above:

“Improved housing and increased involvement in management create increased commitment to the estate, more stability, and opportunities and motivation for social control and collective responsibility.”

To make sense of this, we need a theory of what improved housing looks like, what involvement in management and commitment to the estate, etc., means. To “create commitment” seems like a psychological, motivational process. The entities are the housing, management structures, people living in the estate, etc. To evidence the mechanism, I think it does help to think of variables to operationalise what might be going on and to use comparison groups to avoid mistaking, e.g., regression to the mean or friendlier neighbours for change due to improved housing. And indeed, Pawson and Tilley use quantitative data in one of the “realist” evaluations they discuss (next section). Such operationalisation does not reduce a mechanism to a set of variables; it is merely a way to analyse a mechanism.

### Kinds of evidence

Chapter 4 gives a range of examples of the evidence that has been used in early “realist” evaluations. In summary, and confirming the pluralist stance mentioned above, it seems that all methods are relevant to realist evaluation. Two examples:

1. Interviews with practitioners to try to understand what it is about a programme that might effect change: “These inquiries released a flood of anecdotes, and the tales from the classroom are remarkable not only for their insight but in terms of the explanatory form which is employed. These ‘folk’ theories turn out to be ‘realist’ theories and invariably identify those contexts and mechanisms which are conducive to the outcome of rehabilitation.” (pp. 107-108)
2. Identifying variables in an information management system to “operationalize these hunches and hypotheses in order to identify, with more precision, those combinations of types of offender and types of course involvement which mark the best chances of rehabilitation. Over 50 variables were created…” (p. 108)

Some researchers have made a case for and carried out what they term realist randomised controlled trials (Bonell et al., 2012; which seems eminently sensible to me). The literature subsequently exploded in response. Here’s an illustrative excerpt of the criticisms (Marchal et al., 2013, p. 125):

“Experimental designs, especially RCTs, consider human desires, motives and behaviour as things that need to be controlled for (Fulop et al., 2001, Pawson, 2006). Furthermore, its analytical techniques, like linear regression, typically attempt to isolate the effect of each variable on the outcome. To do this, linear regression holds all other variables constant “instead of showing how the variables combine to create outcomes” (Fiss, 2007, p. 1182). Such designs “purport to control an infinite number of rival hypotheses without specifying what any of them are” by rendering them implausible through statistics (Campbell, 2009), and do not provide a means to examine causal mechanisms (Mingers, 2000).”

Well. What to make of this. Yes, RCTs control for stuff that’s not measured and maybe even unmeasurable. But you can also measure stuff you know about and see if that moderates or mediates the outcome (see, e.g., Windgassen et al., 2016). You might also use the numbers to select people for qualitative interview to try to learn more about what is going on. The comment on linear regression reveals surprising ignorance of how non-linear transformations of and interactions between predictors can be added to models. It is also trivial to calculate marginal outcome predictions for combinations of predictors together, rather than merely identifying which predictors are likely non-zero when holding others fixed. See Bonell et al. (2016) for a very patient reply.

### Conclusions

The plea for evaluators to spend more time developing theory is welcome – especially in policy areas where “key performance indicators” and little else are the norm (see also Carter, 1989, on KPIs as dials versus tin openers opening a can of worms). It is a laudable aim to help “develop the theories of practitioners, participants and policy makers” of why a programme might work (Pawson & Tilley, 1997, p. 214). The separation of context, mechanism, and outcome, also helps structure thinking about social programmes (though there is widespread confusion about what a mechanism is in the “realist” literature; Lemire et al., 2020). But “realist” evaluation is arguably better seen as an exposition of a particular reading of traditional scientific method applied to evaluation, with a call for pluralist methods. I am unconvinced that it is a novel form of evaluation.

### References

Bonell, C., Fletcher, A., Morton, M., Lorenc, T., & Moore, L. (2012). Realist randomised controlled trials: a new approach to evaluating complex public health interventions. Social Science & Medicine, 75(12), 2299–2306.

Bonell, C., Warren, E., Fletcher, A., & Viner, R. (2016). Realist trials and the testing of context-mechanism-outcome configurations: A response to Van Belle et al. Trials, 17(1), 478.

Carter, N. (1989). Performance indicators: “backseat driving” or “hands off” control? Policy & Politics, 17, 131–138.

HM Treasury (2020). Magenta Book.

Illari, P. M., & Williamson, J. (2011). What is a mechanism? Thinking about mechanisms across the sciencesEuropean Journal for Philosophy of Science2(1), 119–135.

Lemire, S., Kwako, A., Nielsen, S. B., Christie, C. A., Donaldson, S. I., & Leeuw, F. L. (2020). What Is This Thing Called a Mechanism? Findings From a Review of Realist Evaluations. New Directions for Evaluation, 167, 73–86.

Marchal, B., Westhorp, G., Wong, G., Van Belle, S., Greenhalgh, T., Kegels, G., & Pawson, R. (2013). Realist RCTs of complex interventions – an oxymoron. Social Science & Medicine, 94, 124–128.

Pawson, R., & Tilley, N. (1997). Realistic Evaluation. SAGE Publications Ltd.

Pawson, R., & Tilley, N. (2004). Realist evaluation. Unpublished.

Power, M. (2010). Emotion-focused cognitive therapy. London: Wiley.

Whittle, P. (1999). Experimental Psychology and Psychoanalysis: What We Can Learn from a Century of Misunderstanding. Neuropsychoanalysis1, 233-245.

Windgassen, S., Goldsmith, K., Moss-Morris, R., & Chalder, T. (2016). Establishing how psychological therapies work: the importance of mediation analysis. Journal of Mental Health, 25, 93–99.

## What is Theory-Based Evaluation, really?

It is a cliché that randomised controlled trials (RCTs) are the gold standard if you want to evaluate a social policy or intervention and quasi-experimental designs (QEDs) are presumably the silver standard. But often it is not possible to use either, especially for complex policies. Theory-Based Evaluation is an alternative that has been around for a few decades, but what exactly is it?

In this post I will sketch out what some key texts say about Theory-Based Evaluation; explore one approach, contribution analysis; and conclude with discussion of an approach to assessing evidence in contribution analyses (and a range of other approaches) using Bayes’ rule.

### theory (lowercase)

Let’s get the obvious out of the way. All research, evaluation included, is “theory-based” by necessity, even if an RCT is involved. Outcome measures and interviews alone cannot tell us what is going on; some sort of theory (or story, account, narrative, …) – however flimsy or implicit – is needed to design an evaluation and interpret what the data means.

If you are evaluating a psychological therapy, then you probably assume that attending sessions exposes therapy clients to something that is likely to be helpful. You might make assumptions about the importance of the therapeutic relationship to clients’ openness, of any homework activities carried out between sessions, etc. RCTs can include statistical mediation tests to determine whether the various things that happen in therapy actually explain any difference in outcome between a therapy and comparison group (e.g., Freeman et al., 2015).

It is great if a theory makes accurate predictions, but theories are underdetermined by evidence, so this cannot be the only criterion for preferring one theory’s explanation over another (Stanford, 2017) – again, even if you have an effect size from an RCT. Lots of theories will be compatible with any RCT’s results. To see this, try a particular social science RCT and think hard about what might be going on in the intervention group beyond what the intervention developers have explicitly intended.

In addition to accuracy, Kuhn (1977) suggests that a good theory should be consistent with itself and other relevant theories; have broad scope; bring “order to phenomena that in its absence would be individually isolated”; and it should produce novel predictions beyond current observations. There are no obvious formal tests for these properties, especially where theories are expressed in ordinary language and box-and-arrow diagrams.

### Theory-Based Evaluation (title case)

Theory-Based Evaluation is a particular genre of evaluation that includes realist evaluation and contribution analysis. According the UK’s government’s Magenta Book (HM Treasury, 2020, p. 43), Theory-Based methods of evaluation

“can be used to investigate net impacts by exploring the causal chains thought to bring about change by an intervention. However, they do not provide precise estimates of effect sizes.”

The Magenta Book acknowledges (p. 43) that “All evaluation methods can be considered and used as part of a [Theory-Based] approach”; however, Figure 3.1 (p. 47) is clear. If you can “compare groups affected and not affected by the intervention”, you should go for experiments or quasi-experiments; otherwise, Theory-Based methods are required.

Theory-Based Evaluation attempts to draw causal conclusions about a programme’s effectiveness in the absence of any comparison group. If a quasi-experimental design (QED) or randomised controlled trial (RCT) were added to an evaluation, it would cease to be Theory-Based Evaluation, as the title case term is used.

### Example: Contribution analysis

Contribution analysis is an approach to Theory-Based Evaluation developed by John Mayne (28 November 1943 – 18 December 2020). Mayne was originally concerned with how to use monitoring data to decide whether social programmes actually worked when quasi-experimental approaches were not feasible (Mayne, 2001), but the approach evolved to have broader scope.

According to a recent summary (Mayne, 2019), contribution analysis consists of six steps (and an optional loop):

Step 1: Set out the specific cause-effect questions to be addressed.

Step 2: Develop robust theories of change for the intervention and its pathways.

Step 3: Gather the existing evidence on the components of the theory of change model of causality: (i) the results achieved and (ii) the causal link assumptions realized.

Step 4: Assemble and assess the resulting contribution claim, and the challenges to it.

Step 5: Seek out additional evidence to strengthen the contribution claim.

Step 6: Revise and strengthen the contribution claim.

Here is a diagrammatic depiction of the kind of theory of change that could be plugged in at Step 2 (Mayne, 2015, p. 132), which illustrates the cause-effect links an evaluation would aim to evaluate.

In this example, mothers are thought to learn from training sessions and materials, which then persuades them to adopt new feeding practices. This leads to children having more nutritious diets. The theory is surrounded by various contextual factors such as food prices. (See also Mayne, 2017, for a version of this that includes ideas from the COM-B model of behaviour.)

Step 4 is key. It requires evaluators to “Assemble and assess the resulting contribution claim”. How are we to carry out that assessment? Mayne (2001, p. 14) suggests some questions to ask:

“How credible is the story? Do reasonable people agree with the story? Does the pattern of results observed validate the results chain? Where are the main weaknesses in the story?”

For me, the most credible stories would include experimental or quasi-experimental tests, with mediation analysis of key hypothesised mechanisms, and qualitative detective work to get a sense of what’s going on beyond the statistical associations. But the quant part of that would lift us out of the Theory-Based Evaluation wing of the Magenta Book flowchart. In general, plausibility will be determined outside contribution analysis in, e.g., quality criteria for whatever methods for data collection and analysis were used. Contribution analysis says remarkably little on this key step.

Although contribution analysis is intended to fill a gap where no comparison group is available, Mayne (2001, p. 18) suggests that further data might be collected to help rule out alternative explanations of outcomes, e.g., from surveys, field visits, or focus groups. He also suggests reviewing relevant meta-analyses, which could (I presume) include QED and RCT evidence.

It is not clear to me what the underlying theory of causation is in contribution analysis. It is clear what it is not (Mayne, 2019, pp. 173–4):

“In many situations a counterfactual perspective on causality—which is the traditional evaluation perspective—is unlikely to be useful; experimental designs are often neither feasible nor practical…”

“[Contribution analysis] uses a stepwise (generative) not a counterfactual approach to causality.”

(We will explore counterfactuals below.) I can guess what this generative approach could be, but Mayne does not provide precise definitions. It clearly isn’t the idea from generative social science in which causation is defined in terms of computer simulations (Epstein, 1999).

One way to think about it might be in terms of mechanisms: “entities and activities organized in such a way that they are responsible for the phenomenon” (Illari & Williamson, 2011, p. 120). We could make this precise by modelling the mechanisms using causal Bayesian networks such that variables (nodes in a network) represent the probability of activities occurring, conditional on temporally earlier activities having occurred – basically, a chain of probabilistic if-thens.

Why do people get vaccinated for Covid-19? Here is the beginning of a (generative?) if-then theory:

1. If you learned about vaccines in school and believed what you learned and are exposed to an advert for Covid-19 jab and are invited by text message to book an appointment for one, then (with a certain probability) you use your phone to book an appointment.
2. If you have booked an appointment, then (with a certain probability) you travel to the vaccine centre in time to attend the appointment.
3. If you attend the appointment, then (with a certain probability) you are asked to join a queue.

… and so on …

In a picture:

This does not explain how or why the various entities (people, phones, etc.) and activities (doing stuff like getting the bus as a result of beliefs and desires) are organised as they are, just the temporal order in which they are organised and dependencies between them. Maybe this suffices.

### What are counterfactual approaches?

Counterfactual impact evaluation usually refers to quantitative approaches to estimate average differences as understood in a potential outcomes framework (or generalisations thereof). The key counterfactual is something like:

“If the beneficiaries had not taken part in programme activities, then they would not have had the outcomes they realised.”

Logicians have long worried how to determine the truth of counterfactuals, “if A had been true, B.” One approach, due to Stalnaker (1968), proposes that you:

1. Start with a model representing your beliefs about the factual situation where A is false. This model must have enough structure so that tweaking it could lead to different conclusions (causal Bayesian networks have been proposed; Pearl, 2013).
3. Modify the belief model in a minimal way to remove contradictions introduced by adding A.
4. Determine the truth of B in that revised belief model.

This broader conception of counterfactual seems compatible with any kind of evaluation, contribution analysis included. White (2010, p. 157) offered a helpful intervention, using the example of a pre-post design where the same outcome measure is used before and after an intervention:

“… having no comparison group is not the same as having no counterfactual. There is a very simple counterfactual: what would [the outcomes] have been in the absence of the intervention? The counterfactual is that it would have remained […] the same as before the intervention.”

The counterfactual is untested and could be false – regression to the mean would scupper it in many cases. But it can be stated and used in an evaluation. I think Stalnaker’s approach is a handy mental trick for thinking through the implications of evidence and producing alternative explanations.

Cook (2000) offers seven reasons why Theory-Based Evaluation cannot “provide the valid conclusions about a program’s causal effects that have been promised.” I think from those seven, two are key: (i) it is usually too difficult to produce a theory of change that is comprehensive enough for the task and (ii) the counterfactual remains theoretical – in the arm-chair, untested sense of theoretical – so it is too difficult to judge what would have happened in the absence of the programme being evaluated. Instead, Cook proposes including more theory in comparison group evaluations.

### Bayesian contribution tracing

Contribution analysis has been supplemented with a Bayesian variant of process tracing (Befani & Mayne, 2014; Befani & Stedman-Bryce, 2017; see also Fairfield & Charman, 2017, for a clear introduction to Bayesian process tracing more generally).

The idea is that you produce (often subjective) probabilities of observing particular (usually qualitative) evidence under your hypothesised causal mechanism and under one or more alternative hypotheses. These probabilities and prior probabilities for your competing hypotheses can then be plugged into Bayes’ rule when evidence is observed.

Suppose you have two competing hypotheses: a particular programme led to change versus pre-existing systems. You may begin by assigning them equal probability, 0.5 and 0.5. If relevant evidence is observed, then Bayes’ rule will shift the probabilities so that one becomes more probable than the other.

Process tracers often cite Van Evera’s (1997) tests such as the hoop test and smoking gun. I find definitions of these challenging to remember so one thing I like about the Bayesian approach is that you can think instead of specificity and sensitivity of evidence, by analogy with (e.g., medical) diagnostic tests. A good test of a causal mechanism is sensitive, in the sense that there is a high probability of observing the relevant evidence if your causal theory is accurate. A good test is also specific, meaning that the evidence is unlikely to be observed if any alternative theory is true. See below for a table (lighted edited from Befani & Mayne, 2014, p. 24) showing the conditional probabilities of evidence for each of Van Evera’s tests given a hypothesis and alternative explanation.

Van Evera test
if Eᵢ is observed
P(Eᵢ | Hyp) P(Eᵢ | Alt)
Fails hoop test Low
Passes smoking gun Low
Doubly-decisive test High Low
Straw-in-the-wind test High High

Let’s take the hoop test. This applies to evidence which is unlikely if your preferred hypothesis were true. So if you observe that evidence, the hoop test fails. The test is agnostic about the probability under the alternative hypothesis. Straw-in-the-wind is hopeless for distinguishing between your two hypotheses, but could suggest that neither holds if the test fails. The double-decisive test has high sensitivity and high specificity, so provides strong evidence for your hypothesis if it passes.

The arithmetic is straightforward if you stick to discrete multinomial variables and use software for conditional independence networks. Eliciting the subjective probabilities for each source of evidence, conditional on each hypothesis, may be less straightforward.

### Conclusions

I am with Cook (2000) and others who favour a broader conception of “theory-based” and suggest that better theories should be tested in quantitative comparison studies. However, it is clear that it is not always possible to find a comparison group – colleagues and I have had to make do without (e.g., Fugard et al., 2015). Using Theory-Based Evaluation in practice reminds me of jury service: a team are guided through thick folders of evidence, revisiting several key sections that are particularly relevant, and work hard to reach the best conclusion they can with what they know. There is no convenient effect size to consult, just a shared (to some extent) and informal idea of what intuitively feels more or less plausible (and lengthy discussion where there is disagreement). To my mind, when quantitative comparison approaches are not possible, Bayesian approaches to assessing qualitative evidence are the most compelling way to synthesise qualitative evidence of causal impact and make transparent how this synthesis was done.

Finally, it seems to me that the Theory-Based Evaluation category is poorly named. Better might be, Assumption-Based Counterfactual approaches. Then RCTs and QEDs are Comparison-Group Counterfactual approaches. Both are types of theory-based evaluation and both use counterfactuals; it’s just that approaches using comparison groups gather quantitative evidence to test the counterfactual. However, the term doesn’t quite work since RCTs and QEDs rely on assumptions too… Further theorising needed.

Edited to add: Reichardt’s (2022), The Counterfactual Definition of a Program Effect, is a very promising addition to the literature and, I think, offers a clear way out of the theory-based versus non-theory-based and counterfactual versus not-counterfactual false dichotomies. I’ve blogged about it here.

### References

Befani, B., & Mayne, J. (2014). Process Tracing and Contribution Analysis: A Combined Approach to Generative Causal Inference for Impact Evaluation. IDS Bulletin, 45(6), 17–36.

Befani, B., & Stedman-Bryce, G. (2017). Process Tracing and Bayesian Updating for impact evaluation. Evaluation, 23(1), 42–60.

Cook, T. D. (2000). The false choice between theory-based evaluation and experimentation. In L. A. Fierro & T. M. Franke (Eds.), New Directions for Evaluation (pp. 27–34).

Epstein, J. M. (1999). Agent-based computational models and generative social science. Complexity, 4(5), 41–60.

Fairfield, T., & Charman, A. E. (2017). Explicit bayesian analysis for process tracing: Guidelines, opportunities, and caveats. Political Analysis, 25(3), 363–380.

Freeman, D., Dunn, G., Startup, H., Pugh, K., Cordwell, J., Mander, H., Černis, E., Wingham, G., Shirvell, K., & Kingdon, D. (2015). Effects of cognitive behaviour therapy for worry on persecutory delusions in patients with psychosis (WIT): a parallel, single-blind, randomised controlled trial with a mediation analysis. The Lancet Psychiatry, 2(4), 305–313.

Fugard, A. J. B., Stapley, E., Ford, T., Law, D., Wolpert, M. & York, A. (2015). Analysing and reporting UK CAMHS outcomes: an application of funnel plotsChild and Adolescent Mental Health, 20, 155–162.

HM Treasury. (2020). Magenta Book.

Illari, P. M., & Williamson, J. (2011). What is a mechanism? Thinking about mechanisms across the sciences. European Journal for Philosophy of Science, 2(1), 119–135.

Kuhn, T. S. (1977). Objectivity, Value Judgment, and Theory Choice. In The Essential Tension: Selected Studies in Scientific Tradition and Change (pp. 320–339). The University of Chicago Press.

Mayne, J. (2001). Addressing attribution through contribution analysis: using performance measures sensibly. The Canadian Journal of Program Evaluation, 16(1), 1–24.

Mayne, J. (2015). Useful theory of change models. Canadian Journal of Program Evaluation, 30(2), 119–142.

Mayne, J. (2017). Theory of change analysis: Building robust theories of change. Canadian Journal of Program Evaluation, 32(2), 155–173.

Mayne, J. (2019). Revisiting contribution analysis. Canadian Journal of Program Evaluation, 34(2), 171–191.

Pearl, J. (2013). Structural counterfactuals: A brief introduction. Cognitive Science, 37(6), 977–985.

Stalnaker, R. C. (1968). A Theory of Conditionals. In Ifs (pp. 41–55). Basil Blackwell Publisher.

Stanford, K. (2017). Underdetermination of Scientific Theory. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy.

Van Evera, S. (1997). Guide to Methods for Students of Political Science. New York, NY: Cornell University Press.

White, H. (2010). A contribution to current debates in impact evaluation. Evaluation, 16(2), 153–164.