“The positivist picture of the structure of scientific theories is now widely rejected. But the underlying idea that scientific theories are primarily designed to predict and explain claims about what we observe remains enormously influential, even among the sharpest critics of positivism.” (p. 304)
“Phenomena are detected through the use of data, but in most cases are not observable in any interesting sense of that term. Examples of data include bubble chamber photographs, patterns of discharge in electronic particle detectors and records of reaction times and error rates in various psychological experiments. Examples of phenomena, for which the above data might provide evidence, include weak neutral currents, the decay of the proton, and chunking and recency effects in human memory.” (p. 306)
“Our general thesis, then, is that we need to distinguish what theories explain (phenomena or facts about phenomena) from what is uncontroversially observable (data).” (p. 314)
Bogen, J., & Woodward, J. (1988). Saving the phenomena. The Philosophical Review, XCVII(3), 303–352.
‘A mechanism is one of the processes in a concrete system that makes it what it is—for example, metabolism in cells, interneuronal connections in brains, work in factories and offices, research in laboratories, and litigation in courts of law. Because mechanisms are largely or totally imperceptible, they must be conjectured. Once hypothesized they help explain, because a deep scientific explanation is an answer to a question of the form, “How does it work, that is, what makes it tick—what are its mechanisms?”’ (p. 182; abstract)
‘Consider the well-known law-statement, “Taking ‘Ecstasy’ causes euphoria,” which makes no reference to any mechanisms. This statement can be analyzed as the conjunction of the following two well-corroborated mechanistic hypotheses: “Taking ‘Ecstasy’ causes serotonin excess,” and “Serotonin excess causes euphoria.” These two together explain the initial statement. (Why serotonin causes euphoria is of course a separate question that cries for a different mechanism.)’ (p. 198)
‘How do we go about conjecturing mechanisms? The same way as in framing any other hypotheses: with imagination both stimulated and constrained by data, well-weathered hypotheses, and mathematical concepts such as those of number, function, and equation. […] There is no method, let alone a logic, for conjecturing mechanisms. […] One reason is that, typically, mechanisms are unobservable, and therefore their description is bound to contain concepts that do not occur in empirical data.’ (p. 200)
‘Even the operations of a corner store are only partly overt. For instance, the grocer does not know, and does not ordinarily care to find out, why a customer buys breakfast cereal of one kind rather than another. However, if he cares he can make inquiries or guesses—for instance, that children are likely to be sold on packaging. That is, the grocer may make up what is called a “theory of mind,” a hypothesis concerning the mental processes that end up at the cash register.’ (p. 201)
Bunge, M. (2004). How Does It Work?: The Search for Explanatory Mechanisms. Philosophy of the Social Sciences, 34(2), 182–210.
Realist evaluation (formerly known as realistic evaluation; Pawson & Tilley, 2004, p. 3) is an approach to Theory-Based Evaluation that treats, e.g., burglars and prisons as real as opposed to narrative constructs; follows “a realist methodology” that aims for scientific “detachment” and “objectivity”; and also strives to be realistic about the scope of evaluation (Pawson & Tilley, 1997, pp. xii-xiv).
“Realist(ic)” evaluation proposes something apparently new and distinctive. How does it look in practice? What’s new about it? Let’s have a read of Pawson and Tilley’s (1997) classic to try to find out.
Open any text on social science methodology, and it will say something like the following about the process of carrying out research:
Review what is known about your topic area, including theories which attempt to explain and bring order to the various disparate findings.
Use prior theory, supplemented with your own thinking, to formulate research questions or hypotheses.
Choose methods that will enable you to answer those questions or test the hypotheses.
Gather and analyse data.
Interpret the analysis in relation to the theories introduced at the outset. What have you learned? Do the theories need to be tweaked? For qualitative research, this interpretation and analysis are often interwoven.
Acknowledge limitations of your study. This will likely include reflection about whether your method or the theory are to blame for any mismatch between theory and findings.
Add your findings to the pool of knowledge (after a gauntlet of peer review).
Loop back to 1.
Realist evaluation has similar:
It is scientific method as usual with constraints on what the various stages should include for a study to be certified genuinely “realist”. For instance, the theories should be framed in terms of contexts, mechanisms, and outcomes (more on which in a moment); hypotheses emphasise the “for whom” and circumstances of an evaluation; and instead of “empirical generalisation” there is a “program specification”.
The method of data collection and analysis can be anything that satisfies this broad research loop (p. 85):
“… we cast ourselves as solid members of the modern, vociferous majority […], for we are whole-heartedly pluralists when it comes to the choice of method. Thus, as we shall attempt to illustrate in the examples to follow, it is quite possible to carry out realistic evaluation using: strategies, quantitative and qualitative; timescales, contemporaneous or historical; viewpoints, cross-sectional or longitudinal; samples, large or small; goals, action-oriented or audit-centred; and so on and so forth. [… T]he choice of method has to be carefully tailored to the exact form of hypotheses developed earlier in the cycle.”
This is reassuringly similar to the standard textbook story. However, like the standard story, in practice there are ethical and financial constraints on method. Indeed the UK government’s evaluation bible, the Magenta Book (HM Treasury, 2020), recommends using Theory-Based approaches like “realist” evaluation when experimental and quasi-experimental approaches are not feasible. (See also, What is Theory-Based Evaluation, really?)
More than a moment’s thought about theory
Pawson and Tilley (1997) emphasise the importance of thinking about why social interventions may lead to change and not only looking at outcomes, which they illustrate with the example of CCTV:
“CCTV certainly does not create a physical barrier making cars impenetrable. A moment’s thought has us realize, therefore, that the cameras must work by instigating a chain of reasoning and reaction. Realist evaluation is all about turning this moment’s thought into a comprehensive theory of the mechanisms through which CCTV may enter the potential criminal’s mind, and the contexts needed if these powers are to be realized.” (p. 78)
They then list a range of potential mechanisms. CCTV might make it more likely that thieves are caught in the act. Or maybe the presence of CCTV make car parks feel safer, which means they are used by more people whose presence and watchful eyes prevent theft. So other people provide the surveillance rather than the camera bolted to the wall.
Nothing new here – social science is awash with theory (Pawson and Tilley cite Durkheim’s 1950s work on suicide as an example). Psychological therapies are some of the most evaluated of social interventions and the field is particularly productive when it comes to theory; see, e.g., Whittle (1999, p. 240) on psychoanalysis, a predecessor of modern therapies:
“Psychoanalysis is full of theory. It has to be, because it is so distrustful of the surface. It could still choose to use the minimum necessary, but it does the opposite. It effervesces with theory…”
Power (2010) argues that most effects in modern therapies can be explained by transference (exploring and using how the relationship between therapist and client mirrors relationships outside therapy), graded exposure to situations which provoke anxiety, and challenging dysfunctional assumptions – for each of which there are detailed theories of change.
However, perhaps evaluations of social programme – therapies included – have concentrated too much on tracking outcomes and neglected getting to grips with potential mechanisms of change, so “realist” evaluation is potentially a helpful intervention. The specific example of CCTV is a joy to read and is a great way to bring the sometimes abstract notion of social mechanism alive.
The structure of explanations in “realist” evaluation
The context-mechanism-outcome triad is a salient feature of the approach. Rather than define each of these (see the original text), here are four examples from Pawson and Tilley (1997) to illustrate what they are. The middle column (New mechanism) describes the putative mechanism that may be “triggered” by a social programme that has been introduced.
Poor-quality, hard-to-let housing; traditional housing department; lack of tenant involvement in estate management
Improved housing and increased involvement in management create increased commitment to the estate, more stability, and opportunities and motivation for social control and collective responsibility
Three tower blocks, occupied mainly by the elderly; traditional housing department; lack of tenant involvement in estate management
Concentration of elderly tenants into smaller blocks and natural wastage creates vacancies taken up by young, formerly homeless single people inexperienced in independent living. They become the dominant group. They have little capacity or inclination for informal social control, and are attracted to a hospitable estate subterranean subculture
Increased burglary prevalence concentrated amongst the more
vulnerable; high levels of vandalism and incivility
Prisoners with little or no previous education with a growing string of convictions – representing a ‘disadvantaged’ background
Modest levels of engagement and success with the program trigger ‘habilitation’ process in which the inmate experiences self-realization and social acceptability (for the first time)
Lowest levels of reconviction as compared with statistical norm for such inmates
High numbers of prepayment meters, with a high proportion of burglaries involving cash from meters
Removal of cash meters reduces incentive to burgle by decreasing actual or perceived rewards
Reduction in percentage of burglaries involving meter breakage; reduced risk of burglary at dwellings where meters are removed; reduced burglary rate overall
This seems a helpful way to organise thinking about the context-mechanism-outcome triad, irrespective of whether the approach is labelled “realist”. Those who are into logframe matricies (logframes) might want to add a column for the “outputs” of a programme.
The authors emphasise that the underlying causal model is “generative” in the sense that causation is seen as
“acting internally as well as externally. Cause describes the transformative potential of phenomena. One happening may well trigger another but only if it is in the right condition in the right circumstances. Unless explanation penetrates to these real underlying levels, it is deemed to be incomplete.” (p. 34)
The “internal” here appears to refer to looking inside the “black box” of a social programme to see how it operates, rather than merely treating it as something that is present in some places and absent in others. Later, there is further elaboration of what “generative” might mean:
“To ‘generate’ is to ‘make up’, to ‘manufacture’, to ‘produce’, to ‘form’, to ‘constitute’. Thus when we explain a regularity generatively, we are not coming up with variables or correlates which associate one with the other; rather we are trying to explain how the association itself comes about. The generative mechanisms thus actually constitute the regularity; they are the regularity. The generative mechanisms thus actually constitute the regularity; they are the regularity.” (p. 67)
We also learn that an action is causal only if its outcome is triggered by a mechanism in a context (p. 58). Okay, but how do we find out if an action’s outcome is triggered in this manner? “Realist” evaluation does not, in my view, provide an adequate analysis of what a causal effect is. Understandable, perhaps, given its pluralist approach to method. So, understandings of causation must come from elsewhere.
Mechanisms can be seen as “entities and activities organized in such a way that they are responsible for the phenomenon” (Illari & Williamson, 2011, p. 120). In “realist” evaluation, entities and their activities in the context would be included in this organisation too – the context supplies the mechanism on which a programme intervenes. So, let’s take one of the example mechanisms from the table above:
“Improved housing and increased involvement in management create increased commitment to the estate, more stability, and opportunities and motivation for social control and collective responsibility.”
To make sense of this, we need a theory of what improved housing looks like, what involvement in management and commitment to the estate, etc., means. To “create commitment” seems like a psychological, motivational process. The entities are the housing, management structures, people living in the estate, etc. To evidence the mechanism, I think it does help to think of variables to operationalise what might be going on and to use comparison groups to avoid mistaking, e.g., regression to the mean or friendlier neighbours for change due to improved housing. And indeed, Pawson and Tilley use quantitative data in one of the “realist” evaluations they discuss (next section). Such operationalisation does not reduce a mechanism to a set of variables; it is merely a way to analyse a mechanism.
Kinds of evidence
Chapter 4 gives a range of examples of the evidence that has been used in early “realist” evaluations. In summary, and confirming the pluralist stance mentioned above, it seems that all methods are relevant to realist evaluation. Two examples:
Interviews with practitioners to try to understand what it is about a programme that might effect change: “These inquiries released a flood of anecdotes, and the tales from the classroom are remarkable not only for their insight but in terms of the explanatory form which is employed. These ‘folk’ theories turn out to be ‘realist’ theories and invariably identify those contexts and mechanisms which are conducive to the outcome of rehabilitation.” (pp. 107-108)
Identifying variables in an information management system to “operationalize these hunches and hypotheses in order to identify, with more precision, those combinations of types of offender and types of course involvement which mark the best chances of rehabilitation. Over 50 variables were created…” (p. 108)
Some researchers have made a case for and carried out what they term realist randomised controlled trials (Bonell et al., 2012; which seems eminently sensible to me). The literature subsequently exploded in response. Here’s an illustrative excerpt of the criticisms (Marchal et al., 2013, p. 125):
“Experimental designs, especially RCTs, consider human desires, motives and behaviour as things that need to be controlled for (Fulop et al., 2001, Pawson, 2006). Furthermore, its analytical techniques, like linear regression, typically attempt to isolate the effect of each variable on the outcome. To do this, linear regression holds all other variables constant “instead of showing how the variables combine to create outcomes” (Fiss, 2007, p. 1182). Such designs “purport to control an infinite number of rival hypotheses without specifying what any of them are” by rendering them implausible through statistics (Campbell, 2009), and do not provide a means to examine causal mechanisms (Mingers, 2000).”
Well. What to make of this. Yes, RCTs control for stuff that’s not measured and maybe even unmeasurable. But you can also measure stuff you know about and see if that moderates or mediates the outcome (see, e.g., Windgassen et al., 2016). You might also use the numbers to select people for qualitative interview to try to learn more about what is going on. The comment on linear regression reveals surprising ignorance of how non-linear transformations of and interactions between predictors can be added to models. It is also trivial to calculate marginal outcome predictions for combinations of predictors together, rather than merely identifying which predictors are likely non-zero when holding others fixed. See Bonell et al. (2016) for a very patient reply.
The plea for evaluators to spend more time developing theory is welcome – especially in policy areas where “key performance indicators” and little else are the norm (see also Carter, 1989, on KPIs as dials versus tin openers opening a can of worms). It is a laudable aim to help “develop the theories of practitioners, participants and policy makers” of why a programme might work (Pawson & Tilley, 1997, p. 214). The separation of context, mechanism, and outcome, also helps structure thinking about social programmes (though there is widespread confusion about what a mechanism is in the “realist” literature; Lemire et al., 2020). But “realist” evaluation is arguably better seen as an exposition of a particular reading of ye olde scientific method applied to evaluation, with a call for pluralist methods. I am unconvinced that it is a novel form of evaluation.
Bhaskar’s critical realism emphasises a distinction between intransitive and transitive objects. I think the easiest way to see how the distinction works in social science (as opposed to say, geology) is as follows. Find all the social theorists and make them and their books and journal articles vanish. The things that are left are intransitive objects, e.g., people and social institutions likes banks and governments, and all the things they do even though no theorists are around to observe. The things that vanish with the theorists are all the transitive objects – the fallible accounts of how the various intransitive objects “work”.
It should be recognised that the theorists and their theories are intransitive objects too and theories influence social life, e.g., through the pop psychology jargon people use when they talk to each other. Also everyone theorises, not just professionals. But let’s not get tied up in knots.
Ontology is about the kinds of things that exist, including material and abstract “things” like numbers. Cruickshank (2004) argues that ontology is defined in two different ways by critical realists. Sometimes it refers to all the things, knowable and not, in the intransitive sense. Other times ontology refers to critical realists’ theories of what there is – these theories are transitive objects. But reducing what there is to what is known (philosophically) about what there is commits what Bhaskar called the epistemic fallacy – one of the key fallacies critical realists are trying to help us avoid.
Cruickshank concludes that Bhaskar shoots himself in the foot by making critical realist theories of ontology inevitably commit the epistemic fallacy (Cruickshank, 2004, p. 572):
“The problem though is that in defining the epistemic fallacy as the transposing of questions about being [ontology] into questions about knowing, Bhaskar has defined the said fallacy so broadly that any reference to what we know of reality (which may well be knowledge claims with a high degree of veracity) must commit this putative fallacy. Indeed the only way to avoid this fallacy would be to step outside knowledge to ‘see’ reality in itself.”
It’s a challenging debate, aiming for precise understandings of concepts like ontology and exploring the possibilities and limits of philosophical reasoning, but it seems unhelpful for the day-to-day work of doing social science.
Perhaps more helpful – and bigger than critical realism – is to emphasise the role of creativity in doing science. We can’t just go out and rigorously observe reality (whether social life or the cosmos) and somehow perceive theories directly. Although rigorous observation is important, science involves speculating about what might be out there and then working out what evidence we would expect to see if we were correct or if plausible alternative theories were correct.
My favourite analogy comes from cryptanalysis. We can systematically analyse letter and word frequencies in cyphertexts to try to spot patterns. But it helps to guess what people might be trying to say to each other based on something beyond the ciphertext, and to use those guesses to reduce the search space of possible encryptions.
Cruickshank, J. (2004). A tale of two ontologies: An immanent critique of critical realism. Sociological Review, 52, 567–585.
The core idea of Jenkins’ norm-relevancy account of gender is that someone’s gender is defined in terms of the gender norms they experience as relevant – whether or not they comply with those norms (there’s a lot more to the account than this – forgive me and please read the original). I’m not sure if this is enough to define gender; however, I think it’s an interesting idea for how people might decode their gender. Jenkins uses a crisp classical binary logic approach. This blog post is an attempt to explore what happens if we add probabilities.
Note that I’m simply using Bayesian networks because they do the sums for me. The direction of the arrows below is not meant to imply causation. Rather, the idea is from the assumption that someone is a particular gender, it is straightforward to guess the probability that a particular gender norm would be relevant. The Bayes trick then is to go in reverse from experiencing the relevance of particular norms to decoding one’s gender.
Let’s get started with some pictures.
The network below shows the setup in the absence of evidence.
The goal is to infer gender and at present the probabilities are 49-49% for man/woman and 1% for non-binary. That’s probably too high for the latter. Also I’m assuming there are only three gender identities, which is false. But onwards.
Each node with an arrow leading into it represents a conditional probability. The table below shows a conditional probability probability distribution defined for one of the norm-relevancy nodes.
So, in this case if someone is a man then this norm is 80% likely to be irrelevant; if someone is a woman then it is 80% likely to be relevant; and if someone is non-binary there is a 50-50 split. I’ve set up all the nodes in this pattern, just flipping the 80% to 20% and vice versa depending on whether a norm is for men or for women.
The idea then is to the use the Bayesian network to calculate how likely it is that someone is a man, woman, or non-binary based on the relevance or irrelevance of the norms.
I have not yet mentioned the Spaces node top left. This is a convenient way to change the prior probabilities of each gender; so in LGBT spaces the prior probability for non-binary raises from 1% to 20% since there are likely to be more non-binary people around. This also captures the intuition that it’s easier to work out whether a particular identity applies to you if you meet more people who hold that identity. See the picture below. Note how LGBT is now bold and underlined over top left. That means we are conditioning on that, i.e., assuming that it is true.
But let’s go back to cisgendered spaces.
Suppose most (but not necessarily all) of the male norms are experienced as irrelevant and most (but not necessarily all) of the female norms are perceived as relevant. As you can see below, the probability that someone is a woman increases to over 90%
Similarly, for the converse where most male norms are relevant and most female norms are irrelevant now the probability that someone is a man rises to over 90%:
Now what if all the norms are relevant? Let’s also reset the evidence on whether someone is in a cis or LGBT space.
The probability of being non-binary has gone up a little to 4%, but in this state there is most likely confusion about whether the gender is male or female since they both have the highest probability and that probability is the same.
Similarly, if all the norms are irrelevant, then the probability of non-binary is 4%. Again, it is unlikely that you would infer that you are non-binary.
But increasing the prior probability of non-binary gender, for instance through meeting more non-binary people in LGBTQ+ spaces, now makes non-binary the most likely gender.
To emphasise again, there are many more varieties of gender identity here and an obvious thought might be that gender nonconforming but still cis man or woman could apply – especially if someone views gender as closely coupled to chromosomes/genitals. I think it’s also interesting how the underdetermination of scientific theories can apply to people’s ruminations about identity given how they feel and what other evidence they have.
The situation can also be fuzzier, e.g., where the difference between one of the binary genders and non-binary is closer:
We don’t have conscious access to mental probabilities to two decimal places, so scenarios like these may feel equiprobable.
So far we have explored the simple situation where people are only aware of three male norms and three female norms. What happens if we had more, but kept the probability distributions on each the same…? Now we’re tip-toeing towards a more realistic scenario:
Everything works as before for men and women; however, something different now happens for non-binary people. Suppose all the norms are experienced as irrelevant (it works the same for relevant):
Now the most probable gender is non-binary (though man and woman are still far from zero: 24%).
This is true even in cis spaces:
Finally, there’s another way to bump up the probability of non-binary. Let’s go back to two gender norms, one male and one female. However, set the probabilities so that if you’re a woman, it’s 99.99% probable that the female norm will apply (and similar for men and male norms). Set it to 50-50 for non-binary. Now we get a strong inference towards non-binary if neither or both norms are relevant, even in cis spaces.
That’s as far as I have got for now. Interim conclusion then:
It is possible to view norm-relevancy through probabilities and as a sort of Bayesian self-identity decoding process.
When there is a small number of norms and (say) 80% chance of a norm being relevant for a particular binary gender, the prior probability of non-binary has a big impact on whether someone decodes their gender that way.
As the number of norms increases, it is easier to infer non-binary as a possibility.
Additionally, if there are only a few norms, but the probability that they apply for men and women is very high, then seeing them as all relevant or irrelevant is strong evidence for non-binary.
So, you have pledged allegiance to the big four critical realist axioms (Archer, et al., 2016) – what next?
Here are some ideas.
1. Ontological realism
What is it? There is a social and material world existing independently of people’s speech acts. “Reality is real.” One way to think about this slogan in relation to social kinds like laws and identities is they have a causal impact on our lives (Dembroff, 2018). Saying that reality is real does not mean that reality is fixed. For example, we can eat chocolate (which changes it and us) and change laws.
What to do? Throw radical social constructionism in the bin. Start with a theory that applies to your particular topic and provides ideas for entities and activities to use and possibly challenge in your own theorising.
Those “entities” (what a cold word) may be people with desires, beliefs, and opportunities (or lack thereof) who do things in the world like going for walks, shopping, cleaning, working, and talking to each other (Hedström, 2005). The entities may be psychological “constructs” like kinds of memory and cognitive control and activities like updating and inhibiting prepotent responses. The entities might be laws and activities carried out by the criminal justice system and campaigners. However you decide to theorise reality, you need something.
2. Epistemic relativity
What is it? The underdetermination of theories means that two theorists can make a compelling case for two different accounts of the same evidence. Their (e.g., political, moral) standpoint and various biases will influence what they can theorise. Quantitative researchers are appealing to epistemic relativity when they cite George Box’s “All models are wrong” and note the variety of models that can be fit to a dataset.
What to do? Throw radical positivism in the bin – even if you are running RCTs. Ensure that you foreground your values whether through statements of conflicts of interest or more reflexive articulations of likely bias and prejudice. Preregistering study plans also seems relevant here.
3. Judgemental/judgmental rationality
What is it? Even though theories are underdetermined by evidence, there often are reasons to prefer one theory over another.
What to do? If predictive accuracy does not help choose a theory, you could also compare them in terms of how consistent they are with themselves and other relevant theories; how broad in scope they are; whether they actually bring some semblance of order to the phenomena being theorised; and whether they make novel predictions beyond current observations (Kuhn, 1977).
You might consider the aims of critical theory which proposes judging theories in terms of how well they help eliminate injustice in the world (Fraser, 1985). But you would have to take a political stance.
4. Ethical naturalism
What is it? Although is does not imply ought, prior ought plus is does imply posterior ought.
What to do? Back to articulating your values. In medical research the following argument form is common (if often implicit): We should prevent people from dying; a systematic review has shown that this treatment prevents people from dying; therefore we should roll out this treatment. We could say something similar for social research that is anti-racist, feminist, LGBTQI+, intersections thereof, and other research. But if your research makes a recommendation for political change, it must also foreground the prior values that enabled that recommendation to inferred.
The big four critical realist axioms provide a handy but broad metaphysical and moral framework for getting out of bed in the morning and continuing to do social research. Now we are presented with further challenges that depend on grappling with substantive theory and specific political and moral values. Good luck.
Archer, M., Decoteau, C., Gorski, P. S., Little, D., Porpora, D., Rutzou, T., Smith, C., Steinmetz, G., & Vandenberghe, F. (2016). What is Critical Realism?Perspectives: Newsletter of the American Sociological Association Theory Section, 38(2), 4–9.
It is a cliché that randomised controlled trials (RCTs) are the gold standard if you want to evaluate a social policy or intervention and quasi-experimental designs (QEDs) are presumably the silver standard. But often it is not possible to use either, especially for complex policies. Theory-Based Evaluation is an alternative that has been around for a few decades, but what exactly is it?
In this post I will sketch out what some key texts say about Theory-Based Evaluation; explore one approach, contribution analysis; and conclude with discussion of an approach to assessing evidence in contribution analyses (and a range of other approaches) using Bayes’ rule.
For what it’s worth, I also propose dropping the category of “Theory-Based Evaluation”, but that’s a longer-term project…
Let’s get the obvious out of the way. All research, evaluation included, is “theory-based” by necessity, even if an RCT is involved. Outcome measures and interviews alone cannot tell us what is going on; some sort of theory (or story, account, narrative, …) – however flimsy or implicit – is needed to design an evaluation and interpret what the data means.
If you are evaluating a psychological therapy, then you probably assume that attending sessions exposes therapy clients to something that is likely to be helpful. You might make assumptions about the importance of the therapeutic relationship to clients’ openness, of any homework activities carried out between sessions, etc. RCTs can include statistical mediation tests to determine whether the various things that happen in therapy actually explain any difference in outcome between a therapy and comparison group (e.g., Freeman et al., 2015).
It is great if a theory makes accurate predictions, but theories are underdetermined by evidence, so this cannot be the only criterion for preferring one theory’s explanation over another (Stanford, 2017) – again, even if you have an effect size from an RCT. Lots of theories will be compatible with any RCT’s results. To see this, try a particular social science RCT and think hard about what might be going on in the intervention group beyond what the intervention developers have explicitly intended.
To accuracy, Kuhn (1977) suggests that a good theory should be consistent with itself and other relevant theories; have broad scope; bring “order to phenomena that in its absence would be individually isolated”; and it should produce novel predictions beyond current observations. There are no obvious formal tests for these properties, especially where theories are expressed in ordinary language and box-and-arrow diagrams.
Theory-Based Evaluation (title case)
Theory-Based Evaluation is a particular genre of evaluation that includes realist evaluation and contribution analysis. According the UK’s government’s Magenta Book (HM Treasury, 2020, p. 43), Theory-Based methods of evaluation
“can be used to investigate net impacts by exploring the causal chains thought to bring about change by an intervention. However, they do not provide precise estimates of effect sizes.”
The Magenta Book acknowledges (p. 43) that “All evaluation methods can be considered and used as part of a [Theory-Based] approach”; however, Figure 3.1 (p. 47) is clear. If you can “compare groups affected and not affected by the intervention”, you should go for experiments or quasi-experiments; otherwise, Theory-Based methods are required.
Theory-Based Evaluation attempts to draw causal conclusions about a programme’s effectiveness in the absence of any comparison group. If a quasi-experimental design (QED) or randomised controlled trial (RCT) were added to an evaluation, it would cease to be Theory-Based Evaluation, as the title case term is used.
Example: Contribution analysis
Contribution analysis is an approach to Theory-Based Evaluation developed by JohnMayne (28 November 1943 – 18 December 2020). Mayne was originally concerned with how to use monitoring data to decide whether social programmes actually worked when quasi-experimental approaches were not feasible (Mayne, 2001), but the approach evolved to have broader scope.
According to a recent summary (Mayne, 2019), contribution analysis consists of six steps (and an optional loop):
Step 1: Set out the specific cause-effect questions to be addressed.
Step 2: Develop robust theories of change for the intervention and its pathways.
Step 3: Gather the existing evidence on the components of the theory of change model of causality: (i) the results achieved and (ii) the causal link assumptions realized.
Step 4: Assemble and assess the resulting contribution claim, and the challenges to it.
Step 5: Seek out additional evidence to strengthen the contribution claim.
Step 6: Revise and strengthen the contribution claim.
Step 7: Return to Step 4 if necessary.
Here is a diagrammatic depiction of the kind of theory of change (or Theory of Change?) that could be plugged in at Step 2 (Mayne, 2015, p. 132), which illustrates the cause-effect links an evaluation would aim to evaluate. (Note the heteronormative and marital assumptions.)
In this example, mothers are thought to learn from training sessions and materials, which then persuades them to adopt new feeding practices. This leads to children having more nutritious diets. The theory is surrounded by various contextual factors such as food prices. (See also Mayne, 2017, for a version of this that includes ideas from the COM-B model of behaviour.)
Step 4 requires analysts to “Assemble and assess the resulting contribution claim”. How are we to carry out that assessment? Mayne (2001, p. 14) suggests some questions to ask:
“How credible is the story? Do reasonable people agree with the story? Does the pattern of results observed validate the results chain? Where are the main weaknesses in the story?”
For me, the most credible stories would include experimental or quasi-experimental tests, with mediation analysis of key hypothesised mechanisms, and qualitative detective work to get a sense of what’s going on beyond the statistical associations. But the quant part of that would lift us out of the Theory-Based Evaluation wing of the Magenta Book flowchart. In general, plausibility will be determined outside contribution analysis in, e.g., quality criteria for whatever methods for data collection and analysis were used.
Although contribution analysis is intended to fill a gap where no comparison group is available, Mayne (2001, p. 18) suggests that further data might be collected to help rule out alternative explanations of outcomes, e.g., from surveys, field visits, or focus groups. He also suggests reviewing relevant meta-analyses, which could (I presume) include QED and RCT evidence.
It is not clear to me what the underlying theory of causation is in contribution analysis. It is clear what it is not (Mayne, 2019, pp. 173–4):
“In many situations a counterfactual perspective on causality—which is the traditional evaluation perspective—is unlikely to be useful; experimental designs are often neither feasible nor practical…”
“[Contribution analysis] uses a stepwise (generative) not a counterfactual approach to causality.”
(We will explore counterfactuals below.) I can guess what this generative approach could be, but Mayne does not provide precise definitions. It clearly isn’t the idea from generative social science in which collections of computational “agents”, representing individual people, are simulated to model how (macro-level) social phenomena emerge from (micro-level) interactions between people (Epstein, 1999).
One way to think about it might be in terms of mechanisms: “entities and activities organized in such a way that they are responsible for the phenomenon” (Illari & Williamson, 2011, p. 120). We could make this precise by modelling the mechanisms using causal Bayesian networks such that variables (nodes in a network) represent the probability of activities occurring, conditional on temporally earlier activities having occurred – basically, a chain of probabilistic if-thens.
Why do people get vaccinated for Covid-19? Here is the beginning of a (generative?) if-then theory:
If you learned about vaccines in school and believed what you learned and are exposed to an advert for Covid-19 jab and are invited by text message to book an appointment for one, then (with a certain probability) you use your phone to book an appointment.
If you have booked an appointment, then (with a certain probability) you travel to the vaccine centre in time to attend the appointment.
If you attend the appointment, then (with a certain probability) you are asked to join a queue.
… and so on …
In a picture:
This does not explain how or why the various entities (people, phones, etc.) and activities (doing stuff like getting the bus as a result of beliefs and desires) are organised as they are, just the temporal order in which they are organised and dependencies between them. Maybe this suffices. “Explanations come to an end somewhere…”
What are counterfactual approaches?
Counterfactual impact evaluation usually refers to quantitative approaches to estimate average differences as understood in a potential outcomes framework (or generalisations thereof). The key counterfactual is something like:
“If the beneficiaries had not taken part in programme activities, then they would not have had the outcomes they realised.”
Logicians have long worried how to determine the truth of counterfactuals, “if A had been true, B.” One approach, due to Stalnaker (1968), proposes that you:
Start with a model representing your beliefs about the factual situation where A is false. This model must have enough structure so that tweaking it could lead to different conclusions (causal Bayesian networks have been proposed; Pearl, 2013).
Add A to your belief model.
Modify the belief model in a minimal way to remove contradictions introduced by adding A.
Determine the truth of B in that revised belief model.
This broader conception of counterfactual seems compatible with any kind of evaluation, contribution analysis included. White (2010, p. 157) offered a helpful intervention, using the example of a pre-post design where the same outcome measure is used before and after an intervention:
“… having no comparison group is not the same as having no counterfactual. There is a very simple counterfactual: what would [the outcomes] have been in the absence of the intervention? The counterfactualis that it would have remained […] the same as before the intervention.”
The counterfactual is untested and could be false – regression to the mean would scupper it in many cases. But it can be stated and used in an evaluation. I think Stalnaker’s approach is a handy mental trick for thinking through the implications of evidence and producing alternative explanations.
Cook (2000) offers seven reasons why Theory-Based Evaluation cannot “provide the valid conclusions about a program’s causal effects that have been promised.” I think from those seven, two are key: (i) it is usually too difficult to produce a theory of change that is comprehensive enough for the task and (ii) the counterfactual remains theoretical – in the arm-chair, untested sense of theoretical – so it is too difficult to judge what would have happened in the absence of the programme being evaluated. Instead, Cook proposes including more theory in comparison group evaluations.
Bayesian contribution tracing
Contribution analysis has been supplemented with a Bayesian variant of process tracing (Befani & Mayne, 2014; Befani & Stedman-Bryce, 2017; see also Fairfield & Charman, 2017, for a clear introduction to Bayesian process tracing more generally).
The idea is that you produce (often subjective) probabilities of observing particular (usually qualitative) evidence under your hypothesised causal mechanism and under one or more alternative hypotheses. These probabilities and prior probabilities for your competing hypotheses can then be plugged into Bayes’ rule when evidence is observed.
Suppose you have two competing hypotheses: a particular programme led to change versus pre-existing systems. You may begin by assigning them equal probability, 0.5 and 0.5. If relevant evidence is observed, then Bayes’ rule will shift the probabilities so that one becomes more probable than the other.
Process tracers often cite Van Evera’s (1997) tests such as the hoop test and smoking gun. I find definitions of these challenging to remember so one thing I like about the Bayesian approach is that you can think instead of specificity and sensitivity of evidence, by analogy with (e.g., medical) diagnostic tests. A good test of a causal mechanism is sensitive, in the sense that there is a high probability of observing the relevant evidence if your causal theory is accurate. A good test is also specific, meaning that the evidence is unlikely to be observed if any alternative theory is true. See below for a table (lighted edited from Befani & Mayne, 2014, p. 24) showing the conditional probabilities of evidence for each of Van Evera’s tests given a hypothesis and alternative explanation.
Van Evera test
if Eᵢ is observed
P(Eᵢ | Hyp)
P(Eᵢ | Alt)
Fails hoop test
Passes smoking gun
Let’s take the hoop test. This applies to evidence which is unlikely if your preferred hypothesis were true. So if you observe that evidence, the hoop test fails. The test is agnostic about the probability under the alternative hypothesis. Straw-in-the-wind is hopeless for distinguishing between your two hypotheses, but could suggest that neither holds if the test fails. The double-decisive test has high sensitivity and high specificity, so provides strong evidence for your hypothesis if it passes.
The arithmetic is straightforward if you stick to discrete multinomial variables and use software for conditional independence networks. Eliciting the subjective probabilities for each source of evidence, conditional on each hypothesis, may be less straightforward.
I am with Cook (2000) and others who favour a broader conception of “theory-based” and suggest that better theories should be tested in quantitative comparison studies. However, it is clear that it is not always possible to find a comparison group – colleagues and I have had to make do without (e.g., Fugard et al., 2015). Using Theory-Based Evaluation in practice reminds me of jury service: a team are guided through thick folders of evidence, revisiting several key sections that are particularly relevant, and work hard to reach the best conclusion they can with what they know. There is no convenient effect size to consult, just a shared (to some extent) and informal idea of what intuitively feels more or less plausible (and lengthy discussion where there is disagreement). To my mind, when quantitative comparison approaches are not possible, Bayesian approaches to assessing qualitative evidence are the most compelling way to synthesise qualitative evidence of causal impact and make transparent how this synthesis was done.
Finally, it seems to me that the Theory-Based Evaluation category is poorly named. Better might be, Assumption-Based Counterfactual approaches. Then RCTs and QEDs are Comparison-Group Counterfactual approaches. Both are types of theory-based evaluation and both use counterfactuals; it’s just that approaches using comparison groups gather quantitative evidence to test the counterfactual. However, the term doesn’t quite work since RCTs and QEDs rely on assumptions too… Further theorising needed.
“There’s something incredibly powerful – revolutionary, even – about challenging someone’s understanding of gender with your very existence.”
According to dominant ideas in “the West”, your gender ultimately reduces to whether you have XX or XY chromosomes, as inferred by inspecting your genitals at birth, and there are only two possibilities: woman or man. Yes, you will occasionally hear how sex is biological and gender is social, but under the dominant norms, (specifically chromosomal) sex and gender categories are defined to align.
The existence of transgender (trans) people challenges this chromosomal definition, since their gender differs from male/female sex category assigned at birth. People whose gender is under the non-binary umbrella challenge the man/woman binary since they are neither, both, or fluctuate between the two.
It is tempting for researchers to ignore these complexities since most people are cisgender (cis for short), that is, their gender aligns with their sex category at birth, and they are either a woman or a man. As the male/female demographic tickboxes illustrate, many do ignore the complexity.
A few years ago, analytic philosophers, having for centuries pondered questions such as “what can be known?” and “is reality real?”, discovered that theorising gender offered intellectual challenges too and could be used to support human rights activism. Although plenty of writers have pondered gender, this corner of philosophy offers clear definitions, so is perhaps easier to understand and critique than other approaches. I think it is also more compatible with applied social research.
One of the politically-aware analytical philosophers who caught my eye, Robin Dembroff, recently published a paper analysing what it means to be genderqueer. Let’s sketch out how the analysis goes.
“… the gendeRevolution has begun, and we’re going to win.”
Genderqueer originally referred to all gender outliers – whether cis, trans, or other. Its meaning has shifted to overlap with non-binary gender and trans identities as per the Venn flags below.
Both genderqueer and non-binary have become umbrella terms with similar meaning; however, genderqueer carries a more radical connotation- especially since it includes the reclaimed slur “queer” – whereas non-binary is more neutral and descriptive, even appearing in HR departments’ IT systems.
The data on how many people are genderqueer thus far is poor – hopefully the 2021 census in England and Wales will improve matters. In the meantime, a 2015 UK convenience sample survey of non-binary people (broadly defined) found that 63% identified as non-binary, 45% as genderqueer, and 65% considered themselves to be trans. The frequency of combinations was not reported.
This year’s international (and also convenience sample) survey of people who are neither men nor women “always, solely and completely” found a small age effect: people over 30 were eight percentage points more likely to identify as genderqueer than younger people.
Externalist versus internalist
Dembroff opens with a critique of two broad categories of theories of what gender is: externalist (or social position) theories and internalist (or psychological identity) theories.
Externalist theories define gender in terms of how someone is perceived by others and advantaged or disadvantaged as a result. So, someone would be genderqueer if they are perceived and treated as neither a man nor a woman. However, this doesn’t work for genderqueer people, Dembroff argues, since they tend to reject the idea that particular gender expressions are necessary to be genderqueer; “we don’t owe you androgyny” is a well-known slogan. Also, many cis people do not present neatly as male or female – that does not mean they are genderqueer.
One of the internalist accounts Dembroff considers, by Katherine Jenkins, defines gender in terms of what gender norms someone feels are relevant to them – e.g., how they should dress, behave, what toilets they may use – regardless of whether they actually comply with (or actively resist) those norms. Norm relevancy requires that genderqueer people feel that neither male nor female norms are relevant. This is easiest to see with binary gendered toilets – neither the trouser nor skirt-logoed room is safe for a genderqueer person. However, it is unlikely that none of the norms would be felt as relevant. So the norm-relevancy account, Dembroff argues, would exclude many genderqueer people too.
Critical gender kinds
Dembroff’s proposed solution combines social and psychological understandings of gender. They introduce the idea of a critical gender kind and offer genderqueer as an example. A kind, in this sense, is roughly a collection of phenomena defined by one or more properties. (For a longer answer, try this on social kinds by Ásta.) Not to be confused with gender-critical feminism.
A gender is a critical gender kind, relative to a given society, if and only if people who are that gender “collectively destabilize one or more core elements of the dominant gender ideology in that society”. The genderqueer kind destabilises the binary assumption that there are only two genders. Dembroff emphasises the collective nature of genderqueer; as a kind it is not reducible to any individual’s characteristics and not every genderqueer person need successfully destabilise the binary norm. An uncritical gender kind is then one which perpetuates dominant norms such as the chromosomal and genital idea of gender outlined above.
Another key ingredient is the distinction between principled and existential destabilising – roughly, whether you are personally oppressed in a society with particular enforced norms. Someone who is happy to support and use all-gender toilets through (principled) solidarity with genderqueer people has a different experience to someone who is genderqueer and feels unsafe in a binary gendered toilet.
In summary, genderqueer people collectively and existentially destabilise the binary norm. Some of the many ways they do this include: using they/them or neopronouns, through gender expression that challenges dominant norms, asserting that they are genderqueer, challenging gender roles in sexual relationships, and switching between male and female coded spaces.
Although Dembroff challenges Jenkins’ norm-relevancy account, to me the general idea of tuning into gender norms is helpful for decoding your gender, and neatly complements Dembroff’s account. Maybe a trick is to add, and view as irrelevant, norms like “your genitals determine your gender” rather than only male and female norms. Additionally, adding probabilities rather than using binary true/false classical logic seems helpful to revise the account too. The externalist accounts are also relevant since they map out some ways that genderqueer people resist binary norms and dominant ways that (especially cis) people perceive and treat others.