Why is evaluation so white?

Useful resources to explore (work in progress):

‘I’m sometimes asked, “Why are there so few people of color in evaluation?” I flip the question: “Why is evaluation so white?” And answer: “Because our labor is actively erased.”’

Further resources, e.g.,

It’s striking how issues discussed in the 70s are still relevant now, e.g., concerning the impossibility of using IQ tests (and covert proxies thereof) to improve outcomes rather than simply to blame a child and excuse education systems for poor outcomes.

‘At its core, evaluation is value laden and embued with and responsive to a larger social political order, and evaluators are situated within contexts of study and within interactions of the setting that shape the evaluation study’s logic, structure, and practices (Hopson, Greene, Bledsoe, Villegas, & Brown, 2007). The question of “Who evaluates and why?” highlights the contexts, agendas, and intentions of the evaluation and the evaluator and so raises questions about practices—sometimes commonly accepted ones—and the structures of power and the uses of those power structures for or against hegemony.’ [p. 418]

This cites The oral history of evaluation, part 3: The professional evolution of Michael Scriven, which provides a clue – hiding in plain sight – to why the official history of evaluation is as it is:

“Now, there was the May 12th group, which was ahead of the game. The May 12th group was so called after the first date on which they met [1968, says Glass –AF]—but the general feeling was if we call it the May 12th group, that will have absolutely zero cachet, and so no one will be able to argue that they were entitled to join the May 12th group because it’s called something generic. And so the idea was you got invited to the May 12th group, and if you weren’t invited, then you weren’t in, and so there was no official stuff. So, they would meet in somebody’s house once a year. […] But some of us felt that we needed to do something that was slightly more official, and we’d got to start making this more than the intellectual elite group.”

The May 13 Group formed on that date in 2020 to challenge this.

‘Evaluation is political. At its simplest, evaluation is the systematic “process of determining the merit, worth and value of things” (Scriven, 1991, p. 1). Who gets to decide, the questions, the process, and the criteria for determining merit, worth, value, or significance—all of these matter.’ [p. 534]

‘As professionals and practitioners, we can no longer sit on the sidelines wearing the cape of objectivity and neutrality, a cape that shields beliefs and assumptions about knowledge, rigor, and evidence and which elevate a Western White worldview. [..] Everyday narratives that continue to marginalize, minimize, and disrespect people of color and those with less privilege could be replaced with ones that do not demonize and place blame on the individual. They could instead lift up the historical, contextual, and powerful dynamics that create and sustain oppression and shed light on the strategies and solutions which can shift the “rules of the game,” so that equity is achievable.’ [p. 538]

“Advisors of evaluation graduate students of colour should create spaces for students to express their feelings and, if they choose, be vulnerable and open about the stressors of simply being a person of colour in a world with white supremacy woven into its very fabric. “

“Whenever a prospective student emails me, I put them in touch with current students in my department. I find this is especially important for international students; I am unable to speak to how the culture in North Carolina and in our department differs from their home culture. I also aim to introduce students to faculty across campus who have similar cultures and backgrounds”

“Advisors of evaluation graduate students of colour can research or have conversations about the norms and dates associated with the holidays and events that their students observe. […] While I can’t know all the traditions observed by my students, I encourage them to inform me about their cultural and religious traditions as appropriate.”

“… advisors and mentors should also practice giving microvalidations […], small acts and words that validate who graduate students believe they can be. My post-doctoral advisor always praised me in public and raised concerns in private. I regularly let my advisees know that I am proud of them, see their potential, and believe in them. I learn every student’s name and work to pronounce their names correctly. And I make a concerted effort to refer to my advisees as my colleagues.”

Here’s a summary table of examples:

Tired Narrative Potential New Narrative The Difference
We should have more people of color on staff. The evaluation field needs to connect to and invite in talent coming from a broader range of lived experiences and expertise types to be relevant and useful. A singular focus on ethnic diversity promotes tokenism and an “unstated standard of whiteness.” New narratives should explicitly acknowledge how diversity of experience and expertise strengthens rigor and contributes to better evaluation.
Diverse applicants don’t meet our standard qualifications. Implicit bias and white- dominant norms constrain our ability to recognize valuable expertise and support talented people. Perceptions of knowledge, experience, and credibility are culturally based and tied to the establishment of cultural hegemony. It is important to recognize-and actively mitigate-how this plays out in workplaces and the evaluation profession overall.
If we supply individuals with knowledge and networks, they will be successful in our field. We need to remove structures and norms that prevent the flourishing of valuable expertise and talented people across our field. Individuals cannot be successful if the ecosystem of organizations in which they practice evaluation are inequitable and non-inclusive. We can do a better job breaking down white-dominant norms and creating the conditions that allow talented professionals to thrive.
We need experts to guide us in diversifying evaluation talent. We need to do a better job listening to people whose insights and talent have been historically marginalized within evaluation and philanthropy. Narrowly defined ideas about expertise and who is considered an expert aren’t getting us where we need to go. We must expand our conceptions of expertise and do a much better job of listening to and learning from people whose voices have been left out of conversations about talent in our field.
If individual organizations improve their hiring and management practices, the evaluation field will make progress. Making progress will only happen if we prioritize equity and inclusion across the evaluation ecosystem and work collectively on solutions. Focusing only on the organizational level prevents us from seeing and addressing larger narratives and systems at play. We need to prioritize this as a field and invest in collective solutions that help shift our outdated narratives.

“… evaluators of color noted that the burden of addressing DEI and calling out racism is often placed on them as they are assumed to be experts…”

“… evaluators of color cited examples of being tapped to join an evaluation project when philanthropic clients asked for demographics of staff in their RFPs, yet not feeling meaningfully included in the subsequent work…”

“When organizations have difficulty retaining staff of color, they often perceive the person of color as the problem, not the ecosystem that reinforces inequities. Persistent challenges with retention should signal a need for the organization to self-reflect on its culture and make changes…”

“I have been in too many meetings where a racialised person has felt they’ve had to speak about their lived experience, at great personal cost […]. Sometimes, the individual’s point is directly challenged or downplayed. In a head-spinning moment of gaslighting, they are left isolated and disbelieved, despite (or, perhaps, because) they are the racialised person specifically invited to the meeting to explain why the racist thing is racist.”

Kharkiv, statistics, and causal inference

As news comes in (14 May 2022) that Ukraine has won the battle of Kharkiv* and Russian troops are withdrawing, it may be of interest to know that a major figure in statistics and causal inference, Jerzy Neyman (1894-1981), trained as a mathematician there 1912-16. If you have ever used a confidence interval or conceptualised causal inference in terms of potential outcomes, then you owe him a debt of gratitude.

“[Neyman] was educated as a mathematician at the University of Kharkov*, 1912-16. After this he became a Lecturer at the Kharkov Institute of Technology with the title of Candidate. When speaking of these years he always stressed his debt to Sergei Bernstein, and his friendship with Otto Struve (later to meet him again in Berkeley). His thesis was entitled ‘Integral of Lebesgue’.” (Kendall et al., 1982)

* Харків (transliterated to Kharkiv) in Ukrainian, Харькoв (transliterated to Kharkov) in Russian.

On a relevance criterion

Logicians study logics – plural. There are different logics for different reasoning tasks. Classical logic, the flavour taught to undergraduate students of all persuasions, falls apart when confronted with the kinds of reasoning that people do effortlessly every day. My favourite way to break classical logic involves an innocent “if” and “or”.

Ponder the following sentence (based on an example by Alf Ross, 1944):

If Alex posted the letter [P], then Alex posted the letter [P] or Alex set fire to the letter [F].

If you think this sentence is true, then your interpretation and reasoning are compatible with translating it into classical logic using the material conditional (\(\Rightarrow\)) for the “if” and inclusive disjunction (\(\lor\)) for the “or”. You could write it like this and it’s trivially true: \(P \Rightarrow (P \lor F)\).

Some people are perfectly content with this interpretation, but many think the sentence is fishy and false.  There are a number of ways to explain what has happened.

One is to assume that the issue is language pragmatics rather than logic. Pragmatics studies the ways in which context and social conventions for communication affect people’s interpretation of language. According to one theory of communication (see Liza Verhoeven’s 2007 explanation), asserting that you posted the letter or burned it under the assumption that you posted it violates principles of cooperativeness. These principles affect the meaning of a sentence and its truth, so in this case the sentence is false.

Another way to make sense of what has gone wrong is using a relevance criterion devised by Gerhard Schurz (1991). The first step we need to take is to transform the “if” into an argument with a single premise and conclusion.

Premise: Alex posted the letter [P].
Conclusion: Alex posted the letter [P] or Alex set fire to the letter [F].

This is an uncontroversial step in classical logic, e.g., application of a rule for introducing an “if” in natural deduction.

Schurz introduces a criterion for a conclusion relevance that roughly goes as follows. The starting point is an argument that is valid according to classical logic. That’s the case for the argument above. If there are any terms in the conclusion that can be substituted with arbitrary alternatives without affecting the argument’s validity, then the conclusion is irrelevant. Otherwise the conclusion is relevant.

For our letter example, we can replace “Alex set fire to the letter” with anything and it has no effect on the validity of the argument. Alex opened the letter. Alex scribbled on the letter. Alex swallowed the letter. The letter was a surrealist painting. The letter was the size of house. And so on. No substitution in the second half of the conclusion can affect the validity of the argument, so the conclusion is irrelevant.

How about an argument where the conclusion is relevant? The trick is to ensure that everything in the conclusion is… relevant. That’s what I like about the criterion: it formalises (and the details are fiddly) an intuitive property of arguments. Here’s an easy example:

Premise: It’s raining and I left my umbrella at home
Conclusion: I left my umbrella at home and it’s raining

This is an example of the conjunction, “and”, being commutative in classical logic: the order of the conjuncts in the sentence (the parts on either side of “and”) doesn’t affect its truth. There are many ways to edit the conclusion so that the argument is no longer valid. For instance replace one or both of the conjuncts with “I posted a letter”. Then the conclusion doesn’t follow from the premise since the premise doesn’t tell us anything about a letter.

Colleagues and I explored people’s interpretations of these kinds of sentence about a decade ago in the context of an alleged paradigm shift in the psychology of reasoning. Read all about it. I was reminded of this again as Google Scholar dutifully notified me that Michał Sikorski recently cited it (thank you kindy Michał!).

The misuses of “biological sex”

‘It is long overdue that we understand sex not as an essential property of individuals but as a set of biological traits and social factors that become important only in specific contexts, such as medicine, and even then complexity persists. If we are concerned with certain cancers, for example, knowing whether someone has a prostate or ovaries is what’s important, not their “sex” per se. If reproduction is the interest, what matters is whether one produces sperm or eggs, whether one has a uterus, a vaginal opening, and so on.’

Karkazis, K. (2019, p. 1899). The misuses of “biological sex.” The Lancet, 394, 1898–1899.

Census data on trans and non-binary people in Canada

Canada published census data on trans and non-binary people on 27 April 2022. Here’s a table of the values they presented in a pie chart (why a pie chart, Canada?). Individuals in the census were aged 15 or above and living in a private household in May 2021.

Gender N %
Cis man 14,814,230 48.83
Cis woman 15,421,085 50.83
Trans man 27,905 0.09
Trans woman 31,555 0.10
Non binary 41,355 0.14
Total 30,336,130 100.00


Playing with RCTs and probabilities

Suppose we run an RCT with two groups, treatment and control, and a binary outcome of whether participants recover or not.

There are two potential outcomes: recovery following treatment (\(R_t\)) and recovery following control (\(R_c\)), \(1\) if recovered and \(0\) if not recovered. Only one of these two potential outcomes is realised, depending on what group someone is assigned to. Let \(W = t\) if a participant was assigned to treatment and \(W = c\) if they were assigned to control.

Suppose, following an RCT, we learn the following (somehow with perfect precision):

\(P(R_t = 1 | W = t) = 0.8\);

\(P(R_c = 1 | W = c) = 0.3\).

Given the two probabilities above, it turns out the best we can say is that \(P(R_t = 1) \in [0, 1]\) and \(P(R_c = 1) \in [0, 1]\). So, it seems that we aren’t yet able to infer anything about the potential outcomes beyond those that were realised.

Add to our premises that participants were assigned to treatment or control by coin flip:

\(P(W = t) = P(W = c) = 0.5\).

Now \(P(R_t = 1) \in [0.4 , 0.9]\) and \(P(R_c = 1) \in [0.15 , 0.65]\). These intervals are clearly better that \([0,1]\); however, can we do better?

The key ingredient we need to add is that treatment assignment is independent of the potential outcomes; that is

\(P(W | R_t, R_c) = P(W)\).

Now, given all this information, we obtain point probabilities: \(P(R_t = 1) = 0.8\) and \(P(R_c = 1) = 0.3\). These are equal to the probabilities that were conditional on what group a participant was assigned to.

Another curiosity is what we can infer about the joint distribution, \(P(R_t, R_c)\). The results are probability intervals:

\(R_t = 0\) \(R_t = 1\)
\(R_c = 0\) \([0, 0.2]\) \([0.5, 0.7]\) \(0.7\)
\(R_c = 1\) \([0, 0.2]\) \([0.1, 0.3]\) \(0.3\)
\(0.2\) \(0.8\)

This illustrates, in a toy example, the more general problem that the joint distribution of potential outcomes typically cannot be obtained from an RCT. However, the joint probabilities are constrained by the marginals.

Wisdom(?) from the 1918 Dadaist manifesto by Tristan Tzara

The 1st and 2nd DADA Art Manifestos are online over there.

  • “Psychoanalysis is a dangerous disease, it deadens man’s anti-real inclinations and systematises the bourgeoisie.”
  • “Dialectics is an amusing machine that leads us (in banal fashion) to the opinions which we would have held in any case.”
  • “People observe, they look at things from one or several points of view, they choose them from amongst the millions that exist. Experience too is the result of chance and of individual abilities.”
  • “Logic is a complication. Logic is always false. It draws the superficial threads of concepts and words towards illusory conclusions and centres.”
  • “What we need are strong straightforward, precise works which will be forever misunderstood.”