Baseline balance in experiments and quasi-experiments

Baseline balance is important for both experiments and quasi-experiments, just not in the way researchers sometimes believe. Here are excerpts from three of my favourite discussions of the topic.

Don’t test for baseline imbalance in RCTs. Senn (1994,  p. 1716):

“… the following are two incontrovertible facts about a randomized clinical trial:

1. over all randomizations the groups are balanced;

2. for a particular randomization they are unbalanced.

Now, no ‘[statistically] significant imbalance’ can cause 1 to be untrue and no lack of a significant balance can make 2 untrue. Therefore the only reason to employ such a test must be to examine the process of randomization itself. Thus a significant result should lead to the decision that the treatment groups have not been randomized…”

Do examine baseline imbalance in quasi-experiments; however, not by using statistical tests. Sample descriptives, such as a difference in means, suffice. Imai et al. (2008, p. 497):

“… from a theoretical perspective, balance is a characteristic of the sample, not some hypothetical population, and so, strictly speaking, hypothesis tests are irrelevant…”

Using p-values from t-tests and similar can lead to erroneous decisions of balance. As you prune a dataset to improve balance, power to detect effects decreases. Imai et al. (2008, p. 497 again):

“Since the values of […] hypothesis tests are affected by factors other than balance, they cannot even be counted on to be monotone functions of balance. The t-test can indicate that balance is becoming better whereas the actual balance is growing worse, staying the same or improving. Although we choose the most commonly used t-test for illustration, the same problem applies to many other test statistics…”

If your matching has led to baseline balance, then you’re good, even if the matching model is misspecified. (Though not if you’re missing key covariates, of course.) Rosenbaum (2023, p. 29):

“So far as matching and stratification are concerned, the propensity score and other methods are a means to an end, not an end in themselves. If matching for a misspecified and misestimated propensity score balances x, then that is fine. If by bad luck, the true propensity score failed to balance x, then the match is inadequate and should be improved.”


Imai, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society: Series A (Statistics in Society), 171(2), 481–502.

Rosenbaum, P. R. (2023). Propensity score. In J. R. Zubizarreta, E. A. Stuart, D. S. Small, & P. R. Rosenbaum, Handbook of Matching and Weighting Adjustments for Causal Inference (pp. 21–38). Chapman and Hall/CRC.

Senn, S. (1994). Testing for baseline balance in clinical trials. Statistics in Medicine13, 1715–1726.


I was 17 when the Good Friday Agreement was signed and I remember being shocked that there was a chance The Troubles™ might end. I hadn’t thought that was a possibility, hadn’t considered it.

Growing up “near Belfast” (as everyone seems to say), my lingering memories of the Troubles are of endless news of shootings and bombings – so many, they didn’t register until a bomb went off in my town and I heard it for myself. Just how pervasively military everything was.

An army helicopter landing in a field behind my parents’ house. Looking out the window to see a pair of armed soldiers navigating the streets by paper map, one kneeling with the map while the other stood guard, the camouflage weirdly incongruent with the quiet residential street. Army Land Rovers periodically appearing, soldiers peeping out from the roof with a gun. Checkpoints at the edge of town, with a soldier lying at the roadside pointing a gun at the car, RUC (police) officer asking questions in – what I can only describe on this website as – an unnecessarily rude way. Checkpoints maintaining a control zone around the airport, cars slowed along a speed-bumped road so number plates could be checked, questions asked. More guns. One day accidentally driving through an army operation, thinking it was a checkpoint so slowing down but being enthusiastically encouraged to accelerate. Another day, missing my usual exit from the motorway as I was enjoying singing along to a Placebo CD, driving past as balaclavad men stopped a bus at a roundabout and hijacked it.

And this was all normal and normalised, survived through brutally dark humour captured perfectly in the film Divorcing Jack.

I left Ireland in 2002 and have lived in London since 2011. It’s striking how far away the north of Ireland is here, though a flight can take you to Belfast in an hour. How unimportant it is in discussions of Brexit and its consequences. How reckless politicians have been.

Communication is probably more than 7% verbal

“Have you ever heard the adage that communication is only 7 percent verbal and 93 percent non-verbal, i.e. body language and vocal variety? You probably have, and if you have any sense at all, you have ignored it.” Philip Yaffe wades into this one. The first two pages provide a concise summary of the 1967 studies that produced this 7% figure:

Subjects were asked to listen to a recording of a woman’s voice saying the word “maybe” three different ways to convey liking, neutrality, and disliking. They were also shown photos of the woman’s face conveying the same three emotions. They were then asked to guess the emotions heard in the recorded voice, seen in the photos, and both together. The result? The subjects correctly identified the emotions 50 percent more often from the photos than from the voice.

In the second study, subjects were asked to listen to nine recorded words, three meant to convey liking (honey, dear, thanks), three to convey neutrality (maybe, really, oh), and three to convey disliking (don’t, brute, terrible). Each word was pronounced three different ways. When asked to guess the emotions being conveyed, it turned out that the subjects were more influenced by the tone of voice than by the words themselves.

The original studies behind the figure look interesting for what they actually tried to do rather than the bullshit claims that resulted.


Mehrabian, A., & Wiener, M. (1967). Decoding of inconsistent communications. Journal of Personality and Social Psychology, 6(1), 109–114.

Mehrabian, A., & Ferris, S. R. (1967). Inference of attitudes from nonverbal communication in two channels. Journal of Consulting Psychology, 31(3), 248–252.

Can you bullshit a bullshitter?

You can bullshit a bullshitter, except if they also have high cognitive ability, according to Littrell et al. (2021).

Littrell, S., Risko, E. F., & Fugelsang, J. A. (2021). ‘You can’t bullshit a bullshitter’ (or can you?): Bullshitting frequency predicts receptivity to various types of misleading information. British Journal of Social Psychology, 60(4), 1484–1505.

Special issue dedicated to John Mayne

‘I am honoured to introduce this special issue dedicated to John Mayne, a “thought leader,” “practical thinker,” “bridge builder,” and “scholar practitioner” in the field of evaluation. Guest editors Steffen Bohni Nielsen, Sebastian Lemire, and Steve Montague bring together 14 colleagues whose articles document, analyze, and expand on John’s contributions to evaluation in the Canadian public service as well as his contributions to evaluation theory.’ –Jill A. Chouinard

Canadian Journal of Program Evaluation, Volume 37 Issue 3, March 2023

A Word on Critical Realism, by Kieran Healy (2013)

I’ve just remembered this:

“[Critical realism] presents itself in a way that some social scientists—with next to no real background in philosophy—feel gives them just what they need to shore up their empirical research and metaphysical intuitions. You want to be realist in your philosophy of social science? Sure! You want your preferred level of analysis to also be an ontologically emergent level of reality? No problem! You want to talk about social structures as irreducible in some serious-but-not-really-analyzed fashion? You got it. You want your theory to be critical? I mean, who doesn’t, right? Just call yourself a Critical Realist and cite some Bhaskar. After all, he has repeatedly asserted that his work is a “Copernican Revolution” in the philosophy of science […]”


“The diffusion of CR was slightly hampered by the transformation of Bhaskar from fringe philosopher of science to full-blown guru. Having recruited followers in sociology on the basis of his realism, he began to pull the rug out from under them in the late 1990s, first with the merely absurd Plato, Etc and then with the frankly embarrassing From East to West: Odyssey of a Soul (which closed with a final chapter titled “The Dance of Shiva in the Age of Aquarius”). This work, in retrospect, seems like the culmination of the unpleasant cult of personality that grew up around Bhaskar in CR circles in the 1980s and which he seems to have done little to discourage.”

One section is constructive, providing alternatives to critical realism:

“Sociologists interested in emergence or macro-level explanations have no need to run together that interest with the specific CR position or view. (This was the point of my old article on Archer, Mouzelis, and CR.) There are large and immediately accessible literatures on all of these topics in philosophy, leading naturally to more technical or specialist work. Consider the SEP article on Emergent Properties, for instance, and the one on Supervenience, or the one on Scientific Explanation, or Scientific Realism. Go from there to, say, Jonathan Schaffer’s Is there a Fundamental Level? (for the metaphysics) or Michael Strevens’ Depth for one take on the philosophy of science, or—for something in parts similar to Bhaskar but far more creative and central—read Nancy Cartwright on laws of nature. I feel confident in asserting that different sociologists could ally themselves with quite incompatible positions in these debates and much of our work would go on as before.”

Regression to the mean

Suppose we were to run an uncontrolled pre-post evaluation of an intervention to alleviate psychological distress. We screen participants for distress and invite those with scores 1.5 SDs or more above the mean to take part. Then, following the intervention, we collect data on distress again to see if it has reduced. The measure we have chosen has a test-retest reliability of 0.8.

Here is a picture of simulated findings (scores have been scaled so that they have a mean of 0 and SD of 1). Red points denote data from people who have been included the study.

I have setup the simulation so that the intervention had no effect, in the sense that outcomes would have been identical in the absence of the intervention. However, looking at the right hand side, it appears that there has been a reduction in distress of 1.1 SDs – a huge effect. This is highly “statistically significant”, p < .001. What happened?!

Tweaking the simulation

Let’s try a different simulation. This time, without any screening, so everyone is included in the intervention regardless of their levels of distress (so all the data points are red):

Looking at the right hand side, the pre-post change is 0 and p is close to 1. There is no change.

Next, select participants whose scores are at the mean or above:

The pre-post change is now statistically significant again, with improvement of 0.27 SDs.

Select participants with more extreme scores, 1.5 SDs or above at baseline, and we see the magnitude of change has increased again:

What happens if we increase the test-retest reliability of the measure to 0.9?

Firstly, the scatterplot on the left is a little less fuzzy. The magnitude of change has reduced to 0.48 SDs.

Finally, let’s make the measure perfectly reliable so that the scatterplot on the left is a fuzz-free straight line:

Now there is no change.

What’s going on?

I have simulated the data so that the intervention had zero impact on outcomes, and yet for many of the analyses above it does appear to have alleviated distress.

The extent to which the effect illustrated above, called regression to the mean, occurs partly depends on how selective we are in inviting participants to join the study. At one extreme, if there is no selection, then the mean change is still zero. At the other extreme, when we are highly selective, then change is over 1 SD.

This is because by selecting people with particularly high scores at baseline, there’s an increased chance that we include people who had, for them, a statistically rare score. Perhaps they had a particularly bad day, which wasn’t indicative of their general levels of distress. Since we selected them when they happened to have a bad day, on measuring again after the intervention, there was a good chance they had a much less extreme score. But this reduction was entirely unrelated to the intervention. We know this because the simulation was setup so that the intervention had zero effect.

Making test-retest reliability perfect also eliminates regression to the mean. However, this is unlikely to be possible for most of the characteristics of people that are of interest for interventions.

You can play around with the app I developed to simulate the data over here.

Regression to the mean is just one reason why interventions can spuriously appear to have an effect. Carefully chosen control groups, where possible with random assignment to intervention or control, can take account of alternative explanations of change.

Evaluating What Works, by Dorothy Bishop and Paul Thompson

“Those who work in allied health professions aim to make people’s lives better. Often, however, it is hard to know how effective we have been: would change have occurred if we hadn’t intervened? Is it possible we are doing more harm than good? To answer these questions and develop a body of knowledge about what works, we need to evaluate interventions.

“As we shall see, demonstrating that an intervention has an impact is much harder than it appears at first sight. There are all kinds of issues that can arise to mislead us into thinking that we have an effective treatment when this is not the case. On the other hand, if a study is poorly designed, we may end up thinking an intervention is ineffective when in fact it is beneficial. Much of the attention of methodologists has focused on how to recognize and control for unwanted factors that can affect outcomes of interest. But psychology is also important: it tells us that own human biases can be just as important in leading us astray. Good, objective intervention research is vital if we are to improve the outcomes of those we work with, but it is really difficult to do it well, and to do so we have to overcome our natural impulses to interpret evidence in biased ways.”

(Over here.)


No reasoning task is too irritating to be completed

Consider the sentence: “No head injury is too trivial to be ignored.” What does it mean?

Now consider: “No missile is too small to be banned.”

Go back to the first sentence – are you sure it means what you thought it did?

See Wason, P. C., & Reich, S. S. (1979). A Verbal Illusion. Quarterly Journal of Experimental Psychology, 31(4), 591–597.

Harvey at PwC

LLM-driven text analysis is becoming a norm, allowing people to process huge volumes of text they wouldn’t otherwise have the capacity to do. Although outputs can be checked, the large volume of inputs processed means there are fundamental limits on how comprehensively analyses can be checked.

PwC announced yesterday that it is trialling the use of Harvey, built on Chat GPT, to “help generate insights and recommendations based on large volumes of data, delivering richer information that will enable PwC professionals to identify solutions faster.”

They say that “All outputs will be overseen and reviewed by PwC professionals.” But what about how the data was processed in the first place…?