A cynical view of SEMs

It is all too common for a box and arrow diagram to be cobbled together in an afternoon and christened a “theory of change”. One formalised version of such a diagram is a structural equation model (SEM), the arrows of which are annotated with coefficients estimated using data. Here is John Fox (2002) on SEM and informal boxology:

“A cynical view of SEMs is that their popularity in the social sciences reflects the legitimacy that the models appear to lend to causal interpretation of observational data, when in fact such interpretation is no less problematic than for other kinds of regression models applied to observational data. A more charitable interpretation is that SEMs are close to the kind of informal thinking about causal relationships that is common in social-science theorizing, and that, therefore, these models facilitate translating such theories into data analysis.”


Fox, J. (2002). Structural Equation Models: Appendix to An R and S-PLUS Companion to Applied Regression. Last corrected 2006.

Do “Growth Mindset” interventions improve students’ academic attainment?

“We conducted a systematic review and multiple meta-analyses of the growth mindset intervention literature. Our goal was to answer two questions: (a) Do growth mindset interventions generally improve students’ academic achievement? and (b) Are growth mindset intervention effects due to instilling growth mindsets in students or are apparent effects due to shortcomings in study designs, analyses, and reporting? To answer these questions, we systematically reviewed the literature and conducted multiple meta-analyses imposing varying degrees of quality control. Our results indicated that apparent effects of growth mindset interventions are possibly due to inadequate study designs, reporting flaws, and bias. In particular, the systematic review yielded several concerning patterns of threats to internal validity.”

Here’s a pic:

Privacy implications of hashing data, by John Cook

Cryptographic hash functions are sometimes used to create pseudo IDs from identifiable info like NHS numbers. An advantage of this approach is that two sites that have no way to communicate with each other will generate the same pseudo ID, allowing data to be linked. A disadvantage is that, although in general hashes are impossible to inverse, a rainbow table lookup attack can be used to inverse the hash when the space of inputs is relatively small, as is the case for ID numbers. John Cook explores.

Baseline balance in experiments and quasi-experiments

Baseline balance is important for both experiments and quasi-experiments, just not in the way researchers sometimes believe. Here are excerpts from three of my favourite discussions of the topic.

Don’t test for baseline imbalance in RCTs. Senn (1994,  p. 1716):

“… the following are two incontrovertible facts about a randomized clinical trial:

1. over all randomizations the groups are balanced;

2. for a particular randomization they are unbalanced.

Now, no ‘[statistically] significant imbalance’ can cause 1 to be untrue and no lack of a significant balance can make 2 untrue. Therefore the only reason to employ such a test must be to examine the process of randomization itself. Thus a significant result should lead to the decision that the treatment groups have not been randomized…”

Do examine baseline imbalance in quasi-experiments; however, not by using statistical tests. Sample descriptives, such as a difference in means, suffice. Imai et al. (2008, p. 497):

“… from a theoretical perspective, balance is a characteristic of the sample, not some hypothetical population, and so, strictly speaking, hypothesis tests are irrelevant…”

Using p-values from t-tests and similar can lead to erroneous decisions of balance. As you prune a dataset to improve balance, power to detect effects decreases. Imai et al. (2008, p. 497 again):

“Since the values of […] hypothesis tests are affected by factors other than balance, they cannot even be counted on to be monotone functions of balance. The t-test can indicate that balance is becoming better whereas the actual balance is growing worse, staying the same or improving. Although we choose the most commonly used t-test for illustration, the same problem applies to many other test statistics…”

If your matching has led to baseline balance, then you’re good, even if the matching model is misspecified. (Though not if you’re missing key covariates, of course.) Rosenbaum (2023, p. 29):

“So far as matching and stratification are concerned, the propensity score and other methods are a means to an end, not an end in themselves. If matching for a misspecified and misestimated propensity score balances x, then that is fine. If by bad luck, the true propensity score failed to balance x, then the match is inadequate and should be improved.”


Imai, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the Royal Statistical Society: Series A (Statistics in Society), 171(2), 481–502.

Rosenbaum, P. R. (2023). Propensity score. In J. R. Zubizarreta, E. A. Stuart, D. S. Small, & P. R. Rosenbaum, Handbook of Matching and Weighting Adjustments for Causal Inference (pp. 21–38). Chapman and Hall/CRC.

Senn, S. (1994). Testing for baseline balance in clinical trials. Statistics in Medicine13, 1715–1726.

Communication is probably more than 7% verbal

“Have you ever heard the adage that communication is only 7 percent verbal and 93 percent non-verbal, i.e. body language and vocal variety? You probably have, and if you have any sense at all, you have ignored it.” Philip Yaffe wades into this one. The first two pages provide a concise summary of the 1967 studies that produced this 7% figure:

Subjects were asked to listen to a recording of a woman’s voice saying the word “maybe” three different ways to convey liking, neutrality, and disliking. They were also shown photos of the woman’s face conveying the same three emotions. They were then asked to guess the emotions heard in the recorded voice, seen in the photos, and both together. The result? The subjects correctly identified the emotions 50 percent more often from the photos than from the voice.

In the second study, subjects were asked to listen to nine recorded words, three meant to convey liking (honey, dear, thanks), three to convey neutrality (maybe, really, oh), and three to convey disliking (don’t, brute, terrible). Each word was pronounced three different ways. When asked to guess the emotions being conveyed, it turned out that the subjects were more influenced by the tone of voice than by the words themselves.

The original studies behind the figure look interesting for what they actually tried to do rather than the bullshit claims that resulted.


Mehrabian, A., & Wiener, M. (1967). Decoding of inconsistent communications. Journal of Personality and Social Psychology, 6(1), 109–114.

Mehrabian, A., & Ferris, S. R. (1967). Inference of attitudes from nonverbal communication in two channels. Journal of Consulting Psychology, 31(3), 248–252.

Can you bullshit a bullshitter?

You can bullshit a bullshitter, except if they also have high cognitive ability, according to Littrell et al. (2021).

Littrell, S., Risko, E. F., & Fugelsang, J. A. (2021). ‘You can’t bullshit a bullshitter’ (or can you?): Bullshitting frequency predicts receptivity to various types of misleading information. British Journal of Social Psychology, 60(4), 1484–1505.

History repeating in psychedelics research

Interesting draft paper by Michiel van Elk and Eiko Fried on flawed evaluations of psychedelics to treat mental health conditions and how to do better. Neat 1966 quotation at the end:

‘… we urge caution repeating the history of so many hyped treatments in clinical psychology and psychiatry in the last century. For psychedelic research in particular, we are not the first ones to raise concerns and can only echo the warning expressed more than half a century ago:

“To be hopeful and optimistic about psychedelic drugs and their potential is one thing; to be messianic is another. Both the present and the future of psychedelic research already have been grievously injured by a messianism that is as unwarranted as it has proved undesirable”. (Masters & Houston, 1966)’

Five questions to ask of social research

  1. Why should I care about this sample? Is the sample itself of interest, whether 1 person (e.g., a biography-like case study) or 1,000?
  2. If generalisation to a broader population is intended or implied,
    (a) How is the case made that the findings in the sample transfer to other people?
    (b) Why should I care about the target population?
  3. To what extent do findings depend on participants being able to articulate the reasons why they acted the way they did?
  4. Do the researchers state or imply that X caused Y or contributed to Y? If so, what evidence is provided that if X hadn’t been the case, then Y would have been different.
  5. What political agendas do (a) the researchers and (b) their institutions have? Related, what constraints are they under, e.g., due to who funds them?