People are rightly critical of the Myers–Briggs Type Indicator (MBTI). But some of the types are moderately correlated with the Big Five dimensions, which are seen as more credible in differential psychology. MBTI extraversion correlates with… wait for it… Big Five extraversion (50% shared variance). MBTI intuition correlates with openness to new experiences (40% shared variance). The opposite poles correlate as you’d expect.

Here are the key correlations (Furnham et al., 2003, p. 580, gender and linear effects of age have been partialed out):

“Neuroticism was most highly correlated with MBTI Extraversion (r = -.30, p = .001) and Introversion (r = .31, p < .001). Costa and McCrae’s Extraversion was most highly correlated with Myers-Briggs Extraversion (r = .71, p < .001) and Introversion (r=-.72, p < .001). Openness was most highly correlated with Sensing (r = -.66, p < .001) and Intuition (r = .64, p < .001). Agreeableness was most highly correlated with Thinking (r=-41, p < .001) and Feeling (r = .28, p < .001). Conscientiousness was most highly correlated with Judgment (r = .46, p<.001) and Perception (r=-.46, p < .001).”

Dichotomising is still silly, particularly for scores close to thresholds, where a light breeze might flip someone’s type from, say, I to E or vice verse. But the same can be said of any discretisation taken too seriously. Consider also clinical bands on mental health questionnaires and attachment styles on the Experience in Close Relationships Scale.

Also silly are tautologous non-explanations of the form: they behave that way because they’re E. Someone is E because they ticked a bunch of boxes saying they consider themselves extraverted! The types are defined transparently in terms of thoughts, feelings, and behaviour. They help structure self-report, but don’t explain why people are the way they are. Explanations require mechanisms.


Furnham, A., Moutafi, J., & Crump, J. (2003). The relationship between the revised NEO-Personality Inventory and the Myers-Briggs Type Indicator. Social Behavior and Personality, 31, 577–584.

Dealing with confounding in observational studies

Excellent review of simulation-based evaluations of quasi-experimental methods, by Varga et al. (2022). Also lovely annexes summarising the methods’ assumptions.

Methods for measured confounding the authors cover (Varga et al., 2022, Table A1):

Method Description of the method
PS matching (N = 47) Treated and untreated individuals are matched based on their propensity score-similarity. After creating comparable groups of treated and untreated individuals the effect of the treatment can be estimated.
IPTW (N = 30) With the help of re-weighting by the inverse probability of receiving the treatment, a synthetic sample is created which is representative of the population and in which treatment assignment is independent of the observed baseline covariates. Over-represented groups are downweighted and underrepresented groups are upweighted.
Overlap weights (N = 4) Overlap weights were developed to overcome the limitations of truncation and trimming for IPTW, when some individual PSs approach 0 or 1.
Matching weights (N = 2) Matching weights is an analogue weighting method for IPTW, when some individual PSs approach 0 or 1.
Covariate adjustment using PS (N = 13) The estimated PS is included as covariate in a regression model of the treatment.
PS stratification (N = 26) First the subjects are grouped into strata based upon their PS. Then, the treatment effect is estimated within each PS stratum, and the ATE is computed as a weighted mean of the stratum specific estimates.
GAM (N = 1) GAMs provide an alternative for traditional PS estimation by replacing the linear component of a logistic regression with a flexible additive function.
GBM (N = 3) GBM trees provide an alternative for traditional PS estimation by estimating the function of covariates in a more flexible manner than logistic regression by averaging the PSs of small regression trees.
Genetic matching (N = 7) This matching method algorithmically optimizes covariate balance and avoids the process of iteratively modifying the PS model.
Covariate-balancing PS (N = 5) Models treatment assignment while optimizing the covariate balance. The method exploits the dual characteristics of the PS as a covariate balancing score and the conditional probability of treatment assignment.
DR estimation (N = 13) Combines outcome regression with with a model for the treatment (eg, weighting by the PS) such that the effect estimator is robust to misspecification of one (but not both) of these models.
AIPTW (N = 8) This estimator achieves the doubly-robust property by combining outcome regression with weighting by the PS.
Stratified DR estimator (N = 1) Hybrid DR method of outcome regression with PS weighting and stratification.
TMLE (N = 2) Semi-parametric double-robust method that allows for flexible estimation using (nonparametric) machine-learning methods.
Collaborative TMLE (N = 1) Data-adaptive estimation method for TMLE.
One step joint Bayesian PS (N = 3) Jointly estimates quantities in the PS and outcome stages.
Two-step Bayesian approach (N = 2) A two-step modeling method is using the Bayesian PS model in the first step, followed by a Bayesian outcome model in the second step.
Bayesian model averaging (N = 1) Fully Bayesian model averaging approach.
An’s intermediate approach (N = 2) Not fully Bayesian insofar as the outcome equation in An’s approach is frequentist.
G-computation (N = 4) The method interprets counterfactual outcomes as missing data and uses a prediction model to obtain potential outcomes under different treatment scenarios. The entire set of predicted outcomes is then regressed on the treatment to obtain the coefficient of the effect estimate.
Prognostic scores (N = 7) Prognostic scores are considered to be the prognostic analog of the PS methods. the prognostic score includes covariates based on their predictive power of the response, the PS includes covariates that predict treatment assignment.

Methods for unmeasured confounding (Varga et al., 2022, Table A2):

Method Description of the method
IV approach (N = 17) Post-randomization can be achieved using a sufficiently strong instrument. IV is correlated with the treatment and only affects the outcome through the treatment.
2SLS (N = 11) Linear estimator of the IV method. Uses linear probability for binary outcome and linear regression for continuous outcome.
2SPS (N = 5) Non-parametric estimator of the IV method. Logistic regression is used for both the first and second stages of 2SPS procedure. The predicted or residual values from the first stage logistic regression of treatment on the IV are used as covariates in the second stage logistic regression: the predicted value of treatment replaces the observed treatment for 2SPS.
2SRI (N = 8) Semi-parametric estimator of the IV method. Logistic regression is used for both the first and second stages of the 2SRI procedure. The predicted or residual values from the first stage logistic regression of treatment on the IV are used as covariates in the second stage logistic regression.
IV based on generalized structural mean model (GSMM) (N = 1) Semi-parametric models that use instrumental variables to identify causal parameters. IV approach
Instrumental PS (Matching enhanced IV) (N = 2) Reduces the dimensionality of the measured confounders, but it also deals with unmeasured confounders by the use of an IV.
DiD (N = 7) DiD method uses the assumption that without the treatment the average outcomes for the treated and control groups would have followed parallel trends over time. The design measures the effect of a treatment as the relative change in the outcomes between individuals in the treatment and control groups over time.
Matching combined with DiD (N = 6) Alternative approach to DiD. (2) Uses matching to balance the treatment and control groups according to pre-treatment outcomes and covariates
SCM (N = 7) This method constructs a comparator, the synthetic control, as a weighted average of the available control individuals. The weights are chosen to ensure that, prior to the treatment, levels of covariates and outcomes are similar over time to those of the treated unit.
Imperfect SCM (N = 1) Extension of SCM method with relaxed assumptions that allow outcomes to be functions of transitory shocks.
Generalized SCM (N = 2) Combines SC with fixed effects.
Synthetic DiD (N = 1) Both unit and time fixed effects, which can be interpreted as the time-weighted version of DiD.
LDV regression approach (N = 1) Adjusts for pre-treatment outcomes and covariates with a parametric regression model. Alternative approach to DiD.
Trend-in-trend (N = 1) The trend-in-trend design examines time trends in outcome as a function of time trends in treatment across strata with different time trends in treatment.
PERR (N = 3) PERR adjustment is a type of self-controlled design in which the treatment effect is estimated by the ratio of two rate ratios (RRs): RR after initiation of treatment and the RR prior to initiation of treatment.
PS calibration (N = 1) Combines PS and regression calibration to address confounding by variables unobserved in the main study by using variables observed in a validation study.
RD (N = 4) Method used for policy analysis. People slightly below and above the threshold for being exposed to a treatment are compared.


Varga, A. N., Guevara Morel, A. E., Lokkerbol, J., van Dongen, J. M., van Tulder, M. W., & Bosmans, J. E. (2022). Dealing with confounding in observational studies: A scoping review of methods evaluated in simulation studies with single‐point exposure. Statistics in Medicine.

Different ways to attain the same average treatment effect

Fun draft paper by Andrew Gelman, looking at different patterns of causal effects holding the average treatment effect (ATE) at 0.1 – part-inspired by Anscombe’s (1973) correlation quartet. Each graph shows a correlation between a hypothetical covariate, such as baseline symptom severity, and treatment effect. All four patterns are compatible with the ATE of 0.1.

A Dilemma for the Russo–Williamson Thesis

The Russo–Williamson thesis states that

“in order to establish a causal claim in medicine, one normally needs to establish both that the putative cause and putative effect are appropriately correlated and that there is some underlying mechanism that can account for this correlation.”

Wilde (2022) explores counterexamples to this where a causal claim was accepted before a mechanism was confirmed, e.g.,

  • Deep brain stimulation as a treatment for Parkinson’s disease.
  • Soot as a cause of scrotal cancer before the mechanisms involving benzo[a]pyrene had been established.

Lots to ponder therein, e.g., whether it works to weaken the causal condition to require a plausible mechanism that need not necessarily be established. There are worries that this manoeuvre leads to a thesis that is too weak since, particularly in social science, it is often easy to come up with some kind of plausible mechanism for just about any phenomena. Read the paper for a proposed solution!


Wilde, M. (2022). A Dilemma for the Russo–Williamson ThesisErkenntnis, in press

The object-subject relation in science

“One can only help oneself through something like the following emergency decree: Quantum mechanics forbids statements about what really exists—statements about the object. Its statements deal only with the object-subject relation. Although this holds, after all, for any description of nature, it evidently holds in a much more radical and far reaching sense in quantum mechanics.”

– Erwin Schrödinger (1931) letter to Arnold Sommerfeld, spotted in
Fuchs, C. A., Mermin, N. D., & Schack, R. (2014). An introduction to QBism with an application to the locality of quantum mechanics. American Journal of Physics, 82(8), 749–754.


Quantitative social research – the worst kind, except for all the others

Breznau, et al. (2022) asked a group of 161 researchers in 73 teams to analyse the same dataset and test the same hypothesis: greater immigration reduces public support for the welfare state. As we now expect in this genre of the literature, results varied. See the study’s figure below:

So roughly 60% of analyses found a non-statistically significant result. Of the 40% that were statistically significant, 60% found a negative association and 40% found a positive association.

Social scientists are well-versed in the replication crisis and, e.g., the importance of preregistering analyses and not relying too heavily on the findings from any one study.

Mathur et al. (2022) offer a glimmer of hope, though. The variation looks fairly wild when focussing on whether a hypothesis test was statistically significant or not. However, 90% of analyses found that a one-unit increase in immigration was associated with an increase or decrease in public support of less than 4% of a standard deviation – tiny effects!

I also find hope in all the meta-analyses transparently showing biases. It seems that quantitative social science is the most unreliable and difficult to replicate form of social science, except for all the others.


Breznau, N., et al. (2022). Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. PNAS 119(44), e2203150119 (2022).

Mathur, M. B., Covington, C., & VanderWeele, T. (2022, November 22). Variation across analysts in statistical significance, yet consistently small effect sizes. Preprint.

Reminiscing about BSSM

I used to run a social science methodology discussion group. Dumping the event list here, since the direction of travel of discussions tends to repeat, e.g., mixing methods, role of theory, sample size, limits of introspection, …

When Info / Readings
Thurs 6 Feb BISR BSSM research network social

This is a joint event with BSSM and the Birkbeck Institute for Social Research. Lunch will be included.

Tues 21 Jan Hilda Weiss, sociologist: A critical theorist of a lesser kind?

Detlef Garz, Hanse-Wissenschaftskolleg (HWK) Institute for Advanced Study

Hilda Weiss (29 August 1900 – 29 May 1981) was a sociologist and one of the first doctoral students at the Institute for Social Research in Frankfurt (joining 1924) which is famous for its role in developing critical theory. She played a central role in designing and running a large study of political views and employment conditions in Germany, 1930, working with Erich Fromm. Given her life, contributions to sociology, and methodological innovations, it seems odd that she has been mostly relegated to the occasional footnote in papers on people like Fromm. This talk will explore her life and contributions to sociology and critical theory.

Please sign up on Eventbrite

Thurs 7 Nov Analytic philosophy as critical theory: what can it do for empirical studies of gender?

Katharine Jenkins, University of Nottingham

Although the distinction between ‘analytic’ and ‘continental’ philosophy is difficult to pinpoint and easy to critique, there is nevertheless a fairly distinct literature that can be thought of as ‘analytic philosophy of social science’. Moreover, critical theory – theory understood as part of an emancipatory social movement – is often seen as part of continental philosophy and not as part of analytic philosophy. Crucially, critical theory involves being in contact with, and responsive to, one or more social justice movements, and developing theoretical tools that are useful for advancing the aims of these movements.In this talk, I explore the possibility for undertaking analytic philosophy of social science as a form of critical theory, with the intention of supplying tools to empirical social science that can aid emancipatory work. Using gender as a case study, I argue that it is possible to use the methods of analytic philosophy to fulfil the aims of critical theory, and that the clarity and precision that analytic philosophy brings can be useful for empirical research. I offer an analytic framework for thinking about social categories or kinds that is suited to projects in critical theory, and I apply this framework to gender in a way that is responsive to transfeminist movements.

Please sign up on Eventbrite

Wed 16 Oct Work in progress: Prediction versus history in political science

Robert Northcott, Philosophy

Robert will introduce a draft of a chapter he is writing on the philosophy of political science. The draft chapter argues that, usually, retrospective testing of wide-scope theories or models will not be appropriate for political science and that forward-looking prediction is required instead. But given the difficulty of the latter, in turn the main actual focus should be on contextual historical work. It then illustrates via a case study what role such a contextual approach leaves for wider-scope theory. It concludes by assessing the scope for political science to offer policy advice.

Weds 24 July Free association

Claudia Lapping, UCL

This session will be a brief introduction to the use of free association as a social research method. You will be invited to try out a couple of exercises: individual free writing and (in pairs) how to encourage free associations in interviews

Weds 29 May Generalising from case studies

Ylikoski, P. (2018). Mechanism-based theorizing and generalization from case studies. Studies in History and Philosophy of Science Part A. In press, corrected proof.

Fri 12 April The constant comparative method

Quinn, K. G., Murphy, M. K., Nigogosyan, Z., & Petroll, A. E. (2019). Stigma, isolation and depression among older adults living with HIV in rural areas. Ageing and Society, 1–19.Boeije, H. (2002). A Purposeful Approach to the Constant Comparative Method in the Analysis of Qualitative Interviews. Quality and Quantity, 36, 391–409.

Thurs 7 March Mixing qualitative methods

Cassell & Bishop (2018). Qualitative data analysis: Exploring themes, metaphors and stories. European Management Review.

Clarke, Willis, Barnes, Caddick, Cromby, McDermott & Wiltshire (2015). Analytical pluralism in qualitative research: A meta-study. Qualitative Research in Psychology, 12(2), 182-201.

Wed 23 Jan Telling more than we can know?

Petitmengin, C., Remillieux, A., Cahour, B., & Carter-Thomas, S. (2013). A gap in Nisbett and Wilson’s findings? A first-person access to our cognitive processes. Consciousness and Cognition, 22, 654–669.

Petitmengin, C. (2006). Describing one’s subjective experience in the second person: An interview method for the science of consciousness. Phenomenology and the Cognitive Sciences, 5, 229–269.

Nisbett, R.E. & Wilson, T.D. (1977). Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84, 231–259.

Tues 4 Dec What happens when mixed method findings conflict?

Johnson, R.B., Russo, F. & Schoonenboom, J., 2017. Causation in Mixed Methods Research: The Meeting of Philosophy, Science, and Practice. Journal of Mixed Methods Research.

Moffatt, S. et al., 2006. Using quantitative and qualitative data in health services research – what happens when mixed method findings conflict? BMC Health Services Research, 6, p.28.

Mon 12 Nov Launch!

Understanding causal estimands like ATE and ATT

Photo by Susanne Jutzeler

Social policy and programme evaluations often report findings in terms of casual estimands such as the average treatment effect (ATE) or the average treatment effect on the treated (ATT or ATET). An estimand is a quantity we are trying to estimate – but what exactly does that mean? This post explains through simple examples.

Suppose a study has two conditions, treat (=1) and control (=0). Causal estimands are defined in terms of potential outcomes: the outcome if someone had been assigned to treatment, \(Y(1)\), and outcome if someone had been assigned to control, \(Y(0)\).

We only get to see one of those two realised, depending on which condition someone was actually assigned to. The other is a counterfactual outcome. Assume, for a moment, that you are omniscient and can observe both potential outcomes. The treatment effect (TE) for an individual is \(Y(1)-Y(0)\) and, since you are omniscient, you can see it for everyone.

Here is a table of potential outcomes and treatment effects for 10 fictional study participants. A higher score represents a better outcome.

Person Condition Y(0) Y(1) TE
1 1 0 7 7
2 0 3 0 -3
3 1 2 9 7
4 1 1 8 7
5 0 4 1 -3
6 1 3 10 7
7 0 4 1 -3
8 0 8 5 -3
9 0 7 4 -3
10 1 3 10 7

Note the pattern in the table. People who were assigned to treatment have a treatment effect of \(7\) and people who were assigned to control have a treatment effect of \(-3\), i.e., if they had been assigned to treatment, their outcome would have been worse. So everyone in this fictional study was lucky: they were assigned to the condition that led to the best outcome they could have had.

The average treatment effect (ATE) is simply the average of treatment effects: 

\(\displaystyle \frac{7 + -3 + 7 + 7 + -3 + 7 + -3 + -3 + -3 + 7}{10}=2\)

The average treatment effect on the treated (ATT or ATET) is the average of treatment effects for people who were assigned to the treatment:

\(\displaystyle \frac{7 + 7 + 7 + 7 + 7}{5}=7\)

The average treatment effect on control (ATC) is the average of treatment effects for people who were assigned to control:

\(\displaystyle \frac{-3 + -3 + -3 + -3 + -3}{5}=-3\)

Alas we aren’t really omniscient, so in reality see a table like this:

Person Condition Y(0) Y(1) TE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 9 ?
4 1 ? 8 ?
5 0 4 ? ?
6 1 ? 10 ?
7 0 4 ? ?
8 0 8 ? ?
9 0 7 ? ?
10 1 ? 10 ?

This table highlights the fundamental problem of causal inference and why it is sometimes seen as a missing data problem.

Don’t confuse estimands and methods for estimation

One of the barriers to understanding these estimands is that we are used to taking a between-participant difference in group means to estimate the average effect of a treatment. But the estmands are defined in terms of a within-participant difference between two potential outcomes, only one of which is observed.

The causal effect is a theoretical quantity defined for individual people and it cannot be directly measured.

Here is another example where the causal effect is zero for everyone, so ATT, ATE, and ATC are all zero too:

Person Condition Y(0) Y(1) TE
1 1 7 7 0
2 0 3 3 0
3 1 7 7 0
4 1 7 7 0
5 0 3 3 0
6 1 7 7 0
7 0 3 3 0
8 0 3 3 0
9 0 3 3 0
10 1 7 7 0

However, people have been assigned to treatment and control in such a way that, given the outcomes realised, it appears that treatment is better than control. Here is the table again, this time with observations we couldn’t observe removed:

Person Condition Y(0) Y(1) CE
1 1 ? 7 ?
2 0 3 ? ?
3 1 ? 7 ?
4 1 ? 7 ?
5 0 3 ? ?
6 1 ? 7 ?
7 0 3 ? ?
8 0 3 ? ?
9 0 3 ? ?
10 1 ? 7 ?

So, if we take the average of realised treatment outcomes we get 7 and the average of realised control outcomes we get 3. The mean difference is then 4. This estimate is biased. The correct answer is zero, but we couldn’t tell from the available data.

The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. For other estimators that don’t require random treatment assignment and for other estimands, try Scott Cunningham’s Causal Inference: The Mixtape.

How do you choose between ATE, ATT, and ATC?

Firstly, if you are running a randomised controlled trial, you don’t choose: ATE, ATT, and ATC will be the same. This is because, on average across trials, the characteristics of those who were assigned to treatment or control will be the same.

So the distinction between these three estimands only matters for quasi-experimental studies, for example where treatment assignment is not under the control of the researcher.

Noah Greifer and Elizabeth Stuart offer a neat set of example research questions to help decide (here lightly edited to make them less medical):

  • ATT: should an intervention currently being offered continue to be offered or should it be withheld?
  • ATC: should an intervention be extended to people who don’t currently receive it?
  • ATE: should an intervention be offered to everyone who is eligible?

How does intention to treat fit in?

The distinction between ATE and ATT is unrelated to the distinction between intention to treat and per-protocol analyses. Intention to treat analysis means we analyse people according to the group they were assigned to, even if they didn’t comply, e.g., by not engaging with the treatment. Per-protocol analysis is a biased analysis that only analyses data from participants who did comply and is generally not recommended.

For instance, it is possible to conduct a quasi-experimental study that uses intention to treat and estimates the average treatment effect on the treated. In this case, ATT might be better called something like average treatment effect for those we intended to treat (ATETWITT). Sadly this term hasn’t yet been used in the literature.


Causal effects are defined in terms of potential outcomes following treatment and following control. Only one potential outcome is observed, depending on whether someone was assigned to treatment or control, so causal effects cannot be directly observed. The fields of statistics and causal inference find ways to estimate these estimands using observable data. The easiest way to estimate ATE is through a randomised controlled trial. In this kind of study, the mean difference in observed outcomes is an unbiased estimate of ATE. Quasi-experimental designs allow the estimation of additional estimands: ATT and ATC.