PHQ-9 “over-diagnosis” paper shows that arithmetic works

A recent paper by Levis et al. (2020) systematically reviews studies looking at depression prevalence in two ways: one using a structured assessment completed by a professional (SCID) and the other using a questionnaire completed by study participants (PHQ-9). The authors conclude that “PHQ-9 ≥10 substantially overestimates depression prevalence.” But this was entirely predictable.

Mean SCID-prevalence was 12.1%.

Mean PHQ-9 prevalence (using a score of 10 or above to decide that someone has depression) was 24.6%.

This is almost exactly what arithmetic predicts; my back-of-envelope estimate of what PHQ-9 would say (see below) gives 23.8%, using estimates of PHQ’s sensitivity and sensitivity from a meta-analysis (88% and 85%, respectively) and the SCID-prevalence found in the review (12.1%).

So the paper’s results are unsurprising.

PHQ-9 (and any other screening questionnaire) gives better predictions in groups with higher rates of depression, such as people who have asked for a GP appointment because they are worried about their mental health.

No clinical decisions – such as whether to accept someone for treatment – should be made on the basis of nine tick-box answers alone. Questionnaires can also miss people who need treatment.

Screening questionnaires are often designed to over-diagnose rather than risk missing people who need treatment, under the assumption that a proper follow-up assessment will be carried out.

When reporting condition prevalence, the psychometric properties of measures should be provided, including what “gold standard” they have been validated against, and the chosen clinical threshold.

Explore Positive/Negative Predictive Values (PPV and NPV) using this app.


Back of envelope

P(SCID) = .121
P(PHQ | SCID) = .88
P(not-PHQ | not-SCID) = .85
P(PHQ | not-SCID) = 1 – P(not-PHQ | not-SCID) = .15

= .88 * .121
= .10648

P(PHQ & not-SCID) = P(PHQ | not-SCID) * P(not-SCID)
= (1 – .85) * (1 – .121)
= .13185

P(PHQ) = P(PHQ & SCID) + P(PHQ & not-SCID)
= .10648 + .13185
= 0.23833


Thanks Chris, for pointing out the typo!

Mental testing

“The unfortunate habit in the mental testing field of devising a new test, administering it to some arbitrarily chosen group of subjects, calling these ‘the standardization population’, and then leaving it at that, does not seem to call for comment.” (Ehrenberg, 1955, p. 26, footnote 1)

Ehrenberg, A. S. C. (1955). Measurement and mathematics in psychology. British Journal of Psychology, 46(1), 20–9. Retrieved from

On reliability and validity

From The Last Psychiatrist:

“As anyone who has ever dated a girl who was too much into the occult will tell you, astrology is difficult. It has a highly structured set of rules—math, really—so precise and complex that, theoretically, any two astrologers should independently arrive at the same result, which is correct enough times to keep people from breaking out into hysterical laughter all the time. However, astrology is crap, right? Some other factors explain the few successes. Is the fact that so many schizophrenics are born in the spring related to Mars rising in Orion, or to a virus women contract in the winter? Etc. In other words, just because a system is reliable, doesn’t mean it’s valid.”