Different notions of “effect size”

Tired of people equating “effect size” with “standardised measure of effect size”? Here’s an antidote, thanks to Shinichi Nakagawa and Innes C. Cuthill (2007). [Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol. Rev. (2007), 82, pp. 591–605.]

They review the different meanings of “effect size”:

  • “Firstly, effect size can mean a statistic which estimates the magnitude of an effect (e.g. mean difference, regression coefficient, Cohen’s d, correlation coefficient). We refer to this as an ‘effect statistic’ (it is sometimes called an effect size measurement or index).
  • “Secondly, it also means the actual values calculated from certain effect statistics (e.g. mean difference = 30 or r = 0.7; in most cases, ‘effect size’ means this, or is written as ‘effect size value’).
  • “The third meaning is a relevant interpretation of an estimated magnitude of an effect from the effect statistics. This is sometimes referred to as the biological importance of the effect, or the practical and clinical importance in social and medical sciences.”

They argue in favour of confidence intervals, as these “are not simply a tool for NHST [signifcance testing], but show a range of probable effect size estimates with a given confidence.”

They also cite Wilkinson, L & The Task Force on Statistical Inference (1999) [Statistical methods in psychology journals. American Psychologist 54, 594–604]:

“our focus on these two standardised effect statistics does not mean priority of standardised effect statistics (r or d) over unstandardised effect statistics (regression coefficient or mean difference) and other effect statistics (e.g. odds ratio, relative risk and risk difference). If the original units of measurement are meaningful, the presentation of unstandardised effect statistics is preferable over that of standardised effect statistics (Wilkinson & the Task Force on Statistical Inference, 1999).”

Good stuff, this.

Group versus individual statistical predictions

Lovely article in the Guardian by Christine Evans-Pughe on how making statistical predictions about individuals is exceedingly tricky. Reports work by Hart, Michie, and Cooke (2007).  Abstract of the latter:

BACKGROUND: Actuarial risk assessment instruments (ARAIs) estimate the probability that individuals will engage in future violence. AIMS: To evaluate the ‘margins of error’ at the group and individual level for risk estimates made using ARAIs. METHOD: An established statistical method was used to construct 95% CI for group and individual risk estimates made using two popular ARAIs. RESULTS: The 95% CI were large for risk estimates at the group level; at the individual level, they were so high as to render risk estimates virtually meaningless. CONCLUSIONS: The ARAIs cannot be used to estimate an individual’s risk for future violence with any reasonable degree of certainty and should be used with great caution or not at all. In theory, reasonably precise group estimates could be made using ARAIs if developers used very large construction samples and if the tests included few score categories with extreme risk estimates.

Hart, S.D., Michie, C., & Cooke D.J. (2007) Precision of actuarial risk assessment instruments: Evaluating the ‘margins of error’ of group v. individual predictions of violence. British Journal of Psychiatry, 190, s60-s65.

What’s the difference between fixed and random effects?

Gelman (2005, p. 21) to the rescue.

We prefer to sidestep the overloaded terms “fixed” and “random” with a cleaner distinction […]. We define effects (or coefficients) in a multilevel model as constant if they are identical for all groups in a population and varying if they are allowed to differ from group to group.

Gelman A. (2005). Analysis of variance—why it is more important than ever. Annals of Statistics, 33(1), 1–53

How to get someone’s g

“Intelligence”, “IQ”, “g” (due to Spearman), are terms that are bandied around.

The following may be helpful: the gist of how to calculate someone’s g score, which is often used as the measure of someone’s “intelligence”.

For example, that’s the “IQ”/”intelligence” referred to in the recentish BBC article on research linking childhood intelligence and adult vegetarianism (clever children grow into clever vegetarian adults).

  1. Give hundreds or thousands of people a dozen tests of ability.
  2. Zap everyone’s scores with PCA or factor analysis.
  3. g is the first component and usually explains around half the variance.  Here’s an example genre of analysis of g with other facets to psychometric intelligence.
  4. Use the component to calculate a score.  For factor analysis there are many ways to do this, e.g. Thompson’s scores, Bartlett’s weighted least-squares.  The gist is that for each person you compute a weighted sum of their scores, where the weights are a function of how loaded the particular test score was on g.
  5. To get something resembling an IQ score, scale it so it has a mean of 100 and an SD of 15.
  6. Talk about it as if it were a substantive psychological construct, rather than a statistical artefact 😉

What is this mysterious g thing?

Florence Nightingale

A bio of Florence Nightingale, statistician and nurse. Excerpt:

“Nightingale helped to promote what was then a revolutionary idea (and a religious one for her) that social phenomena could be objectively measured and subjected to mathematical analysis. Her work with medical statistics was so impressive that she was elected (in 1858) to membership in the Statistical Society of England. One of the pioneers in the graphic method of presentation of data, she invented colorful polar-area diagrams to dramatize medical data. Although other methods of persuasion had failed, her statistical approach convinced military authorities, Parliament, and Queen Victoria to carry out her proposed hospital reforms.”

[Photo of her Polar Area Diagram (“coxcomb”) from over here.]