Texts I like for learning statistics and using R

I’ll be updating this, but first thoughts:

Fitting regression models, GLMs, etc.

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). London: SAGE Publications Ltd.

See also online material, including free appendices and R code.

Data transformation and visualisation

Healy, K. (2019). Data Visualization: A Practical Introduction. Princeton University Press. (Free online version.)

Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly. (Free online version.)

Chang, W. (2020). R Graphics Cookbook (2nd ed.). Sebastopol, CA: O’Reilly. (Free online version.)

Lüdecke D (2018). ggeffects: Tidy Data Frames of Marginal Effects from Regression ModelsJournal of Open Source Software3(26), 772. doi: 10.21105/joss.00772

This is very handy for getting predictions from models, focusing on the effect of predictors of interest whilst holding covariates at some fixed values like a mean or (for factors) mode.

See also the package website for illustrative examples.

Gelman, A. (2011). Tables as graphs: The Ramanujan principleSignificance, 8, 183.

Missing data imputation

Van Buuren, S. (2018). Flexible Imputation of Missing Data. Second Edition.. Chapman & Hall/CRC. Boca Raton, FL. (Free online version.)

See also the package website.


Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology31, 337–350.

“… correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists.”

Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1, 140216. doi: 10.1098/rsos.140216

This generated lots of debate – I like how it attempts to use Bayes rule to turn p-values into something useful and the explanation in terms of diagnostic test properties. See also this on PPV and NPV.

Rafi, Z., & Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology, 20(1), 244. doi: 10.1186/s12874-020-01105-9

Interesting proposal to use s-values, calculated from p-values as −log₂(p). It’s a simple transformation: p is probability of getting all heads from −log₂(p) fair coin tosses. For example if p = 0.5 then s = 1; toss a coin once then the probability of head is 0.5. If p = 0.03125 then s = 5; toss a coin 5 times then the probability of all heads is 0.03125. But the s-value is supposedly easier to think about. I’m not sure if it really is, but I like the idea!

Exploring the Russia Report using R

The UK’s Intelligence and Security Committee’s report into Russian activity in the UK was finally released a few days ago.

Here’s my exploration of redactions in the report, using R. Some highlights below.

One of the best predictors of whether a sentence will have a redaction is what organisations are mentioned in the sentence:


According to a sentiment analysis, the angriest sentences are on page 11 (PDF page 18):


Here’s a word cloud of sentences with a redaction, against the organisations(s) mentioned…


Choropleths in R – example using the 2020 Russian constitutional referendum

Choropleth maps use shading to represent quantities and are common in the press. I gave them a go in R, using the rvest package to scrape the results of the 2020 Russian constitutional referendum and the raster package piped through tidyverse tools to map them.

The code is on my GitHub repo.

Some of the fun I encountered along the way (details in the repo):

  • The CRAN version of raster didn’t work, but the latest on GitHub was fine and it’s easy to install this directly from R.
  • The Russian regions names in the raster map of Russia didn’t always match those on the Wikipedia article. I tried fuzzy matching by edit distance, which did a pretty good job but I still had to match some manually (e.g., “Sakha” and “Yakutia” are different names for the same place and a long edit distance from each other). I suspect it would have been easier just to sort both lists alphabetically and match manually from the start!
  • This warning is a worry: “support for gpclib will be withdrawn from maptools at the next major release” – I hope something comes along to replace it.
  • Lots of the examples of maps online are for the US and one basic problem is what projection to use. The mapproj package is fab for this.