Associations between gender and sexuality in the England and Wales 2021 Census

ONS recently released data about sexual orientation and gender identity from Census 2021 in England and Wales.

I’d like to know whether your gender predicts your sexuality. ONS hasn’t released the relevant crosstabs yet, so here’s an approximation using variation in population counts across (lower-tier) local authorities.

Beware the ecological fallacy, e.g., this might show that areas with more people of a particular gender also have more people of a particular sexuality, but not necessarily that they are the same people.

Knitted R markdown over there. If you improve it, let me know please.

A picture:

Green denotes a positive association and red a negative association. The width of the line denotes the association strength.

  • Het = Straight or Heterosexual
  • GL = Gay or Lesbian
  • Bi = Bisexual
  • Pan = Pansexual
  • Ace = Asexual
  • Q = Queer
  • Cis = Gender identity the same as sex registered at birth
  • TM = Trans man
  • TW = Trans woman
  • NBi = Non-binary

This Be The Worse

Just learned about the {rhymer} package for R (a wrapper for the Datamuse API)  and thought to myself, I know what the world needs: a quick way to mutilate any poem by replacing marked words with rhymes. Here’s an example output:

This Be The Worse

They duck you up, your bum and dyad.
They may not mean to, but they do.
They fill you with the schmaltz they had
And add some extra, just for two.

But they were construct up in their turn
By fools in old-style cats and anecdotes,
Who half the time were soppy sunburn
And half at one another’s quotes.

Man hands on misery to man.
It deepens like a coastal elf.
Get out as early as you can,
And don’t have any eyelids yourself.

(The metre isn’t ideal.)

Code over there.

Mastodon and R

Encoding binary data in hashtags

Problem: you want to encode arbitrary binary data as a hashtag on Mastodon or another social media platform.

A solution: transform to a hex string and shift the \(0, 1, 2 \ldots, f\) up to \(a, b, c \ldots, p\) so all the bytes are lowercase Latin characters.

Example: #gihehehahddkcpcphhhhhhcohjgphfhehfgcgfcogdgpgncphhgbhegdgidphgdngefbhhdehhdjfhghfigdfb

Repo here

Accessing the Mastodon API via {rtoots}

I was curious to know if there’s a positive correlation between the total number of users on a server and how many followers I have from that server. {rtoots} to the rescue…

Repo here

 

Debiasing a biased coin

GIF from Tenor

Suppose you have a coin that you suspect might be biased. Here’s how you can debias it so that there’s a 50-50 chance of heads or tails, thanks to a neat idea often attributed to von Neumann (1951/1963, p. 768):

“… in tossing a coin it is probably easier to make two consecutive tosses independent than to toss heads with probability exactly one-half. If independence of successive tosses is assumed, we can reconstruct a 50-50 chance out of even a badly biased coin by tossing twice. If we get heads-heads or tails-tails, we reject the tosses and try again. If we get heads-tails (or tails-heads), we accept the result as heads (or tails). The resulting process is rigorously unbiased, although the amended process is at most 25 percent as efficient as ordinary coin-tossing.”

Why does this work? First, the probability of heads followed by tails (\(HT\)) is the following product, since coin flips are independent:

\(P(HT) = P(H) [1-P(H)]\)

We get the same answer for tails followed by heads (\(TH\)):

\(P(TH) = [1-P(H)] P(H) = P(H) [1-P(H)]\)

So, \(P(HT) = P(TH)\), which already hints at why this works.

For example, if a coin is so ridiculously biased that it only has a 10% chance of a heads outcome, then

\(\displaystyle P(HT) = P(TH) = \frac{1}{10} \frac{9}{10} = \frac{9}{100}\)

We really want a debiased coin to have a 50-50 chance of a heads or tails outcome. That’s where ignoring \(HH\) and \(TT\) outcomes helps; we condition on \(HT\) or \(TH\).

Assuming \(0 < P(H) < 1\), then

\(\displaystyle P(HT|HT \lor TH) = \frac{P(HT)}{P(HT) + P(TH)}\)

\(\displaystyle = \frac{P(HT)}{P(HT) + P(HT)}\)

\(\displaystyle = \frac{1}{2}\)

Same sums for \(P(TH|HT \lor TH)\). Et voila, a fair coin from two or more tosses of a biased one!

Let’s simulate it using rbinom in R.

Here’s a biased coin, with 1/10 chance of heads, flipped 20,000 times (\(H = 1\) and \(T = 0\)):

test <- rbinom(20000, 1, .1)
table(test)

This gives:

test
    0     1 
17982  2018

Around 10% of outcomes were heads.

Now the debiaser:

debias_iid <- function(x) {
  stopifnot(length(x) >= 2)
  stopifnot(length(x) %% 2 == 0)
  
  res <- rep(NA, length(x)/2)
  j <- 1
  
  for (i in seq(1,length(x), 2)) {
    res[j] <- case_when(
      x[i] == 1 && x[i+1] == 0 ~ 1,
      x[i] == 0 && x[i+1] == 1 ~ 0
    )
    j <- j+1
  }
  res
}

Try it:

debias_iid(test) |> table()

That looks better:

  0   1 
900 924

50.7% of outcomes were heads, which is the sort of value we would expect, given sampling error, for a fair coin.

References

von Neumann, J. (1951) Various Techniques Used in Connection with Random Digits, Notes by G. E. Forsythe, National Bureau of Standards Applied Math Series, 12, 36-38. Reprinted in von Neumann’s Collected Works (1963), Pergamon Press, pp. 768-770.

True random numbers in R

Random numbers are important for lots of things including statistical analysis (e.g., using bootstrapping) and cryptography (e.g., producing a one-time pad). But computers can’t produce random numbers. They produce pseudorandom numbers, which are deterministic and hence predictable given the random seed.

Pseudorandom numbers more than suffice for most purposes (so long as everyone isn’t using the same seeds like 1234 or 42); however, if you want or need the real thing there are various ways to pipe true randomness into computers. Two are available via R packages.

The best known is {random}, which exports pure randomness from Ireland. The basic idea is that a radio receiver is tuned to a frequency where no station broadcasts. This is continuously recorded and sampled to generate random numbers.

Another is {qrandom}, provided by the Australian National University. This works by measuring the quantum fluctuations of the vacuum.

01001110 01101111 00101100 00100000 01110100 01101000 01101001 01110011 00100000 01101001 01110011 01101110 11100010 10000000 10011001 01110100 00100000 01110010 01100001 01101110 01100100 01101111 01101101

Enumerating permutations – an example of recursion

One way to solve a problem of size \(n\) is to pretend you can solve it for \(n-1\), and use that imaginary solution to solve it for \(n\). The approach is called recursion. Here’s an example showing how it works: enumerating all the ways to rearrange the elements of a list.

Given a list with \(n\) elements, \(\langle x_1, x_2, x_3, \ldots, x_n \rangle\), where the order of the elements matters, there are

\(n! = n \times (n-1) \times (n-2) \times  \ldots \times  1\)

(“\(n\) factorial”) different ways to rearrange them (permutations). We can also write this out more formally as follows:

\(0! = 1\);
\(n! = n \times (n-1)!\), for \(n > 0\).

For example, consider the list \(\langle 1, 2,3 \rangle\), which I’ll just write 123 to save keystrokes. There are 3 elements and \(n! = 3 \times 2 \times 1 = 6\) permutations:

  1. 123
  2. 213
  3. 231
  4. 132
  5. 312
  6. 321

Using the more formal definition, the arithmetic goes:

\(3!\)
\(= 3 \times 2!\)
\(= 3 \times (2 \times 1!)\)
\(= 3 \times (2 \times (1 \times 0!))\)
\(= 3 \times (2 \times (1 \times 1))\)
\(= 6\)

To see how to enumerate all these, first start with the simplest list, \(\langle \ \rangle\). This is the empty list, the zero of the list world. There is exactly one way to arrange emptiness (\(0! = 1\)):

  1. \(\langle \ \rangle\)

That’s the problem solved for one list. There are infinitely many more, so we have a bit of work to do.

Now let’s return to the list 123, a list of length 3. Suppose (i.e., pretend) we knew all the permutations of 23 (a list of length 2). How could we use those to work out the permutations of 123.

It’s easy to see that there are two permutations of 23:

  1. 23
  2. 32

We go from those solutions to the solution for 123 by inserting 1 in all positions (beginning, middle, and end) of 23 and 32 in the following way:

  1. 123
  2. 213
  3. 231
  4. 132
  5. 312
  6. 321

I cheated a bit here, though. We haven’t developed a way to enumerate the permutations of 23. In a moment or two I want to write an R program to enumerate permutations of any list and R will need more explicit instructions.

Let’s go back a step and suppose we don’t know the permutations of 23 and want to enumerate them. Assume we know what the permutations are of a list with a single item, 3.

Now go back yet another step. How do we get the permutations of 3, assuming we know what they are for the next smallest list, the empty list, \(\langle \ \rangle\)?

We finally have an answer. As we saw above, there is one way to arrange emptiness: \(\langle \ \rangle\).

Following the same logic that allowed us to go from permutations of 23 to permutations of 123, we can go from from permutations of \(\langle \ \rangle\) to permutations of 3. We just have to insert 3 in all positions of the empty list. There is only one place to put it, giving a list with a single element, 3.

Now how do we go from the permutations of 3 to the permutations of 23? Simply place 2 in all positions of the one item list 3:

  1. 23
  2. 32

And so now our imagined solution actually exists.

We needed two steps:

  1. The base step, which solved the problem for the simplest list, the empty list \(\langle \ \rangle\).
  2. The inductive step, which uses permutations of the list of length \(n-1\), \(\langle x_2, x_3, \ldots, x_n \rangle\) and inserts \(x_1\) in all positions of those permutations to give us the permutations of a list of length \(n\), \(\langle x_1, x_2, x_3, \ldots, x_n \rangle\).

R

This translates straightforwardly to R as follows.

First, a function to insert an element \(x\) into all positions of a vector \(\mathit{xs}\):

insert_all <- function(x, xs) {
  res <- list()
  
  for (i in 0:length(xs))
    res <- append(res, list(append(xs, x, i)))

  res  
}

For example:

> insert_all(1, 2:4)
[[1]]
[1] 1 2 3 4

[[2]]
[1] 2 1 3 4

[[3]]
[1] 2 3 1 4

[[4]]
[1] 2 3 4 1

Now the actual permutation function, which takes a vector \(\mathit{xs}\) and returns a list of all its permutations:

perms <- function(xs) {
  res <- list()
  
  if (length(xs) == 0)
    res <- list(c())
  else  {
    ys <- perms(tail(xs, length(xs) - 1))
    for (y in ys)
      res <- append(res, insert_all(xs[1], y))
  }
    
  res
}

Note how perms is defined in terms of itself. The line

perms(tail(xs, length(xs) - 1))

removes the first element of the list and calls itself again. It keeps doing this until it is called with an empty vector, when it returns a list with an empty vector,

list(c())

Then insert_all pieces everything together.

Let’s give perms a go with the list 1234. There should be \(4! = 24\) permutations:

> perms(1:4)
[[1]]
[1] 1 2 3 4

[[2]]
[1] 2 1 3 4

[[3]]
[1] 2 3 1 4

[[4]]
[1] 2 3 4 1

[[5]]
[1] 1 3 2 4

[[6]]
[1] 3 1 2 4

[[7]]
[1] 3 2 1 4

[[8]]
[1] 3 2 4 1

[[9]]
[1] 1 3 4 2

[[10]]
[1] 3 1 4 2

[[11]]
[1] 3 4 1 2

[[12]]
[1] 3 4 2 1

[[13]]
[1] 1 2 4 3

[[14]]
[1] 2 1 4 3

[[15]]
[1] 2 4 1 3

[[16]]
[1] 2 4 3 1

[[17]]
[1] 1 4 2 3

[[18]]
[1] 4 1 2 3

[[19]]
[1] 4 2 1 3

[[20]]
[1] 4 2 3 1

[[21]]
[1] 1 4 3 2

[[22]]
[1] 4 1 3 2

[[23]]
[1] 4 3 1 2

[[24]]
[1] 4 3 2 1

It works!

Factorial

Understanding how to enumerate permutations also shows us how to count them. In other words, it shows us how to define factorial.

Pretend we don’t know what factorial is. We’re making a new one called “factorial?” (pronounce with rising intonation), such that \(n?\) is the number of permutations of a list of length \(n\). We hope that by the end of process, \(n? = n!\).

Recall again that there’s only one way to arrange emptiness, so

\(0? = 1\).

Suppose we know how to count the number of permutations of a list of length \(n-1\), that is, we know what \((n-1)?\) is. How would we use the answer to that to tell us what \(n?\) is?

If we know \((n-1)?\), that means we know the number of permutations of lists of length \(n-1\).

The length of each list counted by \((n-1)?\) must be \(n-1\). We saw when enumerating permutations that we have to add an extra element to all positions of each element in those lists. That’s \(n\) positions to add the extra element to, and we need to do it \((n-1)?\) times, so we just multiply:

\(n? = n \times (n-1)?\), for \(n > 0\).

It turns out that indeed \(n? = n!\), as defined at the top of this post.

XOR encryption

GCHQ recently posted the following JavaScript code on its Instagram and Twitter accounts:

The code contains two messages. The first is represented as a simple numerical encoding. The second is a secret message that has been encrypted, alongside code for decrypting it. Here are some clues to make sense of how this second message has been encrypted.

The message uses a symmetric-key encryption approach, the XOR cipher, that involves applying the exclusive or (XOR) operator to each letter of the message and the key, recycling the key until all characters have been decoded. The secret message is wrapped up in a Base64 encoding, which is a way of ensuring that all its characters are printable letters and symbols, so it’s possible to include the message within the JavaScript as “gNSkYr+VqyGl1Lhko8fqYq7UpGajiuo67w==”.

Here’s a shorter version of the code in R:

gchq_message <- "gNSkYr+VqyGl1Lhko8fqYq7UpGajiuo67w==" |>
                   base64enc::base64decode()
gchq_key <- c(0xc6, 0xb5, 0xca, 0x01) |> as.raw()
xor(gchq_message, gchq_key) |> rawToChar()

(No spoilers here…)

So, the steps to decrypt are:

  1. Translate the Base64 encoded message to raw bytes
  2. XOR those raw bytes with the key
  3. Translate the bytes to ASCII characters so we can read the message

The nice thing about this form of encryption is that the same algorithm does both encrypting and decrypting. So, if you wanted to reply, “No thanks, I’m good” you just do the same in reverse:

  1. Translate your ASCII text message to raw bytes
  2. XOR those bytes with the key
  3. Translate the result to Base64

In R:

"No thanks, I'm good" |>
  charToRaw() |>
  xor(gchq_key) |>
  base64enc::base64encode()

This gives “iNrqda7UpGq1mepI4djqZqnarg==”.

Fun! Also, I have a tattoo that uses the same approach, except I used Braille ASCII instead of Base64 to ensure that all the characters were tattooable 🙂

If you’re watching The Undeclared War, look out for the shout out to Base64 too:

But why is the key c6b5ca01? It’s not obviously the letters G, C, H, Q. In decimal, it looks like an IP address, but there’s nothing obvious at 198.181.202.1, and any four 8 bit numbers look like an IP address if you stare long enough.