Blog

Census data on trans and non-binary people in Canada

Canada published census data on trans and non-binary people on 27 April 2022. Here’s a table of the values they presented in a pie chart (why a pie chart, Canada?). Individuals in the census were aged 15 or above and living in a private household in May 2021.

Gender N %
Cis man 14,814,230 48.83
Cis woman 15,421,085 50.83
Trans man 27,905 0.09
Trans woman 31,555 0.10
Non binary 41,355 0.14
Total 30,336,130 100.00

 

Playing with RCTs and probabilities

Suppose we run an RCT with two groups, treatment and control, and a binary outcome of whether participants recover or not.

There are two potential outcomes: recovery following treatment (\(R_t\)) and recovery following control (\(R_c\)), \(1\) if recovered and \(0\) if not recovered. Only one of these two potential outcomes is realised, depending on what group someone is assigned to. Let \(W = t\) if a participant was assigned to treatment and \(W = c\) if they were assigned to control.

Suppose, following an RCT, we learn the following (somehow with perfect precision):

\(P(R_t = 1 | W = t) = 0.8\);

\(P(R_c = 1 | W = c) = 0.3\).

Given the two probabilities above, it turns out the best we can say is that \(P(R_t = 1) \in [0, 1]\) and \(P(R_c = 1) \in [0, 1]\). So, it seems that we aren’t yet able to infer anything about the potential outcomes beyond those that were realised.

Add to our premises that participants were assigned to treatment or control by coin flip:

\(P(W = t) = P(W = c) = 0.5\).

Now \(P(R_t = 1) \in [0.4 , 0.9]\) and \(P(R_c = 1) \in [0.15 , 0.65]\). These intervals are clearly better that \([0,1]\); however, can we do better?

The key ingredient we need to add is that treatment assignment is independent of the potential outcomes; that is

\(P(W | R_t, R_c) = P(W)\).

Now, given all this information, we obtain point probabilities: \(P(R_t = 1) = 0.8\) and \(P(R_c = 1) = 0.3\). These are equal to the probabilities that were conditional on what group a participant was assigned to.

Another curiosity is what we can infer about the joint distribution, \(P(R_t, R_c)\). The results are probability intervals:

\(R_t = 0\) \(R_t = 1\)
\(R_c = 0\) \([0, 0.2]\) \([0.5, 0.7]\) \(0.7\)
\(R_c = 1\) \([0, 0.2]\) \([0.1, 0.3]\) \(0.3\)
\(0.2\) \(0.8\)

This illustrates, in a toy example, the more general problem that the joint distribution of potential outcomes typically cannot be obtained from an RCT. However, the joint probabilities are constrained by the marginals.

Wisdom(?) from the 1918 Dadaist manifesto by Tristan Tzara

The 1st and 2nd DADA Art Manifestos are online over there.

  • “Psychoanalysis is a dangerous disease, it deadens man’s anti-real inclinations and systematises the bourgeoisie.”
  • “Dialectics is an amusing machine that leads us (in banal fashion) to the opinions which we would have held in any case.”
  • “People observe, they look at things from one or several points of view, they choose them from amongst the millions that exist. Experience too is the result of chance and of individual abilities.”
  • “Logic is a complication. Logic is always false. It draws the superficial threads of concepts and words towards illusory conclusions and centres.”
  • “What we need are strong straightforward, precise works which will be forever misunderstood.”

Efficacy RCTs as survey twins

Surveys attempt to estimate a quantity of a finite population using a probability sample from that population. How people ended up in the population is somebody else’s problem – demographers, perhaps.

Survey participants are sampled at random from this finite population without replacement. Part a of the figure below illustrates. Green blocks denote people who are surveyed and from whom we collect data. Grey blocks denote people we have not surveyed; we would like to infer what their responses would have been, if they had they been surveyed too.

RCTs randomly assign participants to treatment or control conditions. This is illustrated in part b of the figure above: green cells denote treatment and purple cells denote control. There are no grey cells since we have gathered information from everyone in the finite population. But in a way, we haven’t really.

An alternative way to view efficacy RCTs that aim to estimate a sample average treatment effect (SATE) is as a kind of survey. This illustrated in part c. Now the grey cells return.

There is a finite population of people who present for a trial, often with little known about how they ended up in that population – not dissimilarly to the situation for a survey. (But who studies how they end up in a trial – trial demographers?)

Randomly assigning people to conditions generates two finite populations of theoretical twins, identical except for treatment assignment and the consequences thereafter. One theoretical twin receives treatment and the other receives control. But we only obtain the response from one of the twins, i.e., either the treatment or the control twin. (You could also think of these theoretical twins’ outcomes as potential outcomes.)

Looking individually at one of the two theoretical populations, the random assignment to conditions has generated a random sample from that population. We would like to know what the outcome would have been for everyone in the treatment condition, if everyone had been assigned treatment. Similarly for control. Instead, we have a pair of surveys that sample from these two populations.

Viewing the Table 1 fallacy through the survey twin lens

There is a common practice of testing for differences in covariates between treatment and control. This is the Table 1 fallacy (see also Dean Eckles’s take on whether it really is a fallacy). Let’s see how it can be explained using survey twins.

Firstly, we have a census of covariates for the whole finite population at baseline, so we know with perfect precision what the means are. Treatment and control groups are surveys of the same population, so clearly no statistical test is needed. The sample means in both groups are likely to be different from each other and from the finite population mean of both groups combined. No surprises there: we wouldn’t expect a survey mean to be identical to the population mean. That’s why we use confidence intervals or large samples so that the confidence intervals are very narrow.

What’s the correct analysis of an RCT?

It’s common to analyse RCT data using a linear regression model. The outcome variable is the endpoint, predictors are treatment group and covariates. This is also known as an ANCOVA. This analysis is easy to understand if the trial participants are a simple random sample from some infinite population. But this is not what we have in efficacy trials as modelled by survey twins above. If the total number of participants in the trial is 1000, then we have a finite population of 1000 in the treatment group and a finite population of 1000 in the control group – together, 2000. In total we have 1000 observations, though, split in some proportion between treatment and control.

Following through on this reasoning, it sounds like the correct analysis uses a stratified independent sampling design with two strata, coinciding with treatment and control groups. The strata populations are both 1000, and a finite population correction should be applied accordingly.

It’s a little more complicated, as I discovered in a paper by Reichardt and Gollob (1999), who independently derived results found by Neyman (1923/1990). Their results highlight a wrinkle in the argument when conducting a t-test on two groups for finite populations as described above. This has general implications for analyses with covariates too. The wrinkle is, the two theoretical populations are not independent of each other.

The authors derive the standard error of the mean difference between X and Y as

\(\displaystyle \sqrt{\frac{\sigma_X^2}{n_X} + \frac{\sigma_Y^2}{n_Y} – \left[ \frac{(\sigma_X – \sigma_Y)^2}{N} + \frac{2(1-\rho) \sigma_X \sigma_{Y}}{N} \right]}\),

where \(\sigma_X^2\) and \(\sigma_Y^2\) are the variances of the two groups, \(n_X\) and \(n_Y\) are the observed group sample sizes, and \(N\) is the total sample (the finite population) size. Finally, \(\rho\) is the unobservable correlation between treat and control outcomes for each participant – unobservable because we only get either the treatment outcome or control outcome for each participant and not both. The terms in square brackets correct for the finite population.

If the variances are equal (\(\sigma_X = \sigma_Y\)) and the correlation \(\rho = 1\), then the correction vanishes (glance back at numerators in the square brackets to see). This is great news if you are willing to assume that treatments have constant effects on all participants (an assumption known as unit-treatment additivity): the same regression analysis that you would use assuming a simple random sample from an infinite population applies.

If the variances are equal and the correlation is 0, then this is the same standard error as in the stratified independent sampling design with two strata described above. Or at least it was for the few examples I tried.

If the variances can be different and the correlation is one, then this is the same standard error as per Welch’s two-sample t-test.

So, which correlation should we use? Reichardt and Gollob (1999) suggest using the reliability of the outcome measure to calculate an upper bound on the correlation. More recently, Aronow, Green, and Lee (2014) proved a result that puts bounds on the correlation based on the observed marginal distribution of outcomes, and provide R code to copy and paste to calculate it. It’s interesting that a problem highlighted a century ago on something so basic – what standard error we should use for an RCT – is still being investigated now.

References

Aronow, P. M., Green, D. P., & Lee, D. K. K. (2014). Sharp bounds on the variance in randomized experiments. Annals of Statistics, 42, 850–871.

Neyman, J. (1923/1990). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science, 5, 465-472.

Reichardt, C. S., & Gollob, H. F. (1999). Justifying the Use and Increasing the Power of a t Test for a Randomized Experiment With a Convenience Sample. Psychological Methods, 4, 117–128.

 

Surveys as RCTs: an exercise in analogy

Surveys use probability samples from a finite population to estimate a quantity of that population, e.g., the percentage of people holding a particular view. What happens if we try to understand surveys using the concepts of a randomised controlled trial?

Those assigned to intervention get the survey questions – the questions are the intervention. Those assigned to control get nothing and are ignored entirely. Typically the probability of being assigned to intervention (being surveyed) is much smaller than that of being assigned to control (ignored).

We wish to estimate the percentage of people in the population who hold a particular view following the intervention (the survey questions). This percentage is the outcome measure. We observe the outcome for people randomly assigned to the survey. For the control group, we want to know what percentage would have held that view, if they had been assigned to the survey.

An interesting feature of this “intervention” of a survey is that we hope it does not change the percentage outcome. So, viewing a survey through this RCT lens, the average treatment effect (mean difference between intervention and control, survey and ignore) is assumed to be 0. But this might not hold. There may be a mean difference between people who have been asked to reflect on something and tell a researcher versus those who hold a view but have not told anyone. We cannot tell using this design.

Where did the population come from? Across in the RCT analogy, it would be the (typically nonprobability) sample of people who consented to take part in the trial. In the survey, it is a collection of people who found themselves living in a particular area, having the demographic profile of interest (satisfying the inclusion criteria), and being reachable via a sample frame. We usually do not care how they got there. Other researchers might, e.g., demographers studying migration or births. People often end up in an area because they found a job nearby or because they were born there – both events with an element of chance.

Researchers often worry whether an RCT’s results transfer to other settings. This is not an issue for surveys. In fact we might assume that people living in different areas hold different views. One aspect we might hope does transfer is how people interpret the questions.

Standard errors of marginal means in an RCT

Randomised controlled trials (RCTs) typically use a convenience sample to estimate the mean effect of a treatment for study participants. Participants are randomly assigned to one of (say) two conditions, and an unbiased estimate of the sample mean treatment effect is obtained by taking the difference of the two conditions’ mean outcomes. The estimand in such an RCT is sometimes called the sample average treatment effect (SATE).

Some papers report a standard error for the marginal mean outcomes in treatment and control groups using the textbook formula

\(\displaystyle \frac{\mathit{SD_g}}{\sqrt{n_g}}\),

where \(\mathit{SD_g}\) is the standard deviation of group \(g\) and \(n_g\) the number of participants assigned to that group.

This formula assumes a simple random sample with replacement from an infinite population, so does not work for a convenience sample (see Stephen Senn, A Standard Error). I am convinced, but curious what standard error for each group’s mean would be appropriate, if any. (You could stop here and argue that the marginal group means mean nothing anyway. The whole point of running a trial is to subtract off non-treatment explanations of change such as regression to the mean.)

Let’s consider a two-arm RCT with no covariates and a coin toss determining who receives treatment or control. What standard error would be appropriate for the mean treatment outcome? Let the total sample size be \(N\) and quantities for treatment and control use subscripts \(t\) and \(c\), respectively.

Treatment outcome mean of those who received treatment

If we focus on the mean for the \(n_t\) participants who were assigned to treatment, we have all observations for that group, so the standard error of the mean is 0. This feels like cheating.

Treatment outcome mean of everyone in the sample

Suppose we want to say something about the treatment outcome mean for all \(N\) participants in the trial, not only the \(n_t\) who were assigned to treatment.

To see how to think about this, consider a service evaluation of \(N\) patients mimicking everything about an RCT except that it assigns everyone to treatment and uses a coin toss to determine whether someone is included in the evaluation. This is now a survey of \(n\) participants, rather than a trial. We want to generalise results to the finite \(N\) from which we sampled.

Since the population is finite and the sampling is done without replacement, the standard error of the mean should be multiplied by a finite population correction,

\(\displaystyle \mathit{FPC} = \sqrt{\frac{N – n}{N – 1}}\).

This setup for a survey is equivalent to what we observe in the treatment group of an RCT. Randomly assigning participants to treatment gives us a random sample from a finite population, the sample frame of which we get by the end of the trial: all treatment and control participants. So we can estimate the SEM around the mean treatment outcome as:

\(\displaystyle \mathit{SEM_t} = \frac{\mathit{SD_t}}{\sqrt{n_t}} \sqrt{\frac{N – n_t}{N – 1}}\).

If, by chance (probability \(1/2^N\)), the coin delivers everyone to treatment, then \(N = n_t\) and the FPC reduces to zero, as does the standard error.

Conclusion

If the marginal outcome means mean anything, then there are a couple of standard errors you could use, even with a convenience sample. But the marginal means seem irrelevant when the main reason for a running an RCT is to subtract off non-treatment explanations of change following treatment.

If you enjoyed this, you may now be wondering what standard error to use when estimating a sample average treatment effect. Try Efficacy RCTs as survey twins.

Curiosities: two pairs of ideas that intrigue/trouble me

Troubles with theories

  • Evidence alone can’t determine which scientific theories we should believe since more than one theory will often be consistent with the available evidence.
  • In mathematics, there exist unintended (“nonstandard”) models of formal theories. An example theory where this is the case is Peano Arithmetic. The intended (“standard”) model is the set of (countably infinite) natural numbers (0, 1, 2, 3, …) and operations thereon that we know and love. But there are non-standard models of Peano Arithmetic that are uncountably infinite. That’s weird. In fact, any first-order logic theory which has a countably infinite model also has an uncountably infinite model (upward Löwenheim–Skolem Theorem).

Boundedness of selves in spacetime

  • We often think of our minds as bounded by our skull. This has been challenged by the extended cognition thesis (Andy Clark and David Chalmers). The gist: we’re happy to accept that we can have beliefs that we’re not conscious of at any given time; they lie dormant until called upon for, say, an argument. Selves are more than what we are conscious of. But we also scribble stuff in notebooks and (these days) apps, set reminders, etc. These notes and reminders are similarly beyond consciousness but also thoroughly outside our heads. They seem essential for cognition.
  • There’s a problem with 4D block universe conceptions of spacetime: if all of time – past, present, and future – already exists at points in this 4D geometry then how do we consciously experience the passage of time? Assuming a material conception of conscious experience, each individual experience is scattered across spacetime and individually frozen. No passage. Natalja Deng (2019) points to an solution. Rather than trying to work out how these individual experiences can lead to an experience of passage, “recognize that the fundamental experiential unit is itself temporally extended, and use this to explain how there can be an experience of a temporally extended content.”