Statistics, plain and sample.

No frills. No shenanigans. Just statistics.

Power and Sample Size for Repeated Measures ANOVA with R


One of my colleagues is an academic physical therapist (PT), and he's working on a paper to his colleagues related to power, sample size, and navigating the thicket of trouble that surrounds those two things. We recently got together to walk through some of the issues, and I thought I would share some of the wildlife we observed along the way. If you just want the code and don't care about the harangue, see this gist on GitHub.

The problem

Suppose you are a PT, and you've come up with a brand new exercise method that you think will decrease neck pain, say. How can you demonstrate that your method is effective? Of course, you collect data and show that people using your method have significantly lower neck pain than those from a control group.

The standard approach in the PT literature to analyze said data is repeated measures ANOVA. (Yes, those guys should really be using mixed-effects models, but those haven't quite taken off yet.) There are two groups: the "Treatment" group does your new exercise method, and a "Sham" group does nothing (or just the placebo exercise method). For each Subject, you measure their pain at time 0, 15 minutes, 48 hours, and 96 hours. Pain is measured by an index (there are several); the one we're using is something called NDI, which stands for "Neck Disability Index". The index ranges from 0 to 100 (more on this later). There is some brief information about the index here.

Now comes the question: how many people should you recruit for your study? The answer is: it depends. "On what?" Well, it depends on how good the statistical test is, and how good your method is, but more to the point, it depends on the effect size, that is, how far apart the two groups are, given that the method actually works.

I encounter some variant of this question a lot. I used to go look for research papers where somebody's worked out the F-test and sample sizes required, and pore over tables and tables. Then I resorted to online calculators (the proprietary versions were too expensive for my department!), which are fine, but they all use different notation and it takes a lot of time poring through documentation (which is often poor, pardon the pun) to recall how it works. And I was never really sure whether I'd got it right, or if I had screwed up with a parameter somewhere.

Some of the calculators advertise Cohen's effect sizes, which are usually stated something like "small", "medium", and "large", with accompanying numerical values. Russell Lenth calls these "T-shirt effect sizes". I agree with him.

Nowadays the fashionable people say, "Just run a simulation and estimate the power," but the available materials online are scantly detailed. So my buddy and I worked it all out from start to finish for this simple example, in the hopes that by sharing this information people can get a better idea of how to do it the right way, the first time.

How to attack it

The avenue of attack is simple: for a given sample size,

  1. use prior research and practitioner experience to decide what difference would be "meaningful" to detect,
  2. simulate data consistent with the above difference and run the desired statistical test to see whether or not it rejected, and
  3. repeat step 2 hundreds of times. An estimate of the power (for that sample size) is the proportion of times that the test rejected.

If the power isn't high enough, then increase the given sample size and start over. The value we get is just an estimate of the power, but we can increase the precision of our estimate by increasing the number of repetitions in step 3.

What you find when you start down this path is that there is a lot of information required to be able to answer the question. Of course, this information had been hiding behind the scenes all along, even with those old research papers and online calculators, but the other methods make it easy to gloss over the details, or they're so complicated that researchers will give up and fall back to something like Cohen's T-shirt effect sizes.

Now for the legwork

The details we need include: A) prior knowledge of how average pain decreases for people in the Sham group, B) some idea about the variability of scores, C) how scores would be correlated with one another over time, and D) how much better the Treat group would need to be in order for the new procedure to be considered clinically meaningful.

As a first step, the PT sat down and filled in the following table.

0 hrs15 min48 hrs96 hrs

All of the entries in the above table represent population mean NDI scores for people in the respective groups at the respective measurement times, and were filled in based on prior research and educated guesses by the PT. It was known from other studies that NDI scores have a standard deviation of around 12, and those have been observed to decrease over time.

Note: we could have assumed a simpler model for the means. For example, we could have assumed that mean NDI was linear, with possibly different slopes/intercepts for the Treat/Sham groups. Prior info available to the PT said that such an assumption wasn't reasonable for this example.

Repeated measures designs assume sphericity for the exact F tests to hold, so we need to specify a variance for the differences, \(\mathrm{Var}(X_{i} - X_{j})\), and sphericity says this variance should be the same for all time points \(i\) and \(j\). As it turns out, this last choice implicitly determines all of the remaining covariance structure. We set this standard deviation to \(9\).

Finally we do some coding

We are now ready to turn on the computer. We first intialize the parameters we'll need, next we set up the independent variable data, then we do the simulation, and finally we rinse-and-repeat. Let's go.

nPerGroup <- 10
nTime     <- 4
muTreat   <- c(37, 32, 20, 15)
muSham    <- c(37, 32, 25, 22)
stdevs    <- c(12, 10, 8, 6)
stdiff    <- 9
nSim      <- 500

All of the above should be self-explanatory. Next comes setting up the data - creatively named theData - for the independent variables. Just for the sake of argument I used code to generate the data frame, but we wouldn't have had to. We could have imported an external text file had we wished.

Subject <- factor(1:(nPerGroup*2))
Time <- factor(1:nTime, labels = c("0min", "15min", "48hrs", "96hrs"))

theData <- expand.grid(Time, Subject)
names(theData) <- c("Time", "Subject")

tmp <- rep(c("Treat", "Sham"), each = nPerGroup * nTime)
theData$Method <- factor(tmp)

Again, the above should be self-explanatory for the most part. The data are in "long" form, where each subject appears over multiple rows. In fact, let's take a look at the data frame to make sure it looks right.

   Time Subject Method
1  0min       1  Treat
2 15min       1  Treat
3 48hrs       1  Treat
4 96hrs       1  Treat
5  0min       2  Treat
6 15min       2  Treat

Lookin' good. Now for the fun part. We generate the single remaining column, the NDI scores. The repeated measures model is multivariate normal. The population covariance matrix is a little bit tricky, but it's not too bad and to make things easy we'll assume both groups have the same covariance. See the original paper by Huynh and Feldt for details.

# to set up variance-covariance matrix
ones <- rep(1, nTime)
A <- stdevs^2 %o% ones
B <- (A + t(A) + (stdiff^2)*(diag(nTime) - ones %o% ones))/2

We simulate with the mvrnorm function from the MASS package.

tmp1 <- mvrnorm(nPerGroup, mu = muTreat, Sigma = B)
tmp2 <- mvrnorm(nPerGroup, mu = muSham, Sigma = B)
theData$NDI <- c(as.vector(t(tmp1)), as.vector(t(tmp2)))

Now that we have our data, we can run the test:

aovComp <- aov(NDI ~ Time*Method + Error(Subject/Time), theData)
Error: Subject
          Df Sum Sq Mean Sq F value Pr(>F)
Method     1  157.9   157.9   1.499  0.237
Residuals 18 1896.2   105.3               

Error: Subject:Time
            Df Sum Sq Mean Sq F value   Pr(>F)    
Time         3   5082  1693.9  43.066 2.38e-14 ***
Time:Method  3    119    39.6   1.006    0.397    
Residuals   54   2124    39.3                     
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Terrific! For these data, we observe a highly significant Time effect (this should be obvious given our table above), an insignificant Method fixed effect, and an insignificant Time:Method interaction. If we think about our model and what we're interested in, it's the interaction which we care about and that which we'd like to detect. If our significance level had been \(\alpha = 0.05\), we would not have rejected this time, but who knows what would happen next time.

Now it's time to rinse-and-repeat, which we accomplish with the replicate function. Before we get there, though, let's look at a plot. There are several relevant ones, but in the interest of brevity let's satisfy ourselves with an interaction.plot:

with(theData, interaction.plot(Time, Method, NDI))


Everything is going according to plan. There is definitely a Time effect (the lines both slope downward) but there isn't any evidence of an interaction (the lines have similar slopes).

On to rinse-and-repeat, we first set up the function that runs the test once:

runTest <- function(){
  tmp1 <- mvrnorm(nPerGroup, mu = muTreat, Sigma = B)
  tmp2 <- mvrnorm(nPerGroup, mu = muSham, Sigma = B)
  theData$NDI <- c(as.vector(t(tmp1)), as.vector(t(tmp2)))
  aovComp <- aov(NDI ~ Time*Method + Error(Subject/Time), theData)  
  b <- summary(aovComp)$'Error: Subject:Time'[[1]][2,5]
  b < 0.05

and finally do the repeating:

mean(replicate(nSim, runTest()))
[1] 0.372

Whoa! The power is 0.372? That's pretty low. We recall that this is just an estimate of power - how precise is the estimate? The standard error of \(\hat{p}\) is approximately \(\sqrt{\hat{p}(1 - \hat{p})/n}\), so in our case, our estimate's standard error is approximately 0.022. That means we are approximately 95% confident that the true power at this particular alternative is covered by the interval \([0.329,0.415]\).

Standard practice is to shoot for a power of around \(\beta = 0.80\), so our power isn't even close to what we'd need. We can increase power by increasing sample size (the parameter nPerGroup). A larger sample size means a longer time needed to run the simulation. Below are some results of running the above script at assorted sample sizes.

nPerGroupPower (estimate)SE (approx)

Now we're talking. It looks like somewhere between 20 and 30 subjects per group would be enough to detect the clinically meaningful difference proposed above with a power of 0.80.

Unfortunately, the joke is on us. Because, as it happens, it's no small order for a lone, practicing PT (around here) to snare 60 humans with neck pain for a research study. A person would need to be in (or travel to) a heavily populated area, and even then there would be dropout, people not showing up for subsequent appointments.

So what can we do?

  1. Modify the research details. If we take a closer look at the table, there isn't an expected difference in the means until 48 hours, so why not measure differently, say, at 0, 48, 96, and 144 hours? Is there something else about the measurement process we could change to decrease the variance?
  2. Use a different test. We are going with boilerplate repeated-measures ANOVA here. Is that really the best choice? What would happen if we tried the mixed-effects approach?
  3. Take a second look at the model. We should not only double-check our parameter choices, but rethink: is the repeated-measures model (multivariate normal) the most appropriate? Is it reasonable for the variance of differences at all time pairs to be identical? What about the covariance structure? There are others we could try, such as an autoregressive model (another arrow in the mixed-effects models' quiver).

Other things to keep in mind



1 John von Neumann once said, "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."

View Comments
blog comments powered by Disqus

And now, here's something we hope you'll really like:

Subscribe to RSS Feed

G. Jay Kerns (contact info)