ANOVA in R

Cheatsheet

Published

September 6, 2024

License

This work was developed using resources that are available under a Creative Commons Attribution 4.0 International License, made available on the SOLES Open Educational Resources repository by the School of Life and Environmental Sciences, The University of Sydney.

Assumed knowledge

You know how to install and load packages in R.
You know how to import data into R.
You recognise data frames and vectors.
You know what an ANOVA test is and when to use it.

Data structure

The data should be in a long format (also known as tidy data), where each row is an observation and each column is a variable (Figure 1). If your data is not already structured this way, reshape it manually in a spreadsheet program or in R using the pivot_longer() function from the tidyr package.

Sex	BW
F	2.15
M	2.55
F	2.95
F	2.70
M	2.20
F	1.85
M	2.55
M	2.60

F	M
2.15	2.55
2.95	2.20
2.70	2.55
1.85	2.60

Figure 1: Data should be in long format (left) where each row is an observation and each column is a variable. This is the preferred format for most statistical software. Wide format (right) is also common, but may require additional steps to analyse or visualise in some instances.

Data

For this cheatsheet we will use data from the penguins dataset from the palmerpenguins package. You may need to install this package:

install.packages("palmerpenguins")

About

ANOVA summaries are often used when the model’s predictors/explanatory variables are categorical. This cheatsheet shows you how to perform ANOVA techniqus in R.

ANOVA answers questions such as:

Are there differences in plant height between these two locations? (Yes, similar to a two-sample t-test).
Is location a significant predictor of species diversity?
Is the mean height of kangaroos different between different species?

Interpretation of the results is not covered here.

R packages used

tidyverse, car, emmeans, gt, gtsummary, rstatix, palmerpenguins

Similarity between ANOVA and linear regression

All of the below are equivalent ways to perform an ANOVA test in R. The first method is the most common but method 4 is most flexible. Note that native R pipes (|>) are used in the examples below.

## ANOVA
aov(body_mass_g ~ species, data = penguins) |> summary() # method 1

## Using GLM (multiple ways)
lm(body_mass_g ~ species, data = penguins) # method 2; same p-values
lm(body_mass_g ~ species, data = penguins) |> anova() # method 3
lm(body_mass_g ~ species, data = penguins) |> car::Anova() # method 4

ANOVA types

All ANOVA tables are created using the car::Anova() function, which are more flexible and can create type II and type III ANOVA tables rapidly (useful for unbalanced designs and/or with interactions).

fit01 <- lm(body_mass_g ~ species, data = penguins)
car::Anova(fit01) # prints the ANOVA table

Anova Table (Type II tests)

Response: body_mass_g
             Sum Sq  Df F value    Pr(>F)    
species   146864214   2  343.63 < 2.2e-16 ***
Residuals  72443483 339                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit02 <- lm(body_mass_g ~ species + sex, data = penguins)
car::Anova(fit02) # prints the ANOVA table

Anova Table (Type II tests)

Response: body_mass_g
             Sum Sq  Df F value    Pr(>F)    
species   143401584   2  715.29 < 2.2e-16 ***
sex        37090262   1  370.01 < 2.2e-16 ***
Residuals  32979185 329                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit03 <- lm(body_mass_g ~ species * sex, data = penguins)
car::Anova(fit03) # prints the ANOVA table

Anova Table (Type II tests)

Response: body_mass_g
               Sum Sq  Df F value    Pr(>F)    
species     143401584   2 749.016 < 2.2e-16 ***
sex          37090262   1 387.460 < 2.2e-16 ***
species:sex   1676557   2   8.757 0.0001973 ***
Residuals    31302628 327                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Assumptions

par(mfrow = c(2, 2)) # creates a 2x2 grid for plots
plot(fit01)

Interpreting effects using estimated marginal means

The simplest way to report pairwise comparisons is to plot the estimated marginal means (EMMs) and their confidence intervals.

emm1 <- emmeans(fit01, ~ species)
plot(emm1, comparisons = TRUE)

Reporting of the plots can be supplemented with a table of the estimated marginal means (perhaps in the appendix or as supplementary material).

emm1

 species   emmean   SE  df lower.CL upper.CL
 Adelie      3701 37.6 339     3627     3775
 Chinstrap   3733 56.1 339     3623     3843
 Gentoo      5076 41.7 339     4994     5158

Confidence level used: 0.95

As there are two main effects, we need to plot the estimated marginal means for each main effect separately.

emm2 <- emmeans(fit02, ~ species)
plot(emm2, comparisons = TRUE)

emm3 <- emmeans(fit02, ~ sex)
plot(emm3, comparisons = TRUE)

If there is a significant interaction, the main effects should not be interpreted without considering the interaction. Two plots are needed to interpret a two-way interaction.

emmip(fit03, sex ~ species)

emmip(fit03, species ~ sex)

Other resources

It might be worthwhile to use the performance package to assess model fit (including assumptions using check_model()).
I use this a lot: the interactions package for visualising interactions in GLM models. However it is very technical and not for beginners – use if you are comfortable with R.
The rstatix package has a useful function called anova_summary() to produce almost publishable ANOVA tables. You will still need to manipulate it to make it look nice.