Sex | BW |
---|---|
F | 2.15 |
M | 2.55 |
F | 2.95 |
F | 2.70 |
M | 2.20 |
F | 1.85 |
M | 2.55 |
M | 2.60 |
ANOVA in R
Cheatsheet
This work was developed using resources that are available under a Creative Commons Attribution 4.0 International License, made available on the SOLES Open Educational Resources repository by the School of Life and Environmental Sciences, The University of Sydney.
- You know how to install and load packages in R.
- You know how to import data into R.
- You recognise data frames and vectors.
- You know what an ANOVA test is and when to use it.
The data should be in a long format (also known as tidy data), where each row is an observation and each column is a variable (Figure 1). If your data is not already structured this way, reshape it manually in a spreadsheet program or in R using the pivot_longer()
function from the tidyr
package.
F | M |
---|---|
2.15 | 2.55 |
2.95 | 2.20 |
2.70 | 2.55 |
1.85 | 2.60 |
For this cheatsheet we will use data from the penguins dataset from the palmerpenguins
package. You may need to install this package:
install.packages("palmerpenguins")
About
ANOVA summaries are often used when the model’s predictors/explanatory variables are categorical. This cheatsheet shows you how to perform ANOVA techniqus in R.
ANOVA answers questions such as:
- Are there differences in plant height between these two locations? (Yes, similar to a two-sample t-test).
- Is location a significant predictor of species diversity?
- Is the mean height of kangaroos different between different species?
Interpretation of the results is not covered here.
R packages used
tidyverse
, car
, emmeans
, gt
, gtsummary
, rstatix
, palmerpenguins
Similarity between ANOVA and linear regression
All of the below are equivalent ways to perform an ANOVA test in R. The first method is the most common but method 4 is most flexible. Note that native R pipes (|>
) are used in the examples below.
## ANOVA
aov(body_mass_g ~ species, data = penguins) |> summary() # method 1
## Using GLM (multiple ways)
lm(body_mass_g ~ species, data = penguins) # method 2; same p-values
lm(body_mass_g ~ species, data = penguins) |> anova() # method 3
lm(body_mass_g ~ species, data = penguins) |> car::Anova() # method 4
ANOVA types
All ANOVA tables are created using the car::Anova()
function, which are more flexible and can create type II and type III ANOVA tables rapidly (useful for unbalanced designs and/or with interactions).
<- lm(body_mass_g ~ species, data = penguins)
fit01 ::Anova(fit01) # prints the ANOVA table car
Anova Table (Type II tests)
Response: body_mass_g
Sum Sq Df F value Pr(>F)
species 146864214 2 343.63 < 2.2e-16 ***
Residuals 72443483 339
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- lm(body_mass_g ~ species + sex, data = penguins)
fit02 ::Anova(fit02) # prints the ANOVA table car
Anova Table (Type II tests)
Response: body_mass_g
Sum Sq Df F value Pr(>F)
species 143401584 2 715.29 < 2.2e-16 ***
sex 37090262 1 370.01 < 2.2e-16 ***
Residuals 32979185 329
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
<- lm(body_mass_g ~ species * sex, data = penguins)
fit03 ::Anova(fit03) # prints the ANOVA table car
Anova Table (Type II tests)
Response: body_mass_g
Sum Sq Df F value Pr(>F)
species 143401584 2 749.016 < 2.2e-16 ***
sex 37090262 1 387.460 < 2.2e-16 ***
species:sex 1676557 2 8.757 0.0001973 ***
Residuals 31302628 327
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Assumptions
Interpreting effects using estimated marginal means
The simplest way to report pairwise comparisons is to plot the estimated marginal means (EMMs) and their confidence intervals.
Reporting of the plots can be supplemented with a table of the estimated marginal means (perhaps in the appendix or as supplementary material).
emm1
species emmean SE df lower.CL upper.CL
Adelie 3701 37.6 339 3627 3775
Chinstrap 3733 56.1 339 3623 3843
Gentoo 5076 41.7 339 4994 5158
Confidence level used: 0.95
As there are two main effects, we need to plot the estimated marginal means for each main effect separately.
Other resources
- It might be worthwhile to use the
performance
package to assess model fit (including assumptions usingcheck_model()
). - I use this a lot: the
interactions
package for visualising interactions in GLM models. However it is very technical and not for beginners – use if you are comfortable with R. - The
rstatix
package has a useful function calledanova_summary()
to produce almost publishable ANOVA tables. You will still need to manipulate it to make it look nice.