Sex | BW |
---|---|
F | 2.15 |
M | 2.55 |
F | 2.95 |
F | 2.70 |
M | 2.20 |
F | 1.85 |
M | 2.55 |
M | 2.60 |
Linear regression in R
Cheatsheet
This work was developed using resources that are available under a Creative Commons Attribution 4.0 International License, made available on the SOLES Open Educational Resources repository by the School of Life and Environmental Sciences, The University of Sydney.
- You know how to install and load packages in R.
- You know how to import data into R.
- You recognise data frames and vectors.
The data should be in a long format (also known as tidy data), where each row is an observation and each column is a variable (Figure 1). If your data is not already structured this way, reshape it manually in a spreadsheet program or in R using the pivot_longer()
function from the tidyr
package.
F | M |
---|---|
2.15 | 2.55 |
2.95 | 2.20 |
2.70 | 2.55 |
1.85 | 2.60 |
For this cheatsheet we will use data from the penguins dataset from the palmerpenguins
package. You may need to install this package:
install.packages("palmerpenguins")
data(penguins)
About
Regression analysis is the most commonly used statistical technique for modelling the relationship between variables that can be continuous, categorical or a mix of both. In fact, other techniques such as the t-test, ANOVA, ANCOVA and even non-parametric tests can be considered as special cases of regression analysis. In this cheatsheet, we will focus on linear regression.
R packages used
Implementing linear models
Simple linear regression
<- lm(body_mass_g ~ flipper_length_mm, data = penguins) fit01
Multiple linear regression
<- lm(body_mass_g ~ flipper_length_mm + bill_length_mm,
fit02 data = penguins
)
Interactions
<- lm(body_mass_g ~ flipper_length_mm * bill_length_mm,
fit03 data = penguins
)
Regression involving categorical variables
<- lm(body_mass_g ~ species + sex, data = penguins) fit04
Regression involving a mix of continuous and categorical variables
<- lm(body_mass_g ~ species + flipper_length_mm,
fit04 data = penguins
)
Assumptions
Use the plot()
function on the linear mode object to check the assumptions of the linear regression model.
Viewing interactions
Use the emmeans()
function to interpret interactions in a linear model. For continuous variables, you need to specify the range of the covariate with the cov.reduce
argument – set to range
to avoid the default of using the mean.
Other resources
- It might be worthwhile to use the
performance
package to assess model fit (including assumptions usingcheck_model()
). - I use this a lot: the
interactions
package for visualising interactions in GLM models. However it is very technical and not for beginners – use if you are comfortable with R. - The
gtsummary
package is great for summarising regression models usingtbl_regression()
, but you may need to tweak it further to get the output you want. Another package that can do something similar is thesjPlot
package, usingtab_model()
. Alternatively, you can manually create the table (sometimes it can be easier to copy numbers depending on your level of expertise).