Scatterplots
Getting Started
First, be sure you have installed ggformula
. Remember, you only need to install the package once on your machine.
Then, be sure to load the package ggformula
. Remember, you need to do this with each new Quarto/RMarkdown document or R Session.
Data for Examples
As a reminder (see Overview of Data Visualization), we will be using the penguins
data from the palmerpenguins
package:
library(palmerpenguins)
Here is a snippet of the data:
Palmer Penguins | |||||||
---|---|---|---|---|---|---|---|
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
Adelie | Torgersen | 39.6 | 17.2 | 196 | 3550 | female | 2008 |
Adelie | Biscoe | 41.3 | 21.1 | 195 | 4400 | male | 2008 |
Adelie | Biscoe | 37.9 | 18.6 | 172 | 3150 | female | 2007 |
Adelie | Dream | 39.2 | 21.1 | 196 | 4150 | male | 2007 |
Adelie | Dream | 36.5 | 18.0 | 182 | 3150 | female | 2007 |
Scatterplots with Two Quantitative Variables
Basic Code
To visualize the relationship between two quantitative variables x
and y
, here is the general structure for a scatterplot.
gf_point(y ~ x,
data = mydata)
The y
position will map to the y-axis, and the x
position will map to the x-axis.
Run the code below to see an example using the quantitative variables bill_length_mm
and bill_depth_mm
from the penguins
data. Then flip the placement of the variables in the y
and x
positions. Then replace bill_length_mm
with another quantitative variable from the penguins
data (e.g. flipper_length_mm
)
Notice the warning produced from running the code. This is just a warning that there were rows (penguins) ignored due to missing data for the variables visualized.
Adding Labels
Descriptive labels are important for any visualization. We can always add them to any visualization by adding xlab =
and ylab =
to your function.
gf_point(y ~ x,
data = mydata,
xlab = "X Axis Label",
ylab = "Y Axis Label",
title = "Descriptive Title")
Add labels and a title to the histogram for bill_length_mm
.
Other Modifications
We can add a few other modifications that purely aesthetic - just to make our graphs look nicer or easier to read.
Changing the Color of the Points
We can add a color to the points by telling R to color each point with a specified color either using a built in color from R or using a hex code for colors .
gf_boxplot(y ~ x,
data = mydata,
xlab = "X Axis Label",
ylab = "Y Axis Label",
title = "Descriptive Title",
color = "darkcyan")
Changing the Theme
The package ggformula
is built on top of another package called ggplot2
and so any ggplot2
function can be added to a ggformula
generated graphic. For example, we can change the theme to a built-in theme. To connect ggplot2
functions, we will use a +
at the end of the line precedding the function.
Try changing the theme to the following graph:
Scatterplots for Comparisons Across Groups
When we have a quantitative variable that has been measured across multiple groups, we may be interested in comparing relationships across the values/groups of a categorical variable. We can do this by changing either the color and/or shape of our points based on the values/groups of the categorical variable, z
.
gf_point(y ~ x,
color = ~z, #color is the best way
shape = ~z, #shape is another way, you can also do both
data = mydata)
Here is the scatterplot of bill_length_mm
and bill_depth_mm
with the color varied by species
, a categorical variable with values of Adelie, Chinstrap, and Gentoo. Modify the code below to change the point color to another categorical variable such as island
or sex
and see what happens. Try adding another categorical variable as the shape
.
Overlaying a Linear Model
\[\hat{y} = b_0 + b_1x\]
gf_point(y ~ x,
data = mydata) |>
gf_lm()
Run the code below to see an example using the quantitative variables bill_length_mm
and bill_depth_mm
from the penguins
data with a linear model overlaid. How does the model change when you replace bill_length_mm
with another quantitative variable from the penguins
data (e.g. flipper_length_mm
). Add the appropriate labels to the graph (note you only need to do this within the gf_point()
function).
How did the linear model change when we accounted for species?
Simpson’s Paradox is a type of “Amalgamation Paradox” where the behavior of the whole is different than the behavior of the subgroups. We see this in the field of Statistics when the trend of a linear model for a group changes when you look at the same relationship among subgroups. In our example above, for all penguins, the relationship looks negative between Bill Length and Depth, but when you split by species you see a positive relationship (which makes more sense). This is why it is so important to consider all the possible variables that might influence a relationship!