<- read_csv("data/surveys.csv")
surveys #glimpse(surveys)
Lab 3: Exploring Rodents with ggplot2
1 Part 1: Setup
1.1 GitHub Workflow
Set up your GitHub workflow (either using the method of creating a repository and using Version Control to set up your project or vice versa using the usethis
package commands we have learned).
Use appropriate naming conventions for your project (see Code Style Guide), e.g. lab-3-ggplot2.
Your project folder should contain the following:
- .Rproj
- lab-3-student.qmd
- “data” folder
- rendered document
You will submit a link to your GitHub repository with all content.
1.2 Seeking Help
Part of learning to program is learning from a variety of resources. Thus, I expect you will use resources that you find on the internet. There is, however, an important balance between copying someone else’s code and using their code to learn. Therefore, if you use external resources, I want to know about it.
If you used Google, you are expected to “inform” me of any resources you used by pasting the link to the resource in a code comment next to where you used that resource.
If you used ChatGPT, you are expected to “inform” me of the assistance you received by (1) indicating somewhere in the problem that you used ChatGPT (e.g., below the question prompt or as a code comment), and (2) downloading and attaching the
.txt
file containing your entire conversation with ChatGPT. ChatGPT can we used as a “search engine”, but you should not copy and paste prompts from the lab or the code into your lab.
Additionally, you are permitted and encouraged to work with your peers as you complete lab assignments, but you are expected to do your own work. Copying from each other is cheating, and letting people copy from you is also cheating. Please don’t do either of those things.
1.3 Lab Instructions
The questions in this lab are noted with numbers and boldface. Each question will require you to produce code, whether it is one line or multiple lines.
This document is quite plain, meaning it does not have any special formatting. As part of your demonstration of creating professional looking Quarto documents, I would encourage you to spice your documents up (e.g., declaring execution options, specifying how your figures should be output, formatting your code output, etc.).
1.4 Setup
In the code chunk below, load in the packages necessary for your analysis. You should only need the tidyverse
package for this analysis.
2 Part 2: Data Context
The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been used to study the interactions among rodents, ants, and plants, as well as their respective responses to climate. To study the interactions among organisms, researchers experimentally manipulated access to 24 study plots. This study has produced over 100 scientific papers and is one of the longest running ecological studies in the U.S.
We will be investigating the animal species diversity and weights found within plots at the Portal study site. The data are stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id |
Unique ID for the observation |
month |
month of observation |
day |
day of observation |
year |
year of observation |
plot_id |
ID of a particular plot |
species_id |
2-letter code |
sex |
sex of animal (“M”, “F”) |
hindfoot_length |
length of the hindfoot in mm |
weight |
weight of the animal in grams |
genus |
genus of animal |
species |
species of animal |
taxon |
e.g. Rodent, Reptile, Bird, Rabbit |
plot_type |
type of plot |
2.1 Reading the Data into R
We are going to use the read_csv()
function to load in the surveys.csv
dataset (stored in the data folder). For simplicity, name the data surveys
. We will learn more about this function next week.
1. What are the dimensions (# of rows and columns) of these data?
2. What are the data types of the variables in this dataset?
3 Part 3: Exploratory Data Analysis with ggplot2
ggplot()
graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
To build a ggplot()
, we will use the following basic template that can be used for different types of plots:
ggplot(data = <DATA>,
mapping = aes(<VARIABLE MAPPINGS>)) +
<GEOM_FUNCTION>()
Let’s get started!
3.1 Scatterplot
3. First, let’s create a scatterplot of the relationship between weight
(on the \(x\)-axis) and hindfoot_length
(on the \(y\)-axis).
We can see there are a lot of points plotted on top of each other. Let’s try and modify this plot to extract more information from it.
4. Let’s add transparency (alpha
) to the points, to make the points more transparent and (possibly) easier to see.
Despite our best efforts there is still a substantial amount of overplotting occurring in our scatterplot. Let’s try splitting the dataset into smaller subsets and see if that allows for us to see the trends a bit better.
6. Facet your scatterplot by species
.
7. No plot is complete without axis labels and a title. Include reader friendly labels and a title to your plot.
It takes a larger cognitive load to read text that is rotated. It is common practice in many journals and media outlets to move the \(y\)-axis label to the top of the graph under the title.
8. Specify your \(y\)-axis label to be empty and move the \(y\)-axis label into the subtitle.
3.2 Boxplots
9. Create side-by-side boxplots to visualize the distribution of weight within each species.
A fundamental complaint of boxplots is that they do not plot the raw data. However, with ggplot we can add the raw points on top of the boxplots!
10. Add another layer to your previous plot that plots each observation using geom_point()
.
Alright, this should look less than optimal. Your points should appear rather stacked on top of each other. To make them less stacked, we need to jitter them a bit, using geom_jitter()
.
11. Remove the previous layer and include a geom_jitter()
layer instead.
That should look a bit better! But its really hard to see the points when everything is black.
12. Set the color
aesthetic in geom_jitter()
to change the color of the points and add set the alpha
aesthetic to add transparency. You are welcome to use whatever color you wish! Some of my favorites are “springgreen4” and “steelblue4”. Check them out on R Charts
Great! Now that you can see the points, you should notice something odd: there are two colors of points still being plotted. Some of the observations are being plotted twice, once from geom_boxplot()
as outliers and again from geom_jitter()
!
13. Inspect the help file for geom_boxplot()
and see how you can remove the outliers from being plotted by geom_boxplot()
. Make this change in your code!
Some small changes can make big differences to plots. One of these changes are better labels for a plot’s axes and legend.
14. Modify the \(x\)-axis and \(y\)-axis labels to describe what is being plotted. Be sure to include any necessary units! You might also be getting overlap in the species names – use theme(axis.text.x = ____)
or theme(axis.text.y = ____)
to turn the species axis labels 45 degrees.
Some people (and journals) prefer for boxplots to be stacked with a specific orientation! Let’s practice changing the orientation of our boxplots.
15. Now copy-paste your boxplot code you’ve been adding to above. Flip the orientation of your boxplots. If you created horizontally stacked boxplots, your boxplots should now be stacked vertically. If you had vertically stacked boxplots, you should now stack your boxplots horizontally!
Notice how vertically stacked boxplots make the species labels more readable than horizontally stacked boxplots (even when the axis labels are rotated). This is good practice!
4 Lab 3 Submission
For Lab 3 you will submit the link to your GitHub repository. Your rendered file is required to have the following specifications in the YAML options (at the top of your document):
- have the plots embedded (
embed-resources: true
) - include your source code (
code-tools: true
) - include all your code and output (
echo: true
)
If any of the options are not included, your Lab 3 or Challenge 3 assignment will receive an “Incomplete” and you will be required to submit a revision.
In addition, your document should not have any warnings or messages output in your HTML document. If your HTML contains warnings or messages, you will receive an “Incomplete” for document formatting and you will be required to submit a revision.
5 Challenge Problems
For this week’s Challenge, you will have three different options to explore. I’ve arranged these options in terms of their “spiciness,” or the difficulty of completing the task. You only need to complete one task to complete the challenge, but if you are interested in exploring more than one, feel free!
This is a great place to let your creativity show! Make sure to indicate what additional touches you added, and provide any online references you used.
5.1 🌶 Medium: Ridge Plots
In Lab 3, you used side-by-side boxplots to visualize the distribution of weight within each species of rodent. Boxplots have substantial flaws, namely that they disguise distributions with multiple modes.
A “superior” alternative is the density plot. However, ggplot2
does not allow for side-by-side density plots using geom_density()
. Instead, we will need to make use of the ggridges
package to create side-by-side density (ridge) plots.
For this challenge you are to change your boxplots to ridge plots. You will need to install the
ggridges
package and explore thegeom_density_ridges()
function.
5.2 🌶 🌶 Spicy: Exploring Color Themes
The built-in ggplot()
color scheme may not include the colors you were looking for. Don’t worry – there are many other color palettes available to use!
You can change the colors used by ggplot()
in a few different ways.
Manual Specification
Add the scale_color_manual()
or scale_fill_manual()
functions to your plot and directly specify the colors you want to use. You can either:
define a vector of colors within the
scale
functions (e.g.values = c("blue", "black", "red", "green")
) ORcreate a vector of colors using hex numbers and store that vector as a variable. Then, call that vector in the
scale_color_manual()
function.
Here are some example hex color schemes:
# A vector of a RG color deficient friendly palette with gray:
<- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
cdPalette_grey
# A vector of a RG color deficient friendly palette with black:
<- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") cdPalette_blk
If you are interested in using specific hex colors, I like the image color picker app to find the colors I want.
Package Specification
While manual specification may be necessary for some contexts, it can be a real pain to handpick 5+ colors. This is where color scales built-in to R packages come in handy! Popular packages for colors include:
RColorBrewer
– change colors by usingscale_fill_brewer()
orscale_colour_brewer()
.viridis
– change colors by usingscale_colour_viridis_d()
for discrete data,scale_colour_viridis_c()
for continuous data.ggsci
– change colors by usingscale_color_<PALNAME>()
orscale_fill_<PALNAME>()
, where you specify the name of the palette you wish to use (e.g.scale_color_aaas()
).
This website provides an exhaustive list of color themes available through various packages.
In this challenge you are expected to use this information to modify the boxplots you created Lab 3. First, you are to color the boxplots based on the variable
genus
. Next, you are to change the colors used forgenus
using either manual color specification or any of the packages listed here (or others!).
5.3 🌶 🌶 🌶 Hot: Exploring ggplot2
Annotation
Some data scientists advocate that we should try to eliminate legends from our plots to make them more clear. Instead of using legends, which cause the reader’s eye to stray from the plot, we should use annotation.
We can add annotation(s) to a ggplot()
using the annotate()
function:
ggplot(data = surveys,
mapping = aes(x = weight, y = species, color = genus)
+
) geom_boxplot() +
scale_color_manual(values = cdPalette_grey) +
annotate("text", y = 6, x = 200, label = "Sigmodon") +
annotate("text", y = 4, x = 200, label = "Perognathus") +
theme(legend.position = "none") +
labs(x = "Weight (g)",
y = "",
subtitle = "by Species and Genera",
title = "Rodent Weight")
Note that I’ve labeled the “Sigmodon” and “Perognathus” genera, so the reader can tell that these boxplots are associated with their respective genus.
In this challenge you are expected to use this information to modify the boxplots you created in Lab 3. First, you are to color the boxplots based on the variable
genus
. Next, you are to add annotations for each genus next to the boxplot(s) associated with that genus. Finally, you are expected to use thetheme()
function to remove the color legend from the plot, since it is no longer needed!