Histograms
Getting Started
First, be sure you have installed ggformula
. Remember, you only need to install the package once on your machine.
Then, be sure to load the package ggformula
. Remember, you need to do this with each new Quarto/RMarkdown document or R Session.
Data for Examples
As a reminder (see Overview of Data Visualization), we will be using the penguins
data from the palmerpenguins
package:
library(palmerpenguins)
Here is a snippet of the data:
Palmer Penguins | |||||||
---|---|---|---|---|---|---|---|
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
Chinstrap | Dream | 50.0 | 19.5 | 196 | 3900 | male | 2007 |
Adelie | Biscoe | 42.0 | 19.5 | 200 | 4050 | male | 2008 |
Adelie | Dream | 38.1 | 17.6 | 187 | 3425 | female | 2009 |
Gentoo | Biscoe | 45.5 | 15.0 | 220 | 5000 | male | 2008 |
Adelie | Biscoe | 40.5 | 17.9 | 187 | 3200 | female | 2007 |
Histograms with One Quantitative Variable
Basic Code
For a single quantitative variable, x
, here is the general structure for a histogram.
gf_histogram(~x,
data = mydata)
Run the code below to see an example using the quantitative variable bill_length_mm
from the penguins
data. Then replace bill_length_mm
with another quantitative variable from the penguins
data (e.g. bill_depth_mm
)
Notice the warning produced from running the code. This is just a warning that there were rows (penguins) ignored due to missing data for the variables visualized.
Bin Widths
One of the features of a histogram is the bin width. The bin width is something that is automatically generated for most histogram functions, but it is not always ideal. You can add an argument to the R function gf_histogram()
to modify the bin width.
gf_histogram(~ x,
data = mydata,
binwidth = 10) #adjust this number
Run the code below to see an example using the quantitative variable bill_length_mm
from the penguins
data with a bin width of 5 (mm). Modify the bin width and see how it affects the histogram.
Adding Labels
Descriptive labels are important for any visualization. We can always add them to any visualization by adding xlab =
and ylab =
to your function.
gf_histogram(~x,
data = mydata,
xlab = "X Axis Label",
ylab = "Y Axis Label",
title = "Descriptive Title")
Add labels and a title to the histogram for bill_length_mm
.
Other Modifications
We can add a few other modifications that purely aesthetic - just to make our graphs look nicer or easier to read.
Outlining the Bars
We can add a color that outlines the bars by telling R to outline the color of the bars in black.
gf_histogram(~x,
data = mydata,
xlab = "X Axis Label",
ylab = "Y Axis Label",
title = "Descriptive Title",
color = "black")
Filling the Bars with Color
We can add a color to fill the bars by telling R to fill the bars with a specified color either using a built in color from R or using a hex code for colors .
gf_histogram(~x,
data = mydata,
xlab = "X Axis Label",
ylab = "Y Axis Label",
title = "Descriptive Title",
fill = "darkcyan")
Changing the Theme
The package ggformula
is built on top of another package called ggplot2
and so any ggplot2
function can be added to a ggformula
generated graphic. For example, we can change the theme to a built-in theme.
Try changing the theme to the following graph:
Try It Out: Modifications
Try adding some modifications for the histogram of bill_length_mm
.
Histograms for Comparisons Across Groups
When we have a quantitative variable that has been measured across multiple groups, we may be interested in comparing histograms across the values/groups of a categorical variable. We can do this using two different features of data visualization:
- Color Differences
- Facets
Adding Color to Groups
Similar to changing the color of bars to a single color, we can use the fill =
argument but instead specify our categorical variable z
.
gf_histogram(~x,
data = mydata,
fill = ~z) #don't forget the ~ before the variable name
Here is the histogram of bill_length_mm
with color varied by species
a categorical variable with values of Adelie, Chinstrap, and Gentoo. Modify the code below to change the fill color to another categorical variable such as island
or sex
and see what happens.
Faceting by Groups
Faceting in visualization is a tool that allows you to easily split up data across multiple panels of the same type. To do this in ggformula
you add | z
after the formula which conditions the graph on the categorical variable z
and splits the graph by the groups/values of z
.
gf_histogram(~x | z,
data = mydata)
Here is the histogram of bill_length_mm
with facets based on species
, a categorical variable with values of Adelie, Chinstrap, and Gentoo. Modify the code below to change the facets to another categorical variable such as island
or sex
and see what happens. Try adding fill by the facet variable as well.
Using ggplot2
to Control Facets
Unfortunately, the ggformula
option for facets does not give you much control on how to organize the facets, so it might be useful to instead of a ggplot2
option such as facet_wrap()
or facet_grid
. For example, here is an example using facet_wrap()
that allows us to stack our facets into one column (ncol = 1
). Look up the options for different facet functions and try other modification.
Check Your Understading: Histograms
Question 1. Which of the following characteristics of the distribution of values for a variable can be evaluated using a histogram?
Question 2: Which of the following can we determine from the y-axis of a histogram?
Use the distinct()
function from the {dplyr}
package.
starwars |> distinct(hair_color)
|> distinct(hair_color) starwars