Overview for Data Visualization

Making Visualizations with ggformula

The package ggformula is built to overlay the Grammar of Graphics used in package ggplot2 with a formula based syntax. The formula based syntax means that all ggformula generated graphics will generally take one of the following forms:

One Variable

gf_goal(~ x, data = mydata)

Two Variable - Relationships

gf_goal(y ~ x, data = mydata)

Two or More Variables Split by Groups

gf_goal(~ x | z, data = mydata)
gf_goal(y ~ x | z, data = mydata)

where the

  • goal will be the specific function for the graph type (e.g. gf_histogram()),
  • the y and x will be the specific variables/columns in the data mapped to the y-axis and x-axis respectively, with | z the feature that facets into different graphs for group values of z, and
  • mydata is the object name of your data.

Getting Started

First, be sure you have installed ggformula. Remember, you only need to install the package once on your machine.

Then, be sure to load the package ggformula. Remember, you need to do this with each new Quarto/RMarkdown document or R Session.

library(ggformula) #for graphs

Common Modifications/Arguments

Every visualization type has some modifications that are specific to that type, but there are some universal modifications that should be added to every graph.

Axis Labels

Every visualization should have its axes labelled according to the context of the data. Axis labels should always include

  • The variables or cases they represent
    • Always include units (e.g. inches, %) for variables
  • The individual/sample/population represented by data
  • A descriptive title - this should not be “X vs Y”, but something like “As X increases we see that Y decreases”

Here is how that is added to our gf_goal() function:

gf_goal(formula, #takes many different forms 
        data = mydata,
        xlab = "X Variable, Units, and/or Sample",
        ylab = "Y Variable, Units, and/or Sample",
        title = "Title describing relationship")

Adding ggplot2 Layers

The package ggformula is built on top of another package called ggplot2 and so any ggplot2 function can be added to a ggformula generated graphic.

For example, we can change the theme to a built-in theme or modify other features using + after the ggformula function to add the ggplot2 layers to the graph.

gf_goal(formula, #takes many different forms 
        data = mydata,
        xlab = "X Variable, Units, and/or Sample",
        ylab = "Y Variable, Units, and/or Sample",
        title = "Title describing relationship") +
  theme_light()

Data Structures

In order to make graphs, your data needs to be “tidy”. That means it should have the structure:

  • Every Column is a Variable
  • Every Row is an Individual/Case
  • Every Cell is a Single Value

Here is an example of “tidy” data using data from a package called palmerpenguins (remember to install it!). First, load the package.

library(palmerpenguins)

Here is a snippet of the data:

Palmer Penguins
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Gentoo Biscoe 46.5 14.8 217 5200 female 2008
Chinstrap Dream 49.0 19.5 210 3950 male 2008
Gentoo Biscoe 48.8 16.2 222 6000 male 2009
Gentoo Biscoe 42.7 13.7 208 3950 female 2008
Adelie Dream 41.5 18.5 201 4000 male 2009
Chinstrap Dream 49.6 18.2 193 3775 male 2009
Adelie Torgersen 36.2 16.1 187 3550 female 2008
Adelie Biscoe 37.8 20.0 190 4250 male 2009
Adelie Dream 41.3 20.3 194 3550 male 2008
Adelie Torgersen 40.6 19.0 199 4000 male 2009

We notice that each column is a variable, such as

  • species is the species of penguin
  • island is the location/island on which the penguin is found
  • bill_length_mm is the length of the penguin’s bill in millimeters (mm)

We also notice that each row represents a single penguin and its characteristics. Each cell contains a single value associated with a specific variable measured on a specific penguin.