Overview for Data Visualization

Making Visualizations with ggformula

The package ggformula is built to overlay the Grammar of Graphics used in package ggplot2 with a formula based syntax. The formula based syntax means that all ggformula generated graphics will generally take one of the following forms:

One Variable

gf_goal(~ x, data = mydata)

Two Variable - Relationships

gf_goal(y ~ x, data = mydata)

Two or More Variables Split by Groups

gf_goal(~ x | z, data = mydata)
gf_goal(y ~ x | z, data = mydata)

where the

  • goal will be the specific function for the graph type (e.g. gf_histogram()),
  • the y and x will be the specific variables/columns in the data mapped to the y-axis and x-axis respectively, with | z the feature that facets into different graphs for group values of z, and
  • mydata is the object name of your data.

Getting Started

First, be sure you have installed ggformula. Remember, you only need to install the package once on your machine.

Then, be sure to load the package ggformula. Remember, you need to do this with each new Quarto/RMarkdown document or R Session.

library(ggformula) #for graphs

Common Modifications/Arguments

Every visualization type has some modifications that are specific to that type, but there are some universal modifications that should be added to every graph.

Axis Labels

Every visualization should have its axes labelled according to the context of the data. Axis labels should always include

  • The variables or cases they represent
    • Always include units (e.g. inches, %) for variables
  • The individual/sample/population represented by data
  • A descriptive title - this should not be “X vs Y”, but something like “As X increases we see that Y decreases”

Here is how that is added to our gf_goal() function:

gf_goal(formula, #takes many different forms 
        data = mydata,
        xlab = "X Variable, Units, and/or Sample",
        ylab = "Y Variable, Units, and/or Sample",
        title = "Title describing relationship")

Adding ggplot2 Layers

The package ggformula is built on top of another package called ggplot2 and so any ggplot2 function can be added to a ggformula generated graphic.

For example, we can change the theme to a built-in theme or modify other features using + after the ggformula function to add the ggplot2 layers to the graph.

gf_goal(formula, #takes many different forms 
        data = mydata,
        xlab = "X Variable, Units, and/or Sample",
        ylab = "Y Variable, Units, and/or Sample",
        title = "Title describing relationship") +
  theme_light()

Data Structures

In order to make graphs, your data needs to be “tidy”. That means it should have the structure:

  • Every Column is a Variable
  • Every Row is an Individual/Case
  • Every Cell is a Single Value

Here is an example of “tidy” data using data from a package called palmerpenguins (remember to install it!). First, load the package.

library(palmerpenguins)

Here is a snippet of the data:

Palmer Penguins
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 40.9 16.8 191 3700 female 2008
Gentoo Biscoe 50.0 15.3 220 5550 male 2007
Adelie Dream 41.1 19.0 182 3425 male 2007
Adelie Dream 39.2 21.1 196 4150 male 2007
Gentoo Biscoe 49.1 15.0 228 5500 male 2009
Adelie Biscoe 35.7 16.9 185 3150 female 2008
Gentoo Biscoe 43.2 14.5 208 4450 female 2008
Chinstrap Dream 46.4 17.8 191 3700 female 2008
Adelie Torgersen 34.6 17.2 189 3200 female 2008
Chinstrap Dream 50.5 18.4 200 3400 female 2008

We notice that each column is a variable, such as

  • species is the species of penguin
  • island is the location/island on which the penguin is found
  • bill_length_mm is the length of the penguin’s bill in millimeters (mm)

We also notice that each row represents a single penguin and its characteristics. Each cell contains a single value associated with a specific variable measured on a specific penguin.