Overview for Summary Statistics

Calculating Summary Statistics with mosaic

The package mosaic is built to overlay the existing functions used in base R packages to be easier to use with a formula based syntax. The formula based syntax means that all mosaic generated summary statistics the code will generally take one of the following forms:

One Variable

gf_goal(~ x, data = mydata)

Two Variable - Relationships

gf_goal(y ~ x, data = mydata)

where the

  • goal will be the specific function for the graph type (e.g. mean()),
  • the y and x will be the specific variables/columns in the data and may be categorical or quantitative depending on the statistic type, and
  • mydata is the object name of your data.

Getting Started

First, be sure you have installed mosaic. Remember, you only need to install the package once on your machine.

Then, be sure to load the package mosaic. Remember, you need to do this with each new Quarto/RMarkdown document or R Session.

library(mosaic) #for summary stats

Data Structures

In order to make graphs, your data needs to be “tidy”. That means it should have the structure:

  • Every Column is a Variable
  • Every Row is an Individual/Case
  • Every Cell is a Single Value

Here is an example of “tidy” data using data from a package called palmerpenguins (remember to install it!). First, load the package.

library(palmerpenguins)

Here is a snippet of the data:

Palmer Penguins
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Biscoe 42.0 19.5 200 4050 male 2008
Adelie Torgersen 42.5 20.7 197 4500 male 2007
Adelie Torgersen 36.6 17.8 185 3700 female 2007
Chinstrap Dream 46.4 18.6 190 3450 female 2007
Chinstrap Dream 45.2 17.8 198 3950 female 2007
Gentoo Biscoe 47.4 14.6 212 4725 female 2009
Gentoo Biscoe 49.1 14.8 220 5150 female 2008
Gentoo Biscoe 45.8 14.6 210 4200 female 2007
Chinstrap Dream 52.0 20.7 210 4800 male 2008
Adelie Torgersen 42.0 20.2 190 4250 NA 2007

We notice that each column is a variable, such as

  • species is the species of penguin
  • island is the location/island on which the penguin is found
  • bill_length_mm is the length of the penguin’s bill in millimeters (mm)

We also notice that each row represents a single penguin and its characteristics. Each cell contains a single value associated with a specific variable measured on a specific penguin.