Measures of Spread

Getting Started

First, be sure you have installed mosaic. Remember, you only need to install each package once on your machine.

Then, be sure to load the packages mosaic. Remember, you need to do this with each new Quarto/RMarkdown document or R Session.

Data for Examples

As a reminder (see Overview of Summary Statistics), we will be using the penguins data from the palmerpenguins package:

library(palmerpenguins)

Here is a snippet of the data:

Palmer Penguins
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Gentoo Biscoe 43.3 13.4 209 4400 female 2007
Chinstrap Dream 50.8 18.5 201 4450 male 2009
Adelie Dream 43.2 18.5 192 4100 male 2008
Adelie Biscoe 41.0 20.0 203 4725 male 2009
Adelie Dream 33.1 16.1 178 2900 female 2008

Variance

The average of the distances from each data point in the sample to the sample mean \(\bar{x}\), squared

\[s^2=\frac{\sum_{i=1}^{n}\left(\ x_i-\bar{x}\right)^2}{n-1}\]

The variance is one of the most important metrics in statistics. It is the measure of spread around the mean.

Basic Code

For a single quantitative variable, x, here is the general structure for calculating a variance in R using the var() function from the mosaic package.

var(~x, data = mydata)

Run the code below to see an example using the quantitative variable bill_length_mm from the penguins data. Then replace bill_length_mm with another quantitative variable from the penguins data (e.g. bill_depth_mm)

Handling Missing Values

Notice the returned value of NA. The function needs to have another argument added that tells R to ignore missing values (NA) in order to calculate the mean, na.rm = TRUE. Add the argument to the code above

var(~x, data = mydata, na.rm = TRUE)

When we want to calculate the variance of a quantitative variable (y) measured across the values/groups of a categorical variable (x) it is a simple modification to the single variance code.

var(y ~ x, data = mydata)

Run the code below to see an example using the quantitative variable bill_length_mm and species from the penguins data. Then replace bill_length_mm with another quantitative variable from the penguins data (e.g. bill_depth_mm) and/or another categorical variable (e.g., sex).

Remember to add na.rm = TRUE if you get a warning.

Standard Deviation

The standard deviation is the square root of the variance. It also measures the spread about the mean, always zero or greater than zero, and is used more often than the variance as it has the same units of measurement as the original observations.

\[s= \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^{n}\left(\ x_i-\bar{x}\right)^2}{n-1}}\]

Basic Code

For a single quantitative variable, x, here is the general structure for calculating a standard deviation in R using the sd() function from the mosaic package.

sd(~x, data = mydata)

Run the code below to see an example using the quantitative variable bill_length_mm from the penguins data. Then replace bill_length_mm with another quantitative variable from the penguins data (e.g. bill_depth_mm)

Notice the returned value of NA. The function needs to have another argument added that tells R to ignore missing values (NA) in order to calculate the median, na.rm = TRUE. Add the argument to the code above

sd(~x, data = mydata, na.rm = TRUE)

Multiple Standard Deviations Across Groups

When we want to calculate the standard deviation of a quantitative variable (y) measured across the values/groups of a categorical variable (x) it is a simple modification to the single standard deviation code.

sd(y ~ x, data = mydata)

Run the code below to see an example using the quantitative variable bill_length_mm and species from the penguins data. Then replace bill_length_mm with another quantitative variable from the penguins data (e.g. bill_depth_mm) and/or another categorical variable (e.g., sex).

Remember to add na.rm = TRUE if you get a warning.

Quartiles/Percentiles

Interquartile Range (IQR)

Range