Measures of Spread
Getting Started
First, be sure you have installed mosaic
. Remember, you only need to install each package once on your machine.
Then, be sure to load the packages mosaic
. Remember, you need to do this with each new Quarto/RMarkdown document or R Session.
Data for Examples Show
As a reminder (see Overview of Summary Statistics), we will be using the penguins
data from the palmerpenguins
package:
library(palmerpenguins)
Here is a snippet of the data:
Palmer Penguins | |||||||
---|---|---|---|---|---|---|---|
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
Gentoo | Biscoe | 43.3 | 13.4 | 209 | 4400 | female | 2007 |
Chinstrap | Dream | 50.8 | 18.5 | 201 | 4450 | male | 2009 |
Adelie | Dream | 43.2 | 18.5 | 192 | 4100 | male | 2008 |
Adelie | Biscoe | 41.0 | 20.0 | 203 | 4725 | male | 2009 |
Adelie | Dream | 33.1 | 16.1 | 178 | 2900 | female | 2008 |
Variance
The average of the distances from each data point in the sample to the sample mean \(\bar{x}\), squared
\[s^2=\frac{\sum_{i=1}^{n}\left(\ x_i-\bar{x}\right)^2}{n-1}\]
The variance is one of the most important metrics in statistics. It is the measure of spread around the mean.
Basic Code
For a single quantitative variable, x
, here is the general structure for calculating a variance in R using the var()
function from the mosaic
package.
var(~x, data = mydata)
Run the code below to see an example using the quantitative variable bill_length_mm
from the penguins
data. Then replace bill_length_mm
with another quantitative variable from the penguins
data (e.g. bill_depth_mm
)
Notice the returned value of NA
. The function needs to have another argument added that tells R to ignore missing values (NA
) in order to calculate the mean, na.rm = TRUE
. Add the argument to the code above
var(~x, data = mydata, na.rm = TRUE)
When we want to calculate the variance of a quantitative variable (y
) measured across the values/groups of a categorical variable (x
) it is a simple modification to the single variance code.
var(y ~ x, data = mydata)
Run the code below to see an example using the quantitative variable bill_length_mm
and species
from the penguins
data. Then replace bill_length_mm
with another quantitative variable from the penguins
data (e.g. bill_depth_mm
) and/or another categorical variable (e.g., sex
).
Remember to add na.rm = TRUE
if you get a warning.
Standard Deviation
The standard deviation is the square root of the variance. It also measures the spread about the mean, always zero or greater than zero, and is used more often than the variance as it has the same units of measurement as the original observations.
\[s= \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^{n}\left(\ x_i-\bar{x}\right)^2}{n-1}}\]
Basic Code
For a single quantitative variable, x
, here is the general structure for calculating a standard deviation in R using the sd()
function from the mosaic
package.
sd(~x, data = mydata)
Run the code below to see an example using the quantitative variable bill_length_mm
from the penguins
data. Then replace bill_length_mm
with another quantitative variable from the penguins
data (e.g. bill_depth_mm
)
Notice the returned value of NA
. The function needs to have another argument added that tells R to ignore missing values (NA
) in order to calculate the median, na.rm = TRUE
. Add the argument to the code above
sd(~x, data = mydata, na.rm = TRUE)
Multiple Standard Deviations Across Groups
When we want to calculate the standard deviation of a quantitative variable (y
) measured across the values/groups of a categorical variable (x
) it is a simple modification to the single standard deviation code.
sd(y ~ x, data = mydata)
Run the code below to see an example using the quantitative variable bill_length_mm
and species
from the penguins
data. Then replace bill_length_mm
with another quantitative variable from the penguins
data (e.g. bill_depth_mm
) and/or another categorical variable (e.g., sex
).
Remember to add na.rm = TRUE
if you get a warning.