View(sw_films)
PA 10: Exploring the Star Wars Universe
This task is complex. It requires many different types of abilities. Everyone will be good at some of these abilities but nobody will be good at all of them. In order to produce the best product possible, you will need to use the skills of each member of your group.
Goals for the Activity
- Apply methods of to use lists and iteration (using
purrr
) to extract data from various non-tabular data sets.
- Create new data sets through the cleaning, organization, and joining of data from various sources
- Create visualizations to explore the data
- May the force be with you!
THROUGHOUT THE Activity be sure to follow the Style Guide by doing the following:
- load the appropriate packages at the beginning of the Rmarkdown
- use proper spacing
- name all code chunks
- comment at least once in each code chunk to describe why you made your coding decisions
- add appropriate labels to all graphic axes
Review: Extracting Information from Different Data Sets
Here is information about the fist 7 Star Wars films:
We are going to explore the data contained in several lists similar to this one (and the previously explored sw_people
), combining skills from all of our previous R code learning experiences.
How do the following two codes compare?
4]][["title"]]
sw_films[[%>% pluck(4,"title") sw_films
Insert Answer Here
Suppose we want to pull out just the titles as a character vector, select the correct code (comment out the rest) to perform this action. You may want to run each line of code one at a time (remember Ctrl + Enter
for Windows with your cursor on that line of code).
#comment out the incorrect codes
%>% map("title")
sw_films %>% map_chr("title")
sw_films %>% map_dfc("title") sw_films
Suppose we want to apply a function to count the number of specific kinds of ships and vehicles in our data
Notice that for each film, the “starships” vector contains links to information on those starships (though note this data is out of date and should be linked at swapi.dev, not swapi.co).
1]][["starships"]] sw_films[[
So if we can count the number of webpage links that would tell us the number of starships that appear in that movie. Here are three different ways to count the number of urls under starships
. Can you think of another? (it is ok if you can’t). Compare and contrast how the three codes work differently to do the same thing.
%>% map(., "starships") %>% map_dbl(~length(.))
sw_films map_dbl(sw_films, ~length(.x$starships))
%>% map_dbl(., ~length(.x$starships)) sw_films
Insert Answer
Part 1: Evaluating Hyperdrive in the Star Wars Episodes
We will use the third method from the previous section to extract out the information we want from sw_films
. For each row, specify if we should use a regular map()
, map_dbl()
, or map_chr()
.
NOTE Sometimes code like this gets a little finicky in R if you try to run it with Ctrl + Enter
. Instead, use the code chunk green arrow to run the whole code chunk or highlight all of the code and then use the shortcut to run it.
<- sw_films %>% {
sw_ships_1 tibble(
title = map____(., "title"), #character
episode = map____(., "episode_id"), #numeric
starships = map____(., ~length(.x$starships)), #numeric
vehicles = map____(., ~length(.x$vehicles)), #numeric
planets = map____(., ~length(.x$planets)) #numeric
)} sw_ships_1
Let’s do a bit more data cleaning to 1) assign the Trilogy classification to each episode, 2) calculate the total number of starships (which have hyperdrive) and vehicles (which do not have hyperdrive), and 3) calculate the proportion of total ships that have hyperdrive. Fill in the missing codes.
<- sw_ships_1 ____
sw_ships #create a new variable called trilogy
____(trilogy = case_when(episode %in% 1:3 ~ trilogies[1],
%in% 4:6 ~ trilogies[2],
episode %in% 7 ~ trilogies[3])) %>%
episode #create a new variable called total_ships which adds vehicles and starships together
____(total_ships = vehicles + starships) %>%
#create a new variable called prop that calculate the percent hyperdrive
____(prop = starships / total_ships * 100)
Hyperdrive Use Across Films
Now, let’s make a plot examining how often hyperdrive ships appear in each episode. You will modify the code below to replicate the following visualization:
Fill in the blanks withe appropriate functions.
%>%
sw_ships #be sure to order titles by order/episode
ggplot(aes(y = ____(title, desc(episode)),
x = prop)) +
#we want bars but our data is already summarized!
geom_____(aes(fill = trilogy)) +
labs(
title = "The Rise of Hyperdrive",
subtitle = "Percentage of Ships with Hyperdrive Capability"
+
) #you may need to install `scales` package if you haven't already
scale_x_continuous(labels = scales::label_percent(scale = 1)) +
theme_minimal() +
#what aesthetic do we modify to change the bar color
scale______viridis_d(end = 0.8) +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "bottom",
legend.title = element_blank())
Canvas Quiz Question 1
Which movie has the second highest percentage of Hyperdrive ships?
Insert Answer Here
Hyperdrive Prevalence within the Universe
We can also look at a plot to see if there is a correlation between the total number of ships and the number with hyperdrive (starships).
Fill in the blanks withe appropriate functions.
%>%
sw_ships ggplot(aes(x = total_ships,
y = starships)) +
#make points
geom_____(aes(color = trilogy)) +
#fit a model
geom_____(method = "lm") +
#what does geom_text() do?
geom_text(aes(label = title),
vjust = -1,
hjust = "inward",
size = 2.75) +
labs(title = "Hyperdrive Correlations",
subtitle = "The Number of Ships with Hyperdrive vs Total Ships") +
theme_minimal() +
#what aesthetic do we want to modify the color of points?
scale______viridis_d(end = 0.8) +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "bottom",
legend.title = element_blank())
Canvas Quiz Question 2
What do you notice about the use of hyperdrive type vehicles in the episodes?
Insert Answer
Part 2: The Physical Features of Star Wars Characters
Recall the data for “people” in Star Wars:
View(sw_people)
We want to extract out name
, height
, and mass
as character
vectors (for now, we have to deal with some issues in height and weight later to change them into double type vectors) and keep films
as a list for now. Fill in the correct map
type functions for each one.
<- sw_people %>% {
sw_peeps tibble(
name = ____(., "name"), #character
height = ____(., "height"), #character
mass = ____(., "mass"), #character
films = map____(., "films") #list
)} sw_peeps
Notice that the films
column contains lists of urls for each film reference. Let’s pull out that same information from the sw_films
data to have the title
of the episode and the url
as a character
vector, and the episode number as a numeric value. Fill in the correct map
type functions.
<- sw_films %>% {
film_names tibble(
episode_id = ____(., "episode_id"), #double
episode_name = ____(., "title"), #character
url = ____(., "url") #character
)} film_names
Now we can finish cleaning up our data by doing the following:
- turn
height
andmass
into numeric vectors;
- match the
films
/urls
to theirepisode_names
and assign that back tosw_peeps
.
<- sw_peeps %>%
sw_peeps2 #use a function from readr to extract the numbers and replace "unknown" with na
mutate(height = parse_____(height, na = "unknown"),
mass = parse_____(mass, na = "unknown")) %>%
#unnest the lists in films
____(cols = c("films")) %>%
#join the film data with episodes names to the people data
_____join(film_names, by = c("films" = "url")) %>%
#remove the `films` url from the data frame
____(-films) %>%
#add the variable trilogy
____(trilogy = case_when(episode_id %in% 1:3 ~ trilogies[1],
%in% 4:6 ~ trilogies[2],
episode_id %in% 7 ~ trilogies[3]))
episode_id sw_peeps2
Size of Characters in the Star Wars Universe
We can now create a plot of height and mass by trilogy group to see if the physique of characters differed across Trilogies (keeping in mind the third set of Trilogies is incomplete in this data set).
%>%
sw_peeps2 filter(name != "Jabba Desilijic Tiure") %>% #major outlier removed
#map the correct aesthetics
ggplot(aes(x = ____,
y = ____,
color = ____))+
geom_point(position = "jitter") +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Height of Character (cm)",
y = "Mass of Character (kg)",
color = "Trilogy Group",
title = "Character Characteristics in Star Wars") +
theme_minimal() +
scale_color_viridis_d(end = 0.8)
Canvas Quiz Question 3
Write some code to identify who is is the heaviest (look at the graph to help guide this) Star Wars character (excluding Jabba Desilijic Tiure).
Insert Answer Here
OPTIONAL CHALLENGE PROBLEM
Your professor wants to use purrr
to try and generate a height and mass scatterplot for each episode, but I don’t want to type out all that code. Here is where I got so far, but I am not convince this is the most sophisticated or effective way to do this. Do some research and see if you can find a way to put this process into production!
<- sw_peeps %>%
plots_sw nest(data = !episode_name) %>%
mutate(plot = map(data, ~ggplot(., aes(y=mass, x=height)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = paste0(episode_name))))
print(plots_sw$plot)