Cleaning Data Homework
Here is the homework for Week 2, which gives you an opportunity to practice your dplyr skills some more.
- In the code above, we frequently used not_cancelled, rather than flights as our data. How did this simplify our code? Think especially about the functions we used within summarise().
- You might suspect that there is a relationship between the average delay (on a given day) and the proportion of flights that are are cancelled on that day. For example, if there is bad weather, many flights might start off delayed, but then end up cancelled. Let’s test this intuition. First, find the average delay and proportion of flights cancelled each day. Second, plot them against one another and comment on the relationship. Did our intuition hold?
- No one likes to be delayed when flying. To try and avoid this, you might wonder what hour of the day is least likely to have a departure delay. What hour is it? Also, compute the percentage of flights that leave on time or early in each hour (i.e., the flights you want to find!). What hour of the day are you most likely to find these flights?
- Which carriers are most likely to have a departure delay of at least 30 minutes? Hint: using the ifelse() function may be helpful
- What destination has the smallest average arrival delay?
- BONUS: Load the Lahman() library, which contains data on baseball players and their batting averages. First, convert it to a tibble (the tidyverse data structure we’ll cover in a future lecture) by calling: batting <- as_tibble(Lahman::Batting). Then find the players with the best or worst batting averages (batting average is simply the number of hits a player has, divided by the number of at bats they have). Why would this lead you astray? Now condition on players who had at least 500 at bats. How would you answer change? Remember that with a built-in R data like Batting, you can write ?Batting in the R Console to display the help file, which will explain what the variables mean.