tidyverse.Rmd

# Tidyverse


```{r tvsetup, include=FALSE, eval=TRUE, cache=FALSE}
knitr::opts_chunk$set(echo = T, eval=T, cache.rebuild = F)
```

```{r tidyhex, out.width='10%', echo=FALSE}
knitr::include_graphics('img/hex_tidyverse.svg')
```

## What is the Tidyverse?

The tidyverse consists of a few key packages:

- <span class="pack">ggplot2</span>: data visualization
- <span class="pack">tibble</span>: tibbles, a modern re-imagining of data frames
- <span class="pack">tidyr</span>: data tidying
- <span class="pack">readr</span>: data import
- <span class="pack">purrr</span>: functional programming, e.g. alternate approaches to apply
- <span class="pack">dplyr</span>: data manipulation

And of course the <span class="pack">tidyverse</span> package itself, which will load all of the above in a way that will avoid naming conflicts.

```{r tv_load, eval=FALSE}
library(tidyverse)
```

```{r tv_startup_message, echo=FALSE, message=TRUE}
"Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages -------------------------
filter(): dplyr, stats
lag():    dplyr, stats" %>% 
  message()
```


In addition, there are other packages like <span class="pack">lubridate</span>, <span class="pack">rvest</span>, <span class="pack">stringr</span> and others in the **hadleyverse** that are also greatly useful.

## What is Tidy?

*Tidy data* refers to data arranged in a way that makes data processing, analysis, and visualization simpler.

In a tidy data set: 

- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.

Think *long* before *wide*.


## dplyr

<span class="pack">dplyr</span> provides a grammar of data manipulation (like <span class="pack">ggplot2</span> does for visualization).  It is the next iteration of <span class="pack">plyr</span>, but <span class="pack">plyr</span> is deprecated and no longer used.  It's focused on tools for working with data frames, with over 100 functions that might be of specific use to you.

It has three main goals:

- Make the most important data manipulation tasks easier.
- Do them faster.
- Use the same interface to work with data frames, data tables or a database.


Some key operations include:


- <span class="func">select</span>: grab columns
    - select helpers: <span class="func">one\_of</span>, <span class="func">starts\_with</span>, <span class="func">num_range</span> etc.
- <span class="func">filter</span>/<span class="func">slice</span>: grab rows
- <span class="func">group_by</span>: grouped operations
- <span class="func">mutate</span>/<span class="func">transmute</span>: create new variables
- <span class="func">summarize</span>: summarize/aggregate

There are various (SQL-like) join/merge functions:

- <span class="func">inner_join</span>, <span class="func">left_join</span> etc.

And there are a lot of little things like:

- <span class="func">n</span>, <span class="func">n\_distinct</span>, <span class="func">nth</span>, <span class="func">n\_groups</span>, <span class="func">count</span>, <span class="func">recode</span>, <span class="func">between</span>

In addition, there is no need to quote variable names.


### An example

Let's say we want to select from our data the following variables:

  - Start with the **ID** variable
  - The variables **X1** through **X10**, which are not all grouped together, and there are many more *X\** columns
  - The variables **var1** and **var2**, which are the only variables with *var* in their name
  - Any variable with a name that starts with **XYZ**
    
How might we go about this in a dataset of possibly hundreds or even thousands of columns? There are several base R approaches that we could go with, but often they will be tedious, or require multiple objects to be created just to get the columns you want.  Let's start with the worst choice.

```{r baseRexample1a, eval=FALSE}
newData = oldData[,c(1,2,3,4, etc.)]
```

Using numeric indexes, or rather *magic numbers*, is not conducive to readability or reproducibility.  If anything changes about the data columns, the numbers may no longer be applicable, and you'd have to redo the line again.

We could name the variables explicitly.

```{r baseRexample1b, eval=FALSE}
newData = oldData[,c('ID','X1', 'X2', etc.)]
```

This would be fine if there are only a handful.  But if you're trying to reduce a 1000 column data set to several dozen it's tedious, and generally not pretty regardless.

A more advanced alternative regards a two-step approach with [regular expressions][Regular Expressions]. This requires that you know something about regex (and you should), but it is difficult to read/understand by those who don't, and often by even yourself if it's more complicated.  In any case, you first will need to create an object that represents the column names first, otherwise it looks unwieldy if used within brackets or a function like <span class="func">subset</span>.

```{r baseRexample1c, eval=FALSE}
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))

newData = oldData[,cols]

# or via subset
newData = subset(oldData, select = cols)
```


Now consider there is even more to do. What if you also want observations where **Z** is **Yes**, Q is **No**, and only the observations with the top 50 values of **var2**, ordered by **var1** (descending)?  Probably the more straightforward way in R to do so would be something like the following, where each part is broken out and we continuously write over the object as we modify it.

```{r baseRexample2, eval=FALSE}
# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]
```

And this is for fairly straightforward operations.


Now consider doing all of the previous in one piped operation.  The <span class="pack">dplyr</span> package will allow us to do something like the following.

```{r pipeExample, eval=FALSE}
newData = oldData %>% 
  select(num_range('X', 1:10), 
         contains('var'), 
         starts_with('XYZ')) %>% 
  filter(Z == 'Yes', 
         Q == 'No') %>% 
  top_n(n=50, var2) %>% 
  arrange(desc(var1))
```

Even if it hadn't been explained before, you might have been able to guess a little as to what was going on.  The code is fairly succinct, we don't have to keep referencing objects repeatedly, and no explicit intermediary objects are created.

<span class="pack">dplyr</span> and piping is an *alternative*. You can do all this sort of stuff with base R, for example, with functions like <span class="func">with</span>, <span class="func">within</span>, <span class="func">subset</span>, <span class="func">transform</span>, etc.  Though the initial base R approach depicted is fairly concise, in general, it can potentially be: 

- more verbose
- less legible
- less amenable to additional data changes
- requires esoteric knowledge (e.g. regular expressions)
- often requires creation of new objects (even if we just want to explore)
- often slower, possibly greatly


## Running Example

The following data was scraped initially scraped from the web as follows.  It is data from the NBA basketball league for the last season with things like player names, position, team name, points per game, field goal percentage, and various other statistics.  We'll use it as an example to demonstrate various functionality found within <span class="pack">dplyr</span>.

```{r basketballDataScrape, eval=FALSE}
library(rvest)

current_year = lubridate::year(Sys.Date())

url = glue::glue("http://www.basketball-reference.com/leagues/NBA_{current_year-1}_totals.html")

bball = read_html(url) %>% 
  html_nodes("#totals_stats") %>% 
  html_table() %>% 
  data.frame() 

save(bball, file='data/bball.RData')
```

However you can just load it into your workspace as below. Note that when initially gathered from the website, the data is all character strings.  We'll fix this later.  The following shows the data as it will eventually be.

```{r load_bball}
load('data/bball.RData')
glimpse(bball[,1:5])
```


## Selecting Columns

Often you do not need the entire data set. While this is easily handled in base R (as shown earlier), it can be more clear to use <span class="func">select</span> in <span class='pack'>dplyr</span>.  Now we won't have to create separate objects, use quotes or $, etc.

```{r select1}
bball %>% 
  select(Player, Tm, Pos) %>% 
  head()
```

What if we want to drop some variables?

```{r select2}
bball %>%     
  select(-Player, -Tm, -Pos)  %>% 
  head()
```

### Helper functions

Sometimes, we have a lot of variables to select, and if they have a common naming scheme, this can be very easy.

```{r select3}
bball %>% 
  select(Player, contains("3P"), ends_with("RB")) %>% 
  arrange(desc(TRB)) %>% 
  head()
```

The <span class="func">select</span> also has helper functions to make selecting columns even easier.  I probably don't even need to explain what's being done above, and this is the power of the tidyverse way.  Here is the list of *helper functions* to be aware of:

- <span class="func">starts_with</span>: starts with a prefix
- <span class="func">ends_with</span>: ends with a suffix
- <span class="func">contains</span>: contains a literal string
- <span class="func">matches</span>: matches a regular expression
- <span class="func">num_range</span>: a numerical range like x01, x02, x03.
- <span class="func">one_of</span>: variables in character vector.
- <span class="func">everything</span>: all variables.


## Filtering Rows

There are repeated header rows in this data[^headers], so we need to drop them.  This is also why everything was character string when we first scraped it, because having any character strings in a column coerces the entire column to be character,  since all elements of a vector [need to be of the same type][vectors].  Character string is chosen over others because anything can be converted to a string, but not everything can be a number.  

Filtering by rows requires the basic indexing knowledge [we talked about before][indexing], especially Boolean indexing. In the following, `Rk`, or rank, is for all intents and purposes just a row id, but if it equals the actual text 'Rk' instead of something else, we know we're dealing with a header row, so we'll drop it.

```{r filter0}
bball = bball %>% 
  filter(Rk != "Rk")
```

- <span class='func'>filter</span> returns rows with matching conditions.
- <span class='func'>slice</span> allows for a numeric indexing approach[^noslice].

Say we want to look at forwards (SF or PF) over the age of 35.  The following will do this, and since some players play on multiple teams, we'll want only the unique information on the variables of interest. The function <span class="func">distinct</span> allows us to do this.


```{r filter1}
bball %>% 
  filter(Age > 35, Pos == "SF" | Pos == "PF") %>% 
  distinct(Player, Pos, Age)              
```

Maybe we want just the first 10 rows.  This is often the case when we perform some operation and need to quickly verify that what we're doing is working in principle.

```{r filter2}
bball %>% 
  slice(1:10)
```

We can use filtering even with variables just created. 

```{r uniteFilterArrange}
bball %>% 
  unite("posTeam", Pos, Tm) %>%         # create a new variable
  filter(posTeam == "SG_GSW") %>%       # use it for filtering
  select(Player, posTeam, Age) %>%      # use it for selection
  arrange(desc(Age))                    # descending order 
```

Being able to use a newly created variable on the fly, possibly only to filter or create some other variable, goes a long way toward easy visualization and generation of desired summary statistics.


## Generating New Data

One of the most common data processing tasks is generating new variables. The function <span class='func'>mutate</span> takes a vector and returns one of the same dimension.  In addition, there is <span class="func">mutate_at</span>, <span class="func">mutate_if</span>, and <span class="func">mutate_all</span> to help with specific scenarios.


To demonstrate, we'll use <span class='func'>mutate_at</span> to make appropriate columns numeric, i.e. everything except `Player`, `Pos`, and ` Tm`.  It takes two inputs, variables and functions to apply.  As there are multiple variables and (potentially) multiple functions, we use the <span class="func">vars</span> and <span class="func">funs</span> functions to denote them[^funs].

```{r mutateAt}
bball = bball %>% 
  mutate(across(c(-Player, -Pos, -Tm), as.numeric))

glimpse(bball[,1:7])
```

Now that the data columns are of the correct type, the following demonstrates how we can use the standard <span class="func">mutate</span> function to create composites of existing variables.

```{r mutate}
bball = bball %>% 
  mutate(
    trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
    effectiveFG  = (FG + (.5 * X3P)) / FGA,
    shootingDif  = trueShooting - FG.
  )

summary(select(bball, shootingDif))  # select and others don't have to be piped to use
```


## Grouping and Summarizing Data

Another very common task is to look at group-based statistics, and we can use <span class='func'>group_by</span> and <span class="func">summarize</span> to help us in this regard[^summarise].  Base R has things like <span class="func">aggregate</span>, <span class="func">by</span>, and <span class="func">tapply</span> for this, but they should not be used, as this approach is much more straightforward, flexible, and faster.

Conceptually we are doing a three-phase task: **split**, **apply**, **combine**.  We split the data into subsets, apply a function, and then combine the results back into a single output.  In applying a function, we may do any of the previously demonstrated tasks: calculate some statistic, generate new data, or even filter to a reduced part of the data.

For this demonstration, I'm going to start putting together several things we've demonstrated thus far. Ultimately we'll create a variable called <span class="objclass">trueShooting</span>, which represents 'true shooting percentage', and get an average for each position, and compare it to the average field goal percentage.

```{r groupby}
bball %>%   
  select(Pos, FG, FGA, FG., FTA, X3P, PTS) %>% 
  mutate(
    trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
    effectiveFG = (FG + (.5 * X3P)) / FGA,
    shootingDif = trueShooting - FG.
  ) %>%  
  group_by(Pos) %>%                                                 
  summarize(
    `Mean FG%` = mean(FG., na.rm = TRUE),
    `Mean True Shooting` = mean(trueShooting, na.rm = TRUE)
  )    
```

We can do even more with grouped data.  Specifically, we can create a new *list-column* in the data, the elements of which can be anything, even the results of an analysis for each group.  As such, we can use <span class="pack">tidyr's</span> <span class="func">unnest</span> to get back to a standard data frame.

To demonstrate, the following will group data by position, then get the correlation between field-goal percentage and free-throw shooting percentage.  Some players are listed with multiple positions, so we will reduce those to whatever their first position is using <span class="func" style = "">case_when</span>.  

```{r do-old, echo=FALSE, eval=FALSE}
bball %>% 
  mutate(
    Pos = case_when(
      Pos == 'PG-SG' ~ 'PG',
      Pos == 'C-PF'  ~ 'C',
      Pos == 'SF-SG' ~ 'SF',
      Pos == 'PF-C'  | Pos == 'PF-SF' ~ 'PF',
      Pos == 'SG-PF' | Pos == 'SG-SF' ~ 'SG',
      TRUE ~ Pos
    )) %>% 
  group_by(Pos) %>%     
  do(FgFt_Corr = cor(.$FG., .$FT., use = 'complete')) %>%
  unnest(FgFt_Corr)

library(nycflights13)
carriers = group_by(flights, carrier)
group_size(carriers)

mods = carriers %>% 
  do(model = lm(arr_delay ~ dep_time, data = .)) 

mods

mods %>%
  summarize(carrier = carrier,
            `Adjusted Rsq` = summary(model)$adj.r.squared) %>%
  head()
```

```{r newdo}
bball %>% 
  mutate(
    Pos = case_when(
      Pos == 'PG-SG' ~ 'PG',
      Pos == 'C-PF'  ~ 'C',
      Pos == 'SF-SG' ~ 'SF',
      Pos == 'PF-C'  | Pos == 'PF-SF' ~ 'PF',
      Pos == 'SG-PF' | Pos == 'SG-SF' ~ 'SG',
      TRUE ~ Pos
    )) %>% 
  nest_by(Pos) %>%     
  mutate(FgFt_Corr = list(cor(data$FG., data$FT., use = 'complete'))) %>% 
  unnest(c(Pos, FgFt_Corr))
```

As a reminder, data frames are lists. As such, anything can go into the 'columns', even regression models!

```{r newdo2}
library(nycflights13)

carriers = group_by(flights, carrier)

group_size(carriers)  # if you're curious, there is a function to quickly get group Ns

mods = flights %>% 
  nest_by(carrier) %>% 
  mutate(model = list(lm(arr_delay ~ dep_time, data = data)) )

mods

mods %>%
  summarize(
    carrier = carrier,
    `Adjusted Rsq` = summary(model)$adj.r.squared,
    coef_dep_time = coef(model)[2]
  )
```


<div class='note'>
You can use <span class="func">group_by</span> on more than one variable, e.g. `group_by(var1, var2)`

<img class='img-note' src="img/R.ico" style="display:block; margin: 0 auto;"> 
</div>
 

## Renaming Columns

Tibbles in the tidyverse don't really have a problem with variable names starting with numbers or incorporating symbols and spaces.  I would still suggest it is poor practice, because even if your data set looks fine, you'll possibly encounter problems with modeling and visualization packages using that data.  However, as a demonstration, we can 'fix' some of the variable names.

One issue is that when we scraped the data and converted it to a <span class="objclass">data.frame</span>, the names that started with a number, like `3P` for 'three point baskets made', were made into `X3P`, because that's the way R works by default. In addition, `3P%`, i.e. three point percentage made, was made into `3P.` with a dot for the percent sign.  Same goes for the 2P (two-pointers) and FT (free-throw) variables.

We can use rename to change column names. A basic example is as follows.

```{r rename_ex, eval=FALSE}
data %>% 
  rename(new_name = old_name,
         new_name2 = old_name2)
```

Very straightforward.  However, oftentimes we'll need to change *patterns*, as with our current problem.  The following uses <span class="func">str_replace</span> and <span class="func">str_remove</span> from <span class="pack">stringr</span> to look for a pattern in a name, and replace that pattern with some other pattern.  It uses *regular expressions* for the patterns.


```{r rename_ex2}
bball = bball %>%
  rename_with(
    str_replace,      # function
    contains('.'),    # columns
    pattern = '\\.',  # function arguments
    replacement = '%'
  ) %>% 
  rename_with(str_remove, starts_with('X'), pattern = 'X')

colnames(bball)
```


## Merging Data

Merging data is yet another very common data task, as data often comes from multiple sources. In order to do this, we need some common identifier among the sources by which to join them. The following is a list of <span class="pack">dplyr</span> join functions.

<span class="func">inner_join</span>: return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

<span class="func">left_join</span>: return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

<span class="func">right_join</span>: return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

<span class="func">semi_join</span>: return all rows from x where there are matching values in y, keeping just columns from x. It differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

<span class="func">anti_join</span>: return all rows from x where there are not matching values in y, keeping just columns from x.

<span class="func">full_join</span>: return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.


Probably the most common is a left join, where we have one primary data set, and are adding data from another source to it while retaining it as a base. The following is a simple demonstration.


```{r merge_demo, echo=-(1:2), message=TRUE}
band_members = tibble(
  Name = c('Seth', 'Francis', 'Bubba'),
  Band = c('Com Truise', 'Pixies', 'The New Year')
)
band_instruments = tibble(
  Name = c('Francis', 'Bubba', 'Seth'),
  Instrument = c('Guitar', 'Guitar', 'Synthesizer')
)

band_members
band_instruments

left_join(band_members, band_instruments)
```


When we don't have a one to one match, the result of the different types of join will become more apparent.

```{r merge_demo2, echo=-(1:2), message=TRUE}
band_members = tibble(
  Name = c('Seth', 'Francis', 'Bubba', 'Stephen'),
  Band = c('Com Truise', 'Pixies', 'The New Year', 'Pavement')
)
band_instruments = tibble(
  Name = c('Seth', 'Francis', 'Bubba', 'Steve'),
  Instrument = c('Synthesizer', 'Guitar', 'Guitar', 'Rage')
)

band_members
band_instruments

left_join(band_members, band_instruments)  
right_join(band_members, band_instruments)
inner_join(band_members, band_instruments)
full_join(band_members, band_instruments)
anti_join(band_members, band_instruments)
anti_join(band_instruments, band_members)
```

Merges can get quite complex, and involve multiple data sources. In many cases you may have to do a lot of processing before getting to the merge, but <span class="pack">dplyr's</span> joins will help quite a bit.


## Pivoting axes

The <span class="pack">tidyr</span> package can be thought of as a specialized subset of <span class="pack">dplyr's</span> functionality, as well as an update to the previous <span class="pack">reshape</span> and <span class="pack">reshape2</span> packages[^reshape]. Some of its functions for manipulating data you'll want to be familiar with are:

- <span class="func">pivot_longer</span>: convert data from a wider format to longer one
- <span class="func">pivot_wider</span>:  convert data from a longer format to wider one
- <span class="func">unite</span>: paste together multiple columns into one
- <span class="func">separate</span>: complement of <span class="func" style = "">unite</span>
- <span class="func">unnest</span>: expand 'list columns'

The following example shows how we take a 'wide-form' data set, where multiple columns represent different stock prices, and turn it into two columns, one representing stock name, and one for the price.  We need to know which columns to work on, which is the first entry.  This function works very much like <span class="func" style = "">select</span>, where you can use helpers. Then we need to give a name to the column(s) representing the indicators of what were multiple columns in the wide format. And finally we need to specify the column(s) of the values.

```{r pivot}
library(tidyr)

stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)

stocks %>% head

stocks %>% 
  pivot_longer(
    cols      = -time,   # works similar to using select()
    names_to  = 'stock', # the name of the column that will have column names as labels
    values_to = 'price'  # the name of the column for the values
  ) %>% 
  head()
```

Here is a more complex example where we can handle multiple repeated entries. We additionally add another column for labeling, and posit the separator for the column names.

```{r pivot2}
library(tidyr)

stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X_1 = rnorm(10, 0, 1),
  X_2 = rnorm(10, 0, 1),
  Y_1 = rnorm(10, 0, 2),
  Y_2 = rnorm(10, 0, 2),
  Z_1 = rnorm(10, 0, 4),
  Z_2 = rnorm(10, 0, 4)
)

head(stocks)

stocks %>% 
  pivot_longer(
    cols = -time,
    names_to = c('stock', 'entry'),
    names_sep = '_',
    values_to = 'price'
  ) %>% 
  head()
```


Note that the latter is an example of *tidy data* while the former is not.  Why do we generally prefer such data?  Precisely because the most common data operations, grouping, filtering, etc., would work notably more efficiently with such data. This is especially the case for visualization.

The following demonstrates the <span class="func" style = "">separate</span> function utilized for a very common data processing task- dealing with names. Here' we'll separate <span class="objclass">player</span> into first and last names based on the space.

```{r tidyrSpread}
bball %>% 
  separate(Player, into=c('first_name', 'last_name'), sep=' ') %>% 
  select(1:5) %>% 
  head()
```

Note that this won't necessarily apply to every name, so further processing may be required.


## More Tidyverse

- dplyr functions: There are over a hundred utility functions that perform very common tasks.  You really need to be aware of them, as their use will come up often.

- <span class="pack" style = "">broom</span>: Convert statistical analysis objects from R into tidy data frames, so that they can more easily be combined, reshaped and otherwise processed with tools like dplyr, tidyr and ggplot2.

- tidy*: a lot of packages out there are now 'tidy', though not a part of the official tidyverse. Some examples of the ones I've used:
    - <span class="pack" style = "">tidycensus</span>
    - <span class="pack" style = "">tidybayes</span>
    - <span class="pack" style = "">tidytext</span>
    - <span class="pack" style = "">modelr</span>
    
Seriously, there are [a lot](https://www.r-pkg.org/search.html?q=tidy).

## Personal Opinion

The <span class="pack">dplyr</span> grammar is clear for a lot of standard data processing tasks, and some not so common.

Extremely useful for data exploration and visualization.

- No need to create/overwrite existing objects
- Can overwrite columns and use as they are created
- Makes it easy to look at anything, and do otherwise tedious data checks

Drawbacks:

- Not as fast as <span class="pack">data.table</span> or even some base R approaches for many things[^multidplyr]
- The *mindset* can make for unnecessary complication
    - e.g. There is no need to pipe to create a single new variable
- Some approaches, are not very intuitive
- Notably less ability to work with some very common data structures (e.g. matrices)
    

All in all, if you've only been using base R approaches, the tidyverse will change your R life!  It makes all the sorts of things you do all the time easier and clearer.  Highly recommended!


## Tidyverse Exercises

### Exercise 0

Install and load the <span class="pack">dplyr</span> <span class="pack">ggplot2movies</span> packages. Look at the help file for the `movies` data set, which contains data from IMDB.


```{r, eval=FALSE}
install.packages('ggplot2movies')
library(ggplot2movies)
data('movies')
```


### Exercise 1

Using the movies data set, perform each of the following actions separately.

#### Exercise 1a

Use <span class="func">mutate</span> to create a centered version of the rating variable.  A centered variable is one whose mean has been subtracted from it. The process will take the following form:

```{r ex1a_1, eval=FALSE}
data %>% 
  mutate(new_var_name = '?')
```

```{r ex1a_2, echo=FALSE, eval=FALSE}
movies %>% 
  mutate(ratingCen = rating - mean(rating)) 
```

#### Exercise 1b

Use <span class="func">filter</span> to create a new data frame that has only movies from the years 2000 and beyond. Use the greater than or equal operator `>=`.

```{r ex1b, echo=FALSE, eval=FALSE}
movies %>% 
  filter(year >= 2000) 
```

#### Exercise 1c

Use <span class="func">select</span> to create a new data frame that only has the `title`, `year`, `budget`, `length`, `rating` and `votes` variables.  There are at least 3 ways to do this.

```{r ex1c, echo=FALSE, eval=FALSE}
movies %>% 
  select(title, year, budget, length, rating, votes) 
movies %>% 
  select(1:6)
movies %>% 
  select(-num_range('r',1:10), -mpaa, -starts_with('A'), -Comedy, -starts_with('D'), -Romance, -Short)
```


#### Exercise 1d

Rename the `length` column to `length_in_min` (i.e. length in minutes).


### Exercise 2

Use <span class="func">group_by</span> to group the data by year, and <span class="func">summarize</span> to create a new variable that is the average budget. The <span class="func">summarize</span> function works just like <span class="func">mutate</span> in this case.

Use the <span class="func">mean</span> function to get the average, but you'll also need to use the argument `na.rm = TRUE` within it because the earliest years have no budget recorded.


```{r ex2, echo=FALSE, eval=FALSE}
movies %>% 
  group_by(year) %>% 
  summarise(AvgBudget=mean(budget, na.rm=T)) %>% 
  tail
```


### Exercise 3

Use <span class="func">pivot_longer</span> to create a 'tidy' data set from the following.

```{r ex3, eval=FALSE, echo=1:3}
dat = tibble(id = 1:10,
             x = rnorm(10),
             y = rnorm(10))
dat %>% pivot_longer(key = var, value = score, -id)
```


### Exercise 4

Now put several actions together in one set of piped operations. 

- Filter movies released *after* 1990
- select the same variables as before but also the `mpaa`, `Action`, and `Drama` variables
- group by `mpaa` *and* (your choice) `Action` *or* `Drama`
- get the average rating

It should spit out something like the following:

```{r ex4, echo=FALSE}
movies %>%
  filter(year>=1990) %>% 
  select(title, year, budget, length, rating, votes, mpaa, Action, Drama) %>% 
  group_by(mpaa, Drama) %>% 
  summarise(AvgRating = mean(rating))
```

## Python Pandas Notebook

[Available on GitHub](https://github.com/m-clark/data-processing-and-visualization/blob/master/jupyter_notebooks/pandaverse.ipynb)


[^headers]: You may be thinking- 'it's `r lubridate::year(Sys.Date())` and why on earth would anyone do that?!'.  Peruse most sports websites and you'll see that fundamental web design basics escape them.  See also, financial sites.

[^summarise]: As Hadley Wickham is from New Zealand, and his examples use <span class="func">summarise</span>, you'll probably see it about as much as you do the other spelling, especially since it will come up first in autocomplete.

[^noslice]: If you're following good programming practices, you'll likely never use <span class="func">slice</span>.

[^funs]: For more than one function, you could have supplied `funs(function1, function2, ...).

[^reshape]: Some still use <span class="pack">reshape2</span> but there is no reason to and it is no longer developed.

[^multidplyr]: There is <span class="pack">multidplyr</span>, <span class="pack" style = "">dtplyr</span>, and now <span class="pack" style = "">tidyfast</span> to help get as fast as possible for large data processing.