-
Notifications
You must be signed in to change notification settings - Fork 13
/
tidyverse.Rmd
739 lines (498 loc) · 30.5 KB
/
tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
# Tidyverse
```{r tvsetup, include=FALSE, eval=TRUE, cache=FALSE}
knitr::opts_chunk$set(echo = T, eval=T, cache.rebuild = F)
```
```{r tidyhex, out.width='10%', echo=FALSE}
knitr::include_graphics('img/hex_tidyverse.svg')
```
## What is the Tidyverse?
The tidyverse consists of a few key packages:
- <span class="pack">ggplot2</span>: data visualization
- <span class="pack">tibble</span>: tibbles, a modern re-imagining of data frames
- <span class="pack">tidyr</span>: data tidying
- <span class="pack">readr</span>: data import
- <span class="pack">purrr</span>: functional programming, e.g. alternate approaches to apply
- <span class="pack">dplyr</span>: data manipulation
And of course the <span class="pack">tidyverse</span> package itself, which will load all of the above in a way that will avoid naming conflicts.
```{r tv_load, eval=FALSE}
library(tidyverse)
```
```{r tv_startup_message, echo=FALSE, message=TRUE}
"Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages -------------------------
filter(): dplyr, stats
lag(): dplyr, stats" %>%
message()
```
In addition, there are other packages like <span class="pack">lubridate</span>, <span class="pack">rvest</span>, <span class="pack">stringr</span> and others in the **hadleyverse** that are also greatly useful.
## What is Tidy?
*Tidy data* refers to data arranged in a way that makes data processing, analysis, and visualization simpler.
In a tidy data set:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Think *long* before *wide*.
## dplyr
<span class="pack">dplyr</span> provides a grammar of data manipulation (like <span class="pack">ggplot2</span> does for visualization). It is the next iteration of <span class="pack">plyr</span>, but <span class="pack">plyr</span> is deprecated and no longer used. It's focused on tools for working with data frames, with over 100 functions that might be of specific use to you.
It has three main goals:
- Make the most important data manipulation tasks easier.
- Do them faster.
- Use the same interface to work with data frames, data tables or a database.
Some key operations include:
- <span class="func">select</span>: grab columns
- select helpers: <span class="func">one\_of</span>, <span class="func">starts\_with</span>, <span class="func">num_range</span> etc.
- <span class="func">filter</span>/<span class="func">slice</span>: grab rows
- <span class="func">group_by</span>: grouped operations
- <span class="func">mutate</span>/<span class="func">transmute</span>: create new variables
- <span class="func">summarize</span>: summarize/aggregate
There are various (SQL-like) join/merge functions:
- <span class="func">inner_join</span>, <span class="func">left_join</span> etc.
And there are a lot of little things like:
- <span class="func">n</span>, <span class="func">n\_distinct</span>, <span class="func">nth</span>, <span class="func">n\_groups</span>, <span class="func">count</span>, <span class="func">recode</span>, <span class="func">between</span>
In addition, there is no need to quote variable names.
### An example
Let's say we want to select from our data the following variables:
- Start with the **ID** variable
- The variables **X1** through **X10**, which are not all grouped together, and there are many more *X\** columns
- The variables **var1** and **var2**, which are the only variables with *var* in their name
- Any variable with a name that starts with **XYZ**
How might we go about this in a dataset of possibly hundreds or even thousands of columns? There are several base R approaches that we could go with, but often they will be tedious, or require multiple objects to be created just to get the columns you want. Let's start with the worst choice.
```{r baseRexample1a, eval=FALSE}
newData = oldData[,c(1,2,3,4, etc.)]
```
Using numeric indexes, or rather *magic numbers*, is not conducive to readability or reproducibility. If anything changes about the data columns, the numbers may no longer be applicable, and you'd have to redo the line again.
We could name the variables explicitly.
```{r baseRexample1b, eval=FALSE}
newData = oldData[,c('ID','X1', 'X2', etc.)]
```
This would be fine if there are only a handful. But if you're trying to reduce a 1000 column data set to several dozen it's tedious, and generally not pretty regardless.
A more advanced alternative regards a two-step approach with [regular expressions][Regular Expressions]. This requires that you know something about regex (and you should), but it is difficult to read/understand by those who don't, and often by even yourself if it's more complicated. In any case, you first will need to create an object that represents the column names first, otherwise it looks unwieldy if used within brackets or a function like <span class="func">subset</span>.
```{r baseRexample1c, eval=FALSE}
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]
# or via subset
newData = subset(oldData, select = cols)
```
Now consider there is even more to do. What if you also want observations where **Z** is **Yes**, Q is **No**, and only the observations with the top 50 values of **var2**, ordered by **var1** (descending)? Probably the more straightforward way in R to do so would be something like the following, where each part is broken out and we continuously write over the object as we modify it.
```{r baseRexample2, eval=FALSE}
# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]
```
And this is for fairly straightforward operations.
Now consider doing all of the previous in one piped operation. The <span class="pack">dplyr</span> package will allow us to do something like the following.
```{r pipeExample, eval=FALSE}
newData = oldData %>%
select(num_range('X', 1:10),
contains('var'),
starts_with('XYZ')) %>%
filter(Z == 'Yes',
Q == 'No') %>%
top_n(n=50, var2) %>%
arrange(desc(var1))
```
Even if it hadn't been explained before, you might have been able to guess a little as to what was going on. The code is fairly succinct, we don't have to keep referencing objects repeatedly, and no explicit intermediary objects are created.
<span class="pack">dplyr</span> and piping is an *alternative*. You can do all this sort of stuff with base R, for example, with functions like <span class="func">with</span>, <span class="func">within</span>, <span class="func">subset</span>, <span class="func">transform</span>, etc. Though the initial base R approach depicted is fairly concise, in general, it can potentially be:
- more verbose
- less legible
- less amenable to additional data changes
- requires esoteric knowledge (e.g. regular expressions)
- often requires creation of new objects (even if we just want to explore)
- often slower, possibly greatly
## Running Example
The following data was scraped initially scraped from the web as follows. It is data from the NBA basketball league for the last season with things like player names, position, team name, points per game, field goal percentage, and various other statistics. We'll use it as an example to demonstrate various functionality found within <span class="pack">dplyr</span>.
```{r basketballDataScrape, eval=FALSE}
library(rvest)
current_year = lubridate::year(Sys.Date())
url = glue::glue("http://www.basketball-reference.com/leagues/NBA_{current_year-1}_totals.html")
bball = read_html(url) %>%
html_nodes("#totals_stats") %>%
html_table() %>%
data.frame()
save(bball, file='data/bball.RData')
```
However you can just load it into your workspace as below. Note that when initially gathered from the website, the data is all character strings. We'll fix this later. The following shows the data as it will eventually be.
```{r load_bball}
load('data/bball.RData')
glimpse(bball[,1:5])
```
## Selecting Columns
Often you do not need the entire data set. While this is easily handled in base R (as shown earlier), it can be more clear to use <span class="func">select</span> in <span class='pack'>dplyr</span>. Now we won't have to create separate objects, use quotes or $, etc.
```{r select1}
bball %>%
select(Player, Tm, Pos) %>%
head()
```
What if we want to drop some variables?
```{r select2}
bball %>%
select(-Player, -Tm, -Pos) %>%
head()
```
### Helper functions
Sometimes, we have a lot of variables to select, and if they have a common naming scheme, this can be very easy.
```{r select3}
bball %>%
select(Player, contains("3P"), ends_with("RB")) %>%
arrange(desc(TRB)) %>%
head()
```
The <span class="func">select</span> also has helper functions to make selecting columns even easier. I probably don't even need to explain what's being done above, and this is the power of the tidyverse way. Here is the list of *helper functions* to be aware of:
- <span class="func">starts_with</span>: starts with a prefix
- <span class="func">ends_with</span>: ends with a suffix
- <span class="func">contains</span>: contains a literal string
- <span class="func">matches</span>: matches a regular expression
- <span class="func">num_range</span>: a numerical range like x01, x02, x03.
- <span class="func">one_of</span>: variables in character vector.
- <span class="func">everything</span>: all variables.
## Filtering Rows
There are repeated header rows in this data[^headers], so we need to drop them. This is also why everything was character string when we first scraped it, because having any character strings in a column coerces the entire column to be character, since all elements of a vector [need to be of the same type][vectors]. Character string is chosen over others because anything can be converted to a string, but not everything can be a number.
Filtering by rows requires the basic indexing knowledge [we talked about before][indexing], especially Boolean indexing. In the following, `Rk`, or rank, is for all intents and purposes just a row id, but if it equals the actual text 'Rk' instead of something else, we know we're dealing with a header row, so we'll drop it.
```{r filter0}
bball = bball %>%
filter(Rk != "Rk")
```
- <span class='func'>filter</span> returns rows with matching conditions.
- <span class='func'>slice</span> allows for a numeric indexing approach[^noslice].
Say we want to look at forwards (SF or PF) over the age of 35. The following will do this, and since some players play on multiple teams, we'll want only the unique information on the variables of interest. The function <span class="func">distinct</span> allows us to do this.
```{r filter1}
bball %>%
filter(Age > 35, Pos == "SF" | Pos == "PF") %>%
distinct(Player, Pos, Age)
```
Maybe we want just the first 10 rows. This is often the case when we perform some operation and need to quickly verify that what we're doing is working in principle.
```{r filter2}
bball %>%
slice(1:10)
```
We can use filtering even with variables just created.
```{r uniteFilterArrange}
bball %>%
unite("posTeam", Pos, Tm) %>% # create a new variable
filter(posTeam == "SG_GSW") %>% # use it for filtering
select(Player, posTeam, Age) %>% # use it for selection
arrange(desc(Age)) # descending order
```
Being able to use a newly created variable on the fly, possibly only to filter or create some other variable, goes a long way toward easy visualization and generation of desired summary statistics.
## Generating New Data
One of the most common data processing tasks is generating new variables. The function <span class='func'>mutate</span> takes a vector and returns one of the same dimension. In addition, there is <span class="func">mutate_at</span>, <span class="func">mutate_if</span>, and <span class="func">mutate_all</span> to help with specific scenarios.
To demonstrate, we'll use <span class='func'>mutate_at</span> to make appropriate columns numeric, i.e. everything except `Player`, `Pos`, and ` Tm`. It takes two inputs, variables and functions to apply. As there are multiple variables and (potentially) multiple functions, we use the <span class="func">vars</span> and <span class="func">funs</span> functions to denote them[^funs].
```{r mutateAt}
bball = bball %>%
mutate(across(c(-Player, -Pos, -Tm), as.numeric))
glimpse(bball[,1:7])
```
Now that the data columns are of the correct type, the following demonstrates how we can use the standard <span class="func">mutate</span> function to create composites of existing variables.
```{r mutate}
bball = bball %>%
mutate(
trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
effectiveFG = (FG + (.5 * X3P)) / FGA,
shootingDif = trueShooting - FG.
)
summary(select(bball, shootingDif)) # select and others don't have to be piped to use
```
## Grouping and Summarizing Data
Another very common task is to look at group-based statistics, and we can use <span class='func'>group_by</span> and <span class="func">summarize</span> to help us in this regard[^summarise]. Base R has things like <span class="func">aggregate</span>, <span class="func">by</span>, and <span class="func">tapply</span> for this, but they should not be used, as this approach is much more straightforward, flexible, and faster.
Conceptually we are doing a three-phase task: **split**, **apply**, **combine**. We split the data into subsets, apply a function, and then combine the results back into a single output. In applying a function, we may do any of the previously demonstrated tasks: calculate some statistic, generate new data, or even filter to a reduced part of the data.
For this demonstration, I'm going to start putting together several things we've demonstrated thus far. Ultimately we'll create a variable called <span class="objclass">trueShooting</span>, which represents 'true shooting percentage', and get an average for each position, and compare it to the average field goal percentage.
```{r groupby}
bball %>%
select(Pos, FG, FGA, FG., FTA, X3P, PTS) %>%
mutate(
trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
effectiveFG = (FG + (.5 * X3P)) / FGA,
shootingDif = trueShooting - FG.
) %>%
group_by(Pos) %>%
summarize(
`Mean FG%` = mean(FG., na.rm = TRUE),
`Mean True Shooting` = mean(trueShooting, na.rm = TRUE)
)
```
We can do even more with grouped data. Specifically, we can create a new *list-column* in the data, the elements of which can be anything, even the results of an analysis for each group. As such, we can use <span class="pack">tidyr's</span> <span class="func">unnest</span> to get back to a standard data frame.
To demonstrate, the following will group data by position, then get the correlation between field-goal percentage and free-throw shooting percentage. Some players are listed with multiple positions, so we will reduce those to whatever their first position is using <span class="func" style = "">case_when</span>.
```{r do-old, echo=FALSE, eval=FALSE}
bball %>%
mutate(
Pos = case_when(
Pos == 'PG-SG' ~ 'PG',
Pos == 'C-PF' ~ 'C',
Pos == 'SF-SG' ~ 'SF',
Pos == 'PF-C' | Pos == 'PF-SF' ~ 'PF',
Pos == 'SG-PF' | Pos == 'SG-SF' ~ 'SG',
TRUE ~ Pos
)) %>%
group_by(Pos) %>%
do(FgFt_Corr = cor(.$FG., .$FT., use = 'complete')) %>%
unnest(FgFt_Corr)
library(nycflights13)
carriers = group_by(flights, carrier)
group_size(carriers)
mods = carriers %>%
do(model = lm(arr_delay ~ dep_time, data = .))
mods
mods %>%
summarize(carrier = carrier,
`Adjusted Rsq` = summary(model)$adj.r.squared) %>%
head()
```
```{r newdo}
bball %>%
mutate(
Pos = case_when(
Pos == 'PG-SG' ~ 'PG',
Pos == 'C-PF' ~ 'C',
Pos == 'SF-SG' ~ 'SF',
Pos == 'PF-C' | Pos == 'PF-SF' ~ 'PF',
Pos == 'SG-PF' | Pos == 'SG-SF' ~ 'SG',
TRUE ~ Pos
)) %>%
nest_by(Pos) %>%
mutate(FgFt_Corr = list(cor(data$FG., data$FT., use = 'complete'))) %>%
unnest(c(Pos, FgFt_Corr))
```
As a reminder, data frames are lists. As such, anything can go into the 'columns', even regression models!
```{r newdo2}
library(nycflights13)
carriers = group_by(flights, carrier)
group_size(carriers) # if you're curious, there is a function to quickly get group Ns
mods = flights %>%
nest_by(carrier) %>%
mutate(model = list(lm(arr_delay ~ dep_time, data = data)) )
mods
mods %>%
summarize(
carrier = carrier,
`Adjusted Rsq` = summary(model)$adj.r.squared,
coef_dep_time = coef(model)[2]
)
```
<div class='note'>
You can use <span class="func">group_by</span> on more than one variable, e.g. `group_by(var1, var2)`
<img class='img-note' src="img/R.ico" style="display:block; margin: 0 auto;">
</div>
## Renaming Columns
Tibbles in the tidyverse don't really have a problem with variable names starting with numbers or incorporating symbols and spaces. I would still suggest it is poor practice, because even if your data set looks fine, you'll possibly encounter problems with modeling and visualization packages using that data. However, as a demonstration, we can 'fix' some of the variable names.
One issue is that when we scraped the data and converted it to a <span class="objclass">data.frame</span>, the names that started with a number, like `3P` for 'three point baskets made', were made into `X3P`, because that's the way R works by default. In addition, `3P%`, i.e. three point percentage made, was made into `3P.` with a dot for the percent sign. Same goes for the 2P (two-pointers) and FT (free-throw) variables.
We can use rename to change column names. A basic example is as follows.
```{r rename_ex, eval=FALSE}
data %>%
rename(new_name = old_name,
new_name2 = old_name2)
```
Very straightforward. However, oftentimes we'll need to change *patterns*, as with our current problem. The following uses <span class="func">str_replace</span> and <span class="func">str_remove</span> from <span class="pack">stringr</span> to look for a pattern in a name, and replace that pattern with some other pattern. It uses *regular expressions* for the patterns.
```{r rename_ex2}
bball = bball %>%
rename_with(
str_replace, # function
contains('.'), # columns
pattern = '\\.', # function arguments
replacement = '%'
) %>%
rename_with(str_remove, starts_with('X'), pattern = 'X')
colnames(bball)
```
## Merging Data
Merging data is yet another very common data task, as data often comes from multiple sources. In order to do this, we need some common identifier among the sources by which to join them. The following is a list of <span class="pack">dplyr</span> join functions.
<span class="func">inner_join</span>: return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
<span class="func">left_join</span>: return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
<span class="func">right_join</span>: return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
<span class="func">semi_join</span>: return all rows from x where there are matching values in y, keeping just columns from x. It differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
<span class="func">anti_join</span>: return all rows from x where there are not matching values in y, keeping just columns from x.
<span class="func">full_join</span>: return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
Probably the most common is a left join, where we have one primary data set, and are adding data from another source to it while retaining it as a base. The following is a simple demonstration.
```{r merge_demo, echo=-(1:2), message=TRUE}
band_members = tibble(
Name = c('Seth', 'Francis', 'Bubba'),
Band = c('Com Truise', 'Pixies', 'The New Year')
)
band_instruments = tibble(
Name = c('Francis', 'Bubba', 'Seth'),
Instrument = c('Guitar', 'Guitar', 'Synthesizer')
)
band_members
band_instruments
left_join(band_members, band_instruments)
```
When we don't have a one to one match, the result of the different types of join will become more apparent.
```{r merge_demo2, echo=-(1:2), message=TRUE}
band_members = tibble(
Name = c('Seth', 'Francis', 'Bubba', 'Stephen'),
Band = c('Com Truise', 'Pixies', 'The New Year', 'Pavement')
)
band_instruments = tibble(
Name = c('Seth', 'Francis', 'Bubba', 'Steve'),
Instrument = c('Synthesizer', 'Guitar', 'Guitar', 'Rage')
)
band_members
band_instruments
left_join(band_members, band_instruments)
right_join(band_members, band_instruments)
inner_join(band_members, band_instruments)
full_join(band_members, band_instruments)
anti_join(band_members, band_instruments)
anti_join(band_instruments, band_members)
```
Merges can get quite complex, and involve multiple data sources. In many cases you may have to do a lot of processing before getting to the merge, but <span class="pack">dplyr's</span> joins will help quite a bit.
## Pivoting axes
The <span class="pack">tidyr</span> package can be thought of as a specialized subset of <span class="pack">dplyr's</span> functionality, as well as an update to the previous <span class="pack">reshape</span> and <span class="pack">reshape2</span> packages[^reshape]. Some of its functions for manipulating data you'll want to be familiar with are:
- <span class="func">pivot_longer</span>: convert data from a wider format to longer one
- <span class="func">pivot_wider</span>: convert data from a longer format to wider one
- <span class="func">unite</span>: paste together multiple columns into one
- <span class="func">separate</span>: complement of <span class="func" style = "">unite</span>
- <span class="func">unnest</span>: expand 'list columns'
The following example shows how we take a 'wide-form' data set, where multiple columns represent different stock prices, and turn it into two columns, one representing stock name, and one for the price. We need to know which columns to work on, which is the first entry. This function works very much like <span class="func" style = "">select</span>, where you can use helpers. Then we need to give a name to the column(s) representing the indicators of what were multiple columns in the wide format. And finally we need to specify the column(s) of the values.
```{r pivot}
library(tidyr)
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocks %>% head
stocks %>%
pivot_longer(
cols = -time, # works similar to using select()
names_to = 'stock', # the name of the column that will have column names as labels
values_to = 'price' # the name of the column for the values
) %>%
head()
```
Here is a more complex example where we can handle multiple repeated entries. We additionally add another column for labeling, and posit the separator for the column names.
```{r pivot2}
library(tidyr)
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X_1 = rnorm(10, 0, 1),
X_2 = rnorm(10, 0, 1),
Y_1 = rnorm(10, 0, 2),
Y_2 = rnorm(10, 0, 2),
Z_1 = rnorm(10, 0, 4),
Z_2 = rnorm(10, 0, 4)
)
head(stocks)
stocks %>%
pivot_longer(
cols = -time,
names_to = c('stock', 'entry'),
names_sep = '_',
values_to = 'price'
) %>%
head()
```
Note that the latter is an example of *tidy data* while the former is not. Why do we generally prefer such data? Precisely because the most common data operations, grouping, filtering, etc., would work notably more efficiently with such data. This is especially the case for visualization.
The following demonstrates the <span class="func" style = "">separate</span> function utilized for a very common data processing task- dealing with names. Here' we'll separate <span class="objclass">player</span> into first and last names based on the space.
```{r tidyrSpread}
bball %>%
separate(Player, into=c('first_name', 'last_name'), sep=' ') %>%
select(1:5) %>%
head()
```
Note that this won't necessarily apply to every name, so further processing may be required.
## More Tidyverse
- dplyr functions: There are over a hundred utility functions that perform very common tasks. You really need to be aware of them, as their use will come up often.
- <span class="pack" style = "">broom</span>: Convert statistical analysis objects from R into tidy data frames, so that they can more easily be combined, reshaped and otherwise processed with tools like dplyr, tidyr and ggplot2.
- tidy*: a lot of packages out there are now 'tidy', though not a part of the official tidyverse. Some examples of the ones I've used:
- <span class="pack" style = "">tidycensus</span>
- <span class="pack" style = "">tidybayes</span>
- <span class="pack" style = "">tidytext</span>
- <span class="pack" style = "">modelr</span>
Seriously, there are [a lot](https://www.r-pkg.org/search.html?q=tidy).
## Personal Opinion
The <span class="pack">dplyr</span> grammar is clear for a lot of standard data processing tasks, and some not so common.
Extremely useful for data exploration and visualization.
- No need to create/overwrite existing objects
- Can overwrite columns and use as they are created
- Makes it easy to look at anything, and do otherwise tedious data checks
Drawbacks:
- Not as fast as <span class="pack">data.table</span> or even some base R approaches for many things[^multidplyr]
- The *mindset* can make for unnecessary complication
- e.g. There is no need to pipe to create a single new variable
- Some approaches, are not very intuitive
- Notably less ability to work with some very common data structures (e.g. matrices)
All in all, if you've only been using base R approaches, the tidyverse will change your R life! It makes all the sorts of things you do all the time easier and clearer. Highly recommended!
## Tidyverse Exercises
### Exercise 0
Install and load the <span class="pack">dplyr</span> <span class="pack">ggplot2movies</span> packages. Look at the help file for the `movies` data set, which contains data from IMDB.
```{r, eval=FALSE}
install.packages('ggplot2movies')
library(ggplot2movies)
data('movies')
```
### Exercise 1
Using the movies data set, perform each of the following actions separately.
#### Exercise 1a
Use <span class="func">mutate</span> to create a centered version of the rating variable. A centered variable is one whose mean has been subtracted from it. The process will take the following form:
```{r ex1a_1, eval=FALSE}
data %>%
mutate(new_var_name = '?')
```
```{r ex1a_2, echo=FALSE, eval=FALSE}
movies %>%
mutate(ratingCen = rating - mean(rating))
```
#### Exercise 1b
Use <span class="func">filter</span> to create a new data frame that has only movies from the years 2000 and beyond. Use the greater than or equal operator `>=`.
```{r ex1b, echo=FALSE, eval=FALSE}
movies %>%
filter(year >= 2000)
```
#### Exercise 1c
Use <span class="func">select</span> to create a new data frame that only has the `title`, `year`, `budget`, `length`, `rating` and `votes` variables. There are at least 3 ways to do this.
```{r ex1c, echo=FALSE, eval=FALSE}
movies %>%
select(title, year, budget, length, rating, votes)
movies %>%
select(1:6)
movies %>%
select(-num_range('r',1:10), -mpaa, -starts_with('A'), -Comedy, -starts_with('D'), -Romance, -Short)
```
#### Exercise 1d
Rename the `length` column to `length_in_min` (i.e. length in minutes).
### Exercise 2
Use <span class="func">group_by</span> to group the data by year, and <span class="func">summarize</span> to create a new variable that is the average budget. The <span class="func">summarize</span> function works just like <span class="func">mutate</span> in this case.
Use the <span class="func">mean</span> function to get the average, but you'll also need to use the argument `na.rm = TRUE` within it because the earliest years have no budget recorded.
```{r ex2, echo=FALSE, eval=FALSE}
movies %>%
group_by(year) %>%
summarise(AvgBudget=mean(budget, na.rm=T)) %>%
tail
```
### Exercise 3
Use <span class="func">pivot_longer</span> to create a 'tidy' data set from the following.
```{r ex3, eval=FALSE, echo=1:3}
dat = tibble(id = 1:10,
x = rnorm(10),
y = rnorm(10))
dat %>% pivot_longer(key = var, value = score, -id)
```
### Exercise 4
Now put several actions together in one set of piped operations.
- Filter movies released *after* 1990
- select the same variables as before but also the `mpaa`, `Action`, and `Drama` variables
- group by `mpaa` *and* (your choice) `Action` *or* `Drama`
- get the average rating
It should spit out something like the following:
```{r ex4, echo=FALSE}
movies %>%
filter(year>=1990) %>%
select(title, year, budget, length, rating, votes, mpaa, Action, Drama) %>%
group_by(mpaa, Drama) %>%
summarise(AvgRating = mean(rating))
```
## Python Pandas Notebook
[Available on GitHub](https://github.com/m-clark/data-processing-and-visualization/blob/master/jupyter_notebooks/pandaverse.ipynb)
[^headers]: You may be thinking- 'it's `r lubridate::year(Sys.Date())` and why on earth would anyone do that?!'. Peruse most sports websites and you'll see that fundamental web design basics escape them. See also, financial sites.
[^summarise]: As Hadley Wickham is from New Zealand, and his examples use <span class="func">summarise</span>, you'll probably see it about as much as you do the other spelling, especially since it will come up first in autocomplete.
[^noslice]: If you're following good programming practices, you'll likely never use <span class="func">slice</span>.
[^funs]: For more than one function, you could have supplied `funs(function1, function2, ...).
[^reshape]: Some still use <span class="pack">reshape2</span> but there is no reason to and it is no longer developed.
[^multidplyr]: There is <span class="pack">multidplyr</span>, <span class="pack" style = "">dtplyr</span>, and now <span class="pack" style = "">tidyfast</span> to help get as fast as possible for large data processing.