-
Notifications
You must be signed in to change notification settings - Fork 13
/
data_table.Rmd
483 lines (339 loc) · 16 KB
/
data_table.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
# data.table
```{r dtsetup, include=FALSE, eval=TRUE, cache=FALSE}
knitr::opts_chunk$set(eval=T, echo=T, cache = T)
```
```{r dthex, out.width='10%', echo=FALSE}
knitr::include_graphics('img/hex_datatable.png')
```
Another package for data processing that has been useful to many is <span class="pack">data.table</span>. It works in a notably different way than <span class="pack">dplyr</span>. However, you'd use it for the same reasons, e.g. subset, grouping, update, ordered joins etc., but with key advantages in speed and memory efficiency. Like <span class="pack">dplyr</span>, the data objects are both <span class="objclass">data.frames</span> and a package specific class.
```{r dterror, eval=TRUE}
library(data.table)
dt = data.table(x = sample(1:10, 6),
g = 1:3,
y = runif(6))
class(dt)
```
## data.table Basics
In general, <span class="pack">data.table</span> works with brackets as in base R data frames. However, in order to use <span class="pack">data.table</span> effectively you'll need to forget the data frame similarity. The brackets actually work like a function call, with several key arguments. Consider the following notation to start.
```{r dtgeneric, eval=FALSE}
x[i, j, by, keyby, with = TRUE, ...]
```
Importantly: *you don't use the brackets as you would with data.frames*. What **i** and **j** can be are fairly complex.
In general, you use **i** for filtering by rows.
```{r dtrows, results='hold'}
dt[2] # rows! not columns as with standard data.frame
dt[2,]
```
You use **j** to select (by name!) or create new columns. We can define a new column with the <span class="func">:=</span> operator.
```{r dtcols, results='hold'}
dt[,x]
dt[,z := x+y] # dt now has a new column
dt[,z]
dt[g > 1, mean(z), by = g]
dt
```
Because **j** is an argument, dropping columns is awkward.
```{r dtdropcol, eval=1:2, results='hold'}
dt[, -y] # creates negative values of y
dt[, -'y', with = F] # drops y, but now needs quotes
dt[, y := NULL] # drops y, but this is just a base R approach
dt$y = NULL
```
Data table does not make unnecessary copies. For example if we do the following...
```{r dtmemory}
DT = data.table(A = 5:1, B = letters[5:1])
DT2 = DT
DT3 = copy(DT)
```
DT2 and DT are just names for the same table. You'd actually need to use the copy function to make an explicit copy, otherwise whatever you do to DT2 will be done to DT.
```{r dtnocopy}
DT2[,q:=1]
DT
```
```{r dtcopy}
DT3
```
## Grouped Operations
We can now attempt a 'group-by' operation, along with creation of a new variable. Note that these operations actually modify the <span class="objclass">dt</span> object *in place*, a key distinction with <span class="pack">dplyr</span>. Fewer copies means less of a memory hit.
```{r dtgroup}
dt1 = dt2 = dt
dt[, sum(x, y), by = g] # sum of all x and y values
dt1[, mysum := sum(x), by = g] # add new variable to the original data
dt1
```
We can also create groupings on the fly. For a new summary data set, we'll take the following approach- we create a grouping based on whether `g` is a value of one or not, then get the mean and sum of `x` for those two categories. The corresponding <span class="pack" style = "">dplyr</span> approach is also shown (but not evaluated) for comparison.
```{r dtgroup2, eval=1}
dt2[, list(mean_x = mean(x), sum_x = sum(x)), by = g == 1]
dt2 %>%
group_by(g == 1) %>%
summarise(mean_x = mean(x), sum_x = sum(x))
```
## Faster!
As mentioned, the reason to use <span class="pack">data.table</span> is speed. If you have large data or large operations it'll be useful.
### Joins
Joins can not only be faster but also easy to do. Note that the `i` argument can be a <span class="objclass">data.table</span> object itself. I compare its speed to the comparable <span class="pack">dplyr's</span> <span class="func">left_join</span> function.
```{r dtjoin, eval=FALSE}
dt1 = setkey(dt1, x)
dt1[dt2]
dt1_df = dt2_df = as.data.frame(dt1)
left_join(dt1_df, dt2_df, by = 'x')
```
```{r dtjoin_timing, echo=FALSE}
# load('data/timing.RData')
# timing_join %>%
# kable_df(justify='lr')
dt1 = setkey(dt1, x)
dt1_df = dt2_df = as.data.frame(dt1)
# dt1[dt2]
# timing_join =
microbenchmark::microbenchmark(dt_join = dt1[dt2],
dplyr_join = left_join(dt1_df, dt2_df, by='x')) %>%
summary() %>%
select(expr, mean) %>%
mutate(mean = round(mean, 2)) %>%
rename(func = expr, `mean (microseconds)` = mean) %>%
kable_df(justify = 'lr')
```
### Group by
We can use the <span class="func">setkey</span> function to order a data set by a certain column(s). This ordering is done by reference; again, no copy is made. Doing this will allow for faster grouped operations, though you likely will only see the speed gain with very large data. The timing regards creating a new variable
```{r dtsetkey, echo=1:10}
test_dt0 = data.table(x = rnorm(10000000),
g = sample(letters, 10000000, replace = T))
test_dt1 = copy(test_dt0)
test_dt2 = setkey(test_dt1, g)
identical(test_dt0, test_dt1)
identical(test_dt1, test_dt2)
test_dt0 = test_dt0[, mean := mean(x), by = g]
test_dt1 = test_dt1[, mean := mean(x), by = g]
test_dt2 = test_dt2[, mean := mean(x), by = g]
# timing_group_by_1 %>% kable_df(justify='lr')
timing_group_by_1 =
microbenchmark::microbenchmark(
test_dt0 = test_dt0[, mean := mean(x), by = g],
test_dt1 = test_dt1[, mean := mean(x), by = g],
test_dt2 = test_dt2[, mean := mean(x), by = g],
times = 25
) %>%
summary() %>%
select(expr, mean) %>%
mutate(mean = round(mean, 2)) %>%
rename(func = expr, `mean (milliseconds)` = mean)
timing_group_by_1 %>%
kable_df(justify = 'lr')
```
### String matching
The <span class="func">chin</span> function returns a vector of the *positions* of (first) matches of its first argument in its second, where both arguments are character vectors. Essentially it's just like the <span class="func">%in%</span> function for character vectors.
Consider the following. We sample the first 14 letters 1000 times with replacement and see which ones match in a subset of another subset of letters. I compare the same operation to <span class="pack">stringr</span> and the <span class="pack">stringi</span> package whose functionality <span class="pack">stringr</span> using. They are both far slower than <span class="func" style = "">chin</span>.
```{r dtcharacters, echo=2:6, eval=2:8}
set.seed(1234)
lets_1 = sample(letters[1:14], 1000, replace=T)
lets_1 %chin% letters[13:26] %>% head(10)
# stri_detect_regex(lets_1, paste(letters[13:26], collapse='|'))
# str_detect(lets_1, paste(letters[13:26], collapse='|'))
# timing_chmatch %>% kable_df(justify='lr')
#
# library(stringr); library(stringi)
# timing_chmatch =
# microbenchmark::microbenchmark(
# dt = lets_1 %chin% letters[13:26],
# stringi= stri_detect_regex(lets_1, paste(letters[13:26], collapse='|')),
# stringr= str_detect(lets_1, paste(letters[13:26], collapse='|'))
# ) %>%
# summary() %>%
# select(expr, mean) %>%
# mutate(mean=round(mean, 2)) %>%
# rename(func=expr, `mean (microseconds)`=mean)
```
### Reading files
If you use <span class="pack">data.table</span> for nothing else, you'd still want to consider it strongly for reading in large text files. The function <span class="func">fread</span> may be quite useful in being memory efficient too. I compare it to <span class="pack">readr</span>.
```{r dtfread, eval=FALSE}
fread('data/cars.csv')
```
```{r dtfread_timing, echo=FALSE}
library(readr)
timing_read =
microbenchmark::microbenchmark(
dt = fread('data/cars.csv'),
readr = read_csv('data/cars.csv')
) %>%
summary() %>%
select(expr, mean) %>%
mutate(mean=round(mean, 2)) %>%
rename(func=expr, `mean (microseconds)`=mean)
timing_read %>% kable_df(justify='lr')
```
### More speed
The following demonstrates some timings from [here](http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega/34167477#34167477). I reproduced it on my own machine based on 50 million observations. The grouped operations that are applied are just a sum and length on a vector.
By the way, never, ever use <span class="func">aggregate</span>. For anything.
```{r dttimings_big, eval=FALSE, echo=FALSE}
library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)
timing_group_by_big = list()
# sapply
timing_group_by_big[["sapply"]] = system.time({
lt = split(x, grp)
r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})
# lapply
timing_group_by_big[["lapply"]] = system.time({
lt = split(x, grp)
r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})
# tapply
timing_group_by_big[["tapply"]] = system.time(
r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)
# by
timing_group_by_big[["by"]] = system.time(
r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# aggregate
timing_group_by_big[["aggregate"]] = system.time(
r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# dplyr
timing_group_by_big[["dplyr"]] = system.time({
df = tibble(x, grp)
r.dplyr = summarise(group_by(df, grp), sum(x), n())
})
# data.table
timing_group_by_big[["data.table"]] = system.time({
dt = setnames(setDT(list(x, grp)), c("x","grp"))
r.data.table = dt[, .(sum(x), .N), grp]
})
save(timing_group_by_big, file = 'data/timing_group_by_big.RData')
```
```{r dttimings, eval=TRUE, echo=FALSE}
load('data/timing_group_by_big.RData')
as.data.table(sapply(timing_group_by_big, `[[`, "elapsed"), keep.rownames = TRUE)[, .(fun = V1, elapsed = V2)][order(-elapsed)] %>%
kable_df()
```
Ever.
Really.
Another thing to note is that the tidy approach is more about clarity and code efficiency relative to base R, as well as doing important background data checks and returning more usable results. In practice, it likely won't be notably faster except in some cases, like with <span class="func">aggregate</span>.
## Pipe with data.table
Piping can be done with <span class="objclass" style = "">data.table</span> objects too, using the brackets, but it's awkward at best.
```{r dtpipebad, eval=F}
mydt[, newvar := mean(x), ][, newvar2 := sum(newvar), by = group][, -'y', with = FALSE]
mydt[, newvar := mean(x),
][, newvar2 := sum(newvar), by = group
][,-'y', with=FALSE]
```
Probably better to just use a standard pipe and dot approach if you really need it.
```{r dtpipebad2, eval=F}
mydt[, newvar := mean(x), ] %>%
.[, newvar2 := sum(newvar), by = group] %>%
.[, -'y', with = FALSE]
```
## data.table Summary
Faster and more memory-efficient methods are great to have. If you have large data this is one package that can help.
- For reading data
- Especially for group-by and joins.
Drawbacks:
- Complex
- The syntax can be awkward
- It doesn't work like a data.frame, which can be confusing
- Piping with brackets isn't really feasible, and the dot approach is awkward
- Does not have its own 'verse', though many packages use it
If speed and/or memory is (potentially) a concern, <span class="pack">data.table</span>.
For interactive exploration, <span class="pack">dplyr</span>.
Piping allows one to use both, so no need to choose.
And on the horizon...
## Faster dplyr Alternatives
So we have <span class="pack" style = "">data.table</span> as a starting point for faster data processing operations, but there are others. The <span class="pack">dtplyr</span> package implements the <span class="pack">data.table</span> back-end for <span class="pack">dplyr</span>, so that you can seamlessly use them together. The newer package <span class="pack" style = "">tidyfast</span> works directly with a <span class="objclass" style = "">data.table</span> object, but uses dplyr-esque functions. The following shows times for a counting unique arrival times in the <span class="pack" style = "">nycflights13</span> <span class="objclass" style = "">flights</span> data (`r nrow(nycflights13::flights)` rows).
```{r dtplyr, echo=FALSE, eval=TRUE}
# something weird with n() function in chunk; saved out
# detach(package:data.table); library(dplyr); library(dtplyr)
# set.seed(123)
# n = 5e7 possible conflict with n()?
# k = 5e5
# x = runif(5e7)
# grp = sample(k, 5e7, TRUE)
#
# dtplyrtest = system.time({
# dt = tbl_dt(data.frame(x, grp))
# r.dplyr = summarise(group_by(dt, grp), sum(x), n())
# })
# timingsdtplyr = append(timing[6:7], list(dtplyr=dtplyrtest))
# save(timingsdtplyr, 'data/timingsdtplyr.RData')
# data(flights, 'nycflights13')
# head(flights)
#
# library(tidyverse)
# library(dtplyr)
# library(data.table)
# library(tidyfast)
#
#
# flights_dtp = lazy_dt(flights)
#
# flights_dt = data.table(flights)
#
# library(microbenchmark)
#
# test = microbenchmark::microbenchmark(
# dplyr = count(flights, arr_time),
# dtplyr = as_tibble(count(flights_dtp, arr_time)),
# tidyfast = dt_count(flights_dt, arr_time),
# data.table = flights_dt[, .(n = .N), by = arr_time],
# times = 500
# )
#
# test_table = as.data.frame(print(test))
# colnames(test_table)[c(1, 5)] = c('package', 'timing') # knitr was having issues with the names as they expr and median are functions
# test_table = test_table[c('package', 'timing')]
# save(test, test_table, file = 'data/timingsdtplyr.RData')
load('data/timingsdtplyr.RData')
test_table %>%
arrange(desc(timing)) %>%
kable_df() %>%
kableExtra::add_footnote('Median time in milliseconds to do a count of arr_time on nycflights::flights')
ggplot(test, aes(expr, time, color = expr, fill = expr)) +
geom_violin(alpha = .6) +
# ggbeeswarm::geom_beeswarm(dodge.width = 5, alpha = .6) +
visibly::theme_clean()
```
Just for giggles I did the same in Python with a <span class="pack">pandas</span> <span class="objclass">DataFrame</span>, and it was notably slower than all of these options (almost 10x slower than standard dplyr). A lot of folks that use Python think R is slow, but that is mostly because they don't know how to effectively program with R for data science.
```{python timing, engine = '~/anaconda3/bin/python', eval = FALSE, echo = FALSE}
import pandas as pd
flights = r.flights
flights.set_index(["arr_time", 'year']).count(level="arr_time")
def test():
flights.set_index(["arr_time", 'year']).count(level="arr_time")
test()
import timeit
timeit.Timer.timeit() # see documentation
test_result = timeit.timeit(stmt="test()", setup="from __main__ import test", number=100)
# result is in seconds for the total number of runs
test_result/100 # per run
test_result/100*1000 # in milliseconds
```
#### Out of memory situations
For very large data sets, especially in cases where distributed data solutions like Spark (and <span class="pack" style = "">sparklyr</span>) are not viable for practical or security reasons, you may need to try another approach. The <span class="pack" style = "">disk.frame</span> package does data processing on disk rather than in memory, as is the case with default R approaches. This allows you to process data that may be too large or time consuming to do so otherwise. For example, it'd be a great option if you are starting out with extremely large data, but for which your subset of interest is easily manageable within R. With <span class="pack" style = "">disk.frame</span>, you can do the initial filtering and selection before bringing it into memory.
## data.table Exercises
### Exercise 0
Install and load the <span class="pack">data.table</span> package.
Create the following data table.
```{r dt-ex0, eval=FALSE}
mydt = data.table(
expand.grid(x = 1:3,
y = c('a', 'b', 'c')),
z = sample(1:20, 9)
)
```
### Exercise 1
Create a new object that contains only the 'a' group. Think back to how you use a logical to select rows.
```{r, echo=FALSE, eval=FALSE}
mydt_a = mydt[y == 'a', ]
```
### Exercise 2
Create a new object that is the sum of z grouped by x. You don't need to name the sum variable.
```{r, echo=FALSE, eval=FALSE}
mydt_grpsum = mydt[, sum(z), by = x]
mydt_grpsum = mydt[, list(sumz = sum(z)), by = x]
```