data_table.Rmd

# data.table


```{r dtsetup, include=FALSE, eval=TRUE, cache=FALSE}
knitr::opts_chunk$set(eval=T, echo=T, cache = T)
```


```{r dthex, out.width='10%', echo=FALSE}
knitr::include_graphics('img/hex_datatable.png')
```


Another package for data processing that has been useful to many is <span class="pack">data.table</span>. It works in a notably different way than <span class="pack">dplyr</span>. However, you'd use it for the same reasons, e.g. subset, grouping, update, ordered joins etc., but with key advantages in speed and memory efficiency. Like <span class="pack">dplyr</span>, the data objects are both <span class="objclass">data.frames</span> and a package specific class.


```{r dterror, eval=TRUE}
library(data.table)
dt = data.table(x = sample(1:10, 6),
                g = 1:3,
                y = runif(6))
class(dt)
```

## data.table Basics

In general, <span class="pack">data.table</span> works with brackets as in base R data frames.  However, in order to use <span class="pack">data.table</span> effectively you'll need to forget the data frame similarity. The brackets actually work like a function call, with several key arguments.  Consider the following notation to start.

```{r dtgeneric, eval=FALSE}
x[i, j, by, keyby, with = TRUE, ...]
```


Importantly: *you don't use the brackets as you would with data.frames*. What **i** and **j** can be are fairly complex. 

In general, you use **i** for filtering by rows.
 
```{r dtrows, results='hold'}
dt[2]    # rows! not columns as with standard data.frame
dt[2,]
```

You use **j** to select (by name!) or create new columns.  We can define a new column with the <span class="func">:=</span> operator.

 
```{r dtcols, results='hold'}
dt[,x]
dt[,z := x+y]  # dt now has a new column
dt[,z]
dt[g > 1, mean(z), by = g]
dt
```

Because **j** is an argument, dropping columns is awkward.

```{r dtdropcol, eval=1:2, results='hold'}
dt[, -y]             # creates negative values of y
dt[, -'y', with = F] # drops y, but now needs quotes
dt[, y := NULL]      # drops y, but this is just a base R approach
dt$y = NULL
```


Data table does not make unnecessary copies.  For example if we do the following...

```{r dtmemory}
DT = data.table(A = 5:1, B = letters[5:1])
DT2 = DT
DT3 = copy(DT)  
```

DT2 and DT are just names for the same table.  You'd actually need to use the copy function to make an explicit copy, otherwise whatever you do to DT2 will be done to DT.

```{r dtnocopy}
DT2[,q:=1]
DT
```

```{r dtcopy}
DT3 
```


## Grouped Operations

We can now attempt a 'group-by' operation, along with creation of a new variable. Note that these operations actually modify the <span class="objclass">dt</span> object *in place*, a key distinction with <span class="pack">dplyr</span>.  Fewer copies means less of a memory hit.

```{r dtgroup}
dt1 = dt2 = dt
dt[, sum(x, y), by = g]                # sum of all x and y values
dt1[, mysum := sum(x), by = g]         # add new variable to the original data
dt1
```


We can also create groupings on the fly. For a new summary data set, we'll take the following approach- we create a grouping based on whether `g` is a value of one or not, then get the mean and sum of `x` for those two categories.  The corresponding <span class="pack" style = "">dplyr</span> approach is also shown (but not evaluated) for comparison.

```{r dtgroup2, eval=1}
dt2[, list(mean_x = mean(x), sum_x = sum(x)), by = g == 1]
dt2 %>% 
  group_by(g == 1) %>% 
  summarise(mean_x = mean(x), sum_x = sum(x))
```


## Faster!

As mentioned, the reason to use <span class="pack">data.table</span> is speed.  If you have large data or large operations it'll be useful.


### Joins

Joins can not only be faster but also easy to do. Note that the `i` argument can be a <span class="objclass">data.table</span> object itself.  I compare its speed to the comparable <span class="pack">dplyr's</span> <span class="func">left_join</span> function.

```{r dtjoin, eval=FALSE}
dt1 = setkey(dt1, x)
dt1[dt2]

dt1_df = dt2_df = as.data.frame(dt1)
left_join(dt1_df, dt2_df, by = 'x')
```

```{r dtjoin_timing, echo=FALSE}
# load('data/timing.RData')
# timing_join %>% 
#   kable_df(justify='lr')
dt1 = setkey(dt1, x)
dt1_df = dt2_df = as.data.frame(dt1)
# dt1[dt2]
# timing_join =
  microbenchmark::microbenchmark(dt_join = dt1[dt2],
                                 dplyr_join = left_join(dt1_df, dt2_df, by='x')) %>%
  summary() %>%
  select(expr, mean) %>%
  mutate(mean = round(mean, 2)) %>%
  rename(func = expr, `mean (microseconds)` = mean) %>%
  kable_df(justify = 'lr')
```

### Group by

We can use the <span class="func">setkey</span> function to order a data set by a certain column(s).  This ordering is done by reference; again, no copy is made.  Doing this will allow for faster grouped operations, though you likely will only see the speed gain with very large data. The timing regards creating a new variable


```{r dtsetkey, echo=1:10}
test_dt0 = data.table(x = rnorm(10000000),
                      g = sample(letters, 10000000, replace = T))
test_dt1 = copy(test_dt0)
test_dt2 = setkey(test_dt1, g)

identical(test_dt0, test_dt1)
identical(test_dt1, test_dt2)

test_dt0 = test_dt0[, mean := mean(x), by = g]
test_dt1 = test_dt1[, mean := mean(x), by = g]
test_dt2 = test_dt2[, mean := mean(x), by = g]

# timing_group_by_1 %>% kable_df(justify='lr')

timing_group_by_1 =
  microbenchmark::microbenchmark(
    test_dt0 = test_dt0[, mean := mean(x), by = g],
    test_dt1 = test_dt1[, mean := mean(x), by = g],
    test_dt2 = test_dt2[, mean := mean(x), by = g],
    times = 25
  ) %>%
  summary() %>%
  select(expr, mean) %>%
  mutate(mean = round(mean, 2)) %>%
  rename(func = expr, `mean (milliseconds)` = mean)

timing_group_by_1 %>% 
  kable_df(justify = 'lr')
```

### String matching

The <span class="func">chin</span> function returns a vector of the *positions* of (first) matches of its first argument in its second, where both arguments are character vectors.  Essentially it's just like the <span class="func">%in%</span> function for character vectors. 

Consider the following. We sample the first 14 letters 1000 times with replacement and see which ones match in a subset of another subset of letters.  I compare the same operation to <span class="pack">stringr</span> and the <span class="pack">stringi</span> package whose functionality <span class="pack">stringr</span> using.  They are both far slower than <span class="func" style = "">chin</span>.

```{r dtcharacters, echo=2:6, eval=2:8}
set.seed(1234)
lets_1 = sample(letters[1:14], 1000, replace=T)

lets_1 %chin% letters[13:26] %>% head(10)

# stri_detect_regex(lets_1, paste(letters[13:26], collapse='|'))
# str_detect(lets_1, paste(letters[13:26], collapse='|'))


# timing_chmatch %>% kable_df(justify='lr')
# 
# library(stringr); library(stringi)
# timing_chmatch =
#   microbenchmark::microbenchmark(
#   dt = lets_1 %chin% letters[13:26],
#   stringi= stri_detect_regex(lets_1, paste(letters[13:26], collapse='|')),
#   stringr= str_detect(lets_1, paste(letters[13:26], collapse='|'))
# ) %>%
#   summary() %>%
#   select(expr, mean) %>%
#   mutate(mean=round(mean, 2)) %>%
#   rename(func=expr, `mean (microseconds)`=mean)


```


### Reading files

If you use <span class="pack">data.table</span> for nothing else, you'd still want to consider it strongly for reading in large text files.  The function <span class="func">fread</span> may be quite useful in being memory efficient too.  I compare it to <span class="pack">readr</span>.

```{r dtfread, eval=FALSE}
fread('data/cars.csv')
```

```{r dtfread_timing, echo=FALSE}
library(readr)
timing_read =
  microbenchmark::microbenchmark(
  dt = fread('data/cars.csv'),
  readr = read_csv('data/cars.csv')
) %>%
  summary() %>%
  select(expr, mean) %>%
  mutate(mean=round(mean, 2)) %>%
  rename(func=expr, `mean (microseconds)`=mean)

timing_read %>% kable_df(justify='lr')
```


### More speed

The following demonstrates some timings from [here](http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega/34167477#34167477).  I reproduced it on my own machine based on 50 million observations. The grouped operations that are applied are just a sum and length on a vector. 


By the way, never, ever use <span class="func">aggregate</span>. For anything.


```{r dttimings_big, eval=FALSE, echo=FALSE}
library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)

timing_group_by_big = list()

# sapply
timing_group_by_big[["sapply"]] = system.time({
    lt = split(x, grp)
    r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})

# lapply
timing_group_by_big[["lapply"]] = system.time({
    lt = split(x, grp)
    r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})

# tapply
timing_group_by_big[["tapply"]] = system.time(
    r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)

# by
timing_group_by_big[["by"]] = system.time(
    r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# aggregate
timing_group_by_big[["aggregate"]] = system.time(
    r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# dplyr
timing_group_by_big[["dplyr"]] = system.time({
    df = tibble(x, grp)
    r.dplyr = summarise(group_by(df, grp), sum(x), n())
})

# data.table
timing_group_by_big[["data.table"]] = system.time({
    dt = setnames(setDT(list(x, grp)), c("x","grp"))
    r.data.table = dt[, .(sum(x), .N), grp]
})

save(timing_group_by_big, file = 'data/timing_group_by_big.RData')
```


```{r dttimings, eval=TRUE, echo=FALSE}
load('data/timing_group_by_big.RData')
as.data.table(sapply(timing_group_by_big, `[[`, "elapsed"), keep.rownames = TRUE)[, .(fun = V1, elapsed = V2)][order(-elapsed)] %>% 
  kable_df()
```

Ever.

Really.

Another thing to note is that the tidy approach is more about clarity and code efficiency relative to base R, as well as doing important background data checks and returning more usable results.  In practice, it likely won't be notably faster except in some cases, like with <span class="func">aggregate</span>.

## Pipe with data.table

Piping can be done with <span class="objclass" style = "">data.table</span> objects too, using the brackets, but it's awkward at best.

```{r dtpipebad, eval=F}
mydt[, newvar := mean(x), ][, newvar2 := sum(newvar), by = group][, -'y', with = FALSE]
mydt[, newvar := mean(x),
     ][, newvar2 := sum(newvar), by = group
       ][,-'y', with=FALSE]
```

Probably better to just use a standard pipe and dot approach if you really need it.

```{r dtpipebad2, eval=F}
mydt[, newvar := mean(x), ] %>%
  .[, newvar2 := sum(newvar), by = group] %>%
  .[, -'y', with = FALSE]
```

## data.table Summary

Faster and more memory-efficient methods are great to have. If you have large data this is one package that can help.

  - For reading data
  - Especially for group-by and joins.

Drawbacks:

- Complex
- The syntax can be awkward
- It doesn't work like a data.frame, which can be confusing
- Piping with brackets isn't really feasible, and the dot approach is awkward
- Does not have its own 'verse', though many packages use it


If speed and/or memory is (potentially) a concern, <span class="pack">data.table</span>.

For interactive exploration, <span class="pack">dplyr</span>.

Piping allows one to use both, so no need to choose.

And on the horizon...

## Faster dplyr Alternatives

So we have <span class="pack" style = "">data.table</span> as a starting point for faster data processing operations, but there are others. The <span class="pack">dtplyr</span> package implements the <span class="pack">data.table</span> back-end for <span class="pack">dplyr</span>, so that you can seamlessly use them together.  The newer package <span class="pack" style = "">tidyfast</span> works directly with a <span class="objclass" style = "">data.table</span> object, but uses dplyr-esque functions. The following shows times for a counting unique arrival times in the <span class="pack" style = "">nycflights13</span> <span class="objclass" style = "">flights</span> data (`r nrow(nycflights13::flights)` rows).

```{r dtplyr, echo=FALSE, eval=TRUE}
# something weird with n() function in chunk; saved out
# detach(package:data.table); library(dplyr); library(dtplyr)
# set.seed(123)
# n = 5e7  possible conflict with n()?
# k = 5e5
# x = runif(5e7)
# grp = sample(k, 5e7, TRUE)
# 
# dtplyrtest = system.time({
#     dt = tbl_dt(data.frame(x, grp))
#     r.dplyr = summarise(group_by(dt, grp), sum(x), n())
# })
# timingsdtplyr = append(timing[6:7], list(dtplyr=dtplyrtest))
# save(timingsdtplyr, 'data/timingsdtplyr.RData')


# data(flights, 'nycflights13')
# head(flights)
# 
# library(tidyverse)
# library(dtplyr)
# library(data.table)
# library(tidyfast)
# 
# 
# flights_dtp = lazy_dt(flights)
# 
# flights_dt = data.table(flights)
# 
# library(microbenchmark)
# 
# test = microbenchmark::microbenchmark(
#   dplyr      = count(flights, arr_time),
#   dtplyr     = as_tibble(count(flights_dtp, arr_time)),
#   tidyfast   = dt_count(flights_dt, arr_time),
#   data.table = flights_dt[, .(n = .N), by = arr_time],
#   times = 500
# )
# 
# test_table = as.data.frame(print(test))
# colnames(test_table)[c(1, 5)] = c('package', 'timing')  # knitr was having issues with the names as they expr and median are functions
# test_table = test_table[c('package', 'timing')]
# save(test, test_table, file = 'data/timingsdtplyr.RData')


load('data/timingsdtplyr.RData')

test_table %>%
  arrange(desc(timing)) %>%
  kable_df() %>% 
  kableExtra::add_footnote('Median time in milliseconds to do a count of arr_time on nycflights::flights')

ggplot(test, aes(expr, time, color = expr, fill = expr)) +
  geom_violin(alpha = .6)  +
  # ggbeeswarm::geom_beeswarm(dodge.width = 5, alpha = .6)  +
  visibly::theme_clean()
```

Just for giggles I did the same in Python with a <span class="pack">pandas</span> <span class="objclass">DataFrame</span>, and it was notably slower than all of these options (almost 10x slower than standard dplyr).  A lot of folks that use Python think R is slow, but that is mostly because they don't know how to effectively program with R for data science.

```{python timing, engine = '~/anaconda3/bin/python', eval = FALSE, echo = FALSE}
import pandas as pd

flights = r.flights

flights.set_index(["arr_time", 'year']).count(level="arr_time")

def test():
  flights.set_index(["arr_time", 'year']).count(level="arr_time")

test()
import timeit

timeit.Timer.timeit() # see documentation

test_result = timeit.timeit(stmt="test()", setup="from __main__ import test", number=100)

# result is in seconds for the total number of runs
test_result/100  # per run
test_result/100*1000  # in milliseconds
```

#### Out of memory situations

For very large data sets, especially in cases where distributed data solutions like Spark (and <span class="pack" style = "">sparklyr</span>) are not viable for practical or security reasons, you may need to try another approach.  The <span class="pack" style = "">disk.frame</span> package does data processing on disk rather than in memory, as is the case with default R approaches.  This allows you to process data that may be too large or time consuming to do so otherwise. For example, it'd be a great option if you are starting out with extremely large data, but for which your subset of interest is easily manageable within R. With <span class="pack" style = "">disk.frame</span>, you can do the initial filtering and selection before bringing it into memory.

## data.table Exercises


### Exercise 0

Install and load the <span class="pack">data.table</span> package.

Create the following data table.

```{r dt-ex0, eval=FALSE}
mydt = data.table(
  expand.grid(x = 1:3, 
              y = c('a', 'b', 'c')), 
  z = sample(1:20, 9)
)
```

### Exercise 1

Create a new object that contains only the 'a' group. Think back to how you use a logical to select rows.

```{r, echo=FALSE, eval=FALSE}
mydt_a = mydt[y == 'a', ]
```


### Exercise 2

Create a new object that is the sum of z grouped by x. You don't need to name the sum variable.


```{r, echo=FALSE, eval=FALSE}
mydt_grpsum = mydt[, sum(z), by = x]
mydt_grpsum = mydt[, list(sumz = sum(z)), by = x]
```