-
Notifications
You must be signed in to change notification settings - Fork 13
/
iterative.Rmd
480 lines (320 loc) · 19.9 KB
/
iterative.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
# Iterative Programming
<div class="" style="text-align: center;">
<i class="fas fa-sync-alt fa-5x" ></i>
</div>
<br>
Almost everything you do when dealing with data will need to be done again, and again, and again. If you are copy-pasting your way to repetitively do the same thing, you're not only doing things inefficiently, you're almost certainly setting yourself up for trouble if anything changes about the data or underlying process.
In order to avoid this, you need to be familiar with basic programming, and a starting point is to use an iterative approach to repetitive problems. Let's look at the following. Let's say we want to get the means of some columns in our data set. Do you do something like this?
```{r iter_1, eval=FALSE}
means1 = mean(df$x)
means2 = mean(df$y)
means3 = mean(df$z)
means4 = mean(df$q)
```
Now consider what you have to change if you change a variable name, decide to do a median, or the data object name changes. Any minor change with the data will cause you to have to redo that code, and possibly every line of it.
## For Loops
A <span class="emph">for loop</span> will help us get around the problem. The idea is that we want to perform a particular action *for* every iteration of some sequence. That sequence may be over columns, rows, lines in a text, whatever. Here is a loop.
```{r iter_loop_demo, eval=FALSE}
for (column in c('x','y','z','q')) {
mean(df[[column]])
}
```
What's going on here? We've created an iterative process in which, *for* every *element* in `c('x','y','z','q')`, we are going to do something. We use the completely arbitrary word `column` as a placeholder to index which of the four columns we're dealing with at a given point in the process. On the first iteration, `column` will equal `x`, on the second `y`, and so on. We then take the mean of `df[[column]]`, which will be `df[['x']]`, then `df[['y']]`, etc.
Here is an example with the <span class="pack">nycflights</span> data, which regards flights that departed New York City in 2013. The <span class="objclass">weather</span> data set has columns for things like temperature, humidity, and so forth.
```{r nyc_flights}
weather = nycflights13::weather
for (column in c('temp', 'humid', 'wind_speed', 'precip')) {
print(mean(weather[[column]], na.rm = TRUE))
}
```
You can check this for yourself by testing a column or two directly with just `mean(df$x)`.
Now if the data name changes, the columns we want change, or we want to calculate something else, we usually end up only changing one thing, rather than at least changing one at a minimum, and probably many more things. In addition, the amount of code is the same whether the loop goes over 100 columns or 4.
Let's do things a little differently.
```{r nyc_flights2}
columns = c('temp', 'humid', 'wind_speed', 'precip')
nyc_means = rep(NA, length(columns))
for (i in seq_along(columns)) {
column = columns[i]
nyc_means[i] = mean(weather[[column]], na.rm = TRUE)
# alternative without the initial first step
# nyc_means[i] = mean(weather[[columns[i]]], na.rm = TRUE)
}
nyc_means
```
By creating a <span class="objclass">columns</span> object, if anything changes about the columns we want, that's the only line in the code that would need to be changed. The `i` is now a place holder for a number that goes from 1 to the length of <span class="objclass">columns</span> (i.e. `r length(columns)`). We make an empty <span class="objclass">nyc_means</span> object that's the length of the columns, so that each element will eventually be the mean of the corresponding column.
In the following I remove precipitation and add visibility and air pressure.
```{r nyc_flights3}
columns = c('temp', 'humid', 'wind_speed', 'visib', 'pressure')
nyc_means = rep(NA, length(columns))
for (i in seq_along(columns)) {
nyc_means[i] = mean(weather[[columns[i]]], na.rm = TRUE)
}
nyc_means %>% round(2)
```
Had we been copy-pasting, this would require deleting or commenting out a line in our code, pasting two more, and changing each one after pasting to represent the new columns. That's tedious, and not a fun way to code.
### A slight speed gain
Note that you do not have to create an empty object like we did. The following works also.
```{r nyc_flights4}
columns = c('temp', 'humid', 'wind_speed', 'visib', 'pressure')
nyc_means = numeric()
for (i in seq_along(columns)) {
nyc_means[i] = mean(weather[[columns[i]]], na.rm = TRUE)
}
nyc_means %>% round(2)
```
However, the other approach is slightly faster, because memory is already allocated for all elements of <span class="objclass">nyc_means</span>, rather than updating it every iteration of the loop. This speed gain can become noticeable when dealing with thousands of columns and complex operations.
```{r loop_timing, eval=FALSE, echo=FALSE}
X = matrix(rnorm(1000000), ncol=10000)
means1 = rep(NA, ncol(X))
means2 = c()
loop1 = function(X) {
for (i in 1:ncol(X)) {
means1[i] = mean(X[,i])
}
}
loop2 = function(X) {
for (i in 1:ncol(X)) {
means2[i] = mean(X[,i])
}
}
microbenchmark::microbenchmark(loop1(X), loop2(X))
```
### While alternative
When you look at some people's R code, you may see a loop of a different sort.
```{r nyc_flights_while}
columns = c('temp','humid','wind_speed', 'visib', 'pressure')
nyc_means = c()
i = 1
while (i <= length(columns)) {
nyc_means[i] = mean(weather[[columns[i]]], na.rm = TRUE)
i = i + 1
}
nyc_means %>% round(2)
```
This involves a <span class="emph">while</span> statement. It states, while `i` is less than or equal to the length (number) of <span class="objclass">columns</span>, compute the value of the i^th^ element of <span class="objclass">nyc_means</span> as the mean of i^th^ column of <span class="objclass">weather</span>. After that, increase the value of `i`. So, we start with `i = 1`, compute that subsequent mean, `i` now equals 2, do the process again, and so on. The process will stop as soon as the value of `i` is greater than the length of <span class="objclass">columns</span>.
*There is zero difference to using the while approach vs. the for loop*. While is often used when there is a check to be made, e.g. in modeling functions that have to stop the estimation process at some point, or else they'd go on indefinitely. In that case the while syntax is probably more natural. Either is fine.
### Loops summary
Understanding loops is fundamental toward spending less time processing data and more time toward exploring it. Your code will be more succinct and more able to handle the usual changes that come with dealing with data. Now that you have a sense of it, know that once you are armed with the sorts of things we'll be talking about next- apply functions, writing functions, and vectorization - you'll likely have little need to write explicit loops. While there is always a need for iterative processing of data, R provides even more efficient means to do so.
## Implicit Loops
Writing loops is straightforward once you get the initial hang of it. However, R offers alternative ways to do loops that can simplify code without losing readability. As such, even when you loop in R, you don't have to do so explicitly.
### apply family
A family of functions comes with R that allows for a succinct way of looping when it is appropriate. Common functions in this family include:
- <span class="func">apply</span>
- arrays, matrices, data.frames
- <span class="func">lapply</span>, <span class="func">sapply</span>, <span class="func">vapply</span>
- lists, data.frames, vectors
- <span class="func">tapply</span>
- grouped operations (table apply)
- <span class="func">mapply</span>
- multivariate version of sapply
- <span class="func">replicate</span>
- performs an operation N times
As an example we'll consider standardizing variables, i.e. taking a set of numbers, subtracting the mean, and dividing by the standard deviation. This results in a variable with mean of 0 and standard deviation of 1. Let's start with a loop approach.
```{r loopvsapply1, eval=FALSE}
for (i in 1:ncol(mydf)) {
x = mydf[, i]
for (j in 1:length(x)) {
x[j] = (x[j] - mean(x)) / sd(x)
}
}
```
The above would be a really bad way to use R. It goes over each column individually, then over each value of the column.
Conversely, <span class="func">apply</span> will take a matrix or data frame, and apply a function over the margin, row or column, you want to loop over. The first argument is the data you're considering, the margin is the second argument (1 for rows, 2 for columns[^arrays]), and the function you want to apply to those rows is the third argument. The following example is much cleaner compared to the loop, and now you'd have a function you can use elsewhere if needed.
```{r loopvsapply2, eval=F}
stdize <- function(x) {
(x - mean(x)) / sd(x)
}
apply(mydf, 2, stdize) # 1 for rows, 2 for columnwise application
```
Many of the other apply functions work similarly, taking an object and a function to do the work on the object (possibly implicit), possibly with other arguments specified if necessary.
#### lapply
Let's say we have a list object, or even just a vector of values. There are no rows or columns to iterate over, so what do we do here?
```{r lapply_example}
x = list('aba', 'abb', 'abc', 'abd', 'abe')
lapply(x, str_remove, pattern = 'ab')
```
The <span class="func">lapply</span> operation iterates over each element of the list and applies a function to them. In this case, the function is <span class="func">str_remove</span>. It has an argument for the string pattern we want to take out of the character string that is fed to it ('ab'). For example, for 'aba' we will be left with just the 'a'.
As can be seen, <span class="func">lapply</span> starts with a <span class="objclass">list</span> and returns a <span class="objclass">list</span>. The only difference with <span class="func">sapply</span> is that <span class="func">sapply</span> will return a simplified form if possible[^slapply].
```{r sapply_example}
sapply(x, str_remove, pattern = 'ab')
```
In this case we just get a vector back.
### Apply functions
It is important to be familiar with the apply family for efficient data processing, if only because you'll regularly come code employing these functions. A summary of benefits includes:
- Cleaner/simpler code
- Environment kept clear of unnecessary objects
- Potentially more reproducible
- more likely to use generalizable functions
- Parallelizable
Note that apply functions are <span class="emph">NOT</span> necessarily faster than explicit loops, and if you create an empty object for the loop as discussed previously, the explicit loop will likely be faster. On top of that, functions like <span class="func">replicate</span> and <span class="func">mapply</span> are especially slow.
However, the apply family can ALWAYS *potentially* be faster than standard R loops do to parallelization. With base R's <span class="pack">parallel</span> package, there are parallel versions of the apply family, e.g.<span class="func">parApply</span>, <span class="func">parLapply</span> etc. As every modern computer has at least four cores to play with, you'll always potentially have nearly a 4x speedup by using the parallel apply functions.
Apply functions and similar approaches should be a part of your regular R experience. We'll talk about other options that may have even more benefits, but you need to know the basics of how apply functions work in order to use those.
<div class='note'>
I use R every day, and very rarely use explicit loops. Note that there is no speed difference for a <span class="func">for</span> loop vs. using <span class="func">while</span>. And if you must use an explicit loop, create an empty object of the dimension/form you need, and then fill it in via the loop. This will be notably faster.
I pretty much never use an explicit double loop, as a little more thinking about the problem will usually provide a more efficient path to solving the problem.
<img class='img-note' src="img/R.ico" style="display:block; margin: 0 auto;">
</div>
### purrr
The <span class="pack">purrr</span> package allows you to take the <span class="func">apply</span> family approach to the tidyverse. And with packages <span class="pack" style = "">future</span> + <span class="pack" style = "">furrr</span>, they too are parallelizable.
Consider the following. We'll use the <span class="func">map</span> function to map the <span class="func">sum</span> function to each element in the list, the same way we would with <span class="func">lapply</span>.
```{r purr_example, eval=TRUE, echo=-(1:3)}
# detach(package:maps)
map = purrr::map # could not remove conflict with something, nor able to debug
x = list(1:3, 4:6, 7:9)
map(x, sum)
```
The map functions take some getting used to, and in my experience they are typically slower than the apply functions, sometimes notably so. However they allow you stay within the tidy realm, which has its own benefits, and have more control over the nature of the output[^purrr], which is especially important in reproducibility, package development, producing production-level code, etc. The key idea is that the map functions will always return something the same length as the input given to it.
[^purrr]: The main thing to note is that both the benefit and difficulty of using the `map_*` functions is they are more strict with inputs, which typically should be lists or list-like objects like data frames, and outputs.
The <span class="pack">purrr</span> functions want a list or vector, i.e. they don't work with data.frame objects in the same way we've done with mutate and summarize except in the sense that data.frames are lists.
```{r purr_example2, eval=-1}
mtcars %>%
map(scale) # returns a list, not shown
mtcars %>%
map_df(scale) # returns a df
mtcars %>%
map_dbl(sum) # returns a numeric (double) vector of column sums
diamonds %>%
map_at(
vars(carat, depth, price),
function(x)
as.integer(x > median(x))
) %>%
as_tibble()
```
However, working with lists is very useful, so let's turn to that.
## Looping with Lists
Aside from data frames, you may think you don't have much need for list objects. However, list objects make it very easy to iterate some form of data processing.
Let's say you have models of increasing complexity, and you want to easily summarise and/or compare them. We create a list for which each element is a model object. We then apply a function, e.g. to get the AIC value for each, or adjusted R square (this requires a custom function).
```{r purr_models}
library(mgcv) # for gam
mtcars$cyl = factor(mtcars$cyl)
mod_lm = lm(mpg ~ wt, data = mtcars)
mod_poly = lm(mpg ~ poly(wt, 2), data = mtcars)
mod_inter = lm(mpg ~ wt * cyl, data = mtcars)
mod_gam = gam(mpg ~ s(wt), data = mtcars)
mod_gam_inter = gam(mpg ~ cyl + s(wt, by = cyl), data = mtcars)
model_list = list(
mod_lm = mod_lm,
mod_poly = mod_poly,
mod_inter = mod_inter,
mod_gam = mod_gam,
mod_gam_inter = mod_gam_inter
)
# lowest wins
model_list %>%
map_dbl(AIC) %>%
sort()
# highest wins
model_list %>%
map_dbl(
function(x)
if_else(inherits(x, 'gam'),
summary(x)$r.sq,
summary(x)$adj)
) %>%
sort(decreasing = TRUE)
```
Let's go further and create a plot of these results. We'll map to a data frame, use <span class="func" style = "">pivot_longer</span> to melt it to two columns of model and value, then use ggplot2 to plot the results[^ggplotalpha].
```{r purr_models2}
model_list %>%
map_df(
function(x)
if_else(inherits(x, 'gam'),
summary(x)$r.sq,
summary(x)$adj)
) %>%
pivot_longer(cols = starts_with('mod'),
names_to = 'model',
values_to = "Adj. Rsq") %>%
arrange(desc(`Adj. Rsq`)) %>%
mutate(model = factor(model, levels = model)) %>% # sigh
ggplot(aes(x = model, y = `Adj. Rsq`)) +
geom_point(aes(color = model), size = 10, show.legend = F)
```
Why not throw in AIC also?
```{r purr_models3}
mod_rsq =
model_list %>%
map_df(
function(x)
if_else(
inherits(x, 'gam'),
summary(x)$r.sq,
summary(x)$adj
)
) %>%
pivot_longer(cols = starts_with('mod'),
names_to = 'model',
values_to = 'Rsq')
mod_aic =
model_list %>%
map_df(AIC) %>%
pivot_longer(cols = starts_with('mod'),
names_to = 'model',
values_to = 'AIC')
left_join(mod_rsq, mod_aic) %>%
arrange(AIC) %>%
mutate(model = factor(model, levels = model)) %>%
pivot_longer(cols = -model, names_to = 'measure', values_to = 'value') %>%
ggplot(aes(x = model, y = value)) +
geom_point(aes(color = model), size = 10, show.legend = F) +
facet_wrap(~ measure, scales = 'free')
```
#### List columns
As data.frames are lists, anything can be put into a column just as you would a list element. We'll use <span class="func">pmap</span> here, as it can take more than one argument, and we're feeding all columns of the data.frame. You don't need to worry about the details here, we just want to create a column that is actually a list. In this case the column will contain a data frame in each entry.
```{r list_tibble}
mtcars2 = as.matrix(mtcars)
mtcars2[sample(1:length(mtcars2), 50)] = NA # add some missing data
mtcars2 = data.frame(mtcars2) %>%
rownames_to_column(var = 'observation') %>%
as_tibble()
head(mtcars2)
mtcars2 =
mtcars2 %>%
mutate(
newvar =
pmap(., ~ data.frame(
N = sum(!is.na(c(...))),
Missing = sum(is.na(c(...)))
)
)
)
```
Now check out the list column.
```{r list_tibble2}
mtcars2
mtcars2$newvar %>% head(3)
```
Unnest it with the <span class="pack">tidyr</span> function.
```{r list_tibble3}
mtcars2 %>%
unnest(newvar)
```
This is a pretty esoteric demonstration, and not something you'd normally want to do, as <span class="func">mutate</span> or other approaches would be far more efficient and sensical. However, the idea is that you might want to retain the information you might otherwise store in a list with the data that was used to create it. As an example, you could potentially attach models as a list column to a dataframe that contains meta-information about each model. Once you have a list column, you can use that column as you would any list for iterative programming.
## Iterative Programming Exercises
### Exercise 1
With the following <span class="objclass">matrix</span>, use <span class="func">apply</span> and the <span class="func">sum</span> function to get row or column sums of the matrix <span class="objclass">x</span>.
```{r app_ex1, echo=1, eval=FALSE}
x = matrix(1:9, 3, 3)
apply()
```
### Exercise 2
With the following list object <span class="objclass">x</span>, use <span class="func">lapply</span> and <span class="func">sapply</span> and the <span class="func">sum</span> function to get sums for the elements. There is no margin to specify for a list, so just supply the list and the sum function.
```{r app_ex2, echo=1, eval=FALSE}
x = list(1:3, 4:10, 11:100)
lapply()
sapply()
```
### Exercise 3
As in the previous example, use a <span class="func" style = "">map</span> function to create a data frame of the column means. See `?map` to see all your options.
```{r purrr_ex3, echo=1:5, eval=FALSE}
d = tibble(
x = rnorm(100),
y = rnorm(100, 10, 2),
z = rnorm(100, 50, 10),
)
d %>% map_df(mean)
```
[^slapply]: <span class="func">sapply</span> is actually just a wrapper for <span class="func">lapply</span>. If you supply the argument `simplified=F`, it is identical. Otherwise, it attempts to return a vector or matrix.
[^arrays]: You can have 3 or more dimensional arrays, and <span class="func">apply</span> will work over those dimensions (or any combination of them), but this doesn't come up too often.
[^ggplotalpha]: Note that ggplot2 will change the ordering of the variable unless you coerce it by creating a factor. This pretty much defeats the entire purpose of retaining characters over factors everywhere else in the tidyverse.