iterative.Rmd

# Iterative Programming

<div class="" style="text-align: center;">
<i class="fas fa-sync-alt fa-5x" ></i>
</div>

<br>

Almost everything you do when dealing with data will need to be done again, and again, and again.  If you are copy-pasting your way to repetitively do the same thing, you're not only doing things inefficiently, you're almost certainly setting yourself up for trouble if anything changes about the data or underlying process.

In order to avoid this, you need to be familiar with basic programming, and a starting point is to use an iterative approach to repetitive problems.  Let's look at the following.  Let's say we want to get the means of some columns in our data set.  Do you do something like this?

```{r iter_1, eval=FALSE}
means1 = mean(df$x)
means2 = mean(df$y)
means3 = mean(df$z)
means4 = mean(df$q)
```


Now consider what you have to change if you change a variable name, decide to do a median, or the data object name changes.  Any minor change with the data will cause you to have to redo that code, and possibly every line of it.

## For Loops

A <span class="emph">for loop</span> will help us get around the problem.  The idea is that we want to perform a particular action *for* every iteration of some sequence. That sequence may be over columns, rows, lines in a text, whatever.  Here is a loop.

```{r iter_loop_demo, eval=FALSE}
for (column in c('x','y','z','q')) {
  mean(df[[column]])
}
```

What's going on here?  We've created an iterative process in which, *for* every *element* in `c('x','y','z','q')`, we are going to do something.  We use the completely arbitrary word `column` as a placeholder to index which of the four columns we're dealing with at a given point in the process. On the first iteration, `column` will equal `x`, on the second `y`, and so on.  We then take the mean of `df[[column]]`, which will be `df[['x']]`, then `df[['y']]`, etc.

Here is an example with the <span class="pack">nycflights</span> data, which regards flights that departed New York City in 2013.  The <span class="objclass">weather</span> data set has columns for things like temperature, humidity, and so forth.


```{r nyc_flights}
weather = nycflights13::weather

for (column in c('temp', 'humid', 'wind_speed', 'precip')) {
  print(mean(weather[[column]], na.rm = TRUE))
}
```

You can check this for yourself by testing a column or two directly with just `mean(df$x)`.

Now if the data name changes, the columns we want change, or we want to calculate something else, we usually end up only changing one thing, rather than at least changing one at a minimum, and probably many more things.  In addition, the amount of code is the same whether the loop goes over 100 columns or 4.

Let's do things a little differently.

```{r nyc_flights2}
columns = c('temp', 'humid', 'wind_speed', 'precip')
nyc_means = rep(NA, length(columns))

for (i in seq_along(columns)) {
  column = columns[i]
  nyc_means[i] = mean(weather[[column]], na.rm = TRUE)
  
  # alternative without the initial first step
  # nyc_means[i] = mean(weather[[columns[i]]], na.rm = TRUE)  
}

nyc_means
```

By creating a <span class="objclass">columns</span> object, if anything changes about the columns we want, that's the only line in the code that would need to be changed. The `i` is now a place holder for a number that goes from 1 to the length of <span class="objclass">columns</span> (i.e. `r length(columns)`).  We make an empty <span class="objclass">nyc_means</span> object that's the length of the columns, so that each element will eventually be the mean of the corresponding column.

In the following I remove precipitation and add visibility and air pressure.

```{r nyc_flights3}
columns = c('temp', 'humid', 'wind_speed', 'visib', 'pressure')
nyc_means = rep(NA, length(columns))

for (i in seq_along(columns)) {
  nyc_means[i] = mean(weather[[columns[i]]], na.rm = TRUE)
}

nyc_means %>% round(2)
```

Had we been copy-pasting, this would require deleting or commenting out a line in our code, pasting two more, and changing each one after pasting to represent the new columns.  That's tedious, and not a fun way to code.

### A slight speed gain

Note that you do not have to create an empty object like we did.  The following works also.

```{r nyc_flights4}
columns = c('temp', 'humid', 'wind_speed', 'visib', 'pressure')
nyc_means = numeric()

for (i in seq_along(columns)) {
  nyc_means[i] = mean(weather[[columns[i]]], na.rm = TRUE)
}

nyc_means %>% round(2)
```

However, the other approach is slightly faster, because memory is already allocated for all elements of <span class="objclass">nyc_means</span>, rather than updating it every iteration of the loop. This speed gain can become noticeable when dealing with thousands of columns and complex operations.

```{r loop_timing, eval=FALSE, echo=FALSE}
X = matrix(rnorm(1000000), ncol=10000)
means1 = rep(NA, ncol(X))
means2 = c()

loop1 = function(X) {
  for (i in 1:ncol(X)) {
    means1[i] = mean(X[,i])
  }
}

loop2 = function(X) {
  for (i in 1:ncol(X)) {
    means2[i] = mean(X[,i])
  }
}

microbenchmark::microbenchmark(loop1(X), loop2(X)) 
```


### While alternative

When you look at some people's R code, you may see a loop of a different sort. 

```{r nyc_flights_while}
columns = c('temp','humid','wind_speed', 'visib', 'pressure')
nyc_means = c()

i = 1
while (i <= length(columns)) {
  nyc_means[i] = mean(weather[[columns[i]]], na.rm = TRUE)
  i = i + 1
}

nyc_means %>% round(2)
```

This involves a <span class="emph">while</span> statement.  It states, while `i` is less than or equal to the length (number) of <span class="objclass">columns</span>, compute the value of the i^th^ element of <span class="objclass">nyc_means</span> as the mean of i^th^ column of <span class="objclass">weather</span>.  After that, increase the value of `i`.  So, we start with `i = 1`, compute that subsequent mean, `i` now equals 2, do the process again, and so on.  The process will stop as soon as the value of `i` is greater than the length of <span class="objclass">columns</span>.

*There is zero difference to using the while approach vs. the for loop*.  While is often used when there is a check to be made, e.g. in modeling functions that have to stop the estimation process at some point, or else they'd go on indefinitely.  In that case the while syntax is probably more natural.  Either is fine.

### Loops summary

Understanding loops is fundamental toward spending less time processing data and more time toward exploring it.  Your code will be more succinct and more able to handle the usual changes that come with dealing with data.  Now that you have a sense of it, know that once you are armed with the sorts of things we'll be talking about next- apply functions, writing functions, and vectorization - you'll likely have little need to write explicit loops.  While there is always a need for iterative processing of data, R provides even more efficient means to do so.

## Implicit Loops

Writing loops is straightforward once you get the initial hang of it. However, R offers alternative ways to do loops that can simplify code without losing readability.  As such, even when you loop in R, you don't have to do so explicitly.

### apply family

A family of functions comes with R that allows for a succinct way of looping when it is appropriate. Common functions in this family include:

- <span class="func">apply</span>
    - arrays, matrices, data.frames
- <span class="func">lapply</span>, <span class="func">sapply</span>, <span class="func">vapply</span>
    - lists, data.frames, vectors
- <span class="func">tapply</span>
    - grouped operations (table apply)
- <span class="func">mapply</span>
    - multivariate version of sapply
- <span class="func">replicate</span>
    - performs an operation N times
    
As an example we'll consider standardizing variables, i.e. taking a set of numbers, subtracting the mean, and dividing by the standard deviation.  This results in a variable with mean of 0 and standard deviation of 1.  Let's start with a loop approach.

```{r loopvsapply1, eval=FALSE}
for (i in 1:ncol(mydf)) {
  x = mydf[, i]
  for (j in 1:length(x)) {
    x[j] = (x[j] - mean(x)) / sd(x)
  }
}
```

The above would be a really bad way to use R. It goes over each column individually, then over each value of the column. 

Conversely, <span class="func">apply</span> will take a matrix or data frame, and apply a function over the margin, row or column, you want to loop over.  The first argument is the data you're considering, the margin is the second argument (1 for rows, 2 for columns[^arrays]), and the function you want to apply to those rows is the third argument.  The following example is much cleaner compared to the loop, and now you'd have a function you can use elsewhere if needed.

```{r loopvsapply2, eval=F}
stdize <- function(x) {
  (x - mean(x)) / sd(x)
}

apply(mydf, 2, stdize)   # 1 for rows, 2 for columnwise application
```


Many of the other apply functions work similarly, taking an object and a function to do the work on the object (possibly implicit), possibly with other arguments specified if necessary.


#### lapply

Let's say we have a list object, or even just a vector of values. There are no rows or columns to iterate over, so what do we do here?

```{r lapply_example}
x = list('aba', 'abb', 'abc', 'abd', 'abe')

lapply(x, str_remove, pattern = 'ab')
```

The <span class="func">lapply</span> operation iterates over each element of the list and applies a function to them.  In this case, the function is <span class="func">str_remove</span>.  It has an argument for the string pattern we want to take out of the character string that is fed to it ('ab').  For example, for 'aba' we will be left with just the 'a'. 

As can be seen, <span class="func">lapply</span> starts with a <span class="objclass">list</span> and returns a <span class="objclass">list</span>.  The only difference with <span class="func">sapply</span> is that <span class="func">sapply</span> will return a simplified form if possible[^slapply].

```{r sapply_example}
sapply(x, str_remove, pattern = 'ab')
```

In this case we just get a vector back.

    
### Apply functions

It is important to be familiar with the apply family for efficient data processing, if only because you'll regularly come code employing these functions. A summary of benefits includes:
  
- Cleaner/simpler code
- Environment kept clear of unnecessary objects
- Potentially more reproducible
    - more likely to use generalizable functions
- Parallelizable
    
Note that apply functions are <span class="emph">NOT</span> necessarily faster than explicit loops, and if you create an empty object for the loop as discussed previously, the explicit loop will likely be faster.  On top of that, functions like <span class="func">replicate</span> and <span class="func">mapply</span> are especially slow.

However, the apply family can ALWAYS *potentially* be faster than standard R loops do to parallelization.  With base R's <span class="pack">parallel</span> package, there are parallel versions of the apply family, e.g.<span class="func">parApply</span>, <span class="func">parLapply</span> etc.  As every modern computer has at least four cores to play with, you'll always potentially have nearly a 4x speedup by using the parallel apply functions.

Apply functions and similar approaches should be a part of your regular R experience.  We'll talk about other options that may have even more benefits, but you need to know the basics of how apply functions work in order to use those.  


<div class='note'>
I use R every day, and very rarely use explicit loops.  Note that there is no speed difference for a <span class="func">for</span> loop vs. using <span class="func">while</span>.  And if you must use an explicit loop, create an empty object of the dimension/form you need, and then fill it in via the loop.  This will be notably faster.

I pretty much never use an explicit double loop, as a little more thinking about the problem will usually provide a more efficient path to solving the problem.

<img class='img-note' src="img/R.ico" style="display:block; margin: 0 auto;"> 
</div>


### purrr

The <span class="pack">purrr</span> package allows you to take the <span class="func">apply</span> family approach to the tidyverse.  And with packages <span class="pack" style = "">future</span> + <span class="pack" style = "">furrr</span>, they too are parallelizable.

Consider the following.  We'll use the <span class="func">map</span> function to map the <span class="func">sum</span> function to each element in the list, the same way we would with <span class="func">lapply</span>.

```{r purr_example, eval=TRUE, echo=-(1:3)}
# detach(package:maps)

map = purrr::map  # could not remove conflict with something, nor able to debug

x = list(1:3, 4:6, 7:9)

map(x, sum)
```


The map functions take some getting used to, and in my experience they are typically slower than the apply functions, sometimes notably so.  However they allow you stay within the tidy realm, which has its own benefits, and have more control over the nature of the output[^purrr], which is especially important in reproducibility, package development, producing production-level code, etc.  The key idea is that the map functions will always return something the same length as the input given to it.


[^purrr]:  The main thing to note is that both the benefit and difficulty of using the `map_*` functions is they are more strict with inputs, which typically should be lists or list-like objects like data frames, and outputs.

The <span class="pack">purrr</span> functions want a list or vector, i.e. they don't work with data.frame objects in the same way we've done with mutate and summarize except in the sense that data.frames are lists. 


```{r purr_example2, eval=-1}
mtcars %>% 
  map(scale)    # returns a list, not shown

mtcars %>% 
  map_df(scale) # returns a df

mtcars %>% 
  map_dbl(sum)  # returns a numeric (double) vector of column sums

diamonds %>%
  map_at(
    vars(carat, depth, price),
    function(x)
      as.integer(x > median(x))
  ) %>% 
  as_tibble()
```


However, working with lists is very useful, so let's turn to that.


## Looping with Lists

Aside from data frames, you may think you don't have much need for list objects.  However, list objects make it very easy to iterate some form of data processing.

Let's say you have models of increasing complexity, and you want to easily summarise and/or compare them.  We create a list for which each element is a model object. We then apply a function, e.g. to get the AIC value for each, or adjusted R square (this requires a custom function).

```{r purr_models}
library(mgcv) # for gam

mtcars$cyl = factor(mtcars$cyl)

mod_lm = lm(mpg ~ wt, data = mtcars)

mod_poly = lm(mpg ~ poly(wt, 2), data = mtcars)

mod_inter = lm(mpg ~ wt * cyl, data = mtcars)

mod_gam = gam(mpg ~ s(wt), data = mtcars)

mod_gam_inter = gam(mpg ~ cyl + s(wt, by = cyl), data = mtcars)

model_list = list(
  mod_lm        = mod_lm,
  mod_poly      = mod_poly,
  mod_inter     = mod_inter,
  mod_gam       = mod_gam,
  mod_gam_inter = mod_gam_inter
)

# lowest wins
model_list %>% 
  map_dbl(AIC) %>% 
  sort()

# highest wins
model_list %>% 
  map_dbl(
    function(x)
    if_else(inherits(x, 'gam'),
            summary(x)$r.sq,
            summary(x)$adj)
    ) %>% 
  sort(decreasing = TRUE)
```


Let's go further and create a plot of these results. We'll map to a data frame, use <span class="func" style = "">pivot_longer</span> to melt it to two columns of model and value, then use ggplot2 to plot the results[^ggplotalpha].

```{r purr_models2}
model_list %>% 
  map_df(
    function(x)
      if_else(inherits(x, 'gam'),
              summary(x)$r.sq,
              summary(x)$adj)
  ) %>% 
  pivot_longer(cols = starts_with('mod'),
               names_to = 'model',
               values_to = "Adj. Rsq") %>% 
  arrange(desc(`Adj. Rsq`)) %>%
  mutate(model = factor(model, levels = model)) %>% # sigh
  ggplot(aes(x = model, y = `Adj. Rsq`)) +
  geom_point(aes(color = model), size = 10, show.legend = F)
```

Why not throw in AIC also?

```{r purr_models3}
mod_rsq = 
  model_list %>% 
  map_df(
    function(x)
      if_else(
        inherits(x, 'gam'),
        summary(x)$r.sq,
        summary(x)$adj
      )
  ) %>% 
  pivot_longer(cols = starts_with('mod'),
               names_to = 'model',
               values_to = 'Rsq')

mod_aic =
  model_list %>% 
  map_df(AIC) %>%
  pivot_longer(cols = starts_with('mod'),
               names_to = 'model',
               values_to = 'AIC')

left_join(mod_rsq, mod_aic) %>%
  arrange(AIC) %>%
  mutate(model = factor(model, levels = model)) %>%
  pivot_longer(cols = -model, names_to = 'measure', values_to = 'value') %>%
  ggplot(aes(x = model, y = value)) +
  geom_point(aes(color = model), size = 10, show.legend = F) +
  facet_wrap(~ measure, scales = 'free')
```

#### List columns

As data.frames are lists, anything can be put into a column just as you would a list element.  We'll use <span class="func">pmap</span> here, as it can take more than one argument, and we're feeding all columns of the data.frame.  You don't need to worry about the details here, we just want to create a column that is actually a list. In this case the column will contain a data frame in each entry.

```{r list_tibble}
mtcars2 = as.matrix(mtcars)

mtcars2[sample(1:length(mtcars2), 50)] = NA   # add some missing data

mtcars2 = data.frame(mtcars2) %>%
  rownames_to_column(var = 'observation') %>%
  as_tibble()

head(mtcars2)

mtcars2 = 
  mtcars2 %>% 
  mutate(
    newvar = 
      pmap(., ~ data.frame(
        N = sum(!is.na(c(...))), 
        Missing = sum(is.na(c(...)))
        )
      )
  )
```

Now check out the list column.

```{r list_tibble2}
mtcars2

mtcars2$newvar %>% head(3)
```

Unnest it with the <span class="pack">tidyr</span> function.

```{r list_tibble3}
mtcars2 %>% 
  unnest(newvar)
```

This is a pretty esoteric demonstration, and not something you'd normally want to do, as <span class="func">mutate</span> or other approaches would be far more efficient and sensical.  However, the idea is that you might want to retain the information you might otherwise store in a list with the data that was used to create it. As an example, you could potentially attach models as a list column to a dataframe that contains meta-information about each model. Once you have a list column, you can use that column as you would any list for iterative programming.  


## Iterative Programming Exercises

### Exercise 1

With the following <span class="objclass">matrix</span>, use <span class="func">apply</span> and the <span class="func">sum</span> function to get row or column sums of the matrix <span class="objclass">x</span>.

```{r app_ex1, echo=1, eval=FALSE}
x = matrix(1:9, 3, 3)
apply()
```

### Exercise 2

With the following list object <span class="objclass">x</span>, use <span class="func">lapply</span> and <span class="func">sapply</span> and the <span class="func">sum</span> function to get sums for the elements.  There is no margin to specify for a list, so just supply the list and the sum function.  

```{r app_ex2, echo=1, eval=FALSE}
x = list(1:3, 4:10, 11:100)
lapply()
sapply()
```


### Exercise 3

As in the previous example, use a <span class="func" style = "">map</span> function to create a data frame of the column means.  See `?map` to see all your options.

```{r purrr_ex3, echo=1:5, eval=FALSE}
d = tibble(
  x = rnorm(100),
  y = rnorm(100, 10, 2),
  z = rnorm(100, 50, 10),
)

d %>% map_df(mean)
```

[^slapply]: <span class="func">sapply</span> is actually just a wrapper for <span class="func">lapply</span>.  If you supply the argument `simplified=F`, it is identical. Otherwise, it attempts to return a vector or matrix.

[^arrays]: You can have 3 or more dimensional arrays, and <span class="func">apply</span> will work over those dimensions (or any combination of them), but this doesn't come up too often.


[^ggplotalpha]: Note that ggplot2 will change the ordering of the variable unless you coerce it by creating a factor. This pretty much defeats the entire purpose of retaining characters over factors everywhere else in the tidyverse.