functions.Rmd

# Writing Functions


<div class="" style="text-align: center;">
<i class="fas fa-terminal fa-5x" ></i>
</div>


You can't do anything in R without using functions, but have you ever written your own?  Why would you?

- Efficiency
- Customized functionality
- Reproducibility
- Extend the work that's already been done

There are many benefits to writing your own functions, and it's actually easy to do.  Once you get the basic concept down, you'll likely find yourself using your own functions more and more.

## A Starting Point

Let's assume you want to calculate the mean, standard deviation, and number of missing values for a variable, called `myvar`.  We could do something like the following

```{r func_ex, eval=FALSE}
mean(myvar)
sd(myvar)
sum(is.na(myvar))
```

Now let's say you need to do it for several variables.  Here's what your custom function could look like.  It takes a single input, the variable you want information about, and returns a data frame with that info.

```{r func_ex2}
my_summary <- function(x) {
  data.frame(
    mean = mean(x),
    sd = sd(x),
    N_missing = sum(is.na(x))
  )
}
```


In the above, `x` is an arbitrary name for an input. You can name it whatever you want, but the more meaningful the better.  In R (and other languages) these are called *arguments*, but these inputs will determine in part what is eventually produced as output by the function.

```{r func_ex3, echo=-1}
mtcars = datasets::mtcars  # to undo previous factors
my_summary(mtcars$mpg)
```

Works fine. However, data typically isn't that pretty. It often has missing values.

```{r func_ex_missing}
load('data/gapminder.RData')
my_summary(gapminder_2019$lifeExp)
```


If there are actually missing values, we need to set `na.rm = TRUE` or the <span class="func">mean</span> and <span class="func">sd</span> will return `NA`. Let's try it. We can either hard bake it in, as in the initial example, or add an argument to let us control how to handle NAs with our custom function.

```{r func_ex3ext}
my_summary <- function(x) {
  data.frame(
    mean = mean(x, na.rm = TRUE),
    sd = sd(x, na.rm = TRUE),
    N_missing = sum(is.na(x))
  )
}


my_summary_na <- function(x, remove_na = TRUE) {
  data.frame(
    mean = mean(x, na.rm = remove_na),
    sd = sd(x, na.rm = remove_na),
    N_missing = sum(is.na(x))
  )
}


my_summary(gapminder_2019$lifeExp)

my_summary_na(gapminder_2019$lifeExp, remove_na = FALSE)
```


Seems to work fine.  Let's add how many total observations there are.

```{r func_ex4}
my_summary <- function(x) {
  # create an arbitrarily named object with the summary information
  summary_data = data.frame(
    mean = mean(x, na.rm = TRUE),
    sd = sd(x, na.rm = TRUE),
    N_total = length(x),
    N_missing = sum(is.na(x))
  )
  
  # return the result!
  summary_data       
}
```

That was easy! Let's try it.

```{r func_ex5}
my_summary(gapminder_2019$lifeExp)
```


Now let's do it for every column!  We've used the <span class="func">map</span> function before, now let's use a variant that will return a data frame. 

```{r func_ex6}
gapminder_2019 %>% 
  select_if(is.numeric) %>% 
  map_dfr(my_summary, .id = 'variable')
```

The <span class="func">map_dfr</span> function is just like our previous usage in the [iterative programming][Iterative programming] section, just that it will create mini-data.frames then row-bind them together.

This shows that writing the first part of any function can be straightforward. Then, once in place, you can usually add functionality without too much trouble.  Eventually you could have something very complicated, but which will make sense to you because you built it from the ground up.  

Keep in mind as you start out that your initial decisions to make are:

- What are the inputs (<span class="emph">arguments</span>) to the function?
- What is the <span class="emph">value</span> to be returned? 

When you think about writing a function, just write the code that can do it first.  The goal is then to generalize beyond that single use case.  RStudio even has a shortcut to help you get started. Consider our starting point. Highlight the code, hit Ctrl/Cmd + Shft + X, then give it a name.  

```{r func_ex_redux, eval=FALSE}
mean(myvar)
sd(myvar)
sum(is.na(myvar))
```

It should look something like this.

```{r func_ex_redux_redux, eval=FALSE}
test_fun <- function(myvar) {
  mean(myvar)
  sd(myvar)
  sum(is.na(myvar))
}
```

RStudio could tell that you would need at least one input `myvar`, but beyond that, you're now on your way to tweaking the function as you see fit.


Note that what goes in and what comes out could be anything, even nothing!

```{r nothing_function}
two <- function() {
  2
}

two()
```

Or even another function!

```{r function_function}
center <- function(type) {
  if (type == 'mean') {
    mean
  } 
  else {
    median
  }
}

center(type = 'mean')

myfun = center(type = 'mean')

myfun(1:5)

myfun = center(type = 'median')

myfun(1:4)
```

We can also set default values for the inputs.

```{r default_arg}
hi <- function(name = 'Beyoncé') {
  paste0('Hi ', name, '!')
}

hi()
hi(name = 'Jay-Z')
```

If you are working within an RStudio project, it would be a good idea to create a folder for your functions and save each as their own script. When you need the function just use the following:

```{r source_func, eval=FALSE}
source('my_functions/awesome_func.R')
```


This would make it easy to even create your own personal package with the functions you create.  


However you go about creating a function and for whatever purpose, try to make a clear decision at the beginning 

- What is the (specific) goal of your function?  
- What is the minimum needed to obtain that goal?


<br>
<div class='note'>
There is even a keyboard shortcut to create R style documentation automatically! 

<div style='text-align:center; font-family:"Roboto Mono"'>Cmd/Ctrl + Option/Alt + Shift + R</div>
<br>

<img class='img-note' src="img/R.ico" style="display:block; margin: 0 auto;"> 
</div>


## DRY

An oft-quoted mantra in programming is ***D**on't **R**epeat **Y**ourself*.  One context regards iterative programming, where we would rather write one line of code than one-hundred.  More generally though, we would like to gain efficiency where possible.  A good rule of thumb is, if you are writing the same set of code more than twice, you should write a function to do it instead.  

Consider the following example, where we want to subset the data given a set of conditions.  Given the cylinder, engine displacement, and mileage, we'll get different parts of the data.

```{r dry, eval=FALSE}
good_mileage_displ_low_cyl_4  = if_else(cyl == 4 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_low_cyl_6  = if_else(cyl == 6 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_low_cyl_8  = if_else(cyl == 8 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_4 = if_else(cyl == 4 & displ > mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_6 = if_else(cyl == 6 & displ > mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_8 = if_else(cyl == 8 & displ > mean(displ) & hwy > 30, 'yes', 'no')
```

It was tedious, but that's not much code.  But now consider- what if you want to change the mpg cutoff? The mean to median? Something else?  You have to change all of it.  Screw that- let's write a function instead! What kinds of inputs will we need?

- cyl: Which cylinder type we want
- mpg_cutoff: The cutoff for 'good' mileage
- displ_fun: Whether the displacement to be based on the mean or something else
- displ_low: Whether we are interested in low or high displacement vehicles
- cls: the class of the vehicle (e.g. compact or suv)

```{r mpgfunc}

good_mileage <- function(
  cylinder = 4,
  mpg_cutoff = 30,
  displ_fun = mean,
  displ_low = TRUE,
  cls = 'compact'
) {
  
  if (displ_low == TRUE) {              # condition to check, if it holds,
    result <- mpg %>%                   # filter data given the arguments
      filter(
        cyl == cylinder,
        displ <= displ_fun(displ),
        hwy >= mpg_cutoff,
        class == cls
      )
  } 
  else {                                # if the condition doesn't hold, filter 
    result <- mpg %>%                   # the data this way instead
      filter(
        cyl == cylinder,
        displ >= displ_fun(displ),      # the only change is here
        hwy >= mpg_cutoff,
        class == cls
      )
  }
  
  result                                # return the object
}
```

So what's going on here? Not a whole lot really. The function just filters the data to observations that match the input criteria, and returns that result at the end.  We also put *default values* to the arguments, which can be done to your discretion.  

## Conditionals

The core of the above function uses a *conditional statement* using standard <span class="func">if</span>...<span class="func">else</span> structure.  The <span class="func">if</span> part determines whether some condition holds.  If it does, then proceed to the next step in the brackets.  If not, skip to the <span class="func">else</span> part. You may have used the <span class="func">ifelse</span> function in base R, or <span class="pack">dplyr</span>'s <span class="func">if_else</span> as above, which are a short cuts for this approach.  We can also add conditional else statements (<span class="func">else if</span>), drop the <span class="func">else</span> part entirely, nest conditionals within other conditionals, etc.  Like loops, conditional statements look very similar across all programming languages.

JavaScript:

```{javascript, eval=FALSE}
if (Math.random() < 0.5) {
  console.log("You got Heads!")
} else {
  console.log("You got Tails!")
}
```

Python: 

```{python, eval=FALSE}
if x == 2:
  print(x)
else:
  print(x*x)
```


In any case, with our function at the ready, we can now do the things we want to as needed:

```{r mpgfunc_demo}
good_mileage(mpg_cutoff = 40)

good_mileage(
  cylinder = 8,
  mpg_cutoff = 15,
  displ_low = FALSE,
  cls = 'suv'
)
```

Let's extend the functionality by adding a year argument (the only values available are 2008 and 1999).

```{r mpgfunc_extend, echo=1:6}
good_mileage <- function(
  cylinder = 4,
  mpg_cutoff = 30,
  displ_fun = mean,
  displ_low = TRUE,
  cls = 'compact',
  yr = 2008
) {
  
  if (displ_low) {
    result = mpg %>%
    filter(cyl == cylinder,
           displ <= displ_fun(displ),
           hwy >= mpg_cutoff,
           class == cls,
           year == yr)
  } 
  else {
    result = mpg %>%
    filter(cyl == cylinder,
           displ >= displ_fun(displ),
           hwy >= mpg_cutoff,
           class == cls,
           year == yr)
  }
  
  result
}
```

```{r mpgfunc_extend_demo}
good_mileage(
  cylinder = 8,
  mpg_cutoff = 19,
  displ_low = FALSE,
  cls = 'suv',
  yr = 2008
)
```

So we now have something that is *flexible*, *reusable*, and *extensible*, and it took less code than writing out the individual lines of code.


## Anonymous functions

Oftentimes we just need a quick and easy function for a one-off application, especially when using apply/map functions.  Consider the following two lines of code.

```{r lambda, eval=FALSE}
apply(mtcars, 2, sd)
apply(mtcars, 2, function(x) x / 2 )
```

The difference between the two is that for the latter, our function didn't have to be a named object already available.  We created a function on the fly just to serve a specific purpose.  A function doesn't exist in base R that just does nothing but divide by two, but since it is simple, we just created it as needed.

To further illustrate this, we'll create a robust standardization function that uses the median and median absolute deviation rather than the mean and standard deviation.

```{r lambda_ex}
# some variables have a mad = 0, and so return Inf (x/0) or NaN (0/0)
# apply(mtcars, 2, function(x) (x - median(x))/mad(x)) %>% 
#   head()

mtcars %>%
  map_df(function(x) (x - median(x))/mad(x))
```


Even if you don't use [anonymous functions](https://en.wikipedia.org/wiki/Anonymous_function) (sometimes called *lambda* functions), it's important to understand them, because you'll often see other people's code using them.


<br>
<div class='note'>
While it goes beyond the scope of this document at present, I should note that RStudio has a very nice and easy to use debugger.  Once you get comfortable writing functions, you can use the debugger to troubleshoot problems that arise, and test new functionality (see the 'Debug' menu).  In addition, one can profile functions to see what parts are, for example, more memory intensive, or otherwise serve as a bottleneck (see the 'Profile' menu).  You can use the profiler on any code, not just functions.
<br>

<img class='img-note' src="img/R.ico" style="display:block; margin: 0 auto;"> 
</div>


## Writing Functions Exercises

### Excercise 1

Write a function that takes the log of the sum of two values (i.e. just two single numbers) using the <span class="func">log</span> function.  Just remember that within a function, you can write R code just like you normally would.

```{r wf_ex1, eval=FALSE}
log_sum <- function(a, b) {
  ?
}
```


### Excercise 1b

What happens if the sum of the two numbers is negative?  You can't take a log of a negative value, so it's an error.  How might we deal with this?  Try using a conditional to provide an error message using the <span class="func">stop</span> function.  The first part is basically identical to the function you just did.  But given that result, you will need to check for whether it is negative or not.  The message can be whatever you want.

```{r wf_ex1b, eval=FALSE}
log_sum <- function(a, b) {
  
  ?
  
  if (? < 0) {
    stop('Your message here.')
  } else {
    ?
    return(your_log_sum_results)    # this is an arbitrary name, change accordingly
  }
}
```


### Exercise 2


Let's write a function that will take a numeric variable and convert it to a character string of 'positive' vs. 'negative'.  We can use `if {}... else {}` structure, <span class="func">ifelse</span>, or <span class="pack">dplyr</span>::<span class="func">if_else</span>- they all would accomplish this.  In this case, the input is a single vector of numbers, and the output will recode any negative value to 'negative' and positive values to 'positive' (or whatever you want).  Here is an example of how we would just do it as a one-off.

```{r wf_ex2, eval=FALSE}
set.seed(123)  # so you get the exact same 'random' result
x <- rnorm(10)
if_else(x < 0, "negative", "positive")
```

Now try your hand at writing a function for that.

```{r wf_ex2b, eval=FALSE}
pos_neg <- function(?) {
  ?
}
```