-
Notifications
You must be signed in to change notification settings - Fork 13
/
functions.Rmd
445 lines (315 loc) · 14.4 KB
/
functions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
# Writing Functions
<div class="" style="text-align: center;">
<i class="fas fa-terminal fa-5x" ></i>
</div>
You can't do anything in R without using functions, but have you ever written your own? Why would you?
- Efficiency
- Customized functionality
- Reproducibility
- Extend the work that's already been done
There are many benefits to writing your own functions, and it's actually easy to do. Once you get the basic concept down, you'll likely find yourself using your own functions more and more.
## A Starting Point
Let's assume you want to calculate the mean, standard deviation, and number of missing values for a variable, called `myvar`. We could do something like the following
```{r func_ex, eval=FALSE}
mean(myvar)
sd(myvar)
sum(is.na(myvar))
```
Now let's say you need to do it for several variables. Here's what your custom function could look like. It takes a single input, the variable you want information about, and returns a data frame with that info.
```{r func_ex2}
my_summary <- function(x) {
data.frame(
mean = mean(x),
sd = sd(x),
N_missing = sum(is.na(x))
)
}
```
In the above, `x` is an arbitrary name for an input. You can name it whatever you want, but the more meaningful the better. In R (and other languages) these are called *arguments*, but these inputs will determine in part what is eventually produced as output by the function.
```{r func_ex3, echo=-1}
mtcars = datasets::mtcars # to undo previous factors
my_summary(mtcars$mpg)
```
Works fine. However, data typically isn't that pretty. It often has missing values.
```{r func_ex_missing}
load('data/gapminder.RData')
my_summary(gapminder_2019$lifeExp)
```
If there are actually missing values, we need to set `na.rm = TRUE` or the <span class="func">mean</span> and <span class="func">sd</span> will return `NA`. Let's try it. We can either hard bake it in, as in the initial example, or add an argument to let us control how to handle NAs with our custom function.
```{r func_ex3ext}
my_summary <- function(x) {
data.frame(
mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
N_missing = sum(is.na(x))
)
}
my_summary_na <- function(x, remove_na = TRUE) {
data.frame(
mean = mean(x, na.rm = remove_na),
sd = sd(x, na.rm = remove_na),
N_missing = sum(is.na(x))
)
}
my_summary(gapminder_2019$lifeExp)
my_summary_na(gapminder_2019$lifeExp, remove_na = FALSE)
```
Seems to work fine. Let's add how many total observations there are.
```{r func_ex4}
my_summary <- function(x) {
# create an arbitrarily named object with the summary information
summary_data = data.frame(
mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
N_total = length(x),
N_missing = sum(is.na(x))
)
# return the result!
summary_data
}
```
That was easy! Let's try it.
```{r func_ex5}
my_summary(gapminder_2019$lifeExp)
```
Now let's do it for every column! We've used the <span class="func">map</span> function before, now let's use a variant that will return a data frame.
```{r func_ex6}
gapminder_2019 %>%
select_if(is.numeric) %>%
map_dfr(my_summary, .id = 'variable')
```
The <span class="func">map_dfr</span> function is just like our previous usage in the [iterative programming][Iterative programming] section, just that it will create mini-data.frames then row-bind them together.
This shows that writing the first part of any function can be straightforward. Then, once in place, you can usually add functionality without too much trouble. Eventually you could have something very complicated, but which will make sense to you because you built it from the ground up.
Keep in mind as you start out that your initial decisions to make are:
- What are the inputs (<span class="emph">arguments</span>) to the function?
- What is the <span class="emph">value</span> to be returned?
When you think about writing a function, just write the code that can do it first. The goal is then to generalize beyond that single use case. RStudio even has a shortcut to help you get started. Consider our starting point. Highlight the code, hit Ctrl/Cmd + Shft + X, then give it a name.
```{r func_ex_redux, eval=FALSE}
mean(myvar)
sd(myvar)
sum(is.na(myvar))
```
It should look something like this.
```{r func_ex_redux_redux, eval=FALSE}
test_fun <- function(myvar) {
mean(myvar)
sd(myvar)
sum(is.na(myvar))
}
```
RStudio could tell that you would need at least one input `myvar`, but beyond that, you're now on your way to tweaking the function as you see fit.
Note that what goes in and what comes out could be anything, even nothing!
```{r nothing_function}
two <- function() {
2
}
two()
```
Or even another function!
```{r function_function}
center <- function(type) {
if (type == 'mean') {
mean
}
else {
median
}
}
center(type = 'mean')
myfun = center(type = 'mean')
myfun(1:5)
myfun = center(type = 'median')
myfun(1:4)
```
We can also set default values for the inputs.
```{r default_arg}
hi <- function(name = 'Beyoncé') {
paste0('Hi ', name, '!')
}
hi()
hi(name = 'Jay-Z')
```
If you are working within an RStudio project, it would be a good idea to create a folder for your functions and save each as their own script. When you need the function just use the following:
```{r source_func, eval=FALSE}
source('my_functions/awesome_func.R')
```
This would make it easy to even create your own personal package with the functions you create.
However you go about creating a function and for whatever purpose, try to make a clear decision at the beginning
- What is the (specific) goal of your function?
- What is the minimum needed to obtain that goal?
<br>
<div class='note'>
There is even a keyboard shortcut to create R style documentation automatically!
<div style='text-align:center; font-family:"Roboto Mono"'>Cmd/Ctrl + Option/Alt + Shift + R</div>
<br>
<img class='img-note' src="img/R.ico" style="display:block; margin: 0 auto;">
</div>
## DRY
An oft-quoted mantra in programming is ***D**on't **R**epeat **Y**ourself*. One context regards iterative programming, where we would rather write one line of code than one-hundred. More generally though, we would like to gain efficiency where possible. A good rule of thumb is, if you are writing the same set of code more than twice, you should write a function to do it instead.
Consider the following example, where we want to subset the data given a set of conditions. Given the cylinder, engine displacement, and mileage, we'll get different parts of the data.
```{r dry, eval=FALSE}
good_mileage_displ_low_cyl_4 = if_else(cyl == 4 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_low_cyl_6 = if_else(cyl == 6 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_low_cyl_8 = if_else(cyl == 8 & displ < mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_4 = if_else(cyl == 4 & displ > mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_6 = if_else(cyl == 6 & displ > mean(displ) & hwy > 30, 'yes', 'no')
good_mileage_displ_high_cyl_8 = if_else(cyl == 8 & displ > mean(displ) & hwy > 30, 'yes', 'no')
```
It was tedious, but that's not much code. But now consider- what if you want to change the mpg cutoff? The mean to median? Something else? You have to change all of it. Screw that- let's write a function instead! What kinds of inputs will we need?
- cyl: Which cylinder type we want
- mpg_cutoff: The cutoff for 'good' mileage
- displ_fun: Whether the displacement to be based on the mean or something else
- displ_low: Whether we are interested in low or high displacement vehicles
- cls: the class of the vehicle (e.g. compact or suv)
```{r mpgfunc}
good_mileage <- function(
cylinder = 4,
mpg_cutoff = 30,
displ_fun = mean,
displ_low = TRUE,
cls = 'compact'
) {
if (displ_low == TRUE) { # condition to check, if it holds,
result <- mpg %>% # filter data given the arguments
filter(
cyl == cylinder,
displ <= displ_fun(displ),
hwy >= mpg_cutoff,
class == cls
)
}
else { # if the condition doesn't hold, filter
result <- mpg %>% # the data this way instead
filter(
cyl == cylinder,
displ >= displ_fun(displ), # the only change is here
hwy >= mpg_cutoff,
class == cls
)
}
result # return the object
}
```
So what's going on here? Not a whole lot really. The function just filters the data to observations that match the input criteria, and returns that result at the end. We also put *default values* to the arguments, which can be done to your discretion.
## Conditionals
The core of the above function uses a *conditional statement* using standard <span class="func">if</span>...<span class="func">else</span> structure. The <span class="func">if</span> part determines whether some condition holds. If it does, then proceed to the next step in the brackets. If not, skip to the <span class="func">else</span> part. You may have used the <span class="func">ifelse</span> function in base R, or <span class="pack">dplyr</span>'s <span class="func">if_else</span> as above, which are a short cuts for this approach. We can also add conditional else statements (<span class="func">else if</span>), drop the <span class="func">else</span> part entirely, nest conditionals within other conditionals, etc. Like loops, conditional statements look very similar across all programming languages.
JavaScript:
```{javascript, eval=FALSE}
if (Math.random() < 0.5) {
console.log("You got Heads!")
} else {
console.log("You got Tails!")
}
```
Python:
```{python, eval=FALSE}
if x == 2:
print(x)
else:
print(x*x)
```
In any case, with our function at the ready, we can now do the things we want to as needed:
```{r mpgfunc_demo}
good_mileage(mpg_cutoff = 40)
good_mileage(
cylinder = 8,
mpg_cutoff = 15,
displ_low = FALSE,
cls = 'suv'
)
```
Let's extend the functionality by adding a year argument (the only values available are 2008 and 1999).
```{r mpgfunc_extend, echo=1:6}
good_mileage <- function(
cylinder = 4,
mpg_cutoff = 30,
displ_fun = mean,
displ_low = TRUE,
cls = 'compact',
yr = 2008
) {
if (displ_low) {
result = mpg %>%
filter(cyl == cylinder,
displ <= displ_fun(displ),
hwy >= mpg_cutoff,
class == cls,
year == yr)
}
else {
result = mpg %>%
filter(cyl == cylinder,
displ >= displ_fun(displ),
hwy >= mpg_cutoff,
class == cls,
year == yr)
}
result
}
```
```{r mpgfunc_extend_demo}
good_mileage(
cylinder = 8,
mpg_cutoff = 19,
displ_low = FALSE,
cls = 'suv',
yr = 2008
)
```
So we now have something that is *flexible*, *reusable*, and *extensible*, and it took less code than writing out the individual lines of code.
## Anonymous functions
Oftentimes we just need a quick and easy function for a one-off application, especially when using apply/map functions. Consider the following two lines of code.
```{r lambda, eval=FALSE}
apply(mtcars, 2, sd)
apply(mtcars, 2, function(x) x / 2 )
```
The difference between the two is that for the latter, our function didn't have to be a named object already available. We created a function on the fly just to serve a specific purpose. A function doesn't exist in base R that just does nothing but divide by two, but since it is simple, we just created it as needed.
To further illustrate this, we'll create a robust standardization function that uses the median and median absolute deviation rather than the mean and standard deviation.
```{r lambda_ex}
# some variables have a mad = 0, and so return Inf (x/0) or NaN (0/0)
# apply(mtcars, 2, function(x) (x - median(x))/mad(x)) %>%
# head()
mtcars %>%
map_df(function(x) (x - median(x))/mad(x))
```
Even if you don't use [anonymous functions](https://en.wikipedia.org/wiki/Anonymous_function) (sometimes called *lambda* functions), it's important to understand them, because you'll often see other people's code using them.
<br>
<div class='note'>
While it goes beyond the scope of this document at present, I should note that RStudio has a very nice and easy to use debugger. Once you get comfortable writing functions, you can use the debugger to troubleshoot problems that arise, and test new functionality (see the 'Debug' menu). In addition, one can profile functions to see what parts are, for example, more memory intensive, or otherwise serve as a bottleneck (see the 'Profile' menu). You can use the profiler on any code, not just functions.
<br>
<img class='img-note' src="img/R.ico" style="display:block; margin: 0 auto;">
</div>
## Writing Functions Exercises
### Excercise 1
Write a function that takes the log of the sum of two values (i.e. just two single numbers) using the <span class="func">log</span> function. Just remember that within a function, you can write R code just like you normally would.
```{r wf_ex1, eval=FALSE}
log_sum <- function(a, b) {
?
}
```
### Excercise 1b
What happens if the sum of the two numbers is negative? You can't take a log of a negative value, so it's an error. How might we deal with this? Try using a conditional to provide an error message using the <span class="func">stop</span> function. The first part is basically identical to the function you just did. But given that result, you will need to check for whether it is negative or not. The message can be whatever you want.
```{r wf_ex1b, eval=FALSE}
log_sum <- function(a, b) {
?
if (? < 0) {
stop('Your message here.')
} else {
?
return(your_log_sum_results) # this is an arbitrary name, change accordingly
}
}
```
### Exercise 2
Let's write a function that will take a numeric variable and convert it to a character string of 'positive' vs. 'negative'. We can use `if {}... else {}` structure, <span class="func">ifelse</span>, or <span class="pack">dplyr</span>::<span class="func">if_else</span>- they all would accomplish this. In this case, the input is a single vector of numbers, and the output will recode any negative value to 'negative' and positive values to 'positive' (or whatever you want). Here is an example of how we would just do it as a one-off.
```{r wf_ex2, eval=FALSE}
set.seed(123) # so you get the exact same 'random' result
x <- rnorm(10)
if_else(x < 0, "negative", "positive")
```
Now try your hand at writing a function for that.
```{r wf_ex2b, eval=FALSE}
pos_neg <- function(?) {
?
}
```