Skip to content

Tidyverse functions

Peter Desmet edited this page Jan 30, 2019 · 9 revisions

Tidyverse functions are part of the tidyverse collection of R packages. These functions work well together, keep the code readable and are good for exploring and transforming data, which is why we try to stick to only using these utensils for the mapping script. For mapping data to Darwin Core, we noticed you can mostly get by with just three functions: mutate(), recode() and case_when(), which are discussed below. To learn more about the other Tidyverse functions used in the mapping script, type ?function_name in your R Studio console or check the documentation of the tidyverse packages.

A quick word about piping

Piping means using the pipe operator %>% or pipe. It is easy to use and highly increases the readability of your code:

# Take the dataframe "taxon", group the values of the column "kingdom" and show a count for each unique value
taxon %>%
  group_by(kingdom) %>%
  count()

Is a much more readable way than the classic approach of nesting functions:

# Take the dataframe "taxon", group the values of the column "kingdom" and show a count for each unique value
count(group_by(taxon, kingdom))

mutate()

mutate() adds or updates a column to your dataframe. You use it to add a new Darwin Core term to your data frame and populate it with one or more values. To allow comparison between the source data and the Darwin Core terms, do not update columns.

The basic code for mutate() looks like this:

input_data %<>% mutate(new_column_name = ...)

With:

  • input_data: a data frame with your input data, i.e. the source checklist data
  • %<>%: a shorter way of writing input_data <- input_data %>% ...
  • mutate(): a function to add or update a column
  • new_column_name: a name of the column you want to add to the dataframe, i.e. the Darwin Core term. If a column with that name already exists, it will update that column, which you want to avoid. That is why we suggest to prefix all Darwin Core column names with dwc_, so you don't accidentally update one of the source columns. The prefix will be removed in the post-processing step.
  • : the value(s) to populate this new column with, whether these are static, unaltered or altered

Mapping static values

Some Darwin Core terms have the same static value for every record in the data, i.e. their content is constant for the whole dataset. This is mostly the case for record-level terms (metadata) in the taxon core, but other terms can be static as well.

To map to a static value, write that value in "double quotes":

taxon %<>% mutate(dwc_license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(dwc_kingdom = "Animalia")

Mapping unaltered values

To copy the unaltered value of a source column to a Darwin Core term, use the name of that column as your value:

taxon %<>% mutate(dwc_scientificName = scientific_name)

Mapping altered values

If you want to standardize, correct or combine the source data before mapping it to a Darwin Core term, you will have to write an expression in your mutate() function to do that. A simple example is concatenating the values from two columns together:

taxon %<>% mutate(dwc_scientificName = paste(genus, species))

The range of possibilities and bugs (i.e. the example above will create odd values if one of the source columns is empty) is too big to cover here, but for standardizing/correcting values there are two functions we would like to introduce: recode() and case_when(). Both are used in conjunction with mutate().

recode()

recode() replaces specific source values with a new, altered values in a one-to-one mapping. It is useful for correcting specific typos or mapping values to controlled vocabularies. The basic code is:

input_data %<>% mutate(dwc_term = recode(column,
  "value_1" = "dwc_value_1",
  "value_2" = "dwc_value_2",
  .default = "" # Option to handle other source values, drop this to leave them as is
  .missing = "" # Option to handle NA values
))

Correcting specific typos

input_data %<>% mutate(scientific_name = recode(scientific_name,
  "AseroÙ rubra" = "Asero rubra"
))

In the above example the typo AseroÙ rubra is corrected to Asero rubra. All the other scientific_names are left untouched (the .default parameter is not used). Here the column scientific_name is overwritten with the recoded values, as that column will be used as the basis for Taxon IDs.

Add comments to explain why you recoded some values:

taxon %<>% mutate(dwc_phylum = recode(phylum, 
  "Crustacea" = "Arthropoda" # Crustacea is not a phylum
))

Controlled vocabularies

taxon %<>% mutate(dwc_taxonRank = recode(rankmarker,
  "infrasp."  = "infraspecificname",
  "sp."       = "species",
  "var."      = "variety",
  .default    = ""
))

In the above example the rankmarker is mapped to the GBIF vocabulary for taxon ranks. Any source value that wasn't defined, will be left empty (.default = "").

case_when()

case_when allows to assign values based on conditions, rather than specific values used for recode(). It is useful when the mapping of a term depends on multiple source values. The basic code is:

input_data %<>% mutate(dwc_term = case_when(
  conditional_statement_1 ~ "dwc_value_1",
  conditional_statement_2 ~ "dwc_value_2",
  TRUE ~ "dwc_value_3" # Option to handle all other conditions
))

You can read this as: if conditional_statement_1 is true then map to dwc_value_1, if conditional_statement_2 is true then map to dwc_value_2, else map to dwc_value_3.

Use multiple source values

distribution %<>% mutate(dwc_locality = case_when(
  !is.na(locality) ~ locality,
  country_code == "BE" ~ "Belgium",
  country_code == "GB" ~ "United Kingdom",
  country_code == "MK" ~ "Macedonia",
  country_code == "NL" ~ "The Netherlands",
  TRUE ~ ""
))

In the above example the Darwin Core term locality is populated with information from the locality if that is not empty (!is.na). Otherwise, the specific country_codes is mapped to a country name. In the other cases (e.g. another country_code) the location is left empty (TRUE ~ ""). Note how two source columns (locality and country_code) are used for this mapping.