src/dwc_mapping.Rmd

---
title: "Darwin Core mapping"
subtitle: "For: Checklist of alien species of the Scheldt estuary"
author:
- Lien Reyserhove
- Sanne Govaert
date: "`r Sys.Date()`"
output:
  html_document:
    df_print: paged
    number_sections: yes
    toc: yes
    toc_depth: 3
    toc_float: yes
---

This document describes how we map the checklist data to Darwin Core. The source file for this document can be found [here](https://github.com/trias-project/alien-scheldt-checklist/blob/master/src/dwc_mapping.Rmd).

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = TRUE)
```

Load libraries:

```{r message = FALSE}
library(tidyverse)      # To do data science
library(here)           # To find files
library(janitor)        # To clean input data
```

# Read source data

The data is maintained in [this Google Spreadsheet](https://docs.google.com/spreadsheets/d/1LeXXbry2ArK2rngsmFjz_xErwE1KwQ8ujtvHNmTVA6E/edit?gid=541627250#gid=541627250).

Read the relevant worksheet (published as csv):

```{r}
raw_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTl8IEk2fProQorMu5xKQPdMXl3OQp-c0f6eBXitv0BiVFZ3JSJCde0PtbFXuETgguf6vK8b43FDX1C/pub?gid=541627250&single=true&output=csv", show_col_types = FALSE)
```

Copy the source data to the repository to keep track of changes:

```{r}
write.csv(raw_data, here::here("data", "raw", "alien_scheldt_dump.csv"), row.names = FALSE)
```

# Preprocessing: tidy data and add taxon ID's

To link taxa with information in the extension(s), each taxon needs a unique and relatively stable `taxonID`. We have created one in the form of `dataset_shortname:taxon:hash`, where `hash` is unique code based on scientific name and kingdom. Once this is created, it is added to the source data. 

```{r}
input_data <-
  raw_data %>%
  remove_empty("rows") %>%
  clean_names() %>%
  mutate(
    taxon_id = paste(
      "alien-scheldt-checklist",
      "taxon",
      .data$taxon_id_hash,
      sep = ":"
    )
  )
```

# Darwin Core mapping

## Taxon core

Create a dataframe with unique taxa only (ignoring multiple distribution rows). Map the data to [Darwin Core Taxon](http://rs.gbif.org/core/dwc_taxon_2015-04-24.xml).

```{r}
taxon <-
  input_data %>%
  distinct(taxon_id, .keep_all = TRUE) %>%
  mutate(
    language = "en",
    license = "http://creativecommons.org/publicdomain/zero/1.0/",
    rightsHolder = "INBO",
    accessRights = "https://www.inbo.be/en/norms-data-use",
    datasetID = "",
    institutionCode = "INBO",
    datasetName = "Checklist of alien species in the Scheldt estuary in Flanders, Belgium",
    taxonID = taxon_id,
    scientificName = scientific_name,
    kingdom = kingdom,
    phylum = phylum,
    class = class,
    order = order,
    family = family,
    genus = genus,
    taxonRank = taxon_rank,
    nomenclaturalCode = nomenclatural_code,
    .keep = "none"
  ) %>% 
  arrange(taxonID) %>% 
  select(
    "language", "license", "rightsHolder", "accessRights", "datasetID",
    "institutionCode", "datasetName", "taxonID", "scientificName", "kingdom", 
    "phylum", "class", "order", "family", "genus", "taxonRank",
    "nomenclaturalCode"
  )
```

## Distribution extension

Create a dataframe with all data (including multiple distributions). Map the data to [Species Distribution](http://rs.gbif.org/extension/gbif/1.0/distribution.xml).

Information for `eventDate` is contained in `date_first_observation` and `date_last_observation`, which we will express here in an ISO 8601 date format `yyyy/yyyy` (`start_date/end_date`).

Not all cells for `date_first_observation` (DFO) and/or `date_last_observation` (DLO) are populated. So, we used the following rules for those records: 

***case 1.*** If `DFO` is empty and `DLO` is empty, `eventDate` is `NA` 
***case 2.***  If `DFO` is empty and `DLO` is not empty: eventDate = `/DLO`
***case 3.*** If `DFO` is not empty and `DLO` is empty, eventDate is `DFO/`

```{r}
distribution <-
  input_data %>%
  # pathway
  pivot_longer(
    names_to = "key",
    values_to = "pathway",
    starts_with("introduction_pathway"),
    values_drop_na = FALSE) %>%
  filter( # keep NA value for species with no pathway provided
    !is.na(pathway) |
      (is.na(pathway) & key == "introduction_pathway_1")
    ) %>%
  # other terms
  mutate(
    taxonID = taxon_id,
    locationID = case_when(
      location == "Flanders" ~ "ISO_3166-2:BE-VLG",
      location == "Wallonia" ~ "ISO_3166-2:BE-WAL",
      location == "Brussels" ~ "ISO_3166-2:BE-BRU"
    ),
    locality = case_when(
      location == "Flanders" ~ "Flemish Region",
      location == "Wallonia" ~ "Walloon Region",
      location == "Brussels" ~ "Brussels-Capital Region"
    ),
    countryCode = country_code,
    occurrenceStatus = occurrence_status,
    establishmentMeans = establishment_means,
    degreeOfEstablishment = degree_of_establishment,
    eventDate = case_when(
      is.na(date_first_observation) & is.na(date_last_observation) ~ NA,
      is.na(date_first_observation) ~ paste0("/", date_last_observation),
      is.na(date_last_observation) ~ paste0(date_first_observation, "/"),
      !is.na(date_first_observation) & !is.na(date_last_observation) ~ 
        paste(date_first_observation, date_last_observation, sep = "/")
    ),
    source = source,
    occurrenceRemarks = occurrence_remarks
  ) %>%
  select(
    "taxonID", "locationID", "locality", "countryCode", "occurrenceStatus",
    "establishmentMeans", "degreeOfEstablishment", "pathway", 
    "eventDate", "source", "occurrenceRemarks"
  ) %>%
  arrange(taxonID)
```

## Species profile extension

In this extension we will express broad habitat characteristics of the species (e.g. `isTerrestrial`).

Create a dataframe with unique taxa only (ignoring multiple distribution rows).
Only keep records for which `terrestrial`, `marine` and `freshwater` is not empty.

Map the data to [Species Profile](http://rs.gbif.org/extension/gbif/1.0/speciesprofile.xml).

```{r}
species_profile <-
  input_data %>%
  distinct(taxon_id, .keep_all = TRUE) %>% 
  filter(
    !is.na(terrestrial) |
      !is.na(marine) |
      !is.na(freshwater)
  ) %>% 
  mutate(
    .keep = "none",
    taxonID = taxon_id,
    isMarine = marine,
    isFreshwater = freshwater,
    isTerrestrial = terrestrial
  ) %>% 
  arrange(taxonID)
```

## Description extension

In the description extension we want to include the native range of a species

```{r}
description <-
  input_data %>% 
  # unique taxa only (ignoring multiple distribution rows)
  distinct(taxon_id, .keep_all = TRUE) %>% 
  # Separate values on `|` 
  mutate(native_range = strsplit(native_range, "\\|")) %>% 
  unnest(native_range) %>% 
  filter(!is.na(native_range)) %>% 
  mutate(
    .keep = "none",
    taxonID = taxon_id,
    description = str_trim(native_range),
    type = "native range",
    language = "en"
    ) %>% 
  select("taxonID", "description", "type", "language") %>% 
  arrange(taxonID)
```

Save to CSV:

```{r}
write_csv(taxon, here("data", "processed", "taxon.csv"), na = "")
write_csv(distribution, here("data", "processed", "distribution.csv"), na = "")
write_csv(species_profile, here("data", "processed", "speciesprofile.csv"), na = "")
write_csv(description, here("data", "processed", "description.csv"), na = "")
```