View source code on GitHub

Overview

Biological recorders contribute valuable biodiversity data; and extensive infrastructure exists to support dataflows from recorders submitting records to databases. However, we lack infrastructure dedicated to providing informative feedback to recorders in response to the data they have contributed. By developing this infrastructure, we can create a feedback loop leading to better data and more engaged data providers.

We might want to provide feedback to biological recorders, or other interested parties such as land managers or groups, for a variety of reasons:

  • Increase the volume of species data by improving user engagement
  • Improve data quality by supporting user skill improvement
  • Get data where it is needed most by delivering persuasive messaging to encourage species recording in places or taxonomic groups that we need data (adaptive sampling).

The code in this repository is developed to programmatically generating effective digital engagements (‘data stories’) from biodiversity species data. This code provides tools for turning species recording data into data stories in HTML format. Separate scripts will be used to dispatch the HTML content to recipients. We use R markdown as a flexible templating system to provide extensible scripts that allow the development of different digital engagements. This templating system can then be used by data managers to design digital engagements to send to their participants. Attention should be given to ensuring that this software is computationally efficient to ensure that it has the potential to be scaled.

This work inherits some ideas and concept from the MyDECIDE campaign delivered under the DECIDE project. The code and scripts from MyDECIDE are available here: https://github.com/simonrolph/DECIDE-WP3-newsletter/tree/main

How it works

The email generation process is managed by R package targets. The targets package is a Make-like pipeline tool for statistics and data science in R. The package skips costly runtime for tasks that are already up to date and orchestrates the necessary computation. The pipeline is described in _targets.R. Pipeline inputs and outputs

Inputs

  • Raw data (.csv)
  • R markdown template (.Rmd)
  • HTML template (.html)
  • Computation script - background (.R)
  • Computation script - focal (.R)

Intermediates

  • Focal data
  • background data
  • Computed objects - background
  • Computed objects - focal

Outputs

  • Recorder feedback items (.html)
  • Metadata table (.csv)

Below is a schematic diagram providing an overview of the email generation process. Input data is loaded from an external source (e.g. an Indicia database). For each person/place/project the data is split into the focal data (relating to the person, place or project you which to deliver targeted feedback to) and the background data (all other data). Computations are then applied to these datasets, this might be to calculate summary statistics such as the number of records made in a certain period. The data and the computations are fed into an Rmarkdown document which contains code which you have developed to generate effective digital engagements. A HTML template is also combined here to specify any generic elements such as formatting, header/footers, and logos.

Data source
Biological records (.csv)
Focal records
R markdown template
HTML template
Computed objects (focal)
Computed objects (background)
Background records
Rendered feedback item
Pipeline managed by {targets}

The input data is made available in the /data folder. It must be in a certain format in order to work correctly. During the pipeline the data is split into the user_data which only includes the species records of the target user and the bg_data (background) which is the data for everyone.

The email rendering is done using R markdown. R markdown is used as a very flexible templating system to allow developers to documents in html (and other formats). It combines markdown with code chunks. We use parameterised R markdown to render the email with user-specific data and computed data-derived objects.

Content in the emails can be generated using frequently R packages such as dplyr for data manipulation and ggplot2 for creating data visualisations. There are various R packages available for generating maps but there are example scripts that use ggspatial.

The emails are rendered in an email-ready format borrowed from the R package blastula. They are rendered as ‘self contained’ html files so there are no external local image files.

It is not recommended to carry out computationally heavy calculations within the R markdown template, therefore a computation step can be done before rendering. These computations should be coded in scripts located in computations. The computations are applied separately for the user_data and bg_data, but this can be the same or different computation scripts.

A configuration file (config.yml), which is loaded in using config::get(), is where you define the data file, the computation scripts and the template file.

The rendered html items are saved in a folder renders/[batch_id] where you have set a batch identifier. The folder contains html files for each recipient and a .csv with columns for each file name and the identifier.

All the R code for making the pipeline work is located in the R folder, you shouldn’t need to edit any of these files.

Getting started

Provided in this code is a a minimal example to show how it works. This can then be used as a starting point for developing your own personalised feedback.

Fork and clone the repository

It is recommended if you are going developing your own feedback from this code that you fork the repository to your own GitHub account. Start by forking the Recorder Feedback repository on GitHub. This will create a copy of the project under your GitHub account, allowing you to make changes and contributions without affecting the original repository.

Clone your forked repository to your local machine using Git. Open a terminal or command prompt and execute the following command:

git clone https://github.com/your-github-username/recorder-feedback.git`

Or alternatively you can use the RStudio IDE to clone the repository to a new project: https://happygitwithr.com/rstudio-git-github.html#clone-the-test-github-repository-to-your-computer-via-rstudio

Create key files

Use terminal to copy files with new names. Or do the equivalent action in file explorer or RStudio

cp example_targets.R _targets.R
cp example_run_pipeline.R run_pipeline.R
cp example_config.yml config.yml

By keeping example_ versions of the files in this repo, which you don’t edit, but instead edit a derived file that you have just copied, makes it easier to pull updates from the main repo into your fork.

Install required R packages

Navigate to the project directory and install the necessary R packages using the renv package manager. Open R or RStudio and execute the following commands:

install.packages(c("renv"))
renv::restore()

This will ensure that you have all the required packages installed and ready to use for generating feedback. You can find an introduction to {renv} here: https://rstudio.github.io/renv/articles/renv.html

Generate test data

To help you get started there is a very minimal example of generating feedback items from some simulated data. Run the provided script generate_test_data.R to generate test data for email rendering. Execute the following command in R or RStudio:

source("R/generate_test_data.R")

This script will create sample data that you can use to test the email generation process. The sample data is saved as simulated_participants.csv and data/simulated_data_raw.csv.

simulated_participants.csv

##   user_id      name               email
## 1       1    Ella W    Ella.W@email.com
## 2       2    Emma X    Emma.X@email.com
## 3       3 Matthew S Matthew.S@email.com
## 4       4 Madison R Madison.R@email.com

data/simulated_data_raw.csv

##   latitude   longitude      species       date user_id      name
## 1 51.62816 -0.01707810      Peacock 2023-11-05       4 Madison R
## 2 51.48911 -0.92391545  Large white 2023-12-29       2    Emma X
## 3 51.57612 -0.66409614   Holly Blue 2023-12-30       1    Ella W
## 4 51.03935 -0.22017373      Peacock 2023-12-21       1    Ella W
## 5 51.78953 -0.29878872 Meadow brown 2023-12-12       4 Madison R
## 6 51.36310 -0.03141985  Large white 2023-08-01       2    Emma X
##                 email
## 1 Madison.R@email.com
## 2    Emma.X@email.com
## 3    Ella.W@email.com
## 4    Ella.W@email.com
## 5 Madison.R@email.com
## 6    Emma.X@email.com

Get participant data from controller app

Here we provide an example script and functions for getting data from a controller app, see: https://github.com/BiologicalRecordsCentre/recorder-feedback-controller for more details.

This repository includes the get_subscribers_from_controller function, in order to use it follow these steps:

Ensure that you have the necessary libraries and configuration settings in place:

  • Libraries: The function uses httr for HTTP requests and jsonlite for parsing JSON responses. These should already be available when using renv for package management
  • API Endpoint: You need a deployed Recorder Feedback Controller app and have access to its API that provides subscriber information.
  • API Token: This token is required to authenticate your requests to the API, this will be configured in your controller app.
  • List ID: The unique ID for the email list from which you want to retrieve subscribers, this will be configured in your controller app.

You need to provide the following parameters for the config:

controller_app_base_url: "https://api.your-email-service.com/"
controller_app_api_key: "your_api_token"
participant_data_file: "path/to/save/subscribers.csv"

Now you can use the get_subscribers_from_controller function to fetch subscribers from a specific email list.

Example Code:

# Load required libraries
library(httr)
library(jsonlite)
library(config)

# Load configuration settings (from config.yml file or manually define them)
config <- config::get()

# Example parameters
api_url <- config$controller_app_base_url  # Base URL for your email service API
api_token <- config$controller_app_api_key # API token for authentication
email_list_id <- "1"                       # ID of the email list to query

# Call the function to get subscribers from the email list
subscribers_df <- get_subscribers_from_controller(
  api_url = api_url,
  email_list_id = email_list_id,
  api_token = api_token
)

# View the retrieved data
print(subscribers_df)

# Optionally, save the subscribers data to a CSV file
write.csv(subscribers_df, config$participant_data_file, row.names = FALSE)

The function returns a data.frame containing the list of subscribers from the specified email list saved in the participant_data_file

Get Records data from an Indicia warehouse

You’ll need the following:

  • Indicia Warehouse API URL (set in config.yml): The base URL for the Indicia warehouse.
  • API Authentication Credentials:
    • client_id: Your Indicia warehouse client ID (set in config.yml)
    • shared_secret: Your shared secret for authentication (set in config.yml)
  • User Warehouse ID: The unique identifier for the user whose records you want to retrieve (contained in data downloaded from controller app)
  • Number of Records: The number of records you want to retrieve for that user (maximum is 10,000)

Once you have the required parameters, you can call the get_user_records_from_indicia function to retrieve species records for a specific user from the Indicia warehouse.

You can see how this function can be run in a loop to generate and save data for all users in the script defined in called get_users_and_records.R.

Generating feedback items

Now we’ve generated the test (or real) data we can run the pipeline but before we do let’s have a look at the other code required to make the pipeline work.

Firstly, the config file:

config.yml

## default:
##   participant_data_file: "data/simulated_participants.csv"
##   data_file: "data/simulated_data_raw.csv"
##   computation_script_bg: "computations/computations_example.R"
##   computation_script_user: "computations/computations_example.R"
##   default_template_file: "templates/example.Rmd"
##   template_html_file: "templates_html/basic_template.html"
##   pandoc_path: "C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools"
##   controller_app_base_url: ""
##   controller_app_api_key: ""
##   indicia_warehouse_base_url: ""
##   indicia_warehouse_client_id: ""
##   indicia_warehouse_secret: ""
## 

This file specifies which R/Rmd/HTML files are used in the email generation process. The config file is implemented using R package {config}. The values specified in the config file are loaded in the R code using config. Learn about the package here https://rstudio.github.io/config/articles/introduction.html

  • participant_data_file: The file path for the inut data with the list of participants
  • data_file: The file path for the the input data containing biological records
  • computation_script_bg: The file path for the script for computations for the background data
  • computation_script_user: The file path for the script for computations for the focal data (this could be the same script as computation_script_bg)
  • default_template_file: The file path for the R markdown file containing the EDE format
  • template_html_file: The file path for the HTML template containing formatting/header/footer etc.

The computation scripts are located in the computations folder. Each script is an R function (which must be called compute_objects) which takes either the focal or background data as its argument and returns a named list of computed objects. These objects can then be used in the R markdown file. For example here is the example computations provided in computations_example.R.

## compute_objects <- function(data){
##   #mean number of species
##   mean_n_species = data %>% 
##     group_by(user_id) %>% 
##     summarise(n_species = length(unique(species))) %>%
##     pull(n_species) %>%
##     mean()
##   
##   #mean number of records
##   mean_n_records = data %>% 
##     group_by(user_id) %>% 
##     summarise(n_records = n()) %>%
##     pull(n_records) %>%
##     mean()
##   
##   #return the list of precalculated objects
##   list(mean_n_species = mean_n_species,
##        mean_n_records = mean_n_records)
##   
## }

If you don’t need any computations then you can set computation_script_bg: computations/computations_none.R which provides a dummy function:

## compute_objects <- function(data){
##   list()
## }

The data (focal and background) and the computed objects (focal and background) are all used in the rendering the final HTML feedback item. Please take a look at the example template provided in templates/example.Rmd. Essentially you can us any R code you might use in analyses or data visualisation can be used here. However, please be aware that slower more complex R code will increase the time it takes to generate feedback. Some key principles:

  • The computed objects are passed to the R markdown as parameters which you can access using params$bg_computed_objects or params$user_computed_objects. Note that these are the named list objects so if within this you defined it as list(number_of_records=323) then in order to access this in the R markdown you’d use params$bg_computed_objects$number_of_records.
  • You can use inline R code: https://rmarkdown.rstudio.com/lesson-4.html

The HTML template templates_html/basic_templatehtml contains the formatting for the email. You only need to edit this if you whish to change the look and feel of the emails.

Finally, now we’ve had a look at all the components that are used in the pipeline, you can trigger the pipeline using targets::tar_make() or source(run_pipeline.R). Pipeline can be called with a command line prompt. This is useful if you want to trigger the pipeline run as part of a schedule (eg a CRON job). You can provide a batch ID with a command line argument.

Rscript run_pipeline.R test_001

View generated emails

Once the targets pipeline has completed, you can view the generated email renders. Execute the following command in R or RStudio:

source("R/view_renders.R")
view_renders(batch_id="test_001",5)

Replace “test_001” with the batch identifier you set, and n with the number of renders you want to view.

Development details

Now that you have the project set up and have generated test feedback, you can customize the email template and scripts to generate personalized feedback items according to your specific requirements. Edit the template (example.Rmd) and other scripts as needed to tailor the feedback content and format. By following the following steps, you can quickly set up the project environment and start generating informative feedback for biological recorders.

Input data

Input data (the full dataset) must be provided as a csv (comma separated values). The columns within the data are up to you but for consistency we recommend using Darwin Core terms: https://dwc.tdwg.org/terms/#occurrence & https://dwc.tdwg.org/terms/#location

The columns you specify here (or have been specified by whatever data source you are using) must then be used in the computation scripts and R markdown template.

In the records data (as specified in config as data_file) and the participant data (as specified in config as participant_data_file) there must be a shared user_id column which is used to link the users to their records. For example data from Indicia systems this might be the warehouse ID.

Computating objects

Presenting simply raw data limits the ability to produce meaningful feedback. Therefore as part of the feedback generation pipeline we compute objects. We define these computations as R functions which take the input data as its argument. The computation functions return a named list of objects which can then be referred to in the R markdown file in order to show to the user. You can define different computation functions for the background computations and the focal computations, or use the same computation file for each.

The R Markdown

The R markdown file is where you will spend the majority of you time developing the feedback you wish to send to recorders. R markdown was chosen as it provides all the facilities of R. If you need to use additional R packages please ensure that you update your renv.lock file to capture this.

The HTML template

The basic HTML template provided is lifted from R package {blastula} and provides formatting such as a container and headers/footers. If you want to change the look and feel of your feedback items you should copy the basic template and rename your copy to custom_template.html. You can edit the look and feel of the content by editing the css contained within the <style> tags in the html. You then need to edit config.yml to ensure that your new template is being used in your pipeline.

Sending emails

You could send emails from R using packages such as blastula (https://CRAN.R-project.org/package=blastula) to set up SMTP. Define a function in send_email.R and then run code similar to this where you use the meta_table.csv generated to get the filepath of the email content, and email address from the participant_data_file to loop through and send emails using your send_email() function.

source("R/send_email.R")
meta_table <-read.csv(paste0("renders/",batch_id,"/meta_table_",batch_id,".csv"))
for (i in 1:nrow(meta_table)){
  #get their email address
  participants <- read.csv(config$participant_data_file)
  recipient <- participants[participants$user_id == meta_table[i,"user_id"],]$email
  send_email(recipient,meta_table$file[i])
}

Examples

Here are some examples showing what sort of feedback items are possible.

Demonstrator - This is simply an example of the different elements you can combine together using R markdown; plots, maps and images.

Month in review - This is an example of a ‘month in review’ retrospective feedback piece.