View source code on GitHub

Overview

Biological recorders contribute valuable biodiversity data; and extensive infrastructure exists to support dataflows from recorders submitting records to databases. However, we lack infrastructure dedicated to providing informative feedback to recorders in response to the data they have contributed. By developing this infrastructure, we can create a feedback loop leading to better data and more engaged data providers.

We might want to provide feedback to biological recorders, or other interested parties such as land managers or groups, for a variety of reasons:

  • Increase the volume of species data by improving user engagement
  • Improve data quality by supporting user skill improvement
  • Get data where it is needed most by delivering persuasive messaging to encourage species recording in places or taxonomic groups that we need data (adaptive sampling).

The code in this repository is developed to programmatically generating effective digital engagements (‘data stories’) from biodiversity species data. This code provides tools for turning species recording data into data stories in HTML format. Separate scripts will be used to dispatch the HTML content to recipients. We use R markdown as a flexible templating system to provide extensible scripts that allow the development of different digital engagements. This templating system can then be used by data managers to design digital engagements to send to their participants. Attention should be given to ensuring that this software is computationally efficient to ensure that it has the potential to be scaled.

This work inherits some ideas and concept from the MyDECIDE campaign delivered under the DECIDE project. The code and scripts from MyDECIDE are available here: https://github.com/simonrolph/DECIDE-WP3-newsletter/tree/main

Content Generation Pipeline

The process from getting data to distributing content is defined in run_pipeline.Rmd (and a derived .R version for sourcing). Within this there is the content generation pipeline that automates slicing up the data for each user, doing per-user computations and generating the email content.

The email generation process is managed by R package targets. The targets package is a Make-like pipeline tool for statistics and data science in R. The package skips costly runtime for tasks that are already up to date and orchestrates the necessary computation. The pipeline is described in _targets.R.

Input data is loaded from an external source (e.g. an Indicia database). For each person/place/project the data is split into the focal data (relating to the person, place or project you which to deliver targeted feedback to) and the background data (all other data). Computations are then applied to these datasets, this might be to calculate summary statistics such as the number of records made in a certain period. The data and the computations are fed into an Rmarkdown document which contains code which you have developed to generate effective digital engagements. A HTML template is also combined here to specify any generic elements such as formatting, header/footers, and logos.

Gathering Data

All data for users or their records are stored in the /data folder. The contents of this folder are ignored by git because this is likely to contain personal data such as emails. There are two tables as .csv files representing the users and their records. These tables are both contain a user_id column which can be used to match users to their records.

User data

This contains data about the users for which you are creating personalised feedback. There are three key columns:

  • user_id - This is a unique identifier for a user. This could be their ID from a biological recording website.
  • name - This is the name of the user that they are addressed by in the personalised feedback
  • email - This is their email address, if you are using email as a dispatch method

You can also add any other columns you wish and this information will be passed to the parametised R markdown template through an R object params$extra_params. This is useful if you wish to have different feedback content for different recorders. FOr example, a column of taxon which could have the values of butterfly or dragonfly could be used in if/else logic statements (e.g. if(params$extra_params$taxon=="butterfly){print("Butterfly!")}) within the R markdown template to show different content to different users. This could also be used for splitting a group into A/B testing. Generating the data for these other columns is not part of the provided pipeline so you will need to manipulate this csv yourself. Typically, using writing a script and saving it in R/util (utility scripts).

Using the controller app

If you are using the controller app to host your user data you will need to provide the API endpoints and authentication details in config.yml (see configuration section below). In run_pipeline.R it will call the API and download the user data.

Records data

This data contains the biodiversity data. The only column you must have is the user_id column, the rest of the columns are not prescriptive and up to your use case. For a typical biological recording user case you will have columns describing who, what was recorded, where, when. Some basic headings (and are used in the example below) are: latitude, longitude, species, date, user_id (required). Some recommended extra headings include: species_vernacular (common name), species_group (is it a frog, a bird etc.).

During the pipeline the data is split into the user_data which only includes the species records of the target user and the bg_data (background) which is the data for everyone. The user data and background data are passed to the R markdown template as params$user_data and params$bg_data.

Generating content

Computations

Computationally expensive or reusable logic (e.g. aggregations, statistics, plots) should not be placed inside the .Rmd file. Instead, define these in separate R scripts located in the computations/ folder. Each file defines a function called compute_objects() which can take the user_data or bg_data as a function argument.

Which computation the pipeline should carry out is defined in config.yml. This computation can be applied to both user and background data, or different computations for each. These scripts return a list of R objects (tables, plots, metrics) that are passed to the R Markdown template through params$bg_computed_objects and params$user_computed_objects.

Rendering with R Markdown

We use R Markdown to render the emails. Pandoc will need to be installed and the path set in config.yml. This format allows for combining plain markdown with embedded R code chunks, making it ideal for integrating visualisations, text, and summaries dynamically. The R markdown template to use is defined in config.yml. The .Rmd template receives inputs via parameters defined in the yml at the top of the template. The rendered output is a self-contained HTML file, ready for email delivery (no external image dependencies).

  • R Markdown Template (.Rmd): Customisable to include user-specific content, visual summaries, and feedback messages.
  • HTML Layout Template (.html): Provides structure and styling (e.g., branding, headers/footers, colors). Content that will be consistent across every email, it makes sense to put it here to save any R computation time.

Building your R markown is the interesting bit. The parametised markdown + targets pipeline approach approach means you have the data from users/background, the user/background computed objects, and the extra params from the extra columns in the users table. You can use any R package here so long as you have it installed. We recommend using renv for package management. The development loop can be slow if you have to run the whole content generation pipeline, do what you can do is run the pipeline which has a target for the user parameters. Then you can do params <- targets::tar_read(user_params__[USER_ID]) which will then you can run the chunks in the markdown interactively using this params object you have loaded. To see your targets look in _targets/objects.

The basic HTML template provided is lifted from R package {blastula} and provides formatting such as a container and headers/footers. If you want to change the look and feel of your feedback items you should copy the basic template and rename your copy to custom_template.html. You can edit the look and feel of the content by editing the css contained within the <style> tags in the html. You then need to edit config.yml to ensure that your new template is being used in your pipeline.

Output Location

All rendered email content is stored in the renders/[batch_id]/ folder:

  • One HTML file per recipient
  • A metadata .csv file containing:
    • user_id
    • email
    • rendered file name
    • associated content_key (if used)

Sending emails

After the HTML files are generated, the final step is delivering them to recipients. Email delivery is handled using the blastula package, which provides tools for sending richly formatted HTML emails directly from R. Emails are sent to addresses listed in the metadata CSV (renders/[batch_id]/meta_table.csv). You can manage sending credentials and settings via the config.yml file. The pipeline loops through all rows in the metadata table and sends the appropriate email to each participant.

Configuration

A configuration file (config.yml), which is loaded in using config::get(), is where you define the data file, the computation scripts and the template file. This configuration file defines all the necessary settings used across the data pipeline. It follows a YAML format and is organized under the default configuration profile, which is typically used for local development. To activate it in R, use:

📁 File Paths

Key Description
participant_data_file Path to save or load the participant data CSV
data_file Path to save the gathered records CSV
computation_script_bg Background computations script path
computation_script_user User-level computations script path
default_template_file RMarkdown template for reports
template_html_file HTML template for rendering
pandoc_path Path to Pandoc/Quarto binaries (for rendering RMarkdown)

📥 Data Gathering

Key Description
gather_from_controller_app Toggle to fetch data from the Controller App
controller_app_base_url Base API URL of the Controller App
controller_app_web_url Base web URL of the Controller App
controller_app_api_key API token used to authenticate against the Controller App API
controller_app_list_id Email list ID used to fetch subscribers
Key Description
gather_bio_script Filepath to script which gathers biodiversity data

in .Renviron you need to provide any secrets/keys needed by the script define in gather_bio_script. For example in the provided R/gather/gather_indicia.R there is a call for Sys.getenv("INDICIA_WAREHOUSE_SECRET") which must be provided in .Renviron

📧 Email Settings

Key Description
mail_server SMTP server address
mail_port SMTP port number
mail_use_tls Whether to use TLS encryption (TRUE/FALSE)
mail_use_ssl Whether to use SSL encryption (TRUE/FALSE)
mail_username SMTP login username
mail_password SMTP login password (if not using environment variables)
mail_default_sender Default email sender address
mail_default_name Display name for the sender
mail_default_subject Default subject line for outgoing emails
mail_creds Credential mode: "anonymous" or "envvar"
mail_test_recipient Optional hardcoded recipient for testing

Getting started

Provided in this code is a a minimal example to show how it works. This can then be used as a starting point for developing your own personalised feedback.

Fork and clone the repository

It is recommended if you are going developing your own feedback from this code that you fork the repository to your own GitHub account. Start by forking the Recorder Feedback repository on GitHub. This will create a copy of the project under your GitHub account, allowing you to make changes and contributions without affecting the original repository.

Clone your forked repository to your local machine using Git. Open a terminal or command prompt and execute the following command:

git clone https://github.com/your-github-username/recorder-feedback.git`

Or alternatively you can use the RStudio IDE to clone the repository to a new project: https://happygitwithr.com/rstudio-git-github.html#clone-the-test-github-repository-to-your-computer-via-rstudio

Create key files

Use terminal to copy files with new names. Or do the equivalent action in file explorer or RStudio

cp example_config.yml config.yml

By keeping example_ versions of the files in this repo, which you don’t edit, but instead edit a derived file that you have just copied, makes it easier to pull updates from the main repo into your fork.

Install required R packages

Navigate to the project directory and install the necessary R packages using the renv package manager. Open R or RStudio and execute the following commands:

install.packages(c("renv"))
renv::restore()

This will ensure that you have all the required packages installed and ready to use for generating feedback. You can find an introduction to {renv} here: https://rstudio.github.io/renv/articles/renv.html

Generate test data

To help you get started there is a very minimal example of generating feedback items from some simulated data. Run the provided script generate_test_data.R to generate test data for email rendering. Execute the following command in R or RStudio:

source("R/gather/generate_test_users.R")

This script will create sample data that you can use to test the email generation process. The sample data is saved as simulated_participants.csv and data/simulated_data_raw.csv.

simulated_participants.csv

head(read.csv("data/simulated_participants.csv"))

data/simulated_data_raw.csv

head(read.csv("data/simulated_data_raw.csv"))

Generating feedback items

Now we’ve generated the test (or real) data we can run the pipeline but before we do let’s have a look at the other code required to make the pipeline work.

Firstly, the config file:

config.yml

## default:
##   participant_data_file: "data/simulated_participants.csv"
##   data_file: "data/simulated_data_raw.csv"
##   computation_script_bg: "computations/computations_example.R"
##   computation_script_user: "computations/computations_example.R"
##   default_template_file: "templates/example.Rmd"
##   template_html_file: "templates_html/basic_template.html"
##   pandoc_path: "C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools"
##   gather_from_controller_app: False
##   controller_app_base_url: "http://localhost:5000/api/"
##   controller_app_web_url: "http://localhost:5000/"
##   controller_app_api_key: "complicated_token"
##   controller_app_list_id: "1"
##   gather_bio_script: "R/gather/gather_simulated.R"
##   mail_server: ""
##   mail_port: 000
##   mail_use_tls: True
##   mail_use_ssl: True
##   mail_username: ''
##   mail_password: ''
##   mail_default_sender: ''
##   mail_default_name: 'Recorder Feedback'
##   mail_default_subject: "Your personalised recorder feedback"
##   mail_creds: "anonymous"
##   mail_test_recipient: ''

This file specifies which R/Rmd/HTML files are used in the email generation process. The config file is implemented using R package {config}. The values specified in the config file are loaded in the R code using config. Learn about the package here https://rstudio.github.io/config/articles/introduction.html

  • participant_data_file: The file path for the inut data with the list of participants
  • data_file: The file path for the the input data containing biological records
  • computation_script_bg: The file path for the script for computations for the background data
  • computation_script_user: The file path for the script for computations for the focal data (this could be the same script as computation_script_bg)
  • default_template_file: The file path for the R markdown file containing the EDE format
  • template_html_file: The file path for the HTML template containing formatting/header/footer etc.

The computation scripts are located in the computations folder. Each script is an R function (which must be called compute_objects) which takes either the focal or background data as its argument and returns a named list of computed objects. These objects can then be used in the R markdown file. For example here is the example computations provided in computations_example.R.

## compute_objects <- function(data){
##   #mean number of species
##   mean_n_species = data %>% 
##     group_by(user_id) %>% 
##     summarise(n_species = length(unique(species))) %>%
##     pull(n_species) %>%
##     mean()
##   
##   #mean number of records
##   mean_n_records = data %>% 
##     group_by(user_id) %>% 
##     summarise(n_records = n()) %>%
##     pull(n_records) %>%
##     mean()
##   
##   #return the list of precalculated objects
##   list(mean_n_species = mean_n_species,
##        mean_n_records = mean_n_records)
##   
## }

If you don’t need any computations then you can set computation_script_bg: computations/computations_none.R which provides a dummy function:

## compute_objects <- function(data){
##   list()
## }

The data (focal and background) and the computed objects (focal and background) are all used in the rendering the final HTML feedback item. Please take a look at the example template provided in templates/example.Rmd. Essentially you can us any R code you might use in analyses or data visualisation can be used here. However, please be aware that slower more complex R code will increase the time it takes to generate feedback. Some key principles:

  • The computed objects are passed to the R markdown as parameters which you can access using params$bg_computed_objects or params$user_computed_objects. Note that these are the named list objects so if within this you defined it as list(number_of_records=323) then in order to access this in the R markdown you’d use params$bg_computed_objects$number_of_records.
  • You can use inline R code: https://rmarkdown.rstudio.com/lesson-4.html

Finally, now we’ve had a look at all the components that are used in the pipeline, you can trigger the pipeline using targets::tar_make() or source(run_pipeline.R). Pipeline can be called with a command line prompt. This is useful if you want to trigger the pipeline run as part of a schedule (eg a CRON job).

Rscript run_pipeline.R

View generated emails

Once the targets pipeline has completed, you can view the generated email renders. Execute the following command in R or RStudio:

source("R/view_renders.R")

Details

Controller app

Here we provide an example script and functions for getting data from a controller app, see: https://github.com/BiologicalRecordsCentre/recorder-feedback-controller for more details.

This repository includes the get_subscribers_from_controller function, in order to use it follow these steps:

Ensure that you have the necessary libraries and configuration settings in place:

  • Libraries: The function uses httr for HTTP requests and jsonlite for parsing JSON responses. These should already be available when using renv for package management
  • API Endpoint: You need a deployed Recorder Feedback Controller app and have access to its API that provides subscriber information.
  • API Token: This token is required to authenticate your requests to the API, this will be configured in your controller app.
  • List ID: The unique ID for the email list from which you want to retrieve subscribers, this will be configured in your controller app.

You need to provide the following parameters for the config:

controller_app_base_url: "https://api.your-email-service.com/"
controller_app_api_key: "your_api_token"
participant_data_file: "path/to/save/subscribers.csv"

Now you can use the get_subscribers_from_controller function to fetch subscribers from a specific email list.

Example Code:

# Load required libraries
library(httr)
library(jsonlite)
library(config)

# Load configuration settings (from config.yml file or manually define them)
config <- config::get()

# Example parameters
api_url <- config$controller_app_base_url  # Base URL for your email service API
api_token <- config$controller_app_api_key # API token for authentication
email_list_id <- "1"                       # ID of the email list to query

# Call the function to get subscribers from the email list
subscribers_df <- get_subscribers_from_controller(
  api_url = api_url,
  email_list_id = email_list_id,
  api_token = api_token
)

# View the retrieved data
print(subscribers_df)

# Optionally, save the subscribers data to a CSV file
write.csv(subscribers_df, config$participant_data_file, row.names = FALSE)

The function returns a data.frame containing the list of subscribers from the specified email list saved in the participant_data_file

Indicia warehouse

You’ll need the following:

  • Indicia Warehouse API URL (set in .Renviron): The base URL for the Indicia warehouse.
  • API Authentication Credentials:
    • client_id: Your Indicia warehouse client ID (set in .Renviron)
    • shared_secret: Your shared secret for authentication (set in .Renviron)
  • User Warehouse ID: The unique identifier for the user whose records you want to retrieve (contained in data downloaded from controller app)
  • Number of Records: The number of records you want to retrieve for that user (maximum is 10,000)

Input data

Input data (the full dataset) must be provided as a csv (comma separated values). The columns within the data are up to you but for consistency we recommend using Darwin Core terms: https://dwc.tdwg.org/terms/#occurrence & https://dwc.tdwg.org/terms/#location

The columns you specify here (or have been specified by whatever data source you are using) must then be used in the computation scripts and R markdown template.

In the records data (as specified in config as data_file) and the participant data (as specified in config as participant_data_file) there must be a shared user_id column which is used to link the users to their records. For example data from Indicia systems this might be the warehouse ID.

Deployment

A full pipeline is included in run_pipeline.Rmd (also run_pipeline.R). This code represents an R Markdown script for a feedback delivery pipeline, designed to automate the gathering, processing, and emailing of feedback to users. Here’s a breakdown of its structure and purpose:

1. Configuration

  • Sets up the runtime environment.
  • Uses renv for dependency management:
    • Ensures the correct packages are installed and restored.
    • Differentiates between local development and deployment on Posit Connect.
  • Generates a unique batch_id for tracking a specific run.
  • Loads configurations from a config.yml file, which defines parameters like API keys and file paths.

2. Gather

  • Fetch Subscribers:
    • Uses get_subscribers_from_controller.R to retrieve a list of email subscribers via an API.
    • Stores the list as a CSV file.
  • Fetch User Records:
    • Loops through each subscriber, fetching records of observational data (e.g., species, location, date).
    • Processes raw data into a structured format (e.g., extracting latitude, longitude, and species details).
    • Appends all records into a single dataframe, saving it as a CSV file.

3. Generate

  • Sets the BATCH_ID environment variable.
  • Runs a pipeline using targets::tar_make(), which is likely responsible for computations or rendering feedback reports.

4. Send

  • Prepares for email delivery using the blastula library.
  • Retrieves metadata from meta_table.csv and participant data from the saved CSV.
  • Sends feedback emails

Automation

If you want to automate the pipeline you can use crontab. For this we use the entrypoint.sh script as the things that we are triggering, which then runs run_pipeline.R.

You can create a cron job with crontab by using:

crontab -e

You can see what cron jobs have been created with

crontab -l

Which will then open a file for you to write the crontab job into. You use the crontab syntax (use crontab guru to help: https://crontab.guru/).

Here’s an example that triggers at 7am every day:

0 7 * * * /path/to/recorder-feedback-content/entrypoint.sh >> /path/to/recorder-feedback-content/logs/job_$(date +\%Y\%m\%d_\%H\%M\%S).log 2>&1

Use absolute paths because crotab runs in a minimal environment

The >> bit means that the console output is pasted into a .log file with the date in the filename

Examples

Here are some examples showing what sort of feedback items are possible.

Demonstrator - This is simply an example of the different elements you can combine together using R markdown; plots, maps and images.

Month in review - This is an example of a ‘month in review’ retrospective feedback piece.