View source code on GitHub

Overview

Biological recorders contribute valuable biodiversity data; and extensive infrastructure exists to support dataflows from recorders submitting records to databases. However, we lack infrastructure dedicated to providing informative feedback to recorders in response to the data they have contributed. By developing this infrastructure, we can create a feedback loop leading to better data and more engaged data providers.

We might want to provide feedback to biological recorders, or other interested parties such as land managers or groups, for a variety of reasons:

  • Increase the volume of species data by improving user engagement
  • Improve data quality by supporting user skill improvement
  • Get data where it is needed most by delivering persuasive messaging to encourage species recording in places or taxonomic groups that we need data (adaptive sampling).

The code in this repository is developed to programmatically generating effective digital engagements (‘data stories’) from biodiversity species data. This code provides tools for turning species recording data into data stories in HTML format. Separate scripts will be used to dispatch the HTML content to recipients. We use R markdown as a flexible templating system to provide extensible scripts that allow the development of different digital engagements. This templating system can then be used by data managers to design digital engagements to send to their participants. Attention should be given to ensuring that this software is computationally efficient to ensure that it has the potential to be scaled.

This work inherits some ideas and concept from the MyDECIDE campaign delivered under the DECIDE project. The code and scripts from MyDECIDE are available here: https://github.com/simonrolph/DECIDE-WP3-newsletter/tree/main

Pipeline

The email generation process is managed by R package targets. The targets package is a Make-like pipeline tool for statistics and data science in R. The package skips costly runtime for tasks that are already up to date and orchestrates the necessary computation. The pipeline is described in _targets.R. Pipeline inputs and outputs

Inputs

  • Raw data (.csv)
  • R markdown template (.Rmd)
  • HTML template (.html)
  • Computation script - background (.R)
  • Computation script - focal (.R)

Intermediates

  • Focal data
  • background data
  • Computed objects - background
  • Computed objects - focal

Outputs

  • Recorder feedback items (.html)
  • Metadata table (.csv)

All the R code for making the pipeline work is located in the R folder, you shouldn’t need to edit any of these files.

Below is a schematic diagram providing an overview of the email generation process. Input data is loaded from an external source (e.g. an Indicia database). For each person/place/project the data is split into the focal data (relating to the person, place or project you which to deliver targeted feedback to) and the background data (all other data). Computations are then applied to these datasets, this might be to calculate summary statistics such as the number of records made in a certain period. The data and the computations are fed into an Rmarkdown document which contains code which you have developed to generate effective digital engagements. A HTML template is also combined here to specify any generic elements such as formatting, header/footers, and logos.

Data source
Biological records (.csv)
Focal records
R markdown template
HTML template
Computed objects (focal)
Computed objects (background)
Background records
Rendered feedback item
Pipeline managed by {targets}

Gathering Data

The input data is made available in the /data folder. It must be in a certain format in order to work correctly. During the pipeline the data is split into the user_data which only includes the species records of the target user and the bg_data (background) which is the data for everyone.

Generating content

Content generation for the email feedback is handled using parameterised R Markdown and a flexible templating system. This enables the creation of personalised, data-driven emails in HTML format for each recipient.

Rendering with R Markdown

We use R Markdown to render the emails. This format allows for combining plain markdown with embedded R code chunks, making it ideal for integrating visualisations, text, and summaries dynamically.

  • The .Rmd template receives inputs via parameters for:
    • user_data – records specific to the target recipient
    • bg_data – background records (everyone else)
    • computed objects derived from both datasets
  • The rendered output is a self-contained HTML file, ready for email delivery (no external image dependencies).

Computation Scripts

Computationally expensive or reusable logic (e.g. aggregations, statistics, plots) should not be placed inside the .Rmd file. Instead, define these in separate R scripts:

  • Located in the computations/ folder
  • Defined for both:
    • user_data (focal computation)
    • bg_data (background computation)
  • These scripts return pre-processed R objects (tables, plots, metrics) that are injected into the R Markdown template

Templates

  • R Markdown Template (.Rmd): Customisable to include user-specific content, visual summaries, and feedback messages.
  • HTML Layout Template (.html): Provides structure and styling (e.g., branding, headers/footers, colors).

Output Location

All rendered email content is stored in the renders/[batch_id]/ folder:

  • One HTML file per recipient
  • A metadata .csv file containing:
    • user_id
    • email
    • rendered file name
    • associated content_key (if used)

Sending emails

After the HTML files are generated, the final step is delivering them to recipients. Email delivery is handled using the blastula package, which provides tools for sending richly formatted HTML emails directly from R. Emails are sent to addresses listed in the metadata CSV (renders/[batch_id]/meta_table.csv). You can manage sending credentials and settings via the config.yml file. The pipeline loops through all rows in the metadata table and sends the appropriate email to each participant.

Configuration

A configuration file (config.yml), which is loaded in using config::get(), is where you define the data file, the computation scripts and the template file. This configuration file defines all the necessary settings used across the data pipeline. It follows a YAML format and is organized under the default configuration profile, which is typically used for local development. To activate it in R, use:

📁 File Paths

Key Description
participant_data_file Path to save or load the participant data CSV
data_file Path to save the gathered records CSV
computation_script_bg Background computations script path
computation_script_user User-level computations script path
default_template_file RMarkdown template for reports
template_html_file HTML template for rendering
pandoc_path Path to Pandoc/Quarto binaries (for rendering RMarkdown)

📥 Data Gathering

Key Description
gather_from_controller_app Toggle to fetch data from the Controller App
controller_app_base_url Base API URL of the Controller App
controller_app_web_url Base web URL of the Controller App
controller_app_api_key API token used to authenticate against the Controller App API
controller_app_list_id Email list ID used to fetch subscribers
Key Description
gather_bio_script Filepath to script which gathers biodiversity data

in .Renviron you need to provide any secrets/keys needed by the script define in gather_bio_script. For example in the provided R/gather/gather_indicia.R there is a call for Sys.getenv("INDICIA_WAREHOUSE_SECRET") which must be provided in .Renviron

📧 Email Settings

Key Description
mail_server SMTP server address
mail_port SMTP port number
mail_use_tls Whether to use TLS encryption (TRUE/FALSE)
mail_use_ssl Whether to use SSL encryption (TRUE/FALSE)
mail_username SMTP login username
mail_password SMTP login password (if not using environment variables)
mail_default_sender Default email sender address
mail_default_name Display name for the sender
mail_default_subject Default subject line for outgoing emails
mail_creds Credential mode: "anonymous" or "envvar"
mail_test_recipient Optional hardcoded recipient for testing

Scheduling jobs

You can automate the execution of your run_pipeline.R script by setting up a cron job that runs your Bash script daily.

1. Make Sure the Bash Script Is Executable

Ensure your Bash script (e.g. run_pipeline.sh) has execution permissions:

chmod +x entrypoint.sh

2️. Edit Your Crontab

To add a scheduled task, open your crontab:

crontab -e

This opens the crontab file in your default text editor.

3. Add a Cron Job Entry

To run the pipeline once daily at 7:00 AM, add the following line to the crontab:

0 7 * * * /path/to/entrypoint.sh >> /path/to/logs/pipeline.log 2>&1

Explanation:

  • 0 2 * * * → Run at 2:00 AM every day
  • /path/to/your/script/run_pipeline.sh → Full path to your Bash script
  • >> pipeline.log 2>&1 → Appends both standard output and errors to a log file

Tip: Always use absolute paths in cron jobs—cron runs in a minimal environment and won’t know about your user-specific PATH.


🧪 Test Your Script First

You can manually run the script to confirm it works:

bash run_pipeline.sh

Let me know if you want a version that runs multiple times a day, or only on weekdays!

Getting started

Provided in this code is a a minimal example to show how it works. This can then be used as a starting point for developing your own personalised feedback.

Fork and clone the repository

It is recommended if you are going developing your own feedback from this code that you fork the repository to your own GitHub account. Start by forking the Recorder Feedback repository on GitHub. This will create a copy of the project under your GitHub account, allowing you to make changes and contributions without affecting the original repository.

Clone your forked repository to your local machine using Git. Open a terminal or command prompt and execute the following command:

git clone https://github.com/your-github-username/recorder-feedback.git`

Or alternatively you can use the RStudio IDE to clone the repository to a new project: https://happygitwithr.com/rstudio-git-github.html#clone-the-test-github-repository-to-your-computer-via-rstudio

Create key files

Use terminal to copy files with new names. Or do the equivalent action in file explorer or RStudio

cp example_config.yml config.yml

By keeping example_ versions of the files in this repo, which you don’t edit, but instead edit a derived file that you have just copied, makes it easier to pull updates from the main repo into your fork.

Install required R packages

Navigate to the project directory and install the necessary R packages using the renv package manager. Open R or RStudio and execute the following commands:

install.packages(c("renv"))
renv::restore()

This will ensure that you have all the required packages installed and ready to use for generating feedback. You can find an introduction to {renv} here: https://rstudio.github.io/renv/articles/renv.html

Generate test data

To help you get started there is a very minimal example of generating feedback items from some simulated data. Run the provided script generate_test_data.R to generate test data for email rendering. Execute the following command in R or RStudio:

source("R/gather/generate_test_users.R")

This script will create sample data that you can use to test the email generation process. The sample data is saved as simulated_participants.csv and data/simulated_data_raw.csv.

simulated_participants.csv

head(read.csv("data/simulated_participants.csv"))

data/simulated_data_raw.csv

head(read.csv("data/simulated_data_raw.csv"))

Get participant data from controller app

Here we provide an example script and functions for getting data from a controller app, see: https://github.com/BiologicalRecordsCentre/recorder-feedback-controller for more details.

This repository includes the get_subscribers_from_controller function, in order to use it follow these steps:

Ensure that you have the necessary libraries and configuration settings in place:

  • Libraries: The function uses httr for HTTP requests and jsonlite for parsing JSON responses. These should already be available when using renv for package management
  • API Endpoint: You need a deployed Recorder Feedback Controller app and have access to its API that provides subscriber information.
  • API Token: This token is required to authenticate your requests to the API, this will be configured in your controller app.
  • List ID: The unique ID for the email list from which you want to retrieve subscribers, this will be configured in your controller app.

You need to provide the following parameters for the config:

controller_app_base_url: "https://api.your-email-service.com/"
controller_app_api_key: "your_api_token"
participant_data_file: "path/to/save/subscribers.csv"

Now you can use the get_subscribers_from_controller function to fetch subscribers from a specific email list.

Example Code:

# Load required libraries
library(httr)
library(jsonlite)
library(config)

# Load configuration settings (from config.yml file or manually define them)
config <- config::get()

# Example parameters
api_url <- config$controller_app_base_url  # Base URL for your email service API
api_token <- config$controller_app_api_key # API token for authentication
email_list_id <- "1"                       # ID of the email list to query

# Call the function to get subscribers from the email list
subscribers_df <- get_subscribers_from_controller(
  api_url = api_url,
  email_list_id = email_list_id,
  api_token = api_token
)

# View the retrieved data
print(subscribers_df)

# Optionally, save the subscribers data to a CSV file
write.csv(subscribers_df, config$participant_data_file, row.names = FALSE)

The function returns a data.frame containing the list of subscribers from the specified email list saved in the participant_data_file

Get Records data from an Indicia warehouse

You’ll need the following:

  • Indicia Warehouse API URL (set in .Renviron): The base URL for the Indicia warehouse.
  • API Authentication Credentials:
    • client_id: Your Indicia warehouse client ID (set in .Renviron)
    • shared_secret: Your shared secret for authentication (set in .Renviron)
  • User Warehouse ID: The unique identifier for the user whose records you want to retrieve (contained in data downloaded from controller app)
  • Number of Records: The number of records you want to retrieve for that user (maximum is 10,000)

Generating feedback items

Now we’ve generated the test (or real) data we can run the pipeline but before we do let’s have a look at the other code required to make the pipeline work.

Firstly, the config file:

config.yml

## default:
##   participant_data_file: "data/simulated_participants.csv"
##   data_file: "data/simulated_data_raw.csv"
##   computation_script_bg: "computations/computations_example.R"
##   computation_script_user: "computations/computations_example.R"
##   default_template_file: "templates/example.Rmd"
##   template_html_file: "templates_html/basic_template.html"
##   pandoc_path: "C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools"
##   gather_from_controller_app: False
##   controller_app_base_url: "http://localhost:5000/api/"
##   controller_app_web_url: "http://localhost:5000/"
##   controller_app_api_key: "complicated_token"
##   controller_app_list_id: "1"
##   gather_bio_script: "R/gather/gather_simulated.R"
##   mail_server: ""
##   mail_port: 000
##   mail_use_tls: True
##   mail_use_ssl: True
##   mail_username: ''
##   mail_password: ''
##   mail_default_sender: ''
##   mail_default_name: 'Recorder Feedback'
##   mail_default_subject: "Your personalised recorder feedback"
##   mail_creds: "anonymous"
##   mail_test_recipient: ''

This file specifies which R/Rmd/HTML files are used in the email generation process. The config file is implemented using R package {config}. The values specified in the config file are loaded in the R code using config. Learn about the package here https://rstudio.github.io/config/articles/introduction.html

  • participant_data_file: The file path for the inut data with the list of participants
  • data_file: The file path for the the input data containing biological records
  • computation_script_bg: The file path for the script for computations for the background data
  • computation_script_user: The file path for the script for computations for the focal data (this could be the same script as computation_script_bg)
  • default_template_file: The file path for the R markdown file containing the EDE format
  • template_html_file: The file path for the HTML template containing formatting/header/footer etc.

The computation scripts are located in the computations folder. Each script is an R function (which must be called compute_objects) which takes either the focal or background data as its argument and returns a named list of computed objects. These objects can then be used in the R markdown file. For example here is the example computations provided in computations_example.R.

## compute_objects <- function(data){
##   #mean number of species
##   mean_n_species = data %>% 
##     group_by(user_id) %>% 
##     summarise(n_species = length(unique(species))) %>%
##     pull(n_species) %>%
##     mean()
##   
##   #mean number of records
##   mean_n_records = data %>% 
##     group_by(user_id) %>% 
##     summarise(n_records = n()) %>%
##     pull(n_records) %>%
##     mean()
##   
##   #return the list of precalculated objects
##   list(mean_n_species = mean_n_species,
##        mean_n_records = mean_n_records)
##   
## }

If you don’t need any computations then you can set computation_script_bg: computations/computations_none.R which provides a dummy function:

## compute_objects <- function(data){
##   list()
## }

The data (focal and background) and the computed objects (focal and background) are all used in the rendering the final HTML feedback item. Please take a look at the example template provided in templates/example.Rmd. Essentially you can us any R code you might use in analyses or data visualisation can be used here. However, please be aware that slower more complex R code will increase the time it takes to generate feedback. Some key principles:

  • The computed objects are passed to the R markdown as parameters which you can access using params$bg_computed_objects or params$user_computed_objects. Note that these are the named list objects so if within this you defined it as list(number_of_records=323) then in order to access this in the R markdown you’d use params$bg_computed_objects$number_of_records.
  • You can use inline R code: https://rmarkdown.rstudio.com/lesson-4.html

The HTML template templates_html/basic_templatehtml contains the formatting for the email. You only need to edit this if you whish to change the look and feel of the emails.

Finally, now we’ve had a look at all the components that are used in the pipeline, you can trigger the pipeline using targets::tar_make() or source(run_pipeline.R). Pipeline can be called with a command line prompt. This is useful if you want to trigger the pipeline run as part of a schedule (eg a CRON job).

Rscript run_pipeline.R

View generated emails

Once the targets pipeline has completed, you can view the generated email renders. Execute the following command in R or RStudio:

source("R/view_renders.R")

Development details

Now that you have the project set up and have generated test feedback, you can customize the email template and scripts to generate personalized feedback items according to your specific requirements. Edit the template (example.Rmd) and other scripts as needed to tailor the feedback content and format. By following the following steps, you can quickly set up the project environment and start generating informative feedback for biological recorders.

Input data

Input data (the full dataset) must be provided as a csv (comma separated values). The columns within the data are up to you but for consistency we recommend using Darwin Core terms: https://dwc.tdwg.org/terms/#occurrence & https://dwc.tdwg.org/terms/#location

The columns you specify here (or have been specified by whatever data source you are using) must then be used in the computation scripts and R markdown template.

In the records data (as specified in config as data_file) and the participant data (as specified in config as participant_data_file) there must be a shared user_id column which is used to link the users to their records. For example data from Indicia systems this might be the warehouse ID.

Computating objects

Presenting simply raw data limits the ability to produce meaningful feedback. Therefore as part of the feedback generation pipeline we compute objects. We define these computations as R functions which take the input data as its argument. The computation functions return a named list of objects which can then be referred to in the R markdown file in order to show to the user. You can define different computation functions for the background computations and the focal computations, or use the same computation file for each.

The R Markdown

The R markdown file is where you will spend the majority of you time developing the feedback you wish to send to recorders. R markdown was chosen as it provides all the facilities of R. If you need to use additional R packages please ensure that you update your renv.lock file to capture this.

The HTML template

The basic HTML template provided is lifted from R package {blastula} and provides formatting such as a container and headers/footers. If you want to change the look and feel of your feedback items you should copy the basic template and rename your copy to custom_template.html. You can edit the look and feel of the content by editing the css contained within the <style> tags in the html. You then need to edit config.yml to ensure that your new template is being used in your pipeline.

Deployment

A full pipeline is included in run_pipeline.Rmd (also run_pipeline.R). This code represents an R Markdown script for a feedback delivery pipeline, designed to automate the gathering, processing, and emailing of feedback to users. Here’s a breakdown of its structure and purpose:

1. Configuration

  • Sets up the runtime environment.
  • Uses renv for dependency management:
    • Ensures the correct packages are installed and restored.
    • Differentiates between local development and deployment on Posit Connect.
  • Generates a unique batch_id for tracking a specific run.
  • Loads configurations from a config.yml file, which defines parameters like API keys and file paths.

2. Gather

  • Fetch Subscribers:
    • Uses get_subscribers_from_controller.R to retrieve a list of email subscribers via an API.
    • Stores the list as a CSV file.
  • Fetch User Records:
    • Loops through each subscriber, fetching records of observational data (e.g., species, location, date).
    • Processes raw data into a structured format (e.g., extracting latitude, longitude, and species details).
    • Appends all records into a single dataframe, saving it as a CSV file.

3. Generate

  • Sets the BATCH_ID environment variable.
  • Runs a pipeline using targets::tar_make(), which is likely responsible for computations or rendering feedback reports.

4. Send

  • Prepares for email delivery using the blastula library.
  • Retrieves metadata from meta_table.csv and participant data from the saved CSV.
  • Sends feedback emails

Automation

If you want to automate the pipeline you can use crontab. For this we use the entrypoint.sh script as the things that we are triggering, which then runs run_pipeline.R.

You can create a cron job with crontab by using:

crontab -e

You can see what cron jobs have been created with

crontab -l

Which will then open a file for you to write the crontab job into. You use the crontab syntax (use crontab guru to help: https://crontab.guru/).

Here’s an example that triggers at 7am every day:

0 7 * * * /path/to/recorder-feedback-content/entrypoint.sh >> /path/to/recorder-feedback-content/logs/job_$(date +\%Y\%m\%d_\%H\%M\%S).log 2>&1

Use absolute paths because crotab runs in a minimal environment

The >> bit means that the console output is pasted into a .log file with the date in the filename

Examples

Here are some examples showing what sort of feedback items are possible.

Demonstrator - This is simply an example of the different elements you can combine together using R markdown; plots, maps and images.

Month in review - This is an example of a ‘month in review’ retrospective feedback piece.