Biological recorders contribute valuable biodiversity data; and extensive infrastructure exists to support dataflows from recorders submitting records to databases. However, we lack infrastructure dedicated to providing informative feedback to recorders in response to the data they have contributed. By developing this infrastructure, we can create a feedback loop leading to better data and more engaged data providers.
We might want to provide feedback to biological recorders, or other interested parties such as land managers or groups, for a variety of reasons:
The code in this repository is developed to programmatically generating effective digital engagements (‘data stories’) from biodiversity species data. This code provides tools for turning species recording data into data stories in HTML format. Separate scripts will be used to dispatch the HTML content to recipients. We use R markdown as a flexible templating system to provide extensible scripts that allow the development of different digital engagements. This templating system can then be used by data managers to design digital engagements to send to their participants. Attention should be given to ensuring that this software is computationally efficient to ensure that it has the potential to be scaled.
This work inherits some ideas and concept from the MyDECIDE campaign delivered under the DECIDE project. The code and scripts from MyDECIDE are available here: https://github.com/simonrolph/DECIDE-WP3-newsletter/tree/main
The email generation process is managed by R package targets. The
targets package is a Make-like pipeline tool for statistics and data
science in R. The package skips costly runtime for tasks that are
already up to date and orchestrates the necessary computation. The
pipeline is described in _targets.R
. Pipeline inputs and
outputs
Inputs
.csv
).Rmd
).html
).R
).R
)Intermediates
Outputs
.html
).csv
)Below is a schematic diagram providing an overview of the email generation process. Input data is loaded from an external source (e.g. an Indicia database). For each person/place/project the data is split into the focal data (relating to the person, place or project you which to deliver targeted feedback to) and the background data (all other data). Computations are then applied to these datasets, this might be to calculate summary statistics such as the number of records made in a certain period. The data and the computations are fed into an Rmarkdown document which contains code which you have developed to generate effective digital engagements. A HTML template is also combined here to specify any generic elements such as formatting, header/footers, and logos.
The input data is made available in the /data
folder. It
must be in a certain format in order to work correctly. During the
pipeline the data is split into the user_data
which only
includes the species records of the target user and the
bg_data
(background) which is the data for everyone.
The email rendering is done using R markdown. R markdown is used as a very flexible templating system to allow developers to documents in html (and other formats). It combines markdown with code chunks. We use parameterised R markdown to render the email with user-specific data and computed data-derived objects.
Content in the emails can be generated using frequently R packages such as dplyr for data manipulation and ggplot2 for creating data visualisations. There are various R packages available for generating maps but there are example scripts that use ggspatial.
The emails are rendered in an email-ready format borrowed from the R package blastula. They are rendered as ‘self contained’ html files so there are no external local image files.
It is not recommended to carry out computationally heavy calculations
within the R markdown template, therefore a computation step can be done
before rendering. These computations should be coded in scripts located
in computations
. The computations are applied separately
for the user_data
and bg_data
, but this can be
the same or different computation scripts.
A configuration file (config.yml
), which is loaded in
using config::get()
, is where you define the data file, the
computation scripts and the template file.
The rendered html items are saved in a folder
renders/[batch_id]
where you have set a batch identifier.
The folder contains html files for each recipient and a
.csv
with columns for each file name and the
identifier.
All the R code for making the pipeline work is located in the
R
folder, you shouldn’t need to edit any of these
files.
Provided in this code is a a minimal example to show how it works. This can then be used as a starting point for developing your own personalised feedback.
It is recommended if you are going developing your own feedback from this code that you fork the repository to your own GitHub account. Start by forking the Recorder Feedback repository on GitHub. This will create a copy of the project under your GitHub account, allowing you to make changes and contributions without affecting the original repository.
Clone your forked repository to your local machine using Git. Open a terminal or command prompt and execute the following command:
git clone https://github.com/your-github-username/recorder-feedback.git`
Or alternatively you can use the RStudio IDE to clone the repository to a new project: https://happygitwithr.com/rstudio-git-github.html#clone-the-test-github-repository-to-your-computer-via-rstudio
Use terminal to copy files with new names. Or do the equivalent action in file explorer or RStudio
cp example_targets.R _targets.R
cp example_run_pipeline.R run_pipeline.R
cp example_config.yml config.yml
By keeping example_
versions of the files in this repo,
which you don’t edit, but instead edit a derived file that you have just
copied, makes it easier to pull updates from the main repo into your
fork.
Navigate to the project directory and install the necessary R packages using the renv package manager. Open R or RStudio and execute the following commands:
install.packages(c("renv"))
renv::restore()
This will ensure that you have all the required packages installed and ready to use for generating feedback. You can find an introduction to {renv} here: https://rstudio.github.io/renv/articles/renv.html
To help you get started there is a very minimal example of generating feedback items from some simulated data. Run the provided script generate_test_data.R to generate test data for email rendering. Execute the following command in R or RStudio:
source("R/generate_test_data.R")
This script will create sample data that you can use to test the
email generation process. The sample data is saved as
simulated_participants.csv
and
data/simulated_data_raw.csv
.
simulated_participants.csv
## user_id name email
## 1 1 Ella W Ella.W@email.com
## 2 2 Emma X Emma.X@email.com
## 3 3 Matthew S Matthew.S@email.com
## 4 4 Madison R Madison.R@email.com
data/simulated_data_raw.csv
## latitude longitude species date user_id name
## 1 51.62816 -0.01707810 Peacock 2023-11-05 4 Madison R
## 2 51.48911 -0.92391545 Large white 2023-12-29 2 Emma X
## 3 51.57612 -0.66409614 Holly Blue 2023-12-30 1 Ella W
## 4 51.03935 -0.22017373 Peacock 2023-12-21 1 Ella W
## 5 51.78953 -0.29878872 Meadow brown 2023-12-12 4 Madison R
## 6 51.36310 -0.03141985 Large white 2023-08-01 2 Emma X
## email
## 1 Madison.R@email.com
## 2 Emma.X@email.com
## 3 Ella.W@email.com
## 4 Ella.W@email.com
## 5 Madison.R@email.com
## 6 Emma.X@email.com
Here we provide an example script and functions for getting data from a controller app, see: https://github.com/BiologicalRecordsCentre/recorder-feedback-controller for more details.
This repository includes the
get_subscribers_from_controller
function, in order to use
it follow these steps:
Ensure that you have the necessary libraries and configuration settings in place:
httr
for HTTP requests and
jsonlite
for parsing JSON responses. These should already
be available when using renv for package managementYou need to provide the following parameters for the config:
controller_app_base_url: "https://api.your-email-service.com/"
controller_app_api_key: "your_api_token"
participant_data_file: "path/to/save/subscribers.csv"
Now you can use the get_subscribers_from_controller
function to fetch subscribers from a specific email list.
Example Code:
# Load required libraries
library(httr)
library(jsonlite)
library(config)
# Load configuration settings (from config.yml file or manually define them)
config <- config::get()
# Example parameters
api_url <- config$controller_app_base_url # Base URL for your email service API
api_token <- config$controller_app_api_key # API token for authentication
email_list_id <- "1" # ID of the email list to query
# Call the function to get subscribers from the email list
subscribers_df <- get_subscribers_from_controller(
api_url = api_url,
email_list_id = email_list_id,
api_token = api_token
)
# View the retrieved data
print(subscribers_df)
# Optionally, save the subscribers data to a CSV file
write.csv(subscribers_df, config$participant_data_file, row.names = FALSE)
The function returns a data.frame
containing the list of
subscribers from the specified email list saved in the
participant_data_file
You’ll need the following:
config.yml
): The base
URL for the Indicia warehouse.config.yml
)config.yml
)Once you have the required parameters, you can call the
get_user_records_from_indicia
function to retrieve species
records for a specific user from the Indicia warehouse.
You can see how this function can be run in a loop to generate and
save data for all users in the script defined in called
get_users_and_records.R
.
Now we’ve generated the test (or real) data we can run the pipeline but before we do let’s have a look at the other code required to make the pipeline work.
Firstly, the config file:
config.yml
## default:
## participant_data_file: "data/simulated_participants.csv"
## data_file: "data/simulated_data_raw.csv"
## computation_script_bg: "computations/computations_example.R"
## computation_script_user: "computations/computations_example.R"
## default_template_file: "templates/example.Rmd"
## template_html_file: "templates_html/basic_template.html"
## pandoc_path: "C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools"
## controller_app_base_url: ""
## controller_app_api_key: ""
## indicia_warehouse_base_url: ""
## indicia_warehouse_client_id: ""
## indicia_warehouse_secret: ""
##
This file specifies which R/Rmd/HTML files are used in the email generation process. The config file is implemented using R package {config}. The values specified in the config file are loaded in the R code using config. Learn about the package here https://rstudio.github.io/config/articles/introduction.html
participant_data_file
: The file path for the inut data
with the list of participantsdata_file
: The file path for the the input data
containing biological recordscomputation_script_bg
: The file path for the script for
computations for the background datacomputation_script_user
: The file path for the script
for computations for the focal data (this could be the same script as
computation_script_bg)default_template_file
: The file path for the R markdown
file containing the EDE formattemplate_html_file
: The file path for the HTML template
containing formatting/header/footer etc.The computation scripts are located in the computations
folder. Each script is an R function (which must be called
compute_objects
) which takes either the focal or background
data as its argument and returns a named list of computed objects. These
objects can then be used in the R markdown file. For example here is the
example computations provided in
computations_example.R
.
## compute_objects <- function(data){
## #mean number of species
## mean_n_species = data %>%
## group_by(user_id) %>%
## summarise(n_species = length(unique(species))) %>%
## pull(n_species) %>%
## mean()
##
## #mean number of records
## mean_n_records = data %>%
## group_by(user_id) %>%
## summarise(n_records = n()) %>%
## pull(n_records) %>%
## mean()
##
## #return the list of precalculated objects
## list(mean_n_species = mean_n_species,
## mean_n_records = mean_n_records)
##
## }
If you don’t need any computations then you can set
computation_script_bg: computations/computations_none.R
which provides a dummy function:
## compute_objects <- function(data){
## list()
## }
The data (focal and background) and the computed objects (focal and
background) are all used in the rendering the final HTML feedback item.
Please take a look at the example template provided in templates/example.Rmd
.
Essentially you can us any R code you might use in analyses or data
visualisation can be used here. However, please be aware that slower
more complex R code will increase the time it takes to generate
feedback. Some key principles:
params$bg_computed_objects
or
params$user_computed_objects
. Note that these are the named
list objects so if within this you defined it as
list(number_of_records=323)
then in order to access this in
the R markdown you’d use
params$bg_computed_objects$number_of_records
.The HTML template templates_html/basic_templatehtml
contains the formatting for the email. You only need to edit this if you
whish to change the look and feel of the emails.
Finally, now we’ve had a look at all the components that are used in
the pipeline, you can trigger the pipeline using
targets::tar_make()
or source(run_pipeline.R
).
Pipeline can be called with a command line prompt. This is useful if you
want to trigger the pipeline run as part of a schedule (eg a CRON job).
You can provide a batch ID with a command line argument.
Rscript run_pipeline.R test_001
Once the targets pipeline has completed, you can view the generated email renders. Execute the following command in R or RStudio:
source("R/view_renders.R")
view_renders(batch_id="test_001",5)
Replace “test_001” with the batch identifier you set, and n with the number of renders you want to view.
Now that you have the project set up and have generated test feedback, you can customize the email template and scripts to generate personalized feedback items according to your specific requirements. Edit the template (example.Rmd) and other scripts as needed to tailor the feedback content and format. By following the following steps, you can quickly set up the project environment and start generating informative feedback for biological recorders.
Input data (the full dataset) must be provided as a csv (comma separated values). The columns within the data are up to you but for consistency we recommend using Darwin Core terms: https://dwc.tdwg.org/terms/#occurrence & https://dwc.tdwg.org/terms/#location
The columns you specify here (or have been specified by whatever data source you are using) must then be used in the computation scripts and R markdown template.
In the records data (as specified in config as
data_file
) and the participant data (as specified in config
as participant_data_file
) there must be a shared
user_id
column which is used to link the users to their
records. For example data from Indicia systems this might be the
warehouse ID.
Presenting simply raw data limits the ability to produce meaningful feedback. Therefore as part of the feedback generation pipeline we compute objects. We define these computations as R functions which take the input data as its argument. The computation functions return a named list of objects which can then be referred to in the R markdown file in order to show to the user. You can define different computation functions for the background computations and the focal computations, or use the same computation file for each.
The R markdown file is where you will spend the majority of you time
developing the feedback you wish to send to recorders. R markdown was
chosen as it provides all the facilities of R. If you need to use
additional R packages please ensure that you update your
renv.lock
file to capture this.
The basic HTML template provided is lifted from R package {blastula}
and provides formatting such as a container and headers/footers. If you
want to change the look and feel of your feedback items you should copy
the basic template and rename your copy to
custom_template.html
. You can edit the look and feel of the
content by editing the css contained within the
<style>
tags in the html. You then need to edit
config.yml
to ensure that your new template is being used
in your pipeline.
You could send emails from R using packages such as blastula (https://CRAN.R-project.org/package=blastula) to set up
SMTP. Define a function in send_email.R
and then run code
similar to this where you use the meta_table.csv
generated
to get the filepath of the email content, and email address from the
participant_data_file
to loop through and send emails using
your send_email()
function.
source("R/send_email.R")
meta_table <-read.csv(paste0("renders/",batch_id,"/meta_table_",batch_id,".csv"))
for (i in 1:nrow(meta_table)){
#get their email address
participants <- read.csv(config$participant_data_file)
recipient <- participants[participants$user_id == meta_table[i,"user_id"],]$email
send_email(recipient,meta_table$file[i])
}
Here are some examples showing what sort of feedback items are possible.
Demonstrator - This is simply an example of the different elements you can combine together using R markdown; plots, maps and images.
Month in review - This is an example of a ‘month in review’ retrospective feedback piece.