Biological recorders contribute valuable biodiversity data; and extensive infrastructure exists to support dataflows from recorders submitting records to databases. However, we lack infrastructure dedicated to providing informative feedback to recorders in response to the data they have contributed. By developing this infrastructure, we can create a feedback loop leading to better data and more engaged data providers.
We might want to provide feedback to biological recorders, or other interested parties such as land managers or groups, for a variety of reasons:
The code in this repository is developed to programmatically generating effective digital engagements (‘data stories’) from biodiversity species data. This code provides tools for turning species recording data into data stories in HTML format. Separate scripts will be used to dispatch the HTML content to recipients. We use R markdown as a flexible templating system to provide extensible scripts that allow the development of different digital engagements. This templating system can then be used by data managers to design digital engagements to send to their participants. Attention should be given to ensuring that this software is computationally efficient to ensure that it has the potential to be scaled.
This work inherits some ideas and concept from the MyDECIDE campaign delivered under the DECIDE project. The code and scripts from MyDECIDE are available here: https://github.com/simonrolph/DECIDE-WP3-newsletter/tree/main
The process from getting data to distributing content is defined in
run_pipeline.Rmd
(and a derived .R
version for
sourcing). Within this there is the content generation pipeline that
automates slicing up the data for each user, doing per-user computations
and generating the email content.
The email generation process is managed by R package targets. The
targets package is a Make-like pipeline tool for statistics and data
science in R. The package skips costly runtime for tasks that are
already up to date and orchestrates the necessary computation. The
pipeline is described in _targets.R
.
Input data is loaded from an external source (e.g. an Indicia database). For each person/place/project the data is split into the focal data (relating to the person, place or project you which to deliver targeted feedback to) and the background data (all other data). Computations are then applied to these datasets, this might be to calculate summary statistics such as the number of records made in a certain period. The data and the computations are fed into an Rmarkdown document which contains code which you have developed to generate effective digital engagements. A HTML template is also combined here to specify any generic elements such as formatting, header/footers, and logos.
All data for users or their records are stored in the
/data
folder. The contents of this folder are ignored by
git because this is likely to contain personal data such as emails.
There are two tables as .csv
files representing the users
and their records. These tables are both contain a user_id
column which can be used to match users to their records.
This contains data about the users for which you are creating personalised feedback. There are three key columns:
user_id
- This is a unique identifier for a user. This
could be their ID from a biological recording website.name
- This is the name of the user that they are
addressed by in the personalised feedbackemail
- This is their email address, if you are using
email as a dispatch methodYou can also add any other columns you wish and this information will
be passed to the parametised R markdown template through an R object
params$extra_params
. This is useful if you wish to have
different feedback content for different recorders. FOr example, a
column of taxon
which could have the values of
butterfly
or dragonfly
could be used in
if/else logic statements
(e.g. if(params$extra_params$taxon=="butterfly){print("Butterfly!")}
)
within the R markdown template to show different content to different
users. This could also be used for splitting a group into A/B testing.
Generating the data for these other columns is not part of the provided
pipeline so you will need to manipulate this csv yourself. Typically,
using writing a script and saving it in R/util
(utility
scripts).
If you are using the controller app to host your user data you will
need to provide the API endpoints and authentication details in
config.yml
(see configuration section below). In
run_pipeline.R
it will call the API and download the user
data.
This data contains the biodiversity data. The only column you must
have is the user_id
column, the rest of the columns are not
prescriptive and up to your use case. For a typical biological recording
user case you will have columns describing who, what was recorded,
where, when. Some basic headings (and are used in the example below)
are: latitude
, longitude
,
species
, date
, user_id
(required). Some recommended extra headings include:
species_vernacular
(common name),
species_group
(is it a frog, a bird etc.).
During the pipeline the data is split into the user_data
which only includes the species records of the target user and the
bg_data
(background) which is the data for everyone. The
user data and background data are passed to the R markdown template as
params$user_data
and params$bg_data
.
Computationally expensive or reusable logic (e.g. aggregations,
statistics, plots) should not be placed inside the
.Rmd
file. Instead, define these in separate R scripts
located in the computations/
folder. Each file defines a
function called compute_objects()
which can take the
user_data
or bg_data
as a function
argument.
Which computation the pipeline should carry out is defined in
config.yml
. This computation can be applied to both user
and background data, or different computations for each. These scripts
return a list of R objects (tables, plots, metrics) that are passed to
the R Markdown template through params$bg_computed_objects
and params$user_computed_objects
.
We use R Markdown to
render the emails. Pandoc will need to be installed and the path set in
config.yml
. This format allows for combining plain markdown
with embedded R code chunks, making it ideal for integrating
visualisations, text, and summaries dynamically. The R markdown template
to use is defined in config.yml
. The .Rmd
template receives inputs via parameters defined in the yml at the top of
the template. The rendered output is a self-contained HTML
file, ready for email delivery (no external image
dependencies).
.Rmd
):
Customisable to include user-specific content, visual summaries, and
feedback messages..html
): Provides
structure and styling (e.g., branding, headers/footers, colors). Content
that will be consistent across every email, it makes sense to put it
here to save any R computation time.Building your R markown is the interesting bit. The parametised
markdown + targets pipeline approach approach means you have the data
from users/background, the user/background computed objects, and the
extra params from the extra columns in the users table. You can use any
R package here so long as you have it installed. We recommend using renv
for package management. The development loop can be slow if you have to
run the whole content generation pipeline, do what you can do is run the
pipeline which has a target for the user parameters. Then you can do
params <- targets::tar_read(user_params__[USER_ID])
which will then you can run the chunks in the markdown interactively
using this params
object you have loaded. To see your
targets look in _targets/objects
.
The basic HTML template provided is lifted from R package {blastula}
and provides formatting such as a container and headers/footers. If you
want to change the look and feel of your feedback items you should copy
the basic template and rename your copy to
custom_template.html
. You can edit the look and feel of the
content by editing the css contained within the
<style>
tags in the html. You then need to edit
config.yml
to ensure that your new template is being used
in your pipeline.
All rendered email content is stored in the
renders/[batch_id]/
folder:
.csv
file containing:
user_id
email
file
namecontent_key
(if used)After the HTML files are generated, the final step is delivering them
to recipients. Email delivery is handled using the blastula package,
which provides tools for sending richly formatted HTML emails directly
from R. Emails are sent to addresses listed in the metadata CSV
(renders/[batch_id]/meta_table.csv
). You can manage sending
credentials and settings via the config.yml file. The pipeline loops
through all rows in the metadata table and sends the appropriate email
to each participant.
A configuration file (config.yml
), which is loaded in
using config::get()
, is where you define the data file, the
computation scripts and the template file. This configuration file
defines all the necessary settings used across the data pipeline. It
follows a YAML format and is organized under the default
configuration profile, which is typically used for local development. To
activate it in R, use:
Key | Description |
---|---|
participant_data_file |
Path to save or load the participant data CSV |
data_file |
Path to save the gathered records CSV |
computation_script_bg |
Background computations script path |
computation_script_user |
User-level computations script path |
default_template_file |
RMarkdown template for reports |
template_html_file |
HTML template for rendering |
pandoc_path |
Path to Pandoc/Quarto binaries (for rendering RMarkdown) |
Key | Description |
---|---|
gather_from_controller_app |
Toggle to fetch data from the Controller App |
controller_app_base_url |
Base API URL of the Controller App |
controller_app_web_url |
Base web URL of the Controller App |
controller_app_api_key |
API token used to authenticate against the Controller App API |
controller_app_list_id |
Email list ID used to fetch subscribers |
Key | Description |
---|---|
gather_bio_script |
Filepath to script which gathers biodiversity data |
in .Renviron
you need to provide any secrets/keys needed
by the script define in gather_bio_script
. For example in
the provided R/gather/gather_indicia.R
there is a call for
Sys.getenv("INDICIA_WAREHOUSE_SECRET")
which must be
provided in .Renviron
Key | Description |
---|---|
mail_server |
SMTP server address |
mail_port |
SMTP port number |
mail_use_tls |
Whether to use TLS encryption (TRUE/FALSE) |
mail_use_ssl |
Whether to use SSL encryption (TRUE/FALSE) |
mail_username |
SMTP login username |
mail_password |
SMTP login password (if not using environment variables) |
mail_default_sender |
Default email sender address |
mail_default_name |
Display name for the sender |
mail_default_subject |
Default subject line for outgoing emails |
mail_creds |
Credential mode: "anonymous" or
"envvar" |
mail_test_recipient |
Optional hardcoded recipient for testing |
Provided in this code is a a minimal example to show how it works. This can then be used as a starting point for developing your own personalised feedback.
It is recommended if you are going developing your own feedback from this code that you fork the repository to your own GitHub account. Start by forking the Recorder Feedback repository on GitHub. This will create a copy of the project under your GitHub account, allowing you to make changes and contributions without affecting the original repository.
Clone your forked repository to your local machine using Git. Open a terminal or command prompt and execute the following command:
git clone https://github.com/your-github-username/recorder-feedback.git`
Or alternatively you can use the RStudio IDE to clone the repository to a new project: https://happygitwithr.com/rstudio-git-github.html#clone-the-test-github-repository-to-your-computer-via-rstudio
Use terminal to copy files with new names. Or do the equivalent action in file explorer or RStudio
cp example_config.yml config.yml
By keeping example_
versions of the files in this repo,
which you don’t edit, but instead edit a derived file that you have just
copied, makes it easier to pull updates from the main repo into your
fork.
Navigate to the project directory and install the necessary R packages using the renv package manager. Open R or RStudio and execute the following commands:
install.packages(c("renv"))
renv::restore()
This will ensure that you have all the required packages installed and ready to use for generating feedback. You can find an introduction to {renv} here: https://rstudio.github.io/renv/articles/renv.html
To help you get started there is a very minimal example of generating feedback items from some simulated data. Run the provided script generate_test_data.R to generate test data for email rendering. Execute the following command in R or RStudio:
source("R/gather/generate_test_users.R")
This script will create sample data that you can use to test the
email generation process. The sample data is saved as
simulated_participants.csv
and
data/simulated_data_raw.csv
.
simulated_participants.csv
head(read.csv("data/simulated_participants.csv"))
data/simulated_data_raw.csv
head(read.csv("data/simulated_data_raw.csv"))
Now we’ve generated the test (or real) data we can run the pipeline but before we do let’s have a look at the other code required to make the pipeline work.
Firstly, the config file:
config.yml
## default:
## participant_data_file: "data/simulated_participants.csv"
## data_file: "data/simulated_data_raw.csv"
## computation_script_bg: "computations/computations_example.R"
## computation_script_user: "computations/computations_example.R"
## default_template_file: "templates/example.Rmd"
## template_html_file: "templates_html/basic_template.html"
## pandoc_path: "C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools"
## gather_from_controller_app: False
## controller_app_base_url: "http://localhost:5000/api/"
## controller_app_web_url: "http://localhost:5000/"
## controller_app_api_key: "complicated_token"
## controller_app_list_id: "1"
## gather_bio_script: "R/gather/gather_simulated.R"
## mail_server: ""
## mail_port: 000
## mail_use_tls: True
## mail_use_ssl: True
## mail_username: ''
## mail_password: ''
## mail_default_sender: ''
## mail_default_name: 'Recorder Feedback'
## mail_default_subject: "Your personalised recorder feedback"
## mail_creds: "anonymous"
## mail_test_recipient: ''
This file specifies which R/Rmd/HTML files are used in the email generation process. The config file is implemented using R package {config}. The values specified in the config file are loaded in the R code using config. Learn about the package here https://rstudio.github.io/config/articles/introduction.html
participant_data_file
: The file path for the inut data
with the list of participantsdata_file
: The file path for the the input data
containing biological recordscomputation_script_bg
: The file path for the script for
computations for the background datacomputation_script_user
: The file path for the script
for computations for the focal data (this could be the same script as
computation_script_bg)default_template_file
: The file path for the R markdown
file containing the EDE formattemplate_html_file
: The file path for the HTML template
containing formatting/header/footer etc.The computation scripts are located in the computations
folder. Each script is an R function (which must be called
compute_objects
) which takes either the focal or background
data as its argument and returns a named list of computed objects. These
objects can then be used in the R markdown file. For example here is the
example computations provided in
computations_example.R
.
## compute_objects <- function(data){
## #mean number of species
## mean_n_species = data %>%
## group_by(user_id) %>%
## summarise(n_species = length(unique(species))) %>%
## pull(n_species) %>%
## mean()
##
## #mean number of records
## mean_n_records = data %>%
## group_by(user_id) %>%
## summarise(n_records = n()) %>%
## pull(n_records) %>%
## mean()
##
## #return the list of precalculated objects
## list(mean_n_species = mean_n_species,
## mean_n_records = mean_n_records)
##
## }
If you don’t need any computations then you can set
computation_script_bg: computations/computations_none.R
which provides a dummy function:
## compute_objects <- function(data){
## list()
## }
The data (focal and background) and the computed objects (focal and
background) are all used in the rendering the final HTML feedback item.
Please take a look at the example template provided in templates/example.Rmd
.
Essentially you can us any R code you might use in analyses or data
visualisation can be used here. However, please be aware that slower
more complex R code will increase the time it takes to generate
feedback. Some key principles:
params$bg_computed_objects
or
params$user_computed_objects
. Note that these are the named
list objects so if within this you defined it as
list(number_of_records=323)
then in order to access this in
the R markdown you’d use
params$bg_computed_objects$number_of_records
.Finally, now we’ve had a look at all the components that are used in
the pipeline, you can trigger the pipeline using
targets::tar_make()
or source(run_pipeline.R
).
Pipeline can be called with a command line prompt. This is useful if you
want to trigger the pipeline run as part of a schedule (eg a CRON
job).
Rscript run_pipeline.R
Once the targets pipeline has completed, you can view the generated email renders. Execute the following command in R or RStudio:
source("R/view_renders.R")
Here we provide an example script and functions for getting data from a controller app, see: https://github.com/BiologicalRecordsCentre/recorder-feedback-controller for more details.
This repository includes the
get_subscribers_from_controller
function, in order to use
it follow these steps:
Ensure that you have the necessary libraries and configuration settings in place:
httr
for HTTP requests and
jsonlite
for parsing JSON responses. These should already
be available when using renv for package managementYou need to provide the following parameters for the config:
controller_app_base_url: "https://api.your-email-service.com/"
controller_app_api_key: "your_api_token"
participant_data_file: "path/to/save/subscribers.csv"
Now you can use the get_subscribers_from_controller
function to fetch subscribers from a specific email list.
Example Code:
# Load required libraries
library(httr)
library(jsonlite)
library(config)
# Load configuration settings (from config.yml file or manually define them)
config <- config::get()
# Example parameters
api_url <- config$controller_app_base_url # Base URL for your email service API
api_token <- config$controller_app_api_key # API token for authentication
email_list_id <- "1" # ID of the email list to query
# Call the function to get subscribers from the email list
subscribers_df <- get_subscribers_from_controller(
api_url = api_url,
email_list_id = email_list_id,
api_token = api_token
)
# View the retrieved data
print(subscribers_df)
# Optionally, save the subscribers data to a CSV file
write.csv(subscribers_df, config$participant_data_file, row.names = FALSE)
The function returns a data.frame
containing the list of
subscribers from the specified email list saved in the
participant_data_file
You’ll need the following:
.Renviron
): The base
URL for the Indicia warehouse..Renviron
).Renviron
)Input data (the full dataset) must be provided as a csv (comma separated values). The columns within the data are up to you but for consistency we recommend using Darwin Core terms: https://dwc.tdwg.org/terms/#occurrence & https://dwc.tdwg.org/terms/#location
The columns you specify here (or have been specified by whatever data source you are using) must then be used in the computation scripts and R markdown template.
In the records data (as specified in config as
data_file
) and the participant data (as specified in config
as participant_data_file
) there must be a shared
user_id
column which is used to link the users to their
records. For example data from Indicia systems this might be the
warehouse ID.
A full pipeline is included in run_pipeline.Rmd
(also
run_pipeline.R
). This code represents an R Markdown
script for a feedback delivery pipeline, designed to automate
the gathering, processing, and emailing of feedback to users. Here’s a
breakdown of its structure and purpose:
renv
for dependency management:
batch_id
for tracking a specific
run.config.yml
file, which
defines parameters like API keys and file paths.get_subscribers_from_controller.R
to retrieve a
list of email subscribers via an API.BATCH_ID
environment variable.targets::tar_make()
, which is
likely responsible for computations or rendering feedback reports.blastula
library.meta_table.csv
and participant
data from the saved CSV.If you want to automate the pipeline you can use
crontab
. For this we use the entrypoint.sh
script as the things that we are triggering, which then runs
run_pipeline.R
.
You can create a cron job with crontab by using:
crontab -e
You can see what cron jobs have been created with
crontab -l
Which will then open a file for you to write the crontab job into. You use the crontab syntax (use crontab guru to help: https://crontab.guru/).
Here’s an example that triggers at 7am every day:
0 7 * * * /path/to/recorder-feedback-content/entrypoint.sh >> /path/to/recorder-feedback-content/logs/job_$(date +\%Y\%m\%d_\%H\%M\%S).log 2>&1
Use absolute paths because crotab runs in a minimal environment
The >>
bit means that the console output is pasted
into a .log
file with the date in the filename
Here are some examples showing what sort of feedback items are possible.
Demonstrator - This is simply an example of the different elements you can combine together using R markdown; plots, maps and images.
Month in review - This is an example of a ‘month in review’ retrospective feedback piece.