imerss-bioinfo
v1.12.0
Published
Working data of the IMERSS biodiversity informatics working group, together with algorithms
Downloads
5
Readme
IMERSS Bioinformatics Working Group Data Archive, Pipelines and Visualisations
This repository houses the working data of the IMERSS biodiversity informatics working group together with algorithms for transforming, reconciling and projecting observation and checklist data into formats suitable for publication, submission to global authorities such as GBIF, as well as map-based and graphical visualisations.
Visualisations Portal
A portal presenting a showcase of the latest versions of all of IMERSS' user-facing visualisations is available in GitHub Pages.
Data Archive
All data is stored in CSV files in subdirectories under data although some unprocessed checklist data is a variety of formats such as XLSX and PDF.
Authoritative data resulting from the full reconciliation of upstream catalogues and curated summaries for the Galiano Data Paper part 1: Marine Zoology is held in data/dataPaper-I and data/dataPaper-I-in. In particular, the reconciled and normalised observation data is at data/dataPaper-I/reintegrated-obs.csv, reconciled checklists derived from this data is at /data/dataPaper-I/reintegrated.csv. The corresponding checklists derived from the curated summaries is at /data/dataPaper-I-in/reintegrated.csv.
Full intructions for running the data paper part I pipeline are held in Galiano Data Paper Vol I.md, but here follows an overview of the scripts and overall installation instructions.
Data Pipeline
The full data pipeline for deriving the observation data and its checklists from the upstream raw catalogues for the data paper is held at /data/dataPaper-I/fusion.json5. This is stored in the JSON5 format. This file refers to the most up-to-date versions of the constituent catalogues in their various other subdirectories under /data. Many of these subdirectories contain their own "fusion" files for producing other summaries and visualisations as seen in the mini-portal.
All of the code operating the data pipelines is written in JavaScript running either in node.js or the browser. After installing git to check out this repository via
git clone IMERSS/imerss-biodata
, and then installing node.js from its download, you can install the pipeline's dependencies by running
npm install
in the checkout directory.
The scripts in src will then be ready to run via various node
commands. These scripts are currently of a
basic quality and not easily usable without being in close contact with members of the IMERSS BIWG team. Please join us
in our Matrix channels IMERSS general and
IMERSS tech.
More detailed documentation for these scripts will be forthcoming, but the principal ones involved in the data paper pipeline are:
taxonomise.js [obs or summary CSV file] [--map map JSON file] [-o output file]
Main point of ingesting a collection of catalogues or summary in some CSV form. Accepts a "fusion" JSON5 file laying out
all of the constituent catalogues files as CSV together with accompanying "map" files mapping columns for ingestion and
rendering. Produces one or two reintegrated
files combining the catalogues. This operates several stages of
normalisation, including normalising species names with respect to the internal ontology mapping file
taxon-swaps.json5, filtering out unwanted taxa, georeferencing correction patches,
and filtering with respect to a project boundary defined in GeoJSON. Details of all the capabilities of this
"Swiss Army knife" data processor can be followed in the data paper fusion file data/dataPaper-I/fusion.json5.
inatObs.js
Downloads all observations in an iNaturalist project, whilst remaining within the iNaturalist team's recommended request
rate limit (one per second). Currently hardwired to download Animalia from the Galiano project. Produces a CSV file suitable
to form part of the input to taxonomise.js
.
materialise.js
From the curated summaries end, downloads a collection of summaries mapping in a Google Sheets directory. Currently hardwired to download the "Animalia" at Galiano Data Paper 2021/Marine Life/Animalia in the author's drive mapping.
compile.js
Given the folder of files output by materialise.js
, combines them together - currently hardwired to produce a file Animalia.csv
.
compare.js
Given the results of two taxonomise.js
outputs as reintegrated.csv
file checklists, compares them for any discrepancies,
after casting out any records for higher taxa which are trumped by a more specific species records. Outputs two CSV files
excess1.csv
and excess2.csv
.
arphify.js
Given a pipeline specification such as the one in [data/dataPaper-I-in/arpha-out.json5], accepts both a normalised observation file such as data/dataPaper-I/reintegrated-obs.csv and a normalised curated summary file such as /data/dataPaper-I-in/reintegrated.csv and emits a directory of XLSX spreadsheets in the form accepted by the ARPHA writing tool used for submission of biodiversity data papers as well as a Darwin Core CSV file Materials.csv suitable for submission to GBIF.
wormify.js
Accepts arguments as for taxonomise.js
. Produces a scratch reintegrated-WoRMS.csv
file after downloading and
caching WoRMS taxon files into data/WoRMS which compares the authority
value listed against the
one found in the WoRMS API.
We dream of turning these pipelines into easily usable pluralistic graphical pipelines deployed on public live infrastructure such as github and Google Sheets.
Visualisations
Sunburst visualisation and map view
Observation and checklist data derived from condensed summaries such as, e.g. data/dataPaper-I/reintegrated.csv is in a sunburst partition layout inspired by https://bl.ocks.org/mbostock/4348373, https://www.jasondavies.com/coffee-wheel/, as well as a map-based view rendered with Leaflet.
Preparing data for visualisation
Data is compiled into a compressed JSON representation from CSV sources via a command-line script.
To convert a CSV file, run marmalise.js
e.g. via a line such as
node src/marmalise.js data/dataPaper-I/reintegrated.csv --map data/dataPaper-I/combinedOutMap.json
By default this will produce a Life.json.lz4
file which can be copied into a suitable location, e.g. in the
directories and then referred to in the JavaScript initialisation block seen, e.g. in the various index.html
files in this root. You can supply a -o
option to output a file of a chosen name at a chosen path.
To preview the web UI, host this project via some suitable static web server and then access its index.html
.
Condensed versions of the visualisation source files suitable for production hosting (JS and CSS) can be output to the
directory via node build.js
.
You can see such visualisations running online at locations like
https://biogaliano.org/map-prototype/
https://biogaliano.org/galiano-data-paper-map-view/
and also a gallery in our visualisations portal.
These visualisations are entirely static and so easy to host at any kind of site simply by uploading a folder of files and injecting an initialisation block into the markup such as
<script>
hortis.sunburstLoader(".fl-imerss-container", {
colourCount: "undocumentedCount",
selectOnStartup: "Life",
vizFile: "data/Galiano/Galiano-Life.json.lz4",
phyloMap: "%resourceBase/json/emptyPhyloMap.json",
commonNames: false
});
</script>
together with <script>
and <style>
references to the built files.