bioconductor
v0.3.1
Published
Bioconductor classes, generics and methods, implemented in Javascript.
Downloads
62
Readme
Bioconductor objects in Javascript
This package aims to provide Javascript implementations of Bioconductor data structures for use in web applications. Much like the original R code, we focus on the use of common generics to provide composability, allowing users to construct complex objects that "just work". We also attempt to circumvent Javascript's pass-by-reference behavior to avoid unintended modifications to unrelated objects when calling setter methods from their nested child objects.
Quick start
Here, we perform some generic operations on a DataFrame
object, equivalent to Bioconductor's S4Vectors::DFrame
class.
// Import using ES6 notation
import * as bioc from "bioconductor";
// Construct a DataFrame
let results = new bioc.DataFrame(
{
logFC: new Float64Array([-1, -2, 1.3, 2.1]),
pvalue: new Float64Array([0.01, 0.02, 0.001, 1e-8])
},
{
rowNames: [ "p53", "SNAP25", "MALAT1", "INS" ]
}
);
// Run generics
bioc.LENGTH(results);
bioc.SLICE(results, [ 2, 3, 1 ]);
bioc.CLONE(results);
let more_results = new bioc.DataFrame(
{
logFC: new Float64Array([0, 0.1, -0.1]),
pvalue: new Float64Array([1e-5, 1e-4, 0.5])
},
{
rowNames: [ "GFP", "mCherry", "tdTomato" ]
}
);
bioc.COMBINE([results, more_results]);
See the reference documentation for more details.
Using generics
Our generics allow users to operate on different objects in a consistent manner.
For example, a DataFrame
allows us to store any object as a column as long as it defines methods for the LENGTH
, SLICE
, CLONE
and COMBINE
generics.
This enables the construction of complex objects like a DataFrame
nested inside another DataFrame
.
let genomic_results = new bioc.DataFrame(
{
logFC: new Float64Array([-1, -2, 1.3, 2.1]),
pvalue: new Float64Array([0.01, 0.02, 0.001, 1e-8]),
location: new bioc.DataFrame({
"chromosome": [ "chrA", "chrB", "chrC", "chrD" ],
"start": [ 1, 2, 3, 4 ],
"width": [ 10, 20, 30, 40 ],
"strand": new Uint8Array([-1, 1, 1, -1 ])
})
},
{
rowNames: [ "p53", "SNAP25", "MALAT1", "INS" ]
}
);
let subset = bioc.SLICE(genomic_results, { start: 2, end: 4 });
bioc.LENGTH(subset);
subset.column("location");
Alternatively, we could store an IRanges
(see below) as a column of our DataFrame
.
All generics on the parent DataFrame
will be automatically applied to the IRanges
column.
let old_location = genomic_results.column("location");
let new_location = new bioc.GRanges(old_location.column("chromosome"),
new bioc.IRanges(old_location.column("start"), old_location.column("width")),
{ strand: old_location.column("strand") });
genomic_results.$setColumn("location", new_location);
subset = bioc.SLICE(genomic_results, { start: 2, end: 4 });
subset.column("location");
We mimic R's S4 generics using methods in Javascript classes.
For example, each vector-like class should define a _bioconductor_LENGTH
method to quantify its concept of "length".
The LENGTH
function will then call this method to obtain a length value for any instance of any supported class.
We prefix this method with _bioconductor_
to avoid collisions with other properties;
this allows safe monkey patching of third-party classes if they are sufficiently vector-like.
(Admittedly, the LENGTH
function is not really necessary, as users could just call _bioconductor_LENGTH
directly.
However, the latter is long and unpleasant to type, so we might as well wrap it in something that's easier to remember.
It would also require monkey patching of built-in classes like Arrays and TypedArrays, which is somewhat concerning as it risks interfering with the behavior of other packages.
By defining our own LENGTH
function, we can safely handle the built-in classes as special cases without modifying their prototypes.)
Mimicking copy-on-write
We mimic R's copy-on-write behavior by returning a new object from any setter, rather than mutating the existing object.
This avoids silent pass-by-reference changes in separate objects, which would be particularly problematic in complex classes that contain many child objects.
In the example below, another_reference
still retains the original set of row names while only modified
has its row names removed.
// Construct a DataFrame
let results = new bioc.DataFrame(
{
logFC: new Float64Array([-1, -2, 1.3, 2.1]),
pvalue: new Float64Array([0.01, 0.02, 0.001, 1e-8])
},
{
rowNames: [ "p53", "SNAP25", "MALAT1", "INS" ]
}
);
let another_reference = results;
let modified = results.setRowNames(null);
For users who are very sure that they are only operating on a single instance of the object,
or for those who wish to exploit pass-by-reference behavior to multiple multiple objects at once,
we can use mutating setters for slightly more efficiency.
These are prefixed with $
signs to indicate their potentially unexpected behavior.
results.$setRowNames(null);
another_reference.rowNames(); // this will now be null.
Note that this copy-on-write paradigm only applies to the setters defined in the bioconductor.js classes.
Assignments to base objects (e.g., arrays, TypedArrays) will still exhibit pass-by-reference behavior.
If there is a risk of inadvertently modifying a shared object, users should consider CLONE
ing their object before modifying it.
// Returns a base object, i.e., Float64Array of log-fold changes.
let lfc = results.column("logFC");
// We clone it so that changes don't propagate to 'results' by reference.
// We can then apply our arbitrary modifications to the copy.
let lfc_copy = bioc.CLONE(lfc);
lfc_copy[0] = 100;
// Only 'more_modified' will contain the new log-FC's;
// 'results' itself is not affected.
let more_modified = results.setColumn("logFC", lfc_copy);
Representing (genomic) ranges
We can construct equivalents of Bioconductor's IRanges
and GRanges
objects, representing integer and genomic ranges respectively.
Similarly, Bioconductor's GRangesList
is implemented as a GroupedGRanges
in this package.
let ir = new bioc.IRanges(/* start = */ [1,2,3], /* width = */ [ 10, 20, 30 ]);
let gr = new bioc.GRanges([ "chrA", "chrB", "chrC" ], ir, { strand: [ 1, 0, -1 ] });
// Generics still work on these range objects:
bioc.LENGTH(gr);
bioc.SLICE(gr, [ 2, 1, 0 ]);
bioc.CLONE(gr);
We can find overlaps between two sets of ranges, akin to Bioconductor's findOverlaps()
function:
let index = gr.buildOverlapIndex();
let gr2 = new bioc.GRanges([ "chrA", "chrC", "chrA" ], new bioc.IRanges([5, 3, 2], [9, 9, 9]));
let overlaps = index.overlap(gr2);
We can store per-range metadata in the elementMetadata
field of each object, just like Bioconductor's mcols()
.
let meta = gr.elementMetadata();
meta.$setColumn("symbol", [ "Nanog", "Snap25", "Malat1" ]);
gr.$setElementMetadata(meta);
gr.elementMetadata().columnNames();
Handling experimental assays
The SummarizedExperiment
object is a data structure for storing experimental data in a matrix-like object,
along with further annotations on the rows (usually features) and samples (usually columns).
To illustrate, let's mock up a small count matrix, ostensibly from an RNA-seq experiment:
// Making a column-major dense matrix of random data.
let ngenes = 100;
let nsamples = 20;
let expression = new Int32Array(ngenes * nsamples);
expression.forEach((x, i) => expression[i] = Math.random() * 10);
let mat = new bioc.DenseMatrix(ngenes, nsamples, expression);
// Mocking up row names, column annotations.
let rownames = [];
for (var g = 0; g < ngenes; g++) {
rownames.push("Gene_" + String(g));
}
let treatment = new Array(nsamples);
treatment.fill("control", 0, 10);
treatment.fill("treated", 10, nsamples);
let sample_meta = new bioc.DataFrame({ group: treatment });
We can now store all of this information in a SummarizedExperiment
:
let se = new bioc.SummarizedExperiment({ counts: mat },
{ rowNames: rownames, columnData: sample_meta });
This can be manipulated by generics for two-dimensional objects:
bioc.NUMBER_OF_ROWS(se);
bioc.SLICE_2D(se, { start: 0, end: 50 }, [0, 2, 4, 8, 10, 12, 14, 16, 18]);
bioc.COMBINE_COLUMNS([se, se]);
Similar implementations are provided for the RangedSummarizedExperiment
and SingleCellExperiment
classes.
Supported classes and generics
For classes:
|Javascript|R/Bioconductor equivalent|
|---|---|
| DataFrame
| S4Vectors::DFrame
|
| IRanges
| IRanges::IRanges
|
| GRanges
| GenomicRanges::GRanges
|
| GroupedGRanges
| GenomicRanges::GRangesList
|
| SummarizedExperiment
| SummarizedExperiment::SummarizedExperiment
|
| RangedSummarizedExperiment
| SummarizedExperiment::RangedSummarizedExperiment
|
| SingleCellExperiment
| SingleCellExperiment::SingleCellExperiment
|
For generics:
|Javascript|R/Bioconductor equivalent|
|---|---|
| LENGTH
| base::NROW
|
| SLICE
| S4Vectors::extractROWS
|
| COMBINE
| S4Vectors::bindROWS
|
| CLONE
| - |
| NUMBER_OF_ROWS
| base::NROW
|
| NUMBER_OF_COLUMNS
| base::NCOL
|
| SLICE_2D
| base::"["
|
| COMBINE_ROWS
| S4Vectors::bindROWS
|
| COMBINE_COLUMNS
| S4Vectors::bindCOLS
|
Further reading
A high-level description of Bioconductor data structures is given in the "Orchestrating high-throughput genomic analysis with Bioconductor" paper.
The formulation of the generics was mostly based on the code in the S4Vectors package.
The implementation of each class is based on the code in the corresponding R package, e.g., GRanges
in GenomicRanges.