blast-level

v0.0.11

Published

2 years ago

Streaming BLAST indexes for leveldb databases. Automatically keep an up-to-date BLAST database for your leveldb sequence data and run streaming BLAST queries on the data.

Downloads

0High
0Medium
0Low

bschulzsf

autonomous

juul

Streaming BLAST indexes for leveldb databases. Automatically keep an up-to-date BLAST database for your leveldb sequence data and run streaming BLAST queries on the data.

WARNING: This module is not yet ready for production use. Proceed with caution.

Dependencies

Ensure that you have a recent NCBI BLAST+ installed on your system. You need version 2.4.0 or later.

You might be able to get away with:

sudo apt install ncbi-blast+

Otherwise, download the latest version from the above link and unzip the binaries into /usr/local/bin.

You can check if the correct BLAST+ binaries are installed using:

require('blast-level').check([binPath], [cb])

Where the optional binPath is your path to the BLAST tools.

Install required node modules:

npm install

Usage

var level = require('level');
var blastLevel = require('blast-level');

var db = level('mydb', {valueEncoding: 'json'});
var blastDB = blastLevel(db, {
    type: 'nt', // this is a nucleotide database (as opposed to amino acids)
    seqProp: 'sequence', // property in 'mydb' that stores the sequence data
    changeProp: 'updated', // property in 'mydb' that stores last updated time
    path: 'my/blastdb/dir' // directory to use for storing BLAST databases
});

db.put('my_unique_id', {
  name: "Green Fluorescent Protein",
  sequence: "atgagcaaaggcgaagaactgtttaccggcgtggtgccgattctggtggaactggatgg",
  updated: new Date().getTime()
}, function(err) {
  if(err) return console.error(err);

  var stream = blastDB.query('caaaggcgaaactgtttacc');

  stream.on('data', function(data) {
    console.log("Result:", data);
  });

  stream.on('error', function(err) {
    console.log("Error:", err);
  });

  stream.on('end', function() {
    console.log("end of results")
  });
});

If you just want a plain callback for your query results, instead of a stream, you can provide a callback:

blastDB.blast('caaaggcgaaactgtttacc', function(err, data) {
  if(err) return console.error(err);
  if(!data) return console.log("end of results");

  console.log("result:", data);
});

Modes

blastlevel can operate in two different modes:

blastdb: Fastest search but sequence changes trigger partial BLAST db rebuild
direct: TODO not yet implemented. At least 2-3x slower than blastdb and puts much more load on the node.js process + leveldb but no BLAST db is ever written to disk.

blastdb

blastdb mode keeps a native BLAST database on disk. This is the fastest option. In this mode all existing data is kept in a primary BLAST database called main and all changes since blast-level was last rebuilt is kept in a separate update database. This is done because BLAST databases cannot be modified, so the entire database has to be re-written every time a change occurs. Instead of re-writing the entire database on every .put or .del only the added or changed databases

A main BLAST database is created from all sequence data in leveldb when the db is first opened and another database is kept that contains all changed sequences since the main database was rebuilt. The main database can be manually rebuilt by calling .rebuild() which should be done periodically, and it can be triggered automatically whenever the blastlevel database is opened by setting rebuildOnOpen: true (default false). If rebuildOnChange is set to true (default false) then a single BLAST db is kept containing all sequence data and the entire BLAST database is rebuilt every time sequence data is changed. See the Iplementation section for more details.

direct

TODO: This mode has not yet been implemented.

direct mode does not keep a native BLAST database of the sequence data. Instead, all of the sequence data is streamed from leveldb and piped into the blastn command every time

API

blastLevel(db, [opts] (constructor)

Constructor with all properties (defaults shown):

var blastDB = blastLevel(db, {
    mode: 'blastdb', // or 'direct' or 'streaming' (slow)
    type: 'nt', // 'nt' for nucleotide database. 'aa' for amino acid database
    seqProp: undefined, // property of leveldb value that contains the DNA/AA sequence or file path
    changeProp: undefined, // property of leveldb value that contains a value that will have changed if the sequence was changed, e.g. a timestamp for when the leveldb value was last updated or a hash of the sequence
    filterChanged: true, // filter seqs that have changed since last rebuild. only relevant in 'blastdb' mode when buildOnChange is false
    seqFormatted: false, // false if plaintext, true if FASTA, GenBank, SBOL, etc. 
    seqIsFile: false, // is seqProp a path to a file or array of files (or a function that returns a path to a file or array of files)? if false then seqProp should be a string or array of strings or a function returning either of those.
    seqFileBasePath: '.', // if seqIsFile, this is the base path
    seqFileEncoding: 'utf8', // string encoding of sequence files
    path: undefined, // path to use for storing BLAST database (blastdb mode only)
    listen: true, // listen for changes on level db instance and update BLAST db automatically
    rebuild: false, // rebuild the BLAST db on initialization (now)
    rebuildOnChange: false, // rebuild BLAST db whenever the leveldb is changed
    binPath: undefined, // path where BLAST+ binaries are located if not in PATH
    debug: 0 // debug output levels are 0 (off), 1 (brief) and 2 (full)
});

The option seqProp must be defined. Additionally path must be defined in 'blastdb' mode and changeProp must be defined in 'blastdb' mode unless rebuildOnChange is true or filterChanged is false.

seqProp can be a simple property name like 'sequence' or it can be a property path like 'foo.bar.baz.sequence'. It can also be a synchronous function that takes the value as its only argument and returns the sequence or file path. If the value is undefined for a leveldb value then that value will be skipped.

changeProp is like seqProp but must reference/return a value that changes whenever the sequence for that leveldb value changes. This could be a hash of the sequence but it could also simply be the time-date when the value was last updated.

If seqFormatted is true then streaming-sequence-extractor is used to extract sequence information from FASTA, GenBank or SBOL data. The format is autodetected. This can be used with seqIsFile to consume sequences in a variety of formats based on file paths stored in the database.

.query(sequence, [opts], [cb])

Run a query on the BLAST database.

opts:

output: 'array' or 'stream'
type: 'blastn' or 'blastp' auto-detected based on blastdb constructor opts.type

The blastn-short or blastp-short tasks will be used rather than the blastn or blastp tasks if the query sequence is under 50 nt for blastn or under 30 aa for blastp. This is based on the BLAST+ manual recommendations

If a callback is provided then it will called like so:

cb(err, metadata, data)

where the data argument will be an array per default or a stream if opts.output is 'stream'.

If no callback is provided then a stream will be returned, even if output is set to 'array'. You will not be able to access the metadata using this calling convention.

.rebuild([cb])

Rebuild the blast database. The callback will receive args:

cb(err, count);

Where count is the number of entries added to the blast database.

.check([cb])

Check if the correct versions of all required NCBI BLAST+ binaries are installed. If no callback is specified then prints the results to stdout/stderr.

.status(cb)

Get information about BLAST database status such as the name, path and size of each BLAST database. This is only relevant in blastdb mode.

The callback is called with cb(err, status) where status is an object looking something like this:

  "opts": {
    "mode": "blastdb",
    "type": "nt",
    "seqProp": "seq",
    "changeProp": "updated",
    "filterChanged": true,
    "seqFormatted": false,
    "seqIsFile": false,
    "seqFileBasePath": ".",
    "seqFileEncoding": "utf8",
    "path": "/tmp/tmp-7943PLJ1i00go537",
    "listen": false,
    "rebuild": false,
    "rebuildOnChange": false,
    "binPath": "",
    "debug": false
  },
  "mainDB": {
    "name": "main-1",
    "path": "/tmp/tmp-7943PLJ1i00go537/main-1.*",
    "size": 188
  },
  "updateDB": {
    "name": "update-2",
    "path": "/tmp/tmp-7943PLJ1i00go537/update-2.*",
    "size": 192
  },
  "ratio": "0.98",
  "shouldRebuild": false
}

opts is just a copy of the options used by this blastDB instance when the call was made.

mainDB and updateDB will only be present of the main and update databases exist. The name, absolute path and size in bytes is included.

ratio is the percentage of both BLAST databases made up by the main database. For optimal performance you want most of your database to be in the main database since the main database isn't auto-rebuilt on every change (unless you've specified rebuildOnChange:true or aren't using blastdb mode). You can use ratio as an indicator of when a rebuild is necessary. If either the update or main databases are missing then the ratio will be 1.

shouldRebuild is a boolean that becomes true if the ratio drops below 0.1.

.put(key, value, [opts], cb)

Same as a .put directly on the database but will wait for the index to finish updating before calling the callback.

.del(key, value, [opts], cb)

Same as a .del directly on the database but will wait for the index to finish updating before calling the callback.

.batch(key, value, [opts], cb)

Same as a .batch directly on the database but will wait for the index to finish updating before calling the callback.

Note: Chained batch mode not yet implemented.

Implementation

This module relies on the official NCBI BLAST+ toolset being installed somewhere on your system. It is implemented as a wrapper rather than a native js module due to the somewhat complicated and lightly documented nature of the BLAST+ codebase ¯_(ツ)_/¯

This module creates an actual BLAST database in BLAST database format by streaming the output of a leveldb database into the makeblastdb command line tool with metadata referencing the original leveldb entries. When a BLAST query is performed it is executed using the blastn or blastp command and the results are referenced to the original leveldb entries and streamed out.

Since none of the BLAST+ command line tools allow modifying a BLAST database (appending is sorta supported, see the Notes section) at first glance it seems that the entire BLAST database must be re-written every time the leveldb database changes in ways that modify the sequence data. However, this module implements a workaround.

If opts.rebuildOnChange is false (the default) then two databases are kept. A 'main' database which is built the first time blast-level is initialized on a leveldb database containing any sequence data (or on first write), and an 'update' database that contains all new and changed sequences since last update which is rewritten on every change to the leveldb database (except deletions). Queries are performed on both databases as if they were a single database and results from the main database are ignored if the sequence was changed since it was added to the main database. A rebuild of the main database can be triggered by manually calling .rebuild() and should probably be scheduled to run periodically when server load is minimal. After a rebuild the main database will be up to date and the update database will disappear, only to re-appear as soon as an update is made. This implementation puts minimal load on the nodejs process and leveldb database since the rebuild of the update db is accomplished mostly by a blast command. If opts.rebuildOnChange is true then the main database will be rebuilt in its entirety on every change by streaming all leveldb sequences into a database anew. This is very rarely a good idea. The 'direct' mode accomplishes the same thing without keeping any on-disk BLAST database so only use this if you need to be able to run BLAST queries directly on the on-disk db as well.

Gotchas

When operating in blastdb mode with rebuildOnChange:false when a sequence is deleted or changed in leveldb the sequence is not deleted in the main blast database. If a query is run that results in a hit on a deleted or changed sequence the hit will be reported by blast but the hit will not be passed on to your callback. Since blast has a maximum number of hits that it reports for each query (usually 30) this can result in fewer than the expected number of hits being reported for no apparant reason or in extreme cases where all top 30 hits for a query have been deleted since last rebuild, no hits will be reported even though there may be hits on sequences with lower scores than the 30 deleted sequences. This is probably not fixable without changing the NCBI BLAST+ codebase. If you have a use case where this may become an issue you should consider using the 'direct' mode or manually triggering a rebuild more often.

ToDo

Next version

switch away from level-changes so we can catch .on('batch')
add support for blastx, tblastx and tblastn
write more unit tests

Future

implement direct mode (don't keep any on-filesystem blastdb)
support megablast and maybe blastpgp
write more unit tests
use makembindex command to speed up queries?
make it work with non-JSON value databases?

Design decisions and BLAST+ limitations

This section discusses some of the early considerations that fed into the current design.

Queries without a BLAST database

It is possible to run e.g. blastn without a BLAST database. The syntax is:

blastn -query /path/to/query/file -subject /path/to/subject/file

Both query and subject file can contain multiple FASTA sequences.

You can use stdin as either the source of the query or the subject, but not both:

blastn -query /path/to/query/file -subject -
blastn -query - -subject /path/to/subject/file

If you need an input stream for both query and subject then you need to do something like:

./blastn -query - -subject <(nc -lU /tmp/mysocket)

and then to send the stream of data:

./program_outputting_fasta_sequences | nc -U /tmp/mysocket

This looks encouraging since it seems like we can use two input streams and one output stream and have a nice streaming blastn interface. Unfortunately because blastn sorts the output by best match first, it makes sense that it waits until the query is complete before outputting anything. It looks like this sorting cannot be turned off without altering the codebase, so you have no way of getting proper streaming result output other than to execute the blastn command once for each sequence in the database.

The file examples/multiexec.js implements the "call blastn once for each sequence"-strategy. This was compared to two other strategies: Using a normal blast database as input and using a stream of fasta sequences as input. The NCBI vector database was used as a test set with T7 promoter sequence as the query and the "-task blastn-short" option set. Here's the results on my i5-2520M @ 2.5 GHz and an SSD (though the source files had been purposefully recently accessed such that they should be already be in RAM).

blast database: 0.182 seconds
fasta stream: 0.442 seconds
multiexec: 12.354 seconds

The filesize of the vector blast database was 1.4 MB and the fasta version was 4.8 MB.

The blastn results were capped at 300 while multiexec yielded 901 results, but of course those were the 300 best results so cutting the multiexec results to 300 would not have been a fair comparison. I could not find any blastn option that would give more than 300 results (if this is posssible somehow, please let me know).

It is likely that the difference between multiexec and the other strategies would be much diminished when working with very long sequences, since the cost of executing a new instance of blastn is per sequence.

The multiexec strategy is too slow to seriously consider. The blast database strategy is obviously the fastest, but it comes at the cost of maintaining an up to date blast database of all sequences. Since it does not seem to be possible to modify an existing blast database (see next section) this requires either rebuilding the entire blast database every time any sequence is added, deleted or changed. Or doing something clever like keeping one blast database for all existing sequences and another for all changes since the last build, and then rebuilding the database e.g. every night at 4 am.

It's probably not a good idea to rebuild the entire blast database on each change, unless the database is rather small and changes are rare.

If the fasta stream strategy was used then it would only introduce a slowdown of about a factor 2.5, though granted that was with a fasta stream read from a file rather than from a database. This strategy seems like the winner since it'd be simple to pipe a leveldb read stream into blastn and be done with it. However, this would mean that a node.js process would need to traverse the entire leveldb database, reading the entire values, parsing the JSON and passing on only the sequence data. If queries end up taking multiple seconds then this could become a noticable burden on the server, leaving it tied up for long periods on each database update.

Modifying a BLAST database

While makeblastdb does not support modification of an existing BLAST database, forcing a complete rebuild of the database every time it changes, it does support concatenating existing databases, and it supports the creation of single-entry databases, thus it supports appending to a database in a crude way by first creating a new database with the sequence(s) to be appended, then concatenating the resulting database to the existing database. This isn't exactly an append operation since it writes an antire new database rather than appending to the existing database but it is still much faster than rebuilding from leveldb so I document it here in case someone finds it useful:

# Create a new BLAST database from the sequence(s) to be "appended":
cat seq_to_append1.fasta seq_to_append2.fasta | makeblastdb -dbtype nucl -title 'to_append' -out /tmp/to_append

# Concatenate the to_append database with the existing database
makeblastdb -dbtype nucl -title 'newdb' -in '/path/to/existing/db /tmp/to_append' -input_type blastdb -out /path/to/concatenated/db

BLAST symbolic concat

It is possible to use blastdb_aliastool to create a BLAST database that simply references multiple existing databases, which makes it possible to query several databases at once as if they were a single database, however it is also possible to simply list multiple blast databases when running a query and they will be treated as a single database.

Operating system support

This module has only been tested on debian/ubuntu systems. It will likely work on other *nix systems.

Copyright and license

License: AGPLv3