bioinformatics-hub

v1.4.1

Published

3 years ago

an open source bioinformatics package for retrieving, processing and analyzing the sequences and structures of biomolecules, such as proteins and nucleotides.

Downloads

0High
0Medium
0Low

tianguanghushui

DNA RNA protein sequence analysis prediction bioinformatics NCBI nucleotide

Bioinformatics-Hub

Bioinformatics-Hub is an open source bioinformatics package for retrieving, processing and analyzing the sequences and structures of biomolecules, such as proteins, and nucleotides.

At current stage, this package can retrieve protein and nucleotide sequences from NCBI, analyze protein and nucleotide sequences, and predict motifs in proteins and nucleotides. Sequences are primarily handled as a single string in FASTA format or as a javascript object.

We will add protein structure retrieval and analysis modules in the future.

Bioinformatics-Hub demo site

This website is created using Bioinformatics-Hub package: https://bioinformatics-hub.vercel.app/

Outline

Installation
Handle user provided sequences
- Setup bioinformaticsHub object
- Set/store sequences in BioinformaticsHub application
- Get all sequence Ids
- Get sequence by Id
- Get all sequences with Ids as a key in to Javascript object
- Handle invalid input, blanks, numbers, unsupported characters in user-provided sequences
Use NCBI Sequence Retriever (ncbiSeqRetriever) module
- Retrieve protein sequences from NCBI
- Retrieve nucleotide sequences from NCBI
Predict/scan motifs in protein/nucleotides sequences
Routine bioinformatics tools for protein/nucleotide sequences
- Check if the nucleotide or protein sequences contains invalid characters
- Get reverse complementary sequences
- Restriction sites detector (under development)
- Translate DNA/RNA sequences to protein sequences
PDB module: retrieve and analyze protein structures (future task)
- Retrieve protein structure from protein data bank (future task)
- Retrieve information from protein structure (future task)
Version Changes

Installation

Running this package requires nodeJS environment. Run the below commend to install this package into your javascript project.

npm install --save bioinformatics-hub

Handle user-provided sequences

Setup bioinformaticsHub object

Step 1: Import "bioinformatics-hub" package and create a bioinformaticsHub object.

const BioinformaticsHub = require("bioinformatics-hub");
const bioInformaticsHub = new BioinformaticsHub();

Step 2: Store user-provided sequences into bioinformaticsHub object.
This application can store one sequence or multiple sequences provided by user. The sequences should be provided as a single string in FASTA format. However, if user wants to store only one sequence for analysis, then the sequence id (started with ">" in FASTA format) is optional. This application will automatically provide a sequence Id as "Unnamed sequence 1", if the only sequence does not have a sequence Id.
Here is an example on how to store two sequences in to this application.
```
>sequence_id_1
SLLKASSTLDNLFKELDKNGDGEVSYEEF
>sequence_id_2
DKDKD
```
These two sequences can be stored in this application as shown below.
```
const inputSequence = ">sequence_id_1\nSLLKASSTLDNLFKELDKNGDGEVSYEEF\n>sequence_id_2\r\nDKDKD";

bioInformaticsHub.setFastaSequences(inputSequence); 
```
Now, the bioinformaticsHub setup is completed, and user can use bioinformaticsHub to retrieve sequence Ids , sequences, or retrieve these sequences as a javascript object. User can also use bioinformaticsHub to perform sequence analysis, such as predicting motifs in these sequences.
- Retrieve all sequence Ids of the stored sequences using the code below:
```
const sequenceIdArray = bioInformaticsHub.getAllSequenceIds();
console.log(sequenceIdArray); 
// Console output: [ 'sequence_id_1', 'sequence_id_2' ]
```
- Retrieve a specific sequence by a sequence Id using the code below:
```
const sequence1 = bioInformaticsHub.getSequenceById("sequence_id_1"); 
console.log(sequence1); 
// Console output: SLLKASSTLDNLFKELDKNGDGEVSYEEFF

const sequence2 = bioInformaticsHub.getSequenceById("sequence_id_2"); 
console.log(sequence2); 
// console output: DKDKD
```
- Retrieve all sequences with Ids as a javascript object using the code below:
```
const sequencesWithIds = bioInformaticsHub.getAllSequencesWithIds(); 
console.log(sequencesWithIds); 
```
  console output:
```
{ 
  sequence_id_1: 'SLLKASSTLDNLFKELDKNGDGEVSYEEF', 
  sequence_id_2: 'DKDKD' 
}
```

Use NCBI Sequence Retriever (ncbiSeqRetriever) module

NCBI Sequence Retriever (ncbiRetriever module) is a light-weighted javascript module to fetch nucleotide or protein sequences from NCBI databases. This module is a simplified wrapper for EFetch utility of NCBI E-utilities API.

NCBI Sequence Retriever module can fetch up to 100 protein sequences or up to 10 short nucleotide sequences in one sequence retrieve call. We strongly recommend that the length of each nucleotide sequence for query should be less than 100,000 bp.

The retrieved sequences can be returned as a string in FASTA format, or be returned as a javascript object.

NCBI Sequence Retriever is also a stand alone npm package, named "ncbi-sequence-retriever".

Fetch protein sequences from NCBI

Here are examples to fetch multiple protein sequences from NCBI protein database with user-provided ACCESSION Ids.

Return a string representative of sequences in FASTA format

const {ncbiSequenceRetriever} = require ("bioinformatics-hub/modules");

const proteinIds = ["AAA49004.1","AAK64208.1"];  // add up to 100 accession Ids in this array
ncbiSequenceRetriever.retrieveProteinSequences(proteinIds).then((sequences)=>{
  console.log(sequences);
});

The output from above code:

>AAA49004.1 parvalbumin, partial [Gallus gallus]
FIEEDELKFVLKGFTPDGRDLSDKETKALLAAGDKDGDGKIGVEK

>AAK64208.1 calbindin D9k [Mus musculus]
MCAEKSPAEMKSIFQKYAAKEGDPDQLSKEELKLLIQSEFPSLLKASSTLDNLFKELDKNGDGEVSYEEF
EAFFKKLSQ

Return sequences as a javascript object

const {ncbiSequenceRetriever} = require ("bioinformatics-hub/modules");

const proteinIds = ["AAA49004.1","AAK64208.1"];  // add up to 100 accession Ids in this array
ncbiSequenceRetriever.retrieveProteinSequences(proteinIds, "JSON").then((sequences)=>{
  console.log(sequences);
});

The output from above code:

{ 'AAA49004.1 parvalbumin, partial [Gallus gallus]': 
    'FIEEDELKFVLKGFTPDGRDLSDKETKALLAAGDKDGDGKIGVEK',
  'AAK64208.1 calbindin D9k [Mus musculus]':
    'MCAEKSPAEMKSIFQKYAAKEGDPDQLSKEELKLLIQSEFPSLLKASSTLDNLFKELDKNGDGEVSYEEFEAFFKKLSQ' 
}

Fetch nucleotide sequences

Here are examples to fetch one mRNA sequence from NCBI nucleotide database with user-provided ACCESSION Ids:

Return a string representative of sequences in FASTA format

const {ncbiSequenceRetriever} = require ("bioinformatics-hub/modules");

const nucleotidesIds = ["M65068.1"];  // add up to 10 accession Ids in this array
ncbiSequenceRetriever.retrieveNucleotideSequences(nucleotidesIds).then((sequences)=>{
  console.log(sequences);
});

The output from above code:

>M65068.1 Chicken parvalbumin mRNA, partial cds
TTTATTGAGGAGGATGAGCTAAAGTTTGTACTGAAGGGCTTTACCCCAGATGGCAGAGACCTATCAGACA
AAGAGACAAAGGCTCTTCTGGCTGCTGGAGATAAGGACGGTGATGGCAAAATCGGCGTGGAAAAA

Return sequences as a javascript object

const {ncbiSequenceRetriever} = require ("bioinformatics-hub/modules");

const nucleotidesIds = ["M65068.1"];  // add up to 10 accession Ids in this array
ncbiSequenceRetriever.retrieveNucleotideSequences(nucleotidesIds, "JSON").then((sequences)=>{
  console.log(sequences);
});

The output from above code:

{
  'M65068.1 Chicken parvalbumin mRNA, partial cds': 
    'TTTATTGAGGAGGATGAGCTAAAGTTTGTACTGAAGGGCTTTACCCCAGATGGCAGAGACCTATCAGACAAAGAGACAAAGGCTCTTCTGGCTGCTGGAGATAAGGACGGTGATGGCAAAATCGGCGTGGAAAAA' 
}

Optional API key

retrieveNucleotideSequences() and retrieveProteinSequences() methods in ncbiSequenceRetriever can take a string API key as the third input parameter. This is optional. This parameter is set to be undefined by default. Adding an valid API key as the third input parameter to these methods can increase the number of sequence retrieve calls from 3 calls per second to 10 calls per second from one Ip address.

On December 1, 2018, NCBI will begin enforcing the use of API keys that will offer enhanced levels of supported access to the E-utilities. After that date, any site (IP address) posting more than 3 requests per second to the E-utilities without an API key will receive an error message. By including an API key, a site can post up to 10 requests per second by default. More rules about API key can be found in this link: https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Coming_in_December_2018_API_Key

Sample code with API key as the third input argument:

const {ncbiSequenceRetriever} = require ("bioinformatics-hub/modules");

const nucleotidesIds = ["M65068.1"];  
const apiKey = "fake_api_key";  // if you have a valid API key, set up in this line.
ncbiSequenceRetriever.retrieveNucleotideSequences(nucleotidesIds, "JSON", apiKey).then((sequences)=>{
  console.log(sequences);
});

Predict motifs in protein or nucleotide sequences with user-provided sequences

This application can predict motifs in one or multiple sequences using user defined pattern. User should define one or multiple patterns based on the pattern syntax rules.

Pattern syntax

The standard IUPAC one letter code for the amino acids (for example: "P" stands Proline) and nucleotides code (for example: "C" stands for "cytosine") should be used to define a patten.
"[]" with listed amino acids or nucleotides letters means that the listed letters are allowed in this position. Character "," in between letters are optional. For example: [A,T,C] or [ATC] stands for "A" or "T" or "C" are allowed in this position.
"{}" with listed amino acids or nucleotides letters means that the listed letters are NOT allowed in this position. Character "," in between letters are optional. For example: {DE} or {D,E} means "D" or "E" should not be allowed in this position.
The wildcard "x" means that any amino acid or nucleotide in this position is accepted.
Each element in a pattern could be separated from its neighbor by a '-'. This is optional. For example, [A,T]-x-{C,A}, [A,T]x{C,A}, [AT]-x-{CA}, and [AT]x{CA} are identical patterns.
Repetition of an element in multiple continuous positions can be represented by following that element with a "(number of repeats)" or a "(minimum repeats, maximum repeats)".
- Examples:
- A(3) stands for AAA
- T(2,5) stands TT, TTT, TTTT, or TTTTT.
User should add a "^" or "<" to indicate that the pattern must started with N-terminal (or 5' terminal).
User should add a "$" or ">" to indicate that the pattern must ended with C-terminal (or 3' terminal).

Predict a single motif in a single sequence

The sample sequence for prediction:

>seq1
SLLKASSTLDNLFKELDKNGDGEVSYEEF

The sample pattern for prediction:

[D]-x-[DNS]-{FLIVWY}-[DNESTG]

General workflow for prediction:

import "bioinformatics-hub" package.
create bioinformaticsHub object.
store sequences in the bioinformaticsHub object.
retrieve PredictionAssistant object.
set one or multiple patterns.
call predict() method and the prediction results will be returned.

The example shown below demonstrates the prediction workflow for predicting one user-defined motif (named "patternId_1") in one protein sequence (named "seq1").

const BioinformaticsHub = require("bioinformatics-hub");

const bioInformaticsHub = new BioinformaticsHub();
bioInformaticsHub.setFastaSequences(">seq1\nSLLKASSTLDNLFKELDKNGDGEVSYEEF") // store sequences here
                 .getPredictionAssistant()
                 .setPatterns({"patternId_1": "[D]-x-[DNS]-{FLIVWY}-[DNESTG]"}) // store pattern here
                 .predict();

Execution of above code will find all matches of the given pattern(patternId_1) in the given sequence(seq1). The expected output is shown below:

[
  { 
    sequenceId: 'seq1',
    sequence: 'SLLKASSTLDNLFKELDKNGDGEVSYEEF', 
    contained_motifs: [ 'patternId_1' ],    // Indicate which motifs are found in "seq1"
    motifs:
      { 
        patternId_1:     // the prediction result on "patternId_1"
          {
            pattern_signature: '[D]-x-[DNS]-{FLIVWY}-[DNESTG]', 
            matched_sequences: 
              [ 
                { 
                  startIndex: 16, 
                  matched_sequence: 'DKNGD' 
                }
              ]
          }
      } 
  } 
]

Predict motifs in protein or nucleotide sequences based on NCBI ACCESSION ids

The ncbiMotifScanner module in BioinformaticsHub can predict/scan motifs on sequences if user provides the NCBI ACCESSION ids and patterns.

Scan/predict protein sequences based on NCBI accession ids

const {ncbiMotifScanner} = require("bioinformatics-hub/modules");

const sequenceIdArray = ["CAA44792.1", "EEB16923.1"];
const patterns = {
  "EF-Hand": "[DNS]x[DNS]{FLIVWY}[DNESTG][DNQGHRK]{GP}[LIVMC][DENQSTAGC]x(2)[ED]",
  "zinc finger": "[C]x(2,5)[C]x(12,13)[H]x(2,5)[H]"
};

ncbiMotifScanner.scanProteinMotifsWithNcbiIds(sequenceIdArray, patterns).then((value) =>{
  // print the whole output object
  console.log(value);
});

Scan/predict nucleotide sequences based on NCBI accession ids

const {ncbiMotifScanner} = require("bioinformatics-hub/modules");

const sequenceIdArray = ["NM_009788.4"];
const patterns = {
  "pattern_1": "[A,T]x[C,G]x{A}[G]x(2,5)[C,G]"
};

ncbiMotifScanner.scanNucleotideMotifsWithNcbiIds(sequenceIdArray, patterns).then((value) =>{
  // print the whole output object
  console.log(value);
});

Optional API key

ncbiMotifScanner.scanProteinMotifsWithNcbiIds() and ncbiMotifScanner.scanNucleotideMotifsWithNcbiIds() methods can take the third input parameter, apiKey. You can read more about API key in NCBI Sequence Retriever session.

Handle comment lines, numbers, gaps, blanks, and comment lines in user-provided sequence string.

This application has build in functions to validate and clean up the input sequences. We handle each situation differently when bioInformaticsHub.setFastaSequences(string); method is called.

Invalid input
- The input string cannot be blank, empty, null or undefined.
- Two sequences should not the identical sequence Id.
- Sequence Ids can contain any characters. Sequences should only contain letters a-z, A-Z, "-" (optional, stands for gap) and * (optional, stands for termination).
- It is invalid if a sequence has a sequence id but sequence is blank. In the example shown below, sequence 1 is not valid as the sequence is blank.
```
>sequence 1

>sequence 2
AAAATTTAAAATTT
```
Single sequence without a sequence Id
A single sequence without a sequence Id is a valid input sequence. This application will automatically add "Unnamed sequence 1" as its sequence Id.
The first sequence in multiple sequences do not have a sequence Id
If the first sequence in multiple sequences do not have a sequence Id, this application will automatically add "Unnamed sequence 1" as its sequence Id.
Numbers in sequences
Numbers in sequences will be removed automatically.
Blanks in sequences
Blanks, such as \r\n, in sequences will be removed automatically. Blank lines inside of a sequence will be removed as well.
Comment lines
Comment lines (started with ";") in a FASTA sequence will be removed automatically.

Based on above rules, "Sample sequence 3" and "Sample sequence 4" shown below will have the identical sequences when saved in BioinformaticsHub application.

>Sample sequence 3
; this is a comment line, will be removed by CommentLineRemover;
; Numbers and blanks in below sequence will also be removed 
AAACTCCTCTTTGATTCTTCTAG CTGTTTCACTATTGGGCAACCAGACACCAGAATGAGTACTAAAAAGT 12023423

CTCCTGAGGAACTGAAGAGGATTTTTG 55AAAAATATGCAGCCAAAGAAGGTGATCCAGACCAGTTGTCAAA

>Sample sequence 4
AAACTCCTCTTTGATTCTTCTAGCTGTTTCACTATTGGGCAACCAGACACCAGAATGAGTACTAAAAAGT
CTCCTGAGGAACTGAAGAGGATTTTTGAAAAATATGCAGCCAAAGAAGGTGATCCAGACCAGTTGTCAAA

Routine bioinformatics tools for protein/nucleotide sequences

Bioinformatics-hub contains a few methods (tools) which can be used for the routine sequence manipulation and analysis.

Check if the nucleotide or protein sequences contains invalid characters

Here is a demo on how to use Bioinformatics-hub to check if the input protein sequences or nucleotide sequences contain invalid letters.

Valid letters in protein sequence include "*", "-", "X", and all single letter symbol of amino acids. Letters can be upper case or lower case.

Valid letters in protein sequence include "*", "-", "X", "N", "A", "T", "C", "G", and "U". Letters can be upper case or lower case.

Check if protein sequences contains invalid letters

// input FASTA sequences
>seq1
DKD GNGY
>seq2
BBCKKK

const BioinformaticsHub = require("bioinformatics-hub");
const bioinformaticsHub = new BioinformaticsHub();

const proteinSequences = ">seq1\nDKD GNGY\n>seq2\nBBCKKK";
const output = bioinformaticsHub.setFastaSequences(proteinSequences)
                                .getProteinSequenceAssistant()
                                .containsInvalidCharacters();
console.log(output);
// console output: { seq1: false, seq2: true }

Check if nucleotide sequences contains invalid letters

// input sequences in FASTA format
>seq1
AAAATTTAAAAA
>seq2
BBAATTCCGGTCA

const BioinformaticsHub = require("bioinformatics-hub");
const bioinformaticsHub = new BioinformaticsHub();

const dnaSequences = ">seq1\nAAAATTTAAAAA \n>seq2\nBBAATTCCGGTCA";
const output = bioinformaticsHub.setFastaSequences(dnaSequences)
                                .getNucleotideSequenceAssistant()
                                .containsInvalidCharacters();
console.log(output);
// console output: { seq1: false, seq2: true }

Get reverse complementary sequences

Bioinformatics-hub has methods to get reverse, complementary, and reverse complementary sequences from multiple input sequences in FASTA format. An error will throw when get complementary, reverse-complementary sequences if any of the input sequences contains invalid letters. Only the following letters are considered as valid letters for a nucleotide sequence: "*", "-", "X", "N", "A", "T", "C", "G", and "U". Letters can be upper case or lower case.

// input sequence in FASTA format
>seq1
AAAAUTTGCNN

const BioinformaticsHub = require("bioinformatics-hub");
const bioinformaticsHub = new BioinformaticsHub();

const dnaSequences = ">seq1\nAAAAUTTGCNN";

let output = bioinformaticsHub.setFastaSequences(dnaSequences) // same sequence only need to set once
                              .getNucleotideSequenceAssistant()
                              .getReverseSequences();
console.log(output);
// console output: { seq1: 'NNCGTTUAAAA' }

output = bioinformaticsHub.getNucleotideSequenceAssistant()
                          .getComplementarySequences();
console.log(output);
// console output: { seq1: 'TTTTAAACGNN' }

output = bioinformaticsHub.getNucleotideSequenceAssistant()
                          .getReverseComplementarySequences();
console.log(output);
// console output: { seq1: 'NNGCAAATTTT' }

Translate DNA/RNA sequences to protein sequences

Bioinformatics-hub can be used to get used to translate multiple DNA/RNA sequences into protein sequences. Each nucleotide sequence will produce 6 proteins sequences. Three protein sequences are translated from 5' terminal and three protein sequences are translated from 3' terminal (reverse complementary sequence). An error will throw when using translateToProtein() method if any of the input sequences contains invalid letters. Only the following letters are considered as valid letters for a nucleotide sequence: "*", "-", "X", "N", "A", "T", "C", "G", and "U". Letters can be upper case or lower case.

// input sequences in FASTA format
>seq1
AAAAUTTGCNN
>seq2
xxxxxxactatgaattattttgagcataacggtgtattaatgaaatattttc

const BioinformaticsHub = require("bioinformatics-hub");
const bioinformaticsHub = new BioinformaticsHub();

const dnaSequences = ">seq1\nAAAAUTTGCNN\n>seq2\nxxxxxxactatgaattattttgagcataacggtgtattaatgaaatattttc";
let output = bioinformaticsHub.setFastaSequences(dnaSequences)
                              .getNucleotideSequenceAssistant()
                              .translateToProtein();
console.log(output);
// console output:
{ seq1:
   { '5\' to 3\' Frame 1': 'KIC',
     '5\' to 3\' Frame 2': 'KFA',
     '5\' to 3\' Frame 3': 'NLX',
     '3\' to 5\' Frame 1': 'XQI',
     '3\' to 5\' Frame 2': 'XKF',
     '3\' to 5\' Frame 3': 'ANF' },
  seq2:
   { '5\' to 3\' Frame 1': 'XXTMNYFEHNGVLMKYF',
     '5\' to 3\' Frame 2': 'XXL-IILSITVY--NIF',
     '5\' to 3\' Frame 3': 'XXYELF-A-RCINEIF',
     '3\' to 5\' Frame 1': 'ENISLIHRYAQNNS-XX',
     '3\' to 5\' Frame 2': 'KIFH-YTVMLKIIHSXX',
     '3\' to 5\' Frame 3': 'KYFINTPLCSK-FIXX' } }

Version changes

1.4.1
- Fixed bugs.
- Updated README.md
- New feature: Added a method to translate nucleotide sequences to protein sequences.
- New feature: Added a method to check if nucleotide or protein sequences contain invalid characters.
- New feature: Add methods to get reverse, complementary, and reverse complementary sequences.
1.4.0-SNAPSHOT and 1.4.1-SNAPSHOT
- Exploring new features and bug fix.
1.3.4
- Fixed a bug related with making request using http instead of https by update ncbi-sequence-retriever version.
1.3.3
- Fixed a bug related with an error when using react by updating ncbi-sequence-retriever version.
1.3.2
- Fixed a bug related with an error when using react.
1.3.1
- Bug fix.
1.3.0
- New Feature: Given an array of ACCESSION ids from NCBI protein or nucleotide databases, ncbiMotifScanner module can predict/sca motifs in these sequences.
1.2.0
- New feature: Retrieve nucleotide or protein sequences in batch by ACCESSION Ids from NCBI.
1.1.2
- Updated README.MD
1.1.1
- Updated README.MD
1.1.0
- New feature: Read multiple protein or nucleotide sequences.
- New feature: Remove any numbers, blanks, and comment line (line start with ";") in each sequence.
- New feature: Retrieve a specific sequence by sequenceId.
- New feature: Retrieve all sequence Ids.
- New feature: Retrieve all sequences as javascript object indexed by sequence Ids.
- New feature: Scan and predict protein/nucleotide motifs in multiple sequences provided by user in FASTA format.

For internal Bioinformatics-Hub developers

Test

Jest is used to perform all unit tests in this repository. Tests are written in the test folder using the identical file structure in the .js files in the src folder. Use this command to run all the unit tests:

npm run test

Note: Each commit in master branch must sure to have 0 test failure.

ESLint

We use ESLint to enforce js format. ESLint rules is defined in .eslintrc.json file. Run this command to check js format issues:

npm run lint

Run this command to check js format and auto-fix issues:

npm run lint-fix

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Bioinformatics-Hub

Bioinformatics-Hub demo site

Outline

Installation

Handle user-provided sequences

Setup bioinformaticsHub object

Use NCBI Sequence Retriever (ncbiSeqRetriever) module

Fetch protein sequences from NCBI

Fetch nucleotide sequences

Optional API key

Predict motifs in protein or nucleotide sequences with user-provided sequences

Pattern syntax

Predict a single motif in a single sequence

General workflow for prediction:

Predict motifs in protein or nucleotide sequences based on NCBI ACCESSION ids

Optional API key

Handle comment lines, numbers, gaps, blanks, and comment lines in user-provided sequence string.

Routine bioinformatics tools for protein/nucleotide sequences

Check if the nucleotide or protein sequences contains invalid characters

Get reverse complementary sequences

Translate DNA/RNA sequences to protein sequences

Version changes

For internal Bioinformatics-Hub developers

Test

ESLint