npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

spacy

v0.0.4

Published

JavaScript API for spaCy with Python REST API

Downloads

2,228

Readme

spaCy JS

npm GitHub unpkg

JavaScript interface for accessing linguistic annotations provided by spaCy. This project is mostly experimental and was developed for fun to play around with different ways of mimicking spaCy's Python API.

The results will still be computed in Python and made available via a REST API. The JavaScript API resembles spaCy's Python API as closely as possible (with a few exceptions, as the values are all pre-computed and it's tricky to express complex recursive relationships).

const spacy = require('spacy');

(async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('This is a text about Facebook.');
    for (let ent of doc.ents) {
        console.log(ent.text, ent.label);
    }
    for (let token of doc) {
        console.log(token.text, token.pos, token.head.text);
    }
})();

⌛️ Installation

Installing the JavaScript library

You can install the JavaScript package via npm:

npm install spacy

Setting up the Python server

First, clone this repo and install the requirements. If you've installed the package via npm, you can also use the api/server.py and requirements.txt in your ./node_modules/spacy directory. It's recommended to use a virtual environment.

pip install -r requirements.txt

You can then run the REST API. By default, this will serve the API via 0.0.0.0:8080:

python api/server.py

If you like, you can install more models and specify a comma-separated list of models to load as the first argument when you run the server. All models need to be installed in the same environment.

python api/server.py en_core_web_sm,de_core_news_sm

| Argument | Type | Description | Default | | --- | --- | --- | --- | | models | positional (str) | Comma-separated list of models to load and make available. | en_core_web_sm | | --host, -ho | option (str) | Host to serve the API. | 0.0.0.0 | | --port, -p | option (int) | Port to server the API. | 8080 |

🎛 API

spacy.load

"Load" a spaCy model. This method mostly exists for consistency with the Python API. It sets up the REST API and nlp object, but doesn't actually load anything, since the models are already available via the REST API.

const nlp = spacy.load('en_core_web_sm');

| Argument | Type | Description | | --- | --- | --- | | model | String | Name of model to load, e.g. 'en_core_web_sm'. Needs to be available via the REST API. | | api | String | Alternative URL of REST API. Defaults to http://0.0.0.0:8080. | | RETURNS | Language | The nlp object. |

nlp async

The nlp object created by spacy.load can be called on a string of text and makes a request to the REST API. The easiest way to use it is to wrap the call in an async function and use await:

async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('This is a text.');
}

| Argument | Type | Description | | --- | --- | --- | | text | String | The text to process. | | RETURNS | Doc | The processed Doc. |

Doc

Just like in the original API, the Doc object can be constructed with an array of words and spaces. It also takes an additional attrs object, which corresponds to the JSON-serialized linguistic annotations created in doc2json in api/server.py.

The Doc behaves just like the regular spaCy Doc – you can iterate over its tokens, index into individual tokens, access the Doc attributes and properties and also use native JavaScript methods like map and slice (since there's no real way to make Python's slice notation like doc[2:4] work).

Construction

import { Doc } from 'spacy';

const words = ['Hello', 'world', '!'];
const spaces = [true, false, false];
const doc = Doc(words, spaces)
console.log(doc.text) // 'Hello world!'

| Argument | Type | Description | | --- | --- | --- | | words | Array | The individual token texts. | | spaces | Array | Whether the token at this position is followed by a space or not. | | attrs | Object | JSON-serialized attributes, see doc2json. | | RETURNS | Doc | The newly constructed Doc. |

Symbol iterator and token indexing

async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('Hello world');

    for (let token of doc) {
        console.log(token.text);
    }
    // Hello
    // world

    const token1 = doc[0];
    console.log(token1.text);
    // Hello
}

Properties and Attributes

| Name | Type | Description | | --- | --- | --- | | text | String | The Doc text. | | length | Number | The number of tokens in the Doc. | | ents | Array | A list of Span objects, describing the named entities in the Doc. | | sents | Array | A list of Span objects, describing the sentences in the Doc. | | nounChunks | Array | A list of Span objects, describing the base noun phrases in the Doc. | | cats | Object | The document categories predicted by the text classifier, if available in the model. | | isTagged | Boolean | Whether the part-of-speech tagger has been applied to the Doc. | | isParsed | Boolean | Whether the dependency parser has been applied to the Doc. | | isSentenced | Boolean | Whether the sentence boundary detector has been applied to the Doc. |

Span

A Span object is a slice of a Doc and contains of one or more tokens. Just like in the original API, it can be constructed from a Doc, a start and end index and an optional label, or by slicing a Doc.

Construction

import { Doc, Span } from 'spacy';

const doc = Doc(['Hello', 'world', '!'], [true, false, false]);
const span = Span(doc, 1, 3);
console.log(span.text) // 'world!'

| Argument | Type | Description | | --- | --- | --- | | doc | Doc | The reference document. | | start | Number | The start token index. | | end | Number | The end token index. This is exclusive, i.e. "up to token X". | | label | String | Optional label. | | RETURNS | Span | The newly constructed Span. |

Properties and Attributes

| Name | Type | Description | | --- | --- | --- | | text | String | The Span text. | | length | Number | The number of tokens in the Span. | | doc | Doc | The parent Doc. | | start | Number | The Span's start index in the parent document. | | end | Number | The Span's end index in the parent document. | | label | String | The Span's label, if available. |

Token

For token attributes that exist as string and ID versions (e.g. Token.pos vs. Token.pos_), only the string versions are exposed.

Usage Examples

async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('Hello world');

    for (let token of doc) {
        console.log(token.text, token.pos, token.isLower);
    }
    // Hello INTJ false
    // world NOUN true
}

Properties and Attributes

| Name | Type | Description | | --- | --- | --- | | text | String | The token text. | | whitespace | String | Whitespace character following the token, if available. | | textWithWs | String | Token text with training whitespace. | | orth | Number | ID of the token text. | | doc | Doc | The parent Doc. | | head | Token | The syntactic parent, or "governor", of this token. | | i | Number | Index of the token in the parent document. | | entType | String | The token's named entity type. | | entIob | String | IOB code of the token's named entity tag. | | lemma | String | The token's lemma, i.e. the base form. | | norm | String | The normalised form of the token. | | lower | String | The lowercase form of the token. | | shape | String | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". | | prefix | String | A length-N substring from the start of the token. Defaults to N=1. | | suffix | String | Length-N substring from the end of the token. Defaults to N=3. | | pos | String | The token's coarse-grained part-of-speech tag. | | tag | String | The token's fine-grained part-of-speech tag. | | isAlpha | Boolean | Does the token consist of alphabetic characters? | | isAscii | Boolean | Does the token consist of ASCII characters? | | isDigit | Boolean | Does the token consist of digits? | | isLower | Boolean | Is the token lowercase? | | isUpper | Boolean | Is the token uppercase? | | isTitle | Boolean | Is the token titlecase? | | isPunct | Boolean | Is the token punctuation? | | isLeftPunct | Boolean | Is the token left punctuation? | | isRightPunct | Boolean | Is the token right punctuation? | | isSpace | Boolean | Is the token a whitespace character? | | isBracket | Boolean | Is the token a bracket? | | isCurrency | Boolean | Is the token a currency symbol? | | likeUrl | Boolean | Does the token resemble a URL? | | likeNum | Boolean | Does the token resemble a number? | | likeEmail | Boolean | Does the token resemble an email address? | | isOov | Boolean | Is the token out-of-vocabulary? | | isStop | Boolean | Is the token a stop word? | | isSentStart | Boolean | Does the token start a sentence? |

🔔 Run Tests

Python

First, make sure you have pytest and all dependencies installed. You can then run the tests by pointing pytest to /tests:

python -m pytest tests

JavaScript

This project uses Jest for testing. Make sure you have all dependencies and development dependencies installed. You can then run:

npm run test

To allow testing the code without a REST API providing the data, the test suite currently uses a mock of the Language class, which returns static data located in tests/util.js.

✅ Ideas and Todos

  • [ ] Add Travis CI integration.
  • [ ] Improve JavaScript tests.
  • [ ] Experiment with NodeJS bindings to make Python integration easier. To be fair, running a separate API in an environment controlled by the user and not hiding it a few levels deep is often much easier. But maybe there are some modern Node tricks that this project could benefit from.