npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

happynodetokenizer

v7.1.0

Published

A simple, Twitter-aware tokenizer.

Downloads

94

Readme

😄 HappyNodeTokenizer

A basic Twitter aware tokenizer for Javascript environments.

A Typescript port of HappyFunTokenizer.py by Christopher Potts and HappierFunTokenizing.py by H. Andrew Schwartz.

Features

  • Accurate port of both libraries (run npm run test)
  • Typescript definitions
  • Uses generators / memoize for efficiency
  • Customizable and easy to use

Install

  npm install --save happynodetokenizer

Usage

HappyNodeTokenizer exports a function called tokenizer() which takes an optional configuration object (See "The Options Object" below).

Example

import { tokenizer } from 'happynodetokenizer';

const text = 'RT @ #happyfuncoding: this is a typical Twitter tweet :-)';

// these are the default options
const opts = {
  'mode': 'stanford',
  'normalize': undefined,
  'preserveCase': true,
};

// create a tokenizer instance with our options
const myTokenizer = tokenizer(opts);

// calling myTokenizer returns a generator function
const tokenGenerator = myTokenizer(text);

// you can turn the generator into an array of token objects like this:
const tokens = [...tokenGenerator()];

// you can also convert token objects to array of strings like this:
const values = Array.from(tokens, (token) => token.value);

Output

The tokens variable in the above example will look like this:

[
  { end: 1, start: 0, tag: 'word', value: 'rt' },
  { end: 3, start: 3, tag: 'punct', value: '@' },
  { end: 19, start: 5, tag: 'hashtag', value: '#happyfuncoding' },
  { end: 20, start: 20, tag: 'punct', value: ':' },
  { end: 25, start: 22, tag: 'word', value: 'this' },
  { end: 28, start: 27, tag: 'word', value: 'is' },
  { end: 30, start: 30, tag: 'word', value: 'a' },
  { end: 38, start: 32, tag: 'word', value: 'typical' },
  { end: 46, start: 40, tag: 'word', value: 'twitter' },
  { end: 52, start: 48, tag: 'word', value: 'tweet' },
  { end: 56, start: 54, tag: 'emoticon', value: ':-)' }
]

Where preserveCase in the Options Object is false, each result object may also contain a variation property which presents the token as originally matched if it differs from the value property. E.g.:

[
  { end: 1, start: 0, tag: 'word', value: 'rt', variation: 'RT' },
  { end: 3, start: 3, tag: 'punct', value: '@' },
  { end: 19, start: 5, tag: 'hashtag', value: '#happyfuncoding' },
  ...
  { end: 46, start: 40, tag: 'word', value: 'twitter', variation: 'Twitter' },
  ...
]

The Options Object

The options object and its properties are optional. The defaults are:

{
  'mode': 'stanford',
  'normalize': undefined,
  'preserveCase': true,
};

mode

string - valid options: stanford (default), or dlatk

stanford mode uses the original HappyFunTokenizer pattern. See Github.

dlatk mode uses the modified HappierFunTokenizing pattern. See Github.

normalize

string - valid options: "NFC" | "NFD" | "NFKC" | "NFKD" (default = undefined)

Normalize strings (e.g., when set, mañana becomes manana).

Normalization is disabled with set to null or undefined (default).

preserveCase

boolean - valid options: true, or false (default)

Preserves the case of the input string if true, otherwise all tokens are converted to lowercase. Does not affect emoticons.

Tags

HappyNodeTokenizer outputs an array of token objects. Each token object has three properties: idx, value and tag. The value is the token itself, the idx is the token's original index in the output, the tag is a descriptor based on one of the following depending on which opt.mode you are using:

| Tag | Stanford | DLATK | Example | | ------------- |------------- | ----- | -------- | | phone | :heavy_check_mark: | :heavy_check_mark: | +1 (800) 123-4567 | url | :x: | :heavy_check_mark: | http://www.youtube.com | url_scheme | :x: | :heavy_check_mark: | http:// | url_authority | :x: | :heavy_check_mark: | [0-3] | url_path_query | :x: | :heavy_check_mark: | /index.html?s=search | htmltag | :x: | :heavy_check_mark: | <em class='grumpy'> | emoticon | :heavy_check_mark: | :heavy_check_mark: | >:( | username | :heavy_check_mark: | :heavy_check_mark: | @somefaketwitterhandle | hashtag | :heavy_check_mark: | :heavy_check_mark: | #tokenizing | punct | :heavy_check_mark: | :heavy_check_mark: | , | word | :heavy_check_mark: | :heavy_check_mark: | hello | <UNK> | :heavy_check_mark: | :heavy_check_mark: | (anything left unmatched)

Testing

To compare the results of HappyNodeTokenizer against HappyFunTokenizer and HappierFunTokenizing, run:

npm run test

The goal of this project is to provide an accurate port of HappyFunTokenizer and HappierFunTokenizing. Therefore, any pull requests with test failures will not be accepted.

Acknowledgements

Based on HappyFunTokenizer.py by Christopher Potts and HappierFunTokenizing.py by H. Andrew Schwartz.

Uses the "he" library by Mathias Bynens under the MIT license.

License

(C) 2017-24 P. Hughes. All rights reserved.

Shared under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license.