npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

stopword-sami

v0.6.3

Published

Sami stopword lists (North-, Lule- and South Sami) for natural language processing. Code to create and refine them. Examples usage could be search engines and machine learning.

Downloads

22

Readme

stopword-sami

NPM version NPM downloads MIT License

What

WIP! Project to generate stopword lists for all the Sami languages:

Financial support

The Sami Parliament is financially supporting the project. Hooray! This will make it possible to finish the project in some months.

| Sámediggi | Sámedigge | Saemiedigkie | | --------- | --------- | ------------ | | The Sami Parliament | The Sami Parliament | The Sami Parliament |

Other Sami languages

These are not planned as of now, but could be if we find text sources and someone to help us verify the lists.

  • [ ] Kildinsamisk
  • [ ] Skoltesamisk/østsamisk
  • [ ] Enaresamisk
  • [ ] Pitesamisk
  • [ ] Umesamisk

When the quality of the stopword lists are good enough they will be added to the stopword module. Northern Sami will most likely be the first that reaches good enough quality. Then you'll have Lule Sami and South Sami.

Why stopword lists for Sami languages?

To i.e. be able to create good search engines or do machine learning based on content written in the different Sami langauges.

Install

If you can avoid crawling and just use the content from this repo, that's good. That means less unnecessary trafick on nrk.no. Content is here and will be updated every month, or more often if you need it and published to npm.

npm install stopword-sami

To crawl and calculate

To get more content, you first have to get more IDs, so first the crawlIds-command, then the crawlContent-command and then the calcStopwords-command.

npm run crawlIds && npm run crawlContent && npm run calcStopwords

Work ahead

  • [x] Generating lists of IDs to crawl Using nrk-sapmi-crawler to crawl lists of documents to crawl. These documents will later be crawled and the text content will be the basis for ongoing stopword training. The more content, the better lists.

  • [x] Crawl content (work in progress) When lists of enough content, and the nrk-sapmi-crawler also can crawl documents, crawl the actual documents

  • [X] Start training stopword lists Run the stopword-trainer on the text that is crawled. From this we'll ask for help to manually verify the lists and also come with words to add to a red-list for each Sami language. The stopword lists are black-lists, words that you don't want. Every now and then, words you want sneak into a stopword list. Adding it to a red-list makes sure it won't end up in the finished stopword list.

  • [X] Application for funding last part of the project.

  • [ ] Find people that knows Lule- and South Sami languages to verify lists. North Sami already covered.

  • [ ] Verifying lists and generating redlists Need help to generate redlists so the lists can be cleaned and cut off.

  • [ ] Decide cutoff. How many words to keep in each list.

  • [ ] Add lists that have beta quality to stopword module.

  • [ ] Update daq-proc and daq-proc demo to showcase new stopword lists.

  • [X] Lightning talk at NDC Oslo

  • [ ] Blog posts to market lits

Help needed

We need help to verify generated list and help me understand different traits of the different Sami languages when that time comes.

Also, to generate/train stopword lists, we need text sources. For Northern Sami we will get what we need, but for Lulesami and South Sami it's a little thin. Maybe we just have to wait for NRK to create more content. For the rest of the languages, we have no source so far. If you know of a data-set or a source to generate a data set, please give us a hint!

Applications: Markdown to Word/PDF conversion

So far, Pandoc has worked well:

pandoc application-draft-02.md -f markdown -s -o application-draft.docx