npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

ooooevan-jieba

v1.0.3

Published

test my jieba module

Downloads

3

Readme

学习jieba分词

实现了cut、cutHMM、cutAll、cutForSearch等方法

存在的问题:代码还没整理、cutHMM太慢未优化(字符串太长时构建成树耗时,内存不足)等等

参考:https://github.com/ustcdane/annotated_jieba

过程中的问题

1、同样的词库,自己的cut(str, true)和hmm方法和官方的结果不一致,cut(str)一样

如:

str = '到MI京研大厦'
nodejieba.cut(str); // ['到', 'M', 'I', '京', '研', '大厦']
nodejieba.cut(str, true); // ['到', 'MI', '京研', '大厦']
nodejieba.cutHMM(str);  // [ '到', 'MI', '京', '研大厦' ]
cut(str, true);   // [ '到M', 'I京', '研', '大厦' ]
hmm(str);  // ['到M', 'I京', '研大', '厦']

通过调试看源码,发现了hmm的两个正则匹配。中英文要区分

/([\u4E00-\u9FD5]+)/;   //将中文和非中文分开
/([a-zA-Z0-9]+(?:\.\d+)?%?)/; //匹配英文数字及其他符号,可以将英文数字和其他字符分开,后面的%符号挺奇怪的,是想匹配百分数吗
/([a-zA-Z0-9]+(?:\.\d)?)|([0-9]+(?:\.\d)?%?)/;  //匹配百分数,数字才能匹配百分号

在hmm函数添加了相关代码后,上面的问题解决了

2、hmm函数构建根据BEMS构建树未优化,耗时且太长会导致内存不足

通过比较,我发现是我和官方实现的方式不同,我的方式是:

const treeObj = getWordTree(str); //将字符串转为一个大对象,里面包含需要的字符节点相关数据
var treeArr = recursion(treeObj);  //将大对象转为同等的二维数组,里面有2**n个数组,也就是2**n个不同路径
var conarr = getAppropriate(treeArr); //从2**n个数组中选出最优的一个数组

这种方法很直白,是我根据hmm算法描述写的,有太多不需要的数据,导致内存可能爆掉。是2**n个而不是4**n,因为每个字符在所在位置都只有两种类型而不是四种,如开头字符只可能是B或S,不能是E或M。

str = '程序员'
treeObj = {
  b:{
    next:{
      e:{
        next:{...}
      },
      m:{
        next:{...}
      }
    },
    value: -7.5608
  },
  s:{
    next:{
      s:{
        next:{...}
      },
      b:{
        next:{...}
      }
    },
    value: -11.0326
  }
}
treeArr = [ //长度是8
  [{type:B,value:-7.5608},{type:E,value:-8.363},{type:B,value:-9.5638}],
  []
  ...
]
conarr = [ //最优路径
  {type: "B", value: -7.5608126080925, string: "程"},
  {type: "M", value: -9.818418731874154, string: "序"},
  {type: "E", value: -6.2620785681194855, string: "员"}
]

在测试较长字符串时,会内存不足。因为这个过程存储了太多无用变量

修改为官方的方式,思路如下小

str = '程序员'
const prevTrans = {   //上一个状态可能类型
  B: [E, S],
  E: [B, M],
  M: [B, M],
  S: [E, S]
}

"'程'的概率" = {B:'', E:'', M:'', S:''}
"'序'的概率" = {B:'', E:'', M:'', S:''}
"'员'的概率" = {B:'', E:'', M:'', S:''}
/*
开头字符概率是确定的,后一个要根据前一个来确定。如'序'(有BEMS)中的B,和前一个一起的可能类型组合是:EB或SB,则选择计算两种的较高概率的一种。
所以概率表保存的是组合的概率,每个字符的概率都基于前n个字符的概率。这里和cut函数最大切分组合类似,只是那里是从后往前,这里是从前往后的
"'程'的概率" = {B:'', E:'', M:'', S:''} //保存字符'程'的不同类型的概率
"'序'的概率" = {B:'', E:'', M:'', S:''} //保存字符'程序'的不同类型的概率
"'员'的概率" = {B:'', E:'', M:'', S:''} //保存字符'程序员'的不同类型的概率
*/

// 概率可以确定了,只要在计算概率时,同时保存路径
// 根据最后得到的最大概率类型,就可以得出是哪个路径了
path = {B: "BEB", E: "BME", M: "BMM", S: "BES"}

改成官方的方式,果然好多了,再也不怕用太多内存了

3、自定义字典的路径问题

没有测试时,处理路径是path.resolve(__dirname, './xx.utf8'),直接写也行,因为还不用考虑不同路径 当将它单独为一个模块时,用户可以调用load方法使用自定义字典,此时找不到自定义字典,因为路径还是相对模块自身的 若在test目录执行test.js,能找到文件正常执行,在其他目录都找不到文件无法使用

├jieba
│  ├─dict  #存储默认字典
│  ├─test
│  │  │─doc
│  │  │─test.js  #测试位置
│  │  └─user.dict.utf8  #字典文件
│  ├─index.js  #模块位置
│


初步了解了一下,好像用path模块也不能解决的,因为它是相对自身文件的,在模块内部无法知道用户文件的相对路径,需要借助node模块打包相关方案解决