npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

web-extract

v0.0.6

Published

轻量级网页内容提取工具,基于DSL配置快速抓取结构化数据,适用于博客文章、产品列表等常见网页内容抓取场景。

Downloads

346

Readme

Web Content Extractor

轻量级网页内容提取工具,基于DSL配置快速抓取结构化数据,适用于博客文章、产品列表等常见网页内容抓取场景。

特性亮点

  • 🎯 声明式DSL配置,零编码抓取网页内容
  • 🌳 自动处理嵌套DOM结构
  • 🔍 支持CSS类名精准匹配
  • 📑 多种数据提取方式(文本/属性/HTML)

快速安装

基础使用

// 示例:抓取技术博客更新日志
const run = async () => {
  const html = await fetchHtml('https://example.com/changelog');
  const $ = cheerio.load(html);

  const blogDSL = {
    name: '更新日志',
    selector: 'article.post',  // 定位文章容器
    method: 'texts',          // 提取模式
    multiple: true,           // 多个元素
    validTags: [              // 允许的标签
      'h2:title',            // 带title类的h2
      'p:content',            // 带content类的段落
      'time'                  // 时间标签
    ],
    limit: 5                  // 最大提取数量
  };

  const results = parseWithDSL($('main.content'), blogDSL);
  console.log(results);
};

DSL 配置

{
  name: '数据名称',          // 必填 ▶ 结果字段名
  selector: 'css选择器',     // 必填 ▶ 目标元素定位
  method: 'text',            // 可选 ▼ 提取方式
                            // [text|attr|html|texts]
  args: ['href'],           // 可选 ▶ attr方法专用参数
  multiple: false,          // 可选 ▼ 是否匹配多个元素
  validTags: ['tag:class'], // 可选 ▶ texts模式过滤规则
  limit: 10,                // 可选 ▼ 最大抓取数量
  join: '\n',               // 可选 ▶ 多元素连接符
  children: { ... }         // 可选 ▼ 嵌套子元素配置
}

组合使用

// 多层级产品信息抓取
const productDSL = {
  name: 'products',
  selector: '.product-list',
  multiple: true,
  children: {
    name: 'detail',
    selector: '.item',
    method: 'texts',
    validTags: [
      'h3:name',        // 产品名称
      'span:price',      // 价格标签
      'img:product-img'  // 图片元素
    ]
  }
};