web-extract

v0.0.6

Published

20 days ago

轻量级网页内容提取工具，基于DSL配置快速抓取结构化数据，适用于博客文章、产品列表等常见网页内容抓取场景。

Downloads

346

0High
0Medium
0Low

smilingxinyi

Web Content Extractor

轻量级网页内容提取工具，基于DSL配置快速抓取结构化数据，适用于博客文章、产品列表等常见网页内容抓取场景。

特性亮点

🎯 声明式DSL配置，零编码抓取网页内容
🌳 自动处理嵌套DOM结构
🔍 支持CSS类名精准匹配
📑 多种数据提取方式（文本/属性/HTML）

快速安装

基础使用

// 示例：抓取技术博客更新日志
const run = async () => {
  const html = await fetchHtml('https://example.com/changelog');
  const $ = cheerio.load(html);

  const blogDSL = {
    name: '更新日志',
    selector: 'article.post',  // 定位文章容器
    method: 'texts',          // 提取模式
    multiple: true,           // 多个元素
    validTags: [              // 允许的标签
      'h2:title',            // 带title类的h2
      'p:content',            // 带content类的段落
      'time'                  // 时间标签
    ],
    limit: 5                  // 最大提取数量
  };

  const results = parseWithDSL($('main.content'), blogDSL);
  console.log(results);
};

DSL 配置

{
  name: '数据名称',          // 必填 ▶ 结果字段名
  selector: 'css选择器',     // 必填 ▶ 目标元素定位
  method: 'text',            // 可选 ▼ 提取方式
                            // [text|attr|html|texts]
  args: ['href'],           // 可选 ▶ attr方法专用参数
  multiple: false,          // 可选 ▼ 是否匹配多个元素
  validTags: ['tag:class'], // 可选 ▶ texts模式过滤规则
  limit: 10,                // 可选 ▼ 最大抓取数量
  join: '\n',               // 可选 ▶ 多元素连接符
  children: { ... }         // 可选 ▼ 嵌套子元素配置
}

组合使用

// 多层级产品信息抓取
const productDSL = {
  name: 'products',
  selector: '.product-list',
  multiple: true,
  children: {
    name: 'detail',
    selector: '.item',
    method: 'texts',
    validTags: [
      'h3:name',        // 产品名称
      'span:price',      // 价格标签
      'img:product-img'  // 图片元素
    ]
  }
};

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Web Content Extractor

特性亮点

快速安装

基础使用

DSL 配置

组合使用