rag-crawler

v1.5.0

Published

5 months ago

Crawl a website to generate knowledge file for RAG

Downloads

0High
0Medium
0Low

sigoden

crawler llm RAG website

rag-crawler

Crawl a website to generate knowledge file for RAG.

Installation

npm i -g rag-crawler
yarn add --global rag-crawler

Usage

Usage: rag-crawler [options] <startUrl> [outPath]

Crawl a website to generate knowledge file for RAG
    
Examples:
   rag-crawler https://sigoden.github.io/mynotes/languages/
   rag-crawler https://sigoden.github.io/mynotes/languages/ data.json
   rag-crawler https://sigoden.github.io/mynotes/languages/ pages/
   rag-crawler https://github.com/sigoden/mynotes/tree/main/src/languages/

Arguments:
  startUrl                     The URL to start crawling from. Don't forget trailing slash. [required]
  outPath                      The output path. If omitted, output to stdout

Options:
  --preset <value>             Use predefined crawl options (default: "auto")
  -c, --max-connections <int>  Maximum concurrent connections when crawling the pages
  -e, --exclude <values>       Comma-separated list of path names to exclude from crawling
  --extract <css-selector>     Extract specific content using a CSS selector, If omitted, extract all content
  --no-log                     Disable logging
  -V, --version                output the version number
  -h, --help                   display help for command

Output to stdout

$ rag-crawler https://sigoden.github.io/mynotes/languages/ 
[
  {
    "path": "https://sigoden.github.io/mynotes/languages/",
    "text": "# Languages ..."
  },
  {
    "path": "https://sigoden.github.io/mynotes/languages/shell.html",
    "text": "# Shell ..."
  }
  ...
]

Output to JSON file

$ rag-crawler https://sigoden.github.io/mynotes/languages/ knowledge.json

Output to separates files

$ rag-crawler https://sigoden.github.io/mynotes/languages/ pages/
...
$ tree pages
pages
└── mynotes
    ├── languages
    │   ├── markdown.md
    │   ├── nodejs.md
    │   ├── rust.md
    │   └── shell.md
    └── languages.md

Crawl Markdown files in GitHub Tree

$ rag-crawler https://github.com/sigoden/mynotes/tree/main/src/languages/ knowledge.json

Many documentation sites host their source Markdown files on GitHub. The crawler has been optimized to crawl these files directly from GitHub.

Preset

A preset consists of predefined crawl options. You can review the predefined presets at ./src/preset.ts.

Why Use Preset?

Let's use GitHub Wiki as an example. To enhance scraping quality, we need to configure both --exclude and --extract.

$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --exclude _history --extract '#wiki-body'

Since all GitHub Wiki websites share these crawl options, we can define a preset for reusability.

{
  name: "github-wiki",
  test: "github.com/([^/]+)/([^/]+)/wiki",
  options: {
    exclude: ["_history"],
    extract: "#wiki-body",
  },
}

This allows for a simplified command:

$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --preset github-wiki
// or
$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --preset auto
// or
$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json # `--reset` default to 'auto'

When the preset is set to auto, rag-crawler will automatically determine the appropriate preset. It does this by checking if the startUrl matches the test regex.

Custom Presets

You can add custom presets by editing the ~/.rag-crawler.json file:

[
  {
    "name": "github-wiki",
    "test": "github.com/([^/]+)/([^/]+)/wiki",
    "options": {
      "exclude": ["_history"],
      "extract": "#wiki-body"
    }
  },
  ...
]

License

The project is under the MIT License, Refer to the LICENSE file for detailed information.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

rag-crawler

Installation

Usage

Preset

Why Use Preset?

Custom Presets

License