npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

markdown-crawler

v1.0.11

Published

A powerful web crawler that extracts content from web pages and converts them to clean Markdown format, with support for code blocks and GitHub Flavored Markdown

Downloads

548

Readme

markdown-crawler

English | 繁體中文 | 日本語

A web crawler tool optimized for AI reading that converts web content into structured Markdown format. Using intelligent algorithms to clean up noise and extract core content, it generates clean text data suitable for AI model understanding and processing. Its special feature is the ability to integrate all related pages (including the current page and all its subdirectories) into a single YAML file, producing clearly structured Markdown content.

Why Suitable for AI Reading?

  • 🧠 Intelligently extracts main content, removing ads, navigation bars, and other distractions
  • 🎯 Preserves logical structure and semantic relationships of articles
  • 📋 Converts to standardized Markdown format for easy AI parsing
  • 🔄 Automatically handles special characters and encoding
  • 📊 Integrates all pages in YAML format for batch processing

Features

  • 🚀 Uses Playwright for web crawling, supporting modern web pages and dynamic content
  • 📝 Uses Mozilla's Readability algorithm for intelligent content extraction
  • ✨ Automatically converts to structured Markdown format, removing unnecessary styles and noise
  • 🎨 Supports GitHub Flavored Markdown (GFM), preserving important formatting
  • 💻 Supports syntax highlighting for code blocks, maintaining technical document readability
  • 🔗 Automatically crawls all related pages, integrating them into a single file

Data Integration Benefits

  • 📚 Automatically crawls target URL and all subdirectory pages
  • 🗂️ Integrates all page content into a single YAML file
  • 📖 Maintains the integrity of titles and content for each page
  • 🎯 Generates Markdown format suitable for both human reading and AI processing
  • 🔍 Facilitates quick browsing and searching of large amounts of related content

Usage

# Basic usage
npx markdown-crawler <url> <output-filename>

# Example: Crawl website and save as output.yaml
npx markdown-crawler https://example.com output

# For URLs with spaces, use double quotes
npx markdown-crawler "https://example.com/my page" output

# Output file will automatically add .yaml extension
# Results will be saved in the current working directory

Output Format

The tool integrates all related pages into a structured YAML format:

- title: "Main Page Title"
  content: |
    # Main Page Content
    Here is the main page content...

- title: "Subpage 1 Title"
  content: |
    # Subpage 1 Content
    Here is subpage 1 content...

- title: "Subpage 2 Title"
  content: |
    # Subpage 2 Content
    Here is subpage 2 content...

Features:

  • Automatically extracts title and main content from each page
  • Maintains content hierarchy and formatting
  • Removes unnecessary styles and scripts
  • Generates clear and readable Markdown format
  • Suitable for both human reading and AI model processing

System Requirements

  • Node.js >= 16.0.0
  • npm or yarn package manager

License

This project is licensed under the MIT License - see the LICENSE file for details.