@mmiscool/scrape_to_markdown

v1.0.7

Published

10 days ago

A library to scrape articles from the web and convert them into Markdown format.

Downloads

288

0High
0Medium
0Low

mmiscool

scraper markdown web cli npm

Scrape to markdown

This project extracts the functionality of fetching a URL and generating a markdown string from clipper.js and packages it as a library that can be included in to your project using NPM.

Scraper to Markdown is a lightweight JavaScript library that allows you to scrape articles or web pages and convert their content into Markdown format. This is particularly useful for archiving, content generation, or data processing tasks.

Features

Extracts main content from web pages.
Converts HTML content into Markdown using Turndown.
Handles GitHub Flavored Markdown (GFM) for better compatibility.
Fallback mechanism for handling URLs that return raw Markdown.
Built-in support for readability parsing via @mozilla/readability.

Installation

Install the library using npm:

npm install @mmiscool/scrape_to_markdown -s

Usage

Import the Library

import { scrapeToMarkdown } from '@mmiscool/scrape_to_markdown';

Scrape a Web Page to Markdown

(async () => {
    const url = 'https://example.com/some-article';
    try {
        const markdown = await scrapeToMarkdown(url);
        console.log(markdown);
    } catch (error) {
        console.error('Error scraping the URL:', error);
    }
})();

Fallback for Raw Markdown URLs

The library can handle cases where the URL directly provides Markdown content. It will return the raw Markdown if no HTML is detected.

API

`scrapeToMarkdown(url: string): Promise<string>`

Scrapes the content from the provided URL and converts it to Markdown.

Parameters:
- url: The URL of the web page to scrape.
Returns: A Promise resolving to the Markdown content.

`extract_from_url(page: string): Promise<string>`

Uses JSDOM and @mozilla/readability to extract and convert the primary content from a web page into Markdown.

`extract_from_html(html: string): Promise<string>`

Converts raw HTML input into Markdown.

`oldScrapeToMarkdown(url: string): Promise<string>`

Legacy scraper for handling edge cases or simpler scraping needs.

Dependencies

This library relies on the following NPM packages:

axios for HTTP requests.
cheerio for parsing HTML content.
turndown for converting HTML to Markdown.
turndown-plugin-gfm for GitHub Flavored Markdown support.
@mozilla/readability for extracting readable content from web pages.
jsdom for DOM simulation.

Examples

Scraping a Blog Post

import { scrapeToMarkdown } from '@mmiscool/scrape_to_markdown';

(async () => {
    const url = 'https://medium.com/some-blog-post';
    const markdown = await scrapeToMarkdown(url);
    console.log(markdown);
})();

Converting Raw HTML to Markdown

import { extract_from_html } from '@mmiscool/scrape_to_markdown';

const html = `
    <article>
        <h1>Example Article</h1>
        <p>This is an example paragraph.</p>
    </article>
`;

(async () => {
    const markdown = await extract_from_html(html);
    console.log(markdown);
})();

Credits

Clipper uses the following open source libraries:

Mozilla Readability - For parsing article content
Turndown - For converting HTML to Markdown
Crawlee - For crawling websites

License

Apache 2.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme