@settingdust/article-extractor

v0.3.2

Published

5 months ago

Article Extractor Logo

Downloads

0High
0Medium
0Low

settingdust

website webpage article extractor readability metadata jsonld json-ld browser

Article Extractor Logo

Article Extractor

article-extractor is a tool that extract content from a webpage.

Installation & Usage

Node.js

yarn add @settingdust/article-extractor
npm i @settingdust/article-extractor

import { extract } from '@settingdust/article-extractor'

API

See more detail in index.ts

`extract`

Extract data from html string or Document object

export declare function extract(html: string | Document): Promise<DefaultExtracted>;
export declare function extract<T>(html: string | Document, options: Omit<ExtractOptions<T>, 'extractors'>): Promise<DefaultExtracted>;
export declare function extract<T>(html: string | Document, options?: ExtractOptions<T>): Promise<NestedPartialK<T & TitleExtracted & UrlExtracted>>;

DefaultExtracted is inferred from the default extractors. And is the default extracted result type.

interface DefaultExtracted {
  title: string
  url: string
  content?: string
  author?: { url?: string; name?: string }
  date?: { published?: Date; modified?: Date }
}

`ExtractOptions`

Type parameter T should be inferred from extractors.

interface ExtractOptions<T> {
  /**
   * Url for the page. **may not be the final result**
   * @see urlExtractor
   */
  url?: string
  /**
   * Extractors for extract.
   * @see defaultExtractors
   */
  extractors?: Extractor<T>[]
  /**
   * Options of sanitize-html
   * @see https://www.npmjs.com/package/sanitize-html
   */
  sanitizeHtml?: sanitize.IOptions
  /**
   * For parse date
   * @see mapToNearestLanguage
   */
  lang?: string
}

Default Extractors

import('./author-extractor')
import('./author-url-extractor')
import('./published-date-extractor')
import('./modified-date-extractor')
import('./content-extractor')

Extractor

Custom extractors are supported

type ExtractOperator = (document: Document, url?: string) => string[]

interface Extractor<T> {
  /**
   * Operators that fetch string from document
   */
  operators: ExtractOperators
  /**
   * Process raw strings from {@link operators}. Such as validate and filter.
   */
  processor: (value: string[], context?: ExtractorContext) => string[]
  /**
   * Pick one string as final result and transform to target type (eg. {@link Date}).
   */
  selector: (value: string[], title?: string, context?: ExtractorContext) => T
}

interface ExtractorContext {
  url?: string
  sanitizeHtml?: sanitize.IOptions
  lang?: string
}

/**
 * Class for manage operators can operate with index
 * Note: digit string won't keep the insertion order in object. Have to set index manually
 */
declare class ExtractOperators extends Array<[string, ExtractOperator]> {
  constructor(items?: {
    [key: string]: ExtractOperator;
  });

  push(...items: [key: string, extractor: ExtractOperator][]): number;

  set(key: string, extractor: ExtractOperator, index?: number): this;

  get: (key: string) => [string, ExtractOperator];
}

Content

Page content is extracted by readability. There is an api for adding custom css selector extractor or other custom extractor

import { selectors } from './src/selector-extractors'

selectors.put(new URLPattern('*://exam.ple/*'), {
  selector: ['#id'],
  ignored: ['.bad', 'header']
})

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme