@settingdust/article-extractor
v0.3.2
Published
Article Extractor Logo
Downloads
15
Maintainers
Readme
Article Extractor Logo
Article Extractor
article-extractor is a tool that extract content from a webpage.
Installation & Usage
Node.js
yarn add @settingdust/article-extractor
npm i @settingdust/article-extractor
import { extract } from '@settingdust/article-extractor'
API
See more detail in index.ts
extract
Extract data from html string or Document
object
export declare function extract(html: string | Document): Promise<DefaultExtracted>;
export declare function extract<T>(html: string | Document, options: Omit<ExtractOptions<T>, 'extractors'>): Promise<DefaultExtracted>;
export declare function extract<T>(html: string | Document, options?: ExtractOptions<T>): Promise<NestedPartialK<T & TitleExtracted & UrlExtracted>>;
DefaultExtracted
is inferred from the default extractors. And is the default extracted result type.
interface DefaultExtracted {
title: string
url: string
content?: string
author?: { url?: string; name?: string }
date?: { published?: Date; modified?: Date }
}
ExtractOptions
Type parameter T
should be inferred from extractors
.
interface ExtractOptions<T> {
/**
* Url for the page. **may not be the final result**
* @see urlExtractor
*/
url?: string
/**
* Extractors for extract.
* @see defaultExtractors
*/
extractors?: Extractor<T>[]
/**
* Options of sanitize-html
* @see https://www.npmjs.com/package/sanitize-html
*/
sanitizeHtml?: sanitize.IOptions
/**
* For parse date
* @see mapToNearestLanguage
*/
lang?: string
}
Default Extractors
import('./author-extractor')
import('./author-url-extractor')
import('./published-date-extractor')
import('./modified-date-extractor')
import('./content-extractor')
Extractor
Custom extractors are supported
type ExtractOperator = (document: Document, url?: string) => string[]
interface Extractor<T> {
/**
* Operators that fetch string from document
*/
operators: ExtractOperators
/**
* Process raw strings from {@link operators}. Such as validate and filter.
*/
processor: (value: string[], context?: ExtractorContext) => string[]
/**
* Pick one string as final result and transform to target type (eg. {@link Date}).
*/
selector: (value: string[], title?: string, context?: ExtractorContext) => T
}
interface ExtractorContext {
url?: string
sanitizeHtml?: sanitize.IOptions
lang?: string
}
/**
* Class for manage operators can operate with index
* Note: digit string won't keep the insertion order in object. Have to set index manually
*/
declare class ExtractOperators extends Array<[string, ExtractOperator]> {
constructor(items?: {
[key: string]: ExtractOperator;
});
push(...items: [key: string, extractor: ExtractOperator][]): number;
set(key: string, extractor: ExtractOperator, index?: number): this;
get: (key: string) => [string, ExtractOperator];
}
Content
Page content is extracted by readability. There is an api for adding custom css selector extractor or other custom extractor
import { selectors } from './src/selector-extractors'
selectors.put(new URLPattern('*://exam.ple/*'), {
selector: ['#id'],
ignored: ['.bad', 'header']
})
License
article-extractor Copyright (c) 2022 SettingDust, release under MIT License