scraping-pipeline
v0.0.4
Published
An asynchronous pipeline package which helps to design various scaping tasks with less effort
Downloads
16
Maintainers
Readme
Node.js Scraping Pipelines
Introduction
Scraping pipeline is a typescript asynchronous module.
It helps to organize the code in pipeline applications.
It contains some generic functonal to scrap, parse, process, modify and send data.
It also let you define custom modules when the generic functionality is not enough.
Quick Start
How to install
npm i scraping-pipeline
Here are some examples to help you understand the features
Basic pipeline with custom modules
import { Pipeline, Modules } from 'scraping-pipeline';
const yourFunctionToGetSomeCsv = async (): Promise<string> => {
const someCsv: string;
...
return someCsv;
};
const yourFunctionToStoreData = async (data: any) => {
...
};
const getter = new Modules.General.Custom(yourFunctionToGetSomeCsv);
const parser = new Modules.General.CsvParser({ headers: true });
const saver = new Modules.General.Custom(yourFunctionToStoreData);
const pipeline = new Pipeline([getter, parser, saver]);
pipeline.run().then(() => { console.log('Done') });
Components and Types
Pipeline
Pipeline
is the main component of the package.
It is initiated with a pipe of Modules
.
Pipeline
has a method run
.
By running the Pipline
it will execute the Modules
in sequence and feed Data
from one to another.
First module doesn't have feed Data
.
Modules
Modules
are small components which are usually doing a single task.
All Modules
are implementing Modules.Base
and extending Modules.Common<InputType, OutputType>
.
There are some General
Modules
which are designed to do some standard tasks.
CsvParser
Modules.General.CsvParser
is a module which helps to parse CSV Data
and returns a structured output.
ArrayParser
Modules.General.ArrayParser
is a generic module which helps to convert string arrays to some meaningful structure.
This module may be useful when you need to parse some raw data from documents.
It gets a ParsingTemplate
as an constructor argument which lets the parser know how to convert the array to some structured data.
Custom
Modules.General.Custom<InputType, OutputType>
is using a custom async function to solve custom problems.
It gets an async function as an processor wich will do the task.
The processor functions gets 3 arguments:
- data: InputType
- previous: any
- old: any[]
Returns a value with OutputType type
Data
Data<T>
is generic type to send Data
between modules.
The Data
contains current
, previous
and old
data. It stores all data passed across the Pipeline
.
Usually you don't need to think about Data<T>
, it is used in lower level of pipeline.
License
May be freely distributed under the MIT license.