@data-master/web-extractor
v2.0.0
Published
Represents a web rule with its properties and methods.
Downloads
5
Readme
Web Extractor
Class: WebRule
Represents a web rule with its properties and methods.
Constructor
WebRule(id: string)
Creates a new instance of the WebRule class with the specified id
.
id
(string): The unique identifier for the web rule.
Properties
version
(number): The version of the web rule.id
(string): The unique identifier for the web rule.structures
(array): An array of WebStructure instances associated with the web rule.fields
(array): An array of WebFields instances associated with the web rule.meta
(object): Additional metadata associated with the web rule.type
(string): The type of the web rule.
Methods
setVersion(version: number)
Sets the version of the web rule.
version
(number): The version number to set.
addStructure(structure: WebStructure)
Adds a WebStructure instance to the web rule.
structure
(WebStructure): The WebStructure instance to add.
addFields(field: WebFields)
Adds a WebFields instance to the web rule.
field
(WebFields): The WebFields instance to add.
toJSON(): object
Serializes the web rule object to a JSON representation.
- Returns: An object representing the serialized web rule.
setMeta(key: string, value: any)
Sets a metadata value for the specified key.
key
(string): The key of the metadata.value
(any): The value to set for the specified key.
run(): object
Executes the web rule by running its associated structures and fields.
- Returns: An object containing the storage information and the timestamp of the execution.
static fromJSON(ruleJSON: string): WebRule
Creates a new WebRule instance from a JSON representation.
ruleJSON
(string): The JSON representation of the web rule.Returns: A new WebRule instance created from the JSON representation.
Note: This method is static and does not require an existing instance of the class.
Class: WebFields
Represents a collection of web fields with their properties and methods. Extends the WebRetrieveItem class.
Constructor
WebFields(id: string)
Creates a new instance of the WebFields class with the specified id
.
id
(string): The unique identifier for the web fields.
Properties
Inherited from WebRetrieveItem:
id
(string): The unique identifier for the web fields.steps
(array): An array of WebRetrieveMethod instances representing the steps to retrieve the item.source
(WebRetrieveItem|null): The source WebRetrieveItem instance from which to retrieve the item.debug
(object): Debug information associated with the web fields.retrieved
(boolean): A flag indicating whether the item has been retrieved or not.sourceValue
(any|null): The value retrieved from the source.value
(any|null): The final retrieved value.type
(string): The type of the web fields.
items
(object): An object representing the collection of web field items. Each item is identified by its uniqueid
.confidence
(number): The confidence level associated with the web fields.
Methods
setConfidence(confidence: number)
Sets the confidence level for the web fields.
confidence
(number): The confidence level to set.
addFieldItem(item: WebFieldItem)
Adds a WebFieldItem instance to the collection of web fields.
item
(WebFieldItem): The WebFieldItem instance to add.
toJSON(): object
Serializes the web fields object to a JSON representation.
- Returns: An object representing the serialized web fields.
static fromJSON(fieldJSON: object): WebFields
Creates a new WebFields instance from a JSON representation.
fieldJSON
(object): The JSON representation of the web fields.Returns: A new WebFields instance created from the JSON representation.
Note: This method is static and does not require an existing instance of the class.
run(): array
Executes the retrieval process for the web fields by running the associated steps on the source value.
- Returns: An array of field objects containing the retrieved values for each field.
Note: The retrieval process considers the source value, source items, and field items to generate the fields.
Class: WebFieldItem
Represents a single web field item with its properties and methods. Extends the WebRetrieveItem class.
Constructor
WebFieldItem(id: string)
Creates a new instance of the WebFieldItem class with the specified id
.
id
(string): The unique identifier for the web field item.
Properties
- Inherited from WebRetrieveItem:
id
(string): The unique identifier for the web field item.steps
(array): An array of WebRetrieveMethod instances representing the steps to retrieve the item.source
(WebRetrieveItem|null): The source WebRetrieveItem instance from which to retrieve the item.debug
(object): Debug information associated with the web field item.retrieved
(boolean): A flag indicating whether the item has been retrieved or not.sourceValue
(any|null): The value retrieved from the source.value
(any|null): The final retrieved value.type
(string): The type of the web field item.
Methods
static fromJSON(json: object): WebFieldItem
Creates a new WebFieldItem instance from a JSON representation.
json
(object): The JSON representation of the web field item.Returns: A new WebFieldItem instance created from the JSON representation.
Note: This method is static and does not require an existing instance of the class.
run(value: any): any
Executes the retrieval process for the web field item by running the associated steps on the provided value.
value
(any): The value on which to run the retrieval steps.Returns: The retrieved value after running the steps.
Note: The retrieval process considers the provided value and the associated steps to generate the retrieved value.
Class: WebRetrieveItem
Represents a web retrieve item with its properties and methods.
Constructor
WebRetrieveItem(id: string)
Creates a new instance of the WebRetrieveItem class with the specified id
.
id
(string): The unique identifier for the web retrieve item.
Properties
id
(string): The unique identifier for the web retrieve item.steps
(array): An array of WebRetrieveMethod instances representing the steps to retrieve the item.source
(WebRetrieveItem|null): The source WebRetrieveItem instance from which to retrieve the item.debug
(object): Debug information associated with the web retrieve item.retrieved
(boolean): A flag indicating whether the item has been retrieved or not.sourceValue
(any|null): The value retrieved from the source.value
(any|null): The final retrieved value.type
(string): The type of the web retrieve item.
Methods
setType(type: string)
Sets the type of the web retrieve item.
type
(string): The type to set for the web retrieve item.
setRetrieved(flag: boolean = true)
Sets the retrieved flag indicating whether the item has been retrieved or not.
flag
(boolean): The flag value. Default istrue
.
setSource(source: WebRetrieveItem|null)
Sets the source WebRetrieveItem instance from which to retrieve the item.
source
(WebRetrieveItem|null): The source WebRetrieveItem instance.
setSourceValue(sourceValue: any)
Sets the source value and updates the debug information.
sourceValue
(any): The source value to set.
retrieveFromSource()
Retrieves the value from the source WebRetrieveItem instance and sets the source value and retrieved flag.
addSteps(steps: WebRetrieveMethod)
Adds a WebRetrieveMethod instance to the steps for retrieving the item.
steps
(WebRetrieveMethod): The WebRetrieveMethod instance to add.
toJSON(): object
Serializes the web retrieve item object to a JSON representation.
- Returns: An object representing the serialized web retrieve item.
runSteps(sourceValue: any): any
Executes the steps of retrieving the item by sequentially running each WebRetrieveMethod.
sourceValue
(any): The source value to start with.Returns: The final retrieved value.
run(): any
Executes the retrieval process by retrieving the item from the source and running the steps if necessary.
- Returns: The final retrieved value.
static fromJSON(json: object): WebRetrieveItem
Creates a new WebRetrieveItem instance from a JSON representation.
json
(object): The JSON representation of the web retrieve item.Returns: A new WebRetrieveItem instance created from the JSON representation.
Note: This method is static and does not require an existing instance of the class.
Class: WebRetrieveMethod
Represents a web retrieve method with its properties and methods.
Constructor
WebRetrieveMethod()
Creates a new instance of the WebRetrieveMethod class.
Properties
type
(string): The type of the web retrieve method.parameters
(object): The parameters associated with the web retrieve method.method
(string): The method used for retrieval.debug
(object): Debug information associated with the web retrieve method.classInstance
(string): The class instance identifier.
Methods
setMethod(method: string)
Sets the method used for retrieval.
method
(string): The method used for retrieval.
setParameter(key: string, value: any)
Sets a parameter value for the specified key.
key
(string): The key of the parameter.value
(any): The value to set for the specified key.
getParameter(key: string): any
Retrieves the value of a parameter for the specified key.
key
(string): The key of the parameter.Returns: The value of the parameter.
run(item: any): any
Executes the web retrieve method by processing the provided item.
item
(any): The item to be processed by the method.Returns: The processed item.
fromJSON(stepJSON: object)
Populates the WebRetrieveMethod instance from a JSON representation.
stepJSON
(object): The JSON representation of the web retrieve method.
toJSON(): object
Serializes the web retrieve method object to a JSON representation.
- Returns: An object representing the serialized web retrieve method.
Class: WebStructure
Represents a web structure with its properties and methods. Extends the WebRetrieveItem class.
Constructor
WebStructure(id: string)
Creates a new instance of the WebStructure class with the specified id
.
id
(string): The unique identifier for the web structure.
Properties
- Inherited from WebRetrieveItem:
id
(string): The unique identifier for the web structure.steps
(array): An array of WebRetrieveMethod instances representing the steps to retrieve the item.source
(WebRetrieveItem|null): The source WebRetrieveItem instance from which to retrieve the item.debug
(object): Debug information associated with the web structure.retrieved
(boolean): A flag indicating whether the item has been retrieved or not.sourceValue
(any|null): The value retrieved from the source.value
(any|null): The final retrieved value.type
(string): The type of the web structure.
Methods
fromJSON(structureJSON: object): WebStructure
Creates a new WebStructure instance from a JSON representation.
structureJSON
(object): The JSON representation of the web structure.Returns: A new WebStructure instance created from the JSON representation.
Note: This method is static and does not require an existing instance of the class.
Web Data Retrieval Methods
This project provides a set of classes that implement various web data retrieval methods. These methods are designed to extract specific content from web documents based on different criteria. The classes are implemented in JavaScript and extend the WebRetrieveMethod
class, which provides common functionality for data retrieval.
WebRetrieveMethod
This is the base class for all web data retrieval methods. It contains shared functionality and properties.
Constructor
- No Arguments: Creates an instance of the
WebRetrieveMethod
class.
Methods
setMethod(method)
: Sets the method identifier for the retrieval process.setParameter(name, value)
: Sets a parameter used by the retrieval method.run(content)
: The main method responsible for extracting data from the web content. It takes the web content as input and returns the extracted data.
RetriveByDocumentTextContent
This class retrieves data by directly accessing the text content of a web document.
Constructor
- No Arguments: Creates an instance of the
RetriveByDocumentTextContent
class.
Methods
run(documentNode)
: Takes a DOM elementdocumentNode
as input, and extracts the text content from it using thetextContent
property. It then passes the text content to the parent class'srun
method for further processing.
RetriveByDocumentGetAttribute
This class retrieves data by accessing a specific attribute of a web document.
Constructor
attribute
: The attribute name to retrieve data from.
Methods
run(documentNode)
: Takes a DOM elementdocumentNode
as input, retrieves the value of the specified attribute using thegetAttribute
method, and passes it to the parent class'srun
method for further processing.
RetriveByDocumentParsedTextContent
This class retrieves data by parsing the inner HTML of a web document and extracting text content.
Constructor
- No Arguments: Creates an instance of the
RetriveByDocumentParsedTextContent
class.
Methods
run(documentNode)
: Takes a DOM elementdocumentNode
as input, parses its inner HTML, removes HTML tags, and returns the resulting text content to the parent class'srun
method for further processing.
RetriveByVisibleElement
This class retrieves data from a visible web element by checking its visibility and dimensions.
Constructor
- No Arguments: Creates an instance of the
RetriveByVisibleElement
class.
Methods
run(documentNode)
: Takes a DOM elementdocumentNode
as input, checks its visibility and dimensions, and returns the element itself if it is visible, ornull
if it is hidden, to the parent class'srun
method for further processing.
RetriveByTextSplit
This class retrieves data by splitting a text content using a specified delimiter.
Constructor
splitby
: The delimiter to split the text content.
Methods
run(content)
: Takes the text content as input, splits it using the specified delimiter, and returns an array of the resulting segments to the parent class'srun
method for further processing.
RetriveByRegEx
This class retrieves data by matching a regular expression pattern in the text content.
Constructor
regex
: The regular expression pattern to match.
Methods
run(content)
: Takes the text content as input, matches it against the specified regular expression, and returns an array of matched results to the parent class'srun
method for further processing.
RetriveByFromArray
This class retrieves data from an array by accessing a specific index.
Constructor
index
: The index of the element to retrieve from the array.
Methods
run(content)
: Takes an arraycontent
as input, retrieves the element at the specified index, and returns it to the parent class'srun
method for further processing.
RetriveByStringTrim
This class retrieves data by trimming whitespace from a string.
Constructor
- No Arguments: Creates an instance of the
RetriveByStringTrim
class.
Methods
run(content)
: Takes a stringcontent
as input, trims leading and trailing whitespace from it, and returns the trimmed string to the parent class'srun
method for further processing.
Constants: XPathResultType
A collection of constants representing different XPath result types.
ANY_TYPE
(number): Represents any type of result.NUMBER_TYPE
(number): Represents a number result.STRING_TYPE
(number): Represents a string result.BOOLEAN_TYPE
(number): Represents a boolean result.UNORDERED_NODE_ITERATOR_TYPE
(number): Represents an unordered node iterator result.ORDERED_NODE_ITERATOR_TYPE
(number): Represents an ordered node iterator result.UNORDERED_NODE_SNAPSHOT_TYPE
(number): Represents an unordered node snapshot result.ORDERED_NODE_SNAPSHOT_TYPE
(number): Represents an ordered node snapshot result.ANY_UNORDERED_NODE_TYPE
(number): Represents any unordered node result.FIRST_ORDERED_NODE_TYPE
(number): Represents the first ordered node result.
Class: RetriveByXpath
Represents a web retrieval method that uses XPath expressions to select nodes from an XML or HTML document. Extends the WebRetrieveMethod class.
Constructor
RetriveByXpath(xpathExpression: string)
Creates a new instance of the RetriveByXpath class with the specified XPath expression.
xpathExpression
(string): The XPath expression used for node selection.
Properties
- Inherited from WebRetrieveMethod:
type
(string): The type of the web retrieval method.parameters
(object): A dictionary of parameters for the retrieval method.method
(string): The specific retrieval method.debug
(object): Debug information associated with the retrieval method.classInstance
(string): The class instance identifier.
Methods
setExpression(xpathExpression: string): void
Sets the XPath expression used for node selection.
xpathExpression
(string): The XPath expression to set.
setNamespaceResolver(namespaceResolver: any): void
Sets the namespace resolver for the XPath expression.
namespaceResolver
(any): The namespace resolver to set.
setResultType(resultType: number): void
Sets the result type for the XPath expression.
resultType
(number): The result type to set. Should be one of the constants defined in XPathResultType.
run(contextNode?: Node): any
Executes the retrieval process by evaluating the XPath expression and selecting nodes from the provided context node.
contextNode
(Node): The context node from which to evaluate the XPath expression. If not specified, the document node will be used as the context.Returns: The retrieved nodes or the result of running the retrieved nodes through the superclass's
run
method, depending on the result type.
Class: RetriveByXpathSingleNode
Represents a web retrieval method that uses XPath expressions to select a single node from an XML or HTML document. Extends the RetriveByXpath class.
Constructor
RetriveByXpathSingleNode(xpathExpression: string)
Creates a new instance of the RetriveByXpathSingleNode class with the specified XPath expression.
xpathExpression
(string): The XPath expression used for node selection.
Properties
- Inherited from RetriveByXpath:
type
(string): The type of the web retrieval method.parameters
(object): A dictionary of parameters for the retrieval method.method
(string): The specific retrieval method.debug
(object): Debug information associated with the retrieval method.classInstance
(string): The class instance identifier.
Methods
- Inherits all methods from the RetriveByXpath class.
Class: RetriveBy
XpathMultipleNodes
Represents a web retrieval method that uses XPath expressions to select multiple nodes from an XML or HTML document. Extends the RetriveByXpath class.
Constructor
RetriveByXpathMultipleNodes(xpathExpression: string)
Creates a new instance of the RetriveByXpathMultipleNodes class with the specified XPath expression.
xpathExpression
(string): The XPath expression used for node selection.
Properties
- Inherited from RetriveByXpath:
type
(string): The type of the web retrieval method.parameters
(object): A dictionary of parameters for the retrieval method.method
(string): The specific retrieval method.debug
(object): Debug information associated with the retrieval method.classInstance
(string): The class instance identifier.
Methods
- Inherits all methods from the RetriveByXpath class.
Web Data Retrieval Methods - Query Selector
This project provides two classes that implement web data retrieval methods based on the query selector expressions. These methods allow users to specify elements on a web page using CSS-like selectors and retrieve specific content from those elements. The classes are implemented in JavaScript and extend the WebRetrieveMethod
class, which provides common functionality for data retrieval.
WebRetrieveMethod
This is the base class for all web data retrieval methods. It contains shared functionality and properties.
Constructor
- No Arguments: Creates an instance of the
WebRetrieveMethod
class.
Methods
setMethod(method)
: Sets the method identifier for the retrieval process.setParameter(name, value)
: Sets a parameter used by the retrieval method.run(content)
: The main method responsible for extracting data from the web content. It takes the web content as input and returns the extracted data.
RetriveByQuerySelector
This class retrieves data by using the querySelector
method to select a single element on the web page.
Constructor
selectorExpression
: The CSS-like selector expression to identify the target element.
Methods
run(contextNode)
: Takes a DOM elementcontextNode
as input (optional, defaults to thedocument
), uses thequerySelector
method with the specified selector expression to find the target element, and returns the element itself if found ornull
if not found to the parent class'srun
method for further processing.
RetriveByQuerySelectorAll
This class retrieves data by using the querySelectorAll
method to select multiple elements on the web page.
Constructor
selectorExpression
: The CSS-like selector expression to identify the target elements.
Methods
run(contextNode)
: Takes a DOM elementcontextNode
as input (optional, defaults to thedocument
), uses thequerySelectorAll
method with the specified selector expression to find all matching elements, and returns an array of matched elements to the parent class'srun
method for further processing.
Please note that the above documentation provides an overview of the classes and their methods' functionalities. For the actual implementation and usage of these classes, you would need to see the complete code and how it is integrated into the web data retrieval system.
Example
// Create retrieval methods
const retriveTxtContent = new RetriveByDocumentTextContent();
const retriveuserListBody = new RetriveByXpathSingleNode('//*[@id="id_user_list_body"]');
const retrieveuserRows = new RetriveByXpathMultipleNodes('//tr[@class="clickable"]');
const retriveName = new RetriveByXpathSingleNode('td[2]');
// Create web structures
const userInfoStructure = new WebStructure("user");
userInfoStructure.addSteps(retriveuserListBody);
userInfoStructure.addSteps(retrieveuserRows);
// Create field items
const fnameFieldItem = new WebFieldItem('fname');
fnameFieldItem.addSteps(new RetriveByXpathSingleNode('td[3]'));
fnameFieldItem.addSteps(retriveTxtContent);
const lnameFieldItem = new WebFieldItem('lname');
lnameFieldItem.addSteps(new RetriveByXpathSingleNode('td[2]'));
lnameFieldItem.addSteps(retriveTxtContent);
// Create web fields
const nameGenderDOBFields = new WebFields('name-dob-gender');
nameGenderDOBFields.setSource(userInfoStructure);
nameGenderDOBFields.addFieldItem(fnameFieldItem);
nameGenderDOBFields.addFieldItem(lnameFieldItem);
// Create field items for user header info
const fullname = new WebFieldItem('fullname');
fullname.addSteps(new RetriveByRegEx("([^\s]+) \\([^)]+\\)"));
fullname.addSteps(new RetriveByFromArray(1));
fullname.addSteps(new RetriveByStringTrim());
const gender = new WebFieldItem('gender');
gender.addSteps(new RetriveByRegEx("\\(\\s*(Male|Female|Other|Unknown|Declined to Specify)\\s*\\|"));
gender.addSteps(new RetriveByFromArray(1));
gender.addSteps(new RetriveByStringTrim());
const dob = new WebFieldItem('dob');
dob.addSteps(new RetriveByRegEx("/(\w+ \d{1,2}, \d{4})/"));
dob.addSteps(new RetriveByFromArray(1));
dob.addSteps(new RetriveByStringTrim());
// Create web fields for user header info
const nogdF = new WebFields('nogd');
nogdF.setSource(userHeaderInfoStructure);
nogdF.addFieldItem(fullname);
nogdF.addFieldItem(gender);
// Create web scraping rule
const scrapRule = new WebRule("23232");
scrapRule.version = 3;
scrapRule.addStructure(userInfoStructure);
scrapRule.addStructure(userHeaderInfoStructure);
scrapRule.addFields(nameGenderDOBFields);
scrapRule.addFields(nogdF);
// Run the web scraping rule
scrapRule.run();
This example demonstrates the usage of various web retrieval methods, web structures, field items, web fields, and a web scraping rule to retrieve and extract data from a web page. The retrieved data can be further processed or used as needed.