wget-parser
v2.0.0
Published
Parses the wget spider output into an object
Downloads
1,827
Readme
Table of Contents
Spider parser
Parses the spider output from wget into an object structure of links.
This object could then be processed further to create a tree structure of the hierarchy of a website such that sitemap generation could be implemented.
Tested using wget v1.15
on linux.
Usage
var parser = require('wget-parser')
, buf = new Buffer(0); // buffer should contain the spider output
console.dir(parser(buf));
parser.Parser
: The parser class.parser.Link
: The class that represents a link.parser.ParseStream
: Parse stream class.
Streams support is available, see the test spec for example usage.
wget-parser
A program that reads from stdin
and prints the result of the parse as JSON, exits with error code 1 if any broken links are found.
cat test/fixtures/mock.txt | wget-parser
cat test/fixtures/broken.txt | wget-parser; echo $?;
wget-spider
A program that performs a spider with wget and pipes the output to wget-parser
:
wget-spider http://google.com
Output
Example output from the parser:
{
"links": [
{
"url": {
"protocol": "http:",
"slashes": true,
"auth": null,
"host": "google.com",
"port": null,
"hostname": "google.com",
"hash": null,
"search": null,
"query": null,
"pathname": "/",
"path": "/",
"href": "http://google.com/"
},
"link": "http://google.com/",
"line": "--2016-02-10 16:11:57-- http://google.com/"
},
{
"url": {
"protocol": "http:",
"slashes": true,
"auth": null,
"host": "www.google.co.id",
"port": null,
"hostname": "www.google.co.id",
"hash": null,
"search": "?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"query": "gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"pathname": "/",
"path": "/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"href": "http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ"
},
"link": "http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ",
"line": "--2016-02-10 16:11:57-- http://www.google.co.id/?gws_rd=cr&ei=zfC6Vv6KKYexuATc3pu4DQ"
}
],
"broken": []
}
Developer
Test
To run the test suite:
npm test
Cover
To generate code coverage run:
npm run cover
Lint
Run the source tree through jshint and jscs:
npm run lint
Clean
Remove generated files:
npm run clean
Readme
To build the readme file from the partial definitions:
npm run readme
Generated by mdp(1).