robots-parser-combinator
v1.1.0
Published
A proper robots.txt parser and combinator that works with eulalie
Downloads
3
Maintainers
Readme
robots-parser-combinator
A proper robots.txt parser and combinator that works with eulalie.
Usage
User-agent: *
Allow: /blog/index.html # site blog
Disallow: /cgi-bin/
Disallow: /tmp/
Sitemap: http://www.mysite.com/sitemaps/profiles-sitemap.xml # extra profile urls
# save the robots
> const parser = require('robots-parser-combinator')
> const robotstxt = fs.readFileSync('./robots.txt', 'utf8')
>
> var goodRobots = parser.parse(robotstxt)
[ { useragent: { value: '*' } },
{ allow: { value: '/blog/index.html', comment: 'site blog' } },
{ disallow: { value: '/cgi-bin/' } },
{ disallow: { value: '/tmp/' } },
{ sitemap:
{ value: 'http://www.mysite.com/sitemaps/profiles-sitemap.xml',
comment: 'extra profile urls' } },
'save the robots' ]
> var badRobots = parser.parse('')
[]
Or you can feed the parser.robotstxt
combinator into eulalie to parse robots.txt.
You can also parse robots.txt
containing nonstandard extensions like Crawl-delay
or Host
by using the parser.parseNS
function. The combinators for nonstandard extensions are also provided.
Implementation
The parser is an implementation of the BNF form for robots.txt based on the Google spec, and references RFC 1945 and RFC 1808 when appropriate.
LWS (linear-white-space) is defined using the rule specified in RFC 5234, rather than RFC 1945. There is a small but very significant inconsistency between the rules:
RFC 5234 linear-white-space:
WSP = SP / HTAB
LWSP = *(WSP / CRLF WSP)
RFC 1945 linear-white-space:
LWS = [CRLF] 1*( SP | HT )
The RFC 1945 linear-white-space rule consumes at least one space
or tab
character, and RFC 5234 does not. Due to this inconsistency, the parser has chosen the more general rule in order to be more flexible. You can set the parser to use the stricter rule by setting parser.setStrictLWS(true)
before parsing.
All of the BNF rules in the robots.txt spec are provided as combinators. Since the combinators are compatible with eulalie, you can use them to get partial aspects of a robots.txt file or as part of a larger combinator.
License
Licensed under the MIT license.