blogsearch-crawler
v0.0.4
Published
BlogSearch index building tool for generic static webpages (crawler)
Downloads
1
Readme
BlogSearch index building tool for generic static webpages (crawler)
// Asciidoc references // Documentation: https://asciidoctor.org/docs/user-manual/ // Quick reference: https://asciidoctor.org/docs/asciidoc-syntax-quick-reference/ // Asciidoc vs Markdown: https://asciidoctor.org/docs/user-manual/#comparison-by-example // GitHub Flavored Asciidoc (GFA): https://gist.github.com/dcode/0cfbf2699a1fe9b46ff04c41721dda74
:project-version: 0.0.3 :rootdir: https://github.com/kbumsik/blogsearch
ifdef::env-github[] :tip-caption: :bulb: :note-caption: :information_source: :important-caption: :heavy_exclamation_mark: :caution-caption: :fire: :warning-caption: :warning: endif::[]
CAUTION: This is a part of link:{rootdir}[BlogSearch project]. If you would like to know the overall concept, go to link:{rootdir}[the parent directory].
1. Building a search index file
The easiest way
[source,bash] npx blogsearch-crawler
The formal way
[source,bash]
npm install -g blogsearch-crawler blogsearch-crawler
Configuration
CAUTION: Go to link:{rootdir}#whats-in-the-index[the "What's in the index file" section of the main project]. For more details on how to configure fields.
.blogsearch.config.js [source,javascript,options="nowrap",subs="verbatim,attributes"]
module.exports = { // [Mandatory] This must be 'simple'. type: 'simple', // [Mandatory] Generated blogsearch database file. output: './my_blog_index.db.wasm', // [Mandatory] List of entries to parse. The crawler uses glob pattern internally. // How to use glob: https://github.com/isaacs/node-glob entries: [ './reactjs.org/public/docs/**/*.html' ], // [Mandatory] Fields configurations. // See: {rootdir}#whats-in-the-index fields: { title: { // The value can be a CSS selector. parser: 'article > header', }, body: { // Set false if you want to reduce the size of the database. hasContent: true, // It can be a function as well. parser: (entry, page) => { // Use puppeteer page object. // It's okay to return a promise. return page.$eval('article > div > div:first-child', el => el.textContent); } }, url: { // By setting this false the search engine won't index the URL. indexed: false, parser: (entry, page) => { // entry is a string of the path being parsed. return entry.replace('./reactjs.org/public', 'https://reactjs.org'); }, }, categories: { // This is disabled because the target website doesn't have categories. enabled: false, // This is a dummy parser. This is unused because the field is disabled. parser: () => 'categories-1, categories-2', }, tags: { // This is disabled because the target website doesn't have tags. enabled: false, // This is a dummy parser. This is unused because the field is disabled. parser: () => 'tags-1, tags-2', }, } };
2. Enabling the search engine in the webpage
You need to enable the search engine in the web page. Go to link:../blogsearch[blogsearch Engine].
Again, if you would like to understand the concept of BlogSearch, go to link:{rootdir}/[the parent directory].