discovery-web-crawler
v1.2.1
Published
Crawls a website and populates a Watson Discovery Collection.
Downloads
1
Maintainers
Readme
Crawls a website and populates a Watson Discovery Collection.
Install
npm install discovery-web-crawler
Usage
The following snippet will gather Watson stories from the IBM website and index them in Watson Discovery.
const DiscoveryWebCrawler = require('discovery-web-crawler')
let crawler = new DiscoveryWebCrawler({
serviceUrl: 'YOUR_SERVICE_URL',
apikey: 'YOUR_APIKEY',
environmentId: 'YOUR_ENVIRONMENT_ID',
collectionId: 'YOUR_COLLECTION_ID',
url: 'https://www.ibm.com/watson/stories/', // Starting point URL
maxDepth: 3, // Max crawler depth
fetchCondition: queueItem => queueItem.path.startsWith('/watson/'), // Condition to crawl this URL
urlCondition: url => !url.match('/list'), // Condition to index this URL
parse: async $ => ({ text: $('main').text().replace(/\s+/g, ' ').trim() }), // Cheerio API to extract JSON from HTML content
})
crawler.start()
Run tests
npm run test
Author
👤 Marco Cardoso
- Github: @MarcoABCardoso
- LinkedIn: @marco-cardoso
Show your support
Give a ⭐️ if this project helped you!