larry-crawler

v0.0.1

Published

3 years ago

A simple yet flexible Twitter Crawler for Kayako Twitter Challenge

Downloads

0High
0Medium
0Low

the-mad-king

Twitter Web Scrawling Kayako

larry-crawler

Kayako Twitter challenge

Installation

npm install --save larry-crawler

Usage

Navigate to the node_modules directory which contains larry-crawler.

cd larry-crawler/usage
node get-tweets.js

Test

npm test

Output

The application fetches tweets in batches of 100. Unless forcefully killed (CTRL+C), the app will keep running until all tweets matching the defined criteria have been fetched. See result.

NOTE: A batch might produce less than 100 tweets in output if you've applied a secondary filter (like retweetCounts). If 100 tweets were retrieved based on specified HashTag and 30 of them haven't been retweeted, then only 70 tweets are supplied in the response.statuses Array.

Module API

To access the class larry-crawler exposes for crawling twitter:

const {TwitterCrawler} = require ('./larry-crawler');

Get your app or user credentials from https://dev.twitter.com/, then create a new object like:

const crawler = new TwitterCrawler ({

	consumerKey: process.env.TWITTER_CONSUMER_KEY,
	consumerSecret: process.env.TWITTER_CONSUMER_SECRET,
	accessTokenKey: process.env.TWITTER_ACCESS_TOKEN_KEY,
	accessTokenSecret: process.env.TWITTER_ACCESS_TOKEN_SECRET

});

If you have a twitter app, use bearerToken instead of accessTokenKey & accessTokenSecret.

The new object exposes method getTweets() to fetch tweets based on criteria and returns a Promise.

const criteria = { hashtags: ['custserv'], retweetCount: {$gt: 0} };

crawler.getTweets (criteria).then ((response) => {
  console.log (JSON.stringify (response, null, 2));
}).catch (() => {});

To set the max_id parameter for pagination,

criteria.maxIdString = status.id_str

where status is an item in the response.statuses Array.

See get-tweets.js for a full example.

Technical Details

The module has only 1 dependancy - twitter.

Searching based on Hashtags is simple since Twitter API has in-built support for that. But in order to further refine tweets based on number of retweets, the module contains a class SecondaryFilterForTweets.

See Working with search API

Since a maximum of 100 tweeets are sent per request, an effective pagination strategy had to be implemented using the max_id parameter so we can retrieve ALL the tweets since the very beginning. This strategy was followed to achieve pagination.
The primary challenge was to deal with the 64-bit integer ID provided by the Twitter API. JS can only provide precision upto 53 bits. Hence, the application uses id_str field at all times and a special decrement function has been written in usage/utils.js to operate on the string ID.

See Working with 64-bit id in Twitter

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

larry-crawler

Installation

Usage

Test

Output

Module API

Technical Details