deepsalter

v7.3.0

Published

3 years ago

Trust should be earned. Let's do something about it.

Downloads

406

0High
0Medium
0Low

mysidestheyaregone

deepfreeze

About Deepsalter

What it does

Deepsalter looks at any new reddit submission, finds out if it names one or more journos who have a page on Deepfreeze.it and posts a reply containing links to their Deepfreeze page.

How it does it

Getting new sumbissions

The bot simply polls reddit frequently until it finds any number of new posts and analyzes them.

Processing link posts

If a post is a link with no body, Deepsalter scrapes the linked webpage looking for the author and the body of the article, throwing away anything else (sidebars and the like). This functionality relies partially on Mercury, which takes care of extracting the title and body of the article. To guess the author, a function was adapted from unfluff. Since unfluff can't possibly identify the author correctly on every webpage, especially when the website is 100% dogshit, Deepsalter also relies on a number of special rules defined in data/matchers.json. Those precise rules are tried first and if they fail Deepsalter resorts to guessing.

To scrape a webpage successfully, it's important to get its raw, unmodified HTML code. When a page has been archived on archive.is or the Wayback Machine, Deepsalter discards it and gets the live page instead. Multiple archives are supported - Deepsalter peels them away like an onion until it gets to the original page.

If the webpage doesn't exist anymore, the scrape fails; Deepsalter pretends that the page is empty and carries on.

Processing self posts

If a reddit post is instead a self post, Deepsalter scrapes anything linked in the self post body. The scraping mechanism is the same.

And finally!!

Once everything has been collected, Deepsalter matches the list of journos that have a page on Deepfreeze.it against the post title, its body (if it's a self post) and any of the scraped links. This may result in a list of journos who are named in any of those resources or who are the authors of the linked articles.

The list of journos is used to generate a comment that is then posted as a reply to the submissiion.

Other tech stuff

Deepsalter doesn't write anything to disk and doesn't use any database, relying on reddit's own "save" function instead. Its internal state can be safely thrown away whenever it's done saving and sending replies, making it highly resistant against reboots and failures of all kinds.

While several reddit API wrappers are available, I eventually decided not to use any of them, opting for the smallest and simplest implementation I could write - just a handful of functions.

Deepsalter automatically adjusts the timing of its requests in order to be as responsive as possible without going over reddit's API usage budget of 60 requests per minute.

This project has been feature-complete for a while now. Open an issue or contact whoever is operating it to request corrections or additions.

USAGE

Deepsalter no longer accepts commandline arguments. Run it with yarn start.

CONFIGURATION

As of September 2017, in order to reduce bloat Deepsalter no longer supports logging to file directly. Use pm2 to capture logs - or whatever is supported by the cloud service you're using. All Deepsalter does is write to stderr/stdout.

As of June 2018, Deepsalter no longer supports reading a JSON configuration file from an arbitrary file. Either set the environment variables described below or put a file called .env containing key=value pairs in its source folder as explained in dotenv.

Deepsalter understands the following environment variables. Names are case-sensitive.

deepfreeze_endpoint: Deepfreeze API endpoint. Deepsalter will GET a json document from that address. deepfreeze_journoPageBaseURL: URL fragment to prepend before the ulrencoded journo name. The resulting URL will be the link to the journo page. deepfreeze_TTL: In hours, how long before the Deepfreeze database is re-fetched.

reddit_clientId: Reddit authentication details reddit_clientSecret: Reddit authentication details reddit_username: Reddit authentication details reddit_password: Reddit authentication details reddit_tokenExpiry: In minutes, how long before the auth token is refreshed. Defaults to 55.

reddit_authUrl: Reddit auth endpoint, defaults to https://www.reddit.com/api/v1/access_token reddit_apiBaseUrl: Reddit OAuth API endpoint, defaults to https://oauth.reddit.com/

reddit_subreddits: Comma-separated list of subreddits Deepsalter will watch. Defaults to an empty list. reddit_limit: How many posts Deepsalter should fetch from each subreddit's new feed at the start of every cycle of the main loop. Defaults to 20.

reddit_userAgent: User agent string Deepsalter sends to reddit with each request. Defaults to Node/${process.version} Deepsalter/v${package_json.version}

reddit_delay: In milliseconds, how long Deepsalter should wait between requests even if the budget isn't exausted. Defaults to 100 reddit_concurrency: How many requests Deepsalter should send concurrently. Defaults to 2, is currently capped to 1 by architectural constraints that should be lifted before the final release of v5.0.0.

reddit_budget: How many points Deepsalter can spend during the period set in reddit_budgetDuration. Defaults to 60 points. Every request sent to reddit burns one point. reddit_budgetDuration: In milliseconds, how long the budget lasts before it's reset to the value specified in reddit_budget. Defaults to 61000.

reddit_signature: Text that should be appeneded to every comment Deepsalter generates. Default to a simple cautionary message that also links here.

scraper_userAgent: User agent string Deepsalter sends to websites when it downloads a webpage. scraper_maxDownloadSize: In bytes, maximum size of a webpage. The download of anything larger is interrupted and the result is nulled.

scraper_delay: In milliseconds, how long to wait before moving on to the next webpage. Defaults to 100. scraper_concurrency: How many websites Deepsalter should scrape concurrently. Currently capped to 1 by design because scraping burns too much memory and cloud hosts would terminate the process far too often if it scraped more than one webpage at a time.

scraper_budget: You can set a budget and budgetDuration for scraping, if you want. Defaults to Infinity. scraper_budgetDuration: Defaults to Infinity.

RUNNING THE BOT

Run it:

cd source/directory

yarn

yarn start

You can either use screen or pm2 to keep it running or write a system service script. It does run just fine on Windows.