archivator

v1.0.2

Published

3 years ago

Ever wanted to archive your own copy of articles you enjoyed reading and to be able to search through them?

Downloads

0High
0Medium
0Low

renoirb

bookmark archive offline-copy crawler downloader node-website-scraper archiving

Archivator

Ever wanted to archive your own copy of articles you enjoyed reading and to be able to search through them?

| Version | | ---------------------------------------------------------------------------------------------------------------------------------------------- | | |

CURRENT STATUS: This is frozen v1.x branch, future work is under v3.x-dev branch, but usable as-is see renoirb/archivator-demo

Summary

This project is a means to try out ECMAScript 2017 tooling and do something useful. See Challenge below.

The objective of this project is to:

(Note Check marks below :white_check_mark: denotes that work had been done and should be usable)

:white_check_mark: Cache HTML payload of source Web Pages URLs we want archived (see src/fetcher.js)
:white_check_mark: Store files for each source URL at a consistent path name (see src/normalizer/slugs.js) (see v3.x-dev url-dirname-normalizer)
- :white_check_mark: Extract assets, download them for archiving purposes (see src/transformer.js at extractAssets and src/normalizer/assets.js) (see v3.x-dev @archivator/archivable)
- :white_check_mark: Download images ("assets") from Web Pages (see v3.x-dev @archivator/archivable)
- :white_check_mark: Rename assets in archive and adjust archived version to use cached copies (see src/normalizer/hash.js and src/transformer.js at reworkAssetReference) (see v3.x-dev @archivator/archivable)
- :white_check_mark: Do not download tracking images and/or ignore inline base64 images
Read link list from different source list
- RSS xml document
- :white_check_mark: CSV file (defaults to archive/index.csv)
:white_check_mark: Extract the main content for each article (see src/transformer.js at extractAssets) (see v3.x-dev @archivator/archivable)
:white_check_mark: Export into simplified excerpt document (see src/transformer.js at markdownify) (see v3.x-dev @archivator/content-divinator)
Add documents into a search index
Make a stand-alone bundle using Rollup
:white_check_mark: (incomplete) Make it usable as an external module (see renoirb/archivator-demo)
:white_check_mark: Make it an NPM package

Use

Install production only dependencies.

Assuming you have dist/ compiled (see Build below), and you deleted node_modules/.

npm install --only=production

Edit example.js, add more urls (if you want)

node example.js

Run through Babel

yarn install

Create a folder archive/, add an index file that we'll use to read and fetch pages from

File is CSV, using semi-column ; as a separator, fields are:

URL to read from
CSS selector to main part of the content you want to keep
One or many CSS selectors (i.e. coma separated, like CSS supports already) of elements you want off of archives (e.g. ads)

// file archive/index.csv
https://renoirboulanger.com/blog/2015/05/converting-dynamic-site-static-copy/;article;
https://renoirboulanger.com/blog/2015/05/add-openstack-instance-meta-data-info-salt-grains/;article;

Run fetcher

npm start

You should see the following in the terminal output

...
Archived renoirboulanger.com/blog/2015/05/converting-dynamic-site-static-copy
Archived renoirboulanger.com/blog/2015/05/add-openstack-instance-meta-data-info-salt-grains

And you should see a few files getting created:

cache.html: Is the raw HTML file download from the origin
cache.json: Is a JSON cache of gathered metadata from the process
index.md: Is the simplified article converted to Markdown
Files with letters and numbers are images found in the document

archive/
 `-renoirboulanger.com/
   `-blog/
     `-2015/
       `-05/
         `-add-openstack-instance-meta-data-info-salt-grains/
           |- cache.html
           |- cache.json
           |- 5e6327f278a336349f8bb6b26163dabedb173bcd.png
           |- 881811befc2fa6ad9c8ec058e1be3bd231fdcc1f.png
           |- b69a780dc3278f5d86296d2f219821eeac385f20.jpg
           |- c0e21ae7f0a56374116f08b44087d07ab8710035.png
           |- c3d25fac5b0c573275b15822294e484097edd945
           |- cd5f2a6cfa00a45e755b07013e59cb7c03bb9826.jpg
           |- eb31cca43b832b0016a2211e6e0058b263f4a1c0.png
           |- f6c4338884f46d3942589fcc29611fa68b600bad.png
           |- index.md

Run tests

npm test

Run xo (coding convention linter)

npm run lint

Build

IMPORTANT This is no longer supported and is broken, see note in dist/README.md

Run in Node.js, as ECMASCript 5 transpiled code.

yarn install
npm run build

Should do the same as if we ran npm start with modern Node.js v6+ with Babel

node dist/cli.js

Challenge

Make an archiving system while learning how to use bleeding edge JavaScript.

Use ECMAScript 2016’ Async/Await along with Generators (function * (){ /* ... */ yield 'something'; })
Figure out how to export into ES5
Figure out how to package, test and so on
Least number of dependencies as possible for development
(Ideally) No dependencies to run once bundled

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme