atrica
v1.0.2
Published
library that helps you building a web-crawler, espcially for web-privacy measurements
Downloads
4
Readme
Atrica
Atrica is a set of tools, built on top of google-chrome pupeteer API, meant to help you write crawlers, especially for web privacy measurement. It lets you start, control, and gather data from many popular browsers (Chrome, Opera, (and other chromium-based browsers such as the next version of Edge), Firefox, TorBrowser, etc...).
This project features:
- Puppeteer's API for chrome, and a part of that API for WebExtension-compatible browsers.
- This API is enriched with other functionalities such as browser.cookies() that returns a list of all the cookies, browser.clearAllBrowsingData() that remove cookies, localstore, cache, etc... and browser.evaluate() that lets you execute a function with access to the WebExtension API.
- A simple profile manager that lets you start a clean browsing session, install extensions into it, save it, clone it, delete it.
- A flexible logger that can save requests and response in a sqlite database. You can customize the schemas and create your own tables.
- A crawler builder that helps you create a reliable crawler that lets you start several browsers concurrently and restart them if they crash or become unresponsive.
This repository is looking for a maintainer.
Table of contents
Installation
- Install the lastest version of nodejs (atrica has not been tested with node < 12.x)
- Create a new folder for your project:
mkdir my-crawler cd my-crawler
- Initialize your project and install atrica:
npm init npm install atrica
- Write a script using atrica (See guide below. You can start with Basic example)
- Run the script:
node script.js
Warning: this project depends on puppeteer which will automatically download chromium-browser. It's about 100Mo so it might take a while on slow connections.
Documentation
The documentation can be found here: https://fukuda-lab.github.io/atrica/
Guide
This is a step by step guide with code and explainations.
The full codes can be found in the examples
folder.
Warning: Puppeteer and atrica heavily rely on javascript's await/async feature.
You should first learn how to use it.
Basic example
In this example we create a very simple crawler, launching chrome and have it visit wikipedia and other wikimedia websites.
Hopefully the following code is self explainaroty, but here are a few remarks:
- In the first line of the main function, we create a new profile. The 'browser' parameter can take two values:
- "chromium" will create a profile for a chromium-based browser (so it will also work with chrome, opera, vivaldi, Edge, etc...)
- "firefox" will create a profile for a Firefox-based browser (will work with TorBrowser, etc...)
- See next subsection for more details on profiles
- In puppeteer, a 'page' is actually a browser tab, so browser.newPage create a new tab.
// examples/basic.js
const atrica = require("atrica");
async function main() {
const profile = await atrica.profile({ browser: "chromium" });
const browser = await atrica.launch(profile);
const domains = ["wikipedia.org", "wiktionary.org", "wikiquote.org"];
for(let domain of domains) {
const url = `https://${domain}`;
const page = await browser.newPage();
await page.goto(url);
await atrica.utils.sleep(1);
await page.close();
}
await browser.close();
process.exit();
}
main();
Profiles
All the code in this subsection is an extract from examples/profiles.js
(and examples/profile-tor.js
).
You are encouraged to also read this file to get the whole picture.
Browser binary path
If you create a profile with the minimum options like this:
const profile = await atrica.profile({ browser: "chromium" });
then atrica will use the chromium version downloaded with puppeteer.
If you want to use firefox, or a different chromium-based browser, you can
use the binary
option, which expect the full path to the browser binary file:
const operaProfile = await atrica.profile({
browser: "chromium",
binary: "/snap/bin/opera",
});
Reusing profiles
By default, the profiles are saved in your OS temporary folder (/tmp for linux)
You can use the path
option to choose where the profile will be saved:
const trainingProfile = await atrica.profile({
browser: "firefox",
binary: "/usr/bin/firefox",
path: "./privacy-badger-profile"
});
Then later in an other script you can load that profile. However keep in mind that if you launch a browser with a profile, that profile will be modified by the browser. Therefore you should copy the profile first if you don't want to modify the original profile.
// Loading the trained profile.
const trainedProfile = await atrica.profile({
browser: "firefox",
binary: "/usr/bin/firefox",
path: "./privacy-badger-profile"
});
// We don't want to modifiy trainedProfile, we might need it later
// so we make a copy of it (stored in /tmp/xxxxxx)
const trainedProfileCopy = await trainedProfile.copy();
Extensions
Downloading extensions
Atrica has utility functions to automatically download extensions and put them in a cache. To use those functions you need the id of the extension, but you can easily extract it from the url of the extension in the chrom web store:
// For example if you find ublock origin in the webstore at the following url:
let url = "https://chrome.google.com/webstore/detail/ublock-origin/cjpalhdlnbpafiamejdnhcphjbkeiagm?hl=en";
// Then the id is:
let id = "ublock-origin/cjpalhdlnbpafiamejdnhcphjbkeiagm";
// For mozilla addons store, if you find pribacy-badger at the follow url:
let url2 = "https://addons.mozilla.org/en-US/firefox/addon/privacy-badger17/";
// Then the id is:
let id2 = "privacy-badger17";
Then you can download the extension with:
// Chromium-based browsers
const ublockPath = await atrica.profile.getChromeExtension({
id: "ublock-origin/cjpalhdlnbpafiamejdnhcphjbkeiagm"
});
// Firefox-based browsers
const privacyBadger = await atrica.profile.getFirefoxExtension({
id: "privacy-badger17",
name: "privacy-badger" //Optional
});
Installing extensions
Once you have downloaded the extensions you can install them in the profile. ** However with this method you can only install them TEMPORARLY (except for Firefox)** This means that all the data and configuration related to the extension will be lost if the browser restart.
const operaProfile = await atrica.profile({
browser: "chromium",
binary: "/snap/bin/opera",
// Works with both firefox and chromium
extensions: [ublockPath]
});
However with Firefox (or TorBrowser) you can install the extensions permanently, like this:
const firefox = await atrica.launch(trainingProfile);
// Firefox-based browsers only
firefox.install(privacyBadger);
Unfortunately, to my knowledge, chromium does not offer any way to install an extension permanently programmatically. However you can quickly and easily do it manually (see next paragraph)
Manual extension installation and configuration
Atrica does not offer any API to automatically configure an extension. However you can do it manually, like this:
# For firefox : create a profile at /home/user/firefox-manual-profile
# Then you can use firefox graphical user interface to install and configure
# all the extensions you want, and it will be saved in the firefox-manual-profile folder
firefox -no-remote -profile=/home/user/firefox-manual-profile
# Same thing for chromium-based browsers
chromium-browser --user-data-dir=/home/user/chromim-manual-profile
WARNING : the profiles (especially firefox's profiles) contains hard-coded absolute paths, so simply copying a profile folder from a machine to another might not work you can maybe replace the old paths with the new one with tools like sed, however, for better portabilty you should probably use Docker and its virtual file system (if you need to share a profile with manual configuration).
Cleaning a profile
If you want to keep an extension configuration but remove traces of previous browsing in the profile (cookies, localstorage, cache, etc...) you can use the following methods:
// Remove cookies, localStorage, cache
await browser.clearAllBrowsingData();
// Remove only what you specify in the options
// See documentation for more information
await browser.clearBrowsingData(/* options */)
Headless mode and other puppeteer's options
Puppeteer's options
You can set puppeteer options using the options
attribute in the profile's options :
const profile = await atrica.profile({
browser: "chromium",
// Works for both chromium and firefox
options: {
defaultViewport: {
width: 1920,
height: 1080
}
}
});
const browser = await atrica.launch(profile);
The availables options for chromium are documented here and those for firefox are documented here
Headless mode
By default, atrica do not launch the browser in headless mode. That is because chromium do not support headless mode with extensions. Moreover Atrica is using an extension in the background for firefox support and to enrich puppeteer API. If you are using firefox only, then there is no problem, firefox support headless mode with extension.
However if you are using chrome you have three options :
- Using chrome without headless mode with atrica enabled and other extensions
- Using chrome with headless mode with atrica disabled. In that case:
- You cannot use the enriched API (browser.clearAllBrowsingData, browser.cookies, browser.evaluate)
- You cannot load extensions
- On linux you can install
xvfb
, which creates a virtual screen and run chromium inside it with atrica and extensions enabled. Very usefull if you are on a linux server. In that case you can just start your script like this:# Launch crawler in a virtual screen xvfb-run -a node crawler.js
You can enable headless mode like this:
// For chromium
const profile = await atrica.profile({
browser: "chromium",
options: { headless: true }
});
// You then need to disable atrica extension
const browser = await atrica.launch(profile, {atrica: false});
// For firefox
const fprofile = await atrica.profile({
browser: "firefox",
options: { headless: true }
});
// You don't need to disable atrica for firefox
const firefox = await atrica.launch(fprofile);
TorBrowser
If you want to use Atrica with TorBrowser you must be aware of a few things: TorBrowser is a modified version of Firefox that comes together with a default profile. This default profile has a few extensions installed by default:
- HTTPS everywhere
- NoScript
- TorButton
- TorLauncher: That last extension is what helps you configure the tor proxy before launching firefox
That profile is located at /path-to-tor-browser/Browser/TorBrowser/Data/Browser/profile.default
.
We are going to use this profile as a starting point.
TorBrowser with Tor Proxy
If you want to use TorBrowser with the tor proxy and atrica you will have to change the default profile. You might want to make a copy of the default profile first. By default, all network traffic go through the Tor proxy. However we want the atrica extension installed in TorBrowser to be able communicate with the atrica server in the nodejs process. Only then can we control the browser. So we have to add an exception to the proxy rules for localhost.
- Launch TorBrowser
- Open the Preferences and type "proxy" in the search bar
- Click on "Settings..."
- In the "No proxy for" (the textarea at the bottom) type "127.0.0.1"
You can now use with atrica.
You have an example in examples/tor-profile.js
, but you need to remove env: { TOR_TRANSPROXY: "1" }
TorBrowser without Tor Proxy
The Tor proxy considerably slows down the crawling, so if you're only interested in the TorBrowser's privacy performance
- You need to disable TorLauncher extension in the default profile
- Launch tor browser
- about:addons in url bar
- disable TorLauncher
- You need to set an environment variable
TOR_TRANSPROXY=1
when launching tor You can do that with theenv
option in atrica.profile:let profile = await (await atrica.profile({ browser: "firefox", path: defaultProfilePath, binary: binaryPath, env: { TOR_TRANSPROXY: "1" }, })).copy();
You have an example in examples/tor-profile.js
Logger
Atrica has a logger tool that helps you save the requests, responses and cookies for targetted tabs.
Logger Basic example
Here is a basic example of how to use the logger API:
// examples/basic-logger.js
const atrica = require("atrica");
async function main() {
const profile = await atrica.profile({ browser: "chromium" });
const browser = await atrica.launch(profile);
const domains = ["wikipedia.org", "wiktionary.org", "wikiquote.org"];
// Create the logger, saves results in './crawl-results'
const logger = await atrica.logger(browser, './crawl-results');
for(let domain of domains) {
const sessionName = domain;
const session = await logger.newSession(sessionName);
const url = `https://${domain}`;
const page = await browser.newPage();
session.listen(page);
await page.goto(url);
await atrica.utils.sleep(1);
await page.close();
await session.saveCookies();
await session.close();
await browser.clearAllBrowsingData();
}
await browser.close();
process.exit();
}
main();
First we create the logger, which will save the resulst in the "./crawl-results" folder.
const logger = await atrica.logger(browser, './crawl-results');
The folder has the following structure:
- database.sqlite : a sqlite database, containing the requests, responses, cookies, etc...
- files : a folder containing the files
Then we create a session. Sessions allow you to group requests, responses and cookies. In the example above, we create one session per domain.
const session = await logger.newSession(sessionName);
Then we need to tell the session which page should be listened. The session will only log requests, responses, etc... from pages who are being 'listened' :
session.listen(page);
Then you can save all the cookies in the browser in the session with:
await session.saveCookies();
You need to call session.close() to make sure all requests and responses are properly saved in the database
await session.close();
Database structure
- sessions : a table containing the session
| id | name | | :---: | :----: | | int | string |
- pages : everytime the main frame of a tab change location, a new row is added in this table
| id | url | sessionId | requestId | | :---: | :----: | :-------: | :-------: | | int | string | int | int |
- requests : represent an HTTP request sent by the browser
| id | url | method | headers | resourceType | pageId | prevId | nextId | sourceId | | :---: | :----: | :----: | :----------: | :----------------------: | :----: | :----: | :----: | :------: | | int | string | string | string(json) | string(image,script,...) | int | int | int | int |
The prevId, nextId and sourceId are used to represent a redirection chain.
sourceId is the id of the first request in the chain, while prevId and nextId are the ids
of the previous and next requests.
- responses : represent the responses to an HTTP request
| id | status | headers | bodySize | bodyLocation | requestId |
| :---: | :----: | :-----------: | :--------------: | :------------------------: | :-------: |
| int | int | string (json) | int (# of bytes) | string (in files
folder) | int |
- cookies : represent the cookies in the session
| id | name | value | domain | hostOnly | path | secure | httpOnly | sameSite | isSession | expirationDate | storeId | sessionId | | :---: | :----: | :----: | :----: | :------: | :----: | :----: | :------: | :------: | :-------: | :------------: | :-----: | :-------: | | int | string | string | string | bool | string | bool | bool | string | bool | string | int | int |
See MDN cookie documentation for more details
Custom data models
Atrica uses sequelize as an ORM to communicate with sqlite, and use it to let you easily customize the database schema.
The default models are : Request
, Response
, Session
, Page
, Cookie
You can customize those models, or create your own.
Here is an example :
// from examples/logger-custom.js
const logger = await atrica.logger(browser, "./custom-crawl-results", {
Request: {
schema: ({ STRING }) => ({
frameType: STRING
}),
onRequest: (values, request, context) => {
// the session name is the domain in this example (see below)
const domain = context.session.name;
if (request.frame() === context.page.mainFrame()) {
values.frameType = "main";
} else if (isFirstParty(request.frame().url(), domain)) {
values.frameType = "1st_party_iframe";
} else {
values.frameType = "3rd_party_iframe";
}
}
},
Script: {
schema: Sequelize => ({
first_party_scripts: Sequelize.INTEGER,
third_party_scripts: Sequelize.INTEGER
}),
relations: (Script, models) => {
let Session = models.Session;
Session.hasMany(Script, { as: "scripts", foreignKey: "sessionId" });
Script.belongsTo(Session, {
as: "session",
foreignKey: "sessionId"
});
},
onSession: (values, session) => {
session.custom_props = {
first_party_scripts: 0,
third_party_scripts: 0
};
session.on("close", async () => {
session.custom_props.sessionId = session.model.id;
await session.logger.models.Script.create(session.custom_props);
});
},
onRequest: (values, request, context) => {
const session = context.session;
const domain = context.session.name;
if (request.resourceType() === "script") {
if (isFirstParty(request.url(), domain)) {
session.custom_props.first_party_scripts++;
} else {
session.custom_props.third_party_scripts++;
}
}
}
}
});
The third parameter of logger must have the following structure:
{
Model1: {
schema: (Sequelize) => ({attr1: Sequelize.STRING(100), ...})
relations: (Model1, models) => {Model1.belongsTo(models.?); Model1.hasMany(...)}
onSession: (values, session) => {...},
onPage: (values, request, context) => {},
onRequest: (values, request, context) => {...},
onResponse: (values, response, context) => {...},
onResponseBody: (body, values, response, context) => {...}
},
Model2: {...},
...
}
The schema function must describe the schema of the table.
If the model already exists, then the property returned by the function will be added to the schema, otherwise it will create a new table with the schema returned.
So in our example above, we are adding a frameType
column of type STRING to our table.
Then in the onRequest
we are adding frameType
to the values that will be sent to the database.
So in onSession
the values
parameter are the values that will be sent to the database to create a new row in the sessions
table, in onPage
it's for the pages
table, in onRequest
for the requests
table, in onResponse
for the responses
table and in onResponseBody
also for the responses
table.
In our example above, we are also creating a new scripts
table which, for each session, count
the number of first-party scripts, and third-party scripts.
Then we are creating a foreign-key in the scripts
table to associate it to the sessions
table.
Javascript is a flexible script language and lets you add attributes to any object at any time.
Therefore, in onSession we are adding a custom_props attribute to keep our scripts counters.
Then we use the session's close
event to create a new row in the scripts
.
You are responsible for inserting new rows in your custom tables, with Model.create or Model.bulkCreate.
Note that a large number of insertion in the database may afect the performance of the crawler, so you are
encouraged to use Model.bulkCreate when you can.
Read the documentation of sequelize for more information on how to create models.
Crawler
In the previous examples we used a simple loop in order to crawl the different urls. But when crawling the web, we may stumble upon some poorly designed or malicious website that might make the browser crash or make it unresponsive. We might also want to use several browsers at once in order to speed-up the crawling. Atrica has a Crawler class in order to make it easier to handle those issues.
const atrica = require("atrica");
// tasks
const domains = [
"wikipedia.org",
"wiktionary.org",
"wikiquote.org",
"wikibooks.org"
];
async function setup(isRestarting, workerId) {
const profile = await atrica.profile({ browser: "chromium" });
const browser = await atrica.launch(profile);
const logger = await atrica.logger(browser, "./crawling-result");
// return context
return [browser, logger];
}
// worker(task, context)
async function worker(domain, [browser, logger]) {
const url = `https://${domain}`;
const sessionName = domain;
const session = await logger.newSession(sessionName);
const page = await browser.newPage();
session.listen(page);
// Max time
await page.goto(url, { timeout: 30000 }).catch(console.error);
await page.close();
await session.close();
}
async function cleanup(domain, [browser, logger]) {
await browser.close();
// If the brower failed during this session, we delete it
await logger.deleteSession(domain);
}
async function main() {
const crawler = await atrica.crawler();
crawler.setup(setup);
crawler.tasks(domains);
crawler.worker(worker);
crawler.timeout(40)
crawler.concurrency(2); // Launching 2 browsers
crawler.cleanup(cleanup);
await crawler.crawl();
process.exit();
}
main();
First we define a setup
function. Its role will be to setup a worker or in other words,
launching the browser and creating a logger. It must return a context
object, that
will be passed as argument to the worker
function.
In this example we set concurrency to 2
. This will create 2
workers.
Each worker will call the setup
function with their own workerId
, and thus, two browsers will be launched.
Then each worker will take a task on the task stack, and call worker
with that task and the context returned by the setup
function as parameters. Each worker will repeat this process until there is no task left on the stack.
If there is an error during the execution of the worker
function, or if it takes more than a certain time (set with crawler.timeout(duration)
), then we call the cleanup
function followed by the setup
function again to restart the browser. Finally we call the worker
function for the same task and with the new context created by setup
.
Advanced
Executing scripts
Puppeteer lets you execute scripts in a given page with the page.evaluate
method (see puppeteer's documentation for more details).
Atrica enrich puppeteer API and provide a browser.evaulate
method, working in a similar way,
but that execute the provided function inside atrica's extension, giving you access to
Chrome Extensions APIs and Firefox WebExtension APIs (similar to Chrome Extensions APIs but with a few changes)
Here is an example:
// examples/evaluate.js
const atrica = require("atrica");
async function main() {
const profile = await atrica.profile({ browser: "chromium" });
const browser = await atrica.launch(profile);
const domains = ["wikipedia.org", "wiktionary.org", "wikiquote.org"];
const getHistory = () => {
return new Promise(resolve => {
chrome.history.search({ text: "" }, resolve);
});
};
for (let domain of domains) {
const url = `https://${domain}`;
const page = await browser.newPage();
await page.goto(url);
// Native puppeteer feature
let title = await page.evaluate(() => window.document.title);
console.log(title);
await page.close();
}
// Using atrica browser.evaluate method to access Chrome Extensions API
let history = await browser.evaluate(getHistory);
console.log(history);
await browser.close();
process.exit();
}
main();
Using DevTools protocol with chromium-based browsers
Puppeteer lets us use the Chrome Devtools Protocol directly, allowing us to do advanced stuff. Here is a tutorial explaining how to do that. A detailed documentation of the Devtools protocol is available here Here is an example with atrica:
// examples/devtools-protocol.js
const atrica = require("atrica");
async function main() {
const profile = await atrica.profile({ browser: "chromium" });
const browser = await atrica.launch(profile);
const domains = ["wikipedia.org", "wiktionary.org", "wikiquote.org"];
for (let domain of domains) {
const url = `https://${domain}`;
const page = await browser.newPage();
// Chrome Devtools Protocol client
const client = await page.target().createCDPSession();
await client.send("Network.enable");
let wsTransferedBytes = 0;
client.on(
"Network.webSocketFrameReceived",
({ response: { opcode, mask, payloadData } }) => {
if (opcode == 1) wsTransferedBytes += payloadData.length;
else wsTransferedBytes += Math.floor(payloadData.length * (6 / 8));
}
);
await page.goto(url);
await atrica.utils.sleep(1);
// An alternative way to get the cookies on chrome if you
// want to disable atrica's extension to enable headless mode
let { cookies } = await client.send("Network.getAllCookies");
console.log(`After visiting ${domain}:`);
console.log(` - Cookies: ${cookies.length}`);
console.log(` - Bytes transfered using websockets: ${wsTransferedBytes}`);
console.log();
// Removing all cookies from browser
await client.send("Network.clearBrowserCookies");
await page.close();
}
await browser.close();
process.exit();
}
main();
In this example we create a client for each tab (page) and we use it to get
all the cookies in the browser and then to delete them.
We also use the event Network.webSocketFrameReceived
to calculate the amount
of data transfered during the browsing.