@webis-de/scriptor
v0.14.0
Published
Plug-and-play reproducible web analysis
Downloads
35
Maintainers
Readme
Webis Scriptor
Plug-and-play reproducible web analysis.
Scriptor runs your web analyses on rendered web pages in an up-to-date browser. It owes much of its power to the Playwright browser automation library, but integrates pywb's archiving and replay capabilities for provenance and reproducibility. Use cases are as diverse as high-fidelity web archiving, content extraction, and web user simulation.
Installation
Make sure you have both Docker and a recent NodeJS installation. If you do not want to install NodeJS, you can also run the Docker container directly.
# install packages to run './bin/scriptor.js':
npm install --omit=dev
# install into system path to run 'scriptor', may require sudo or similar:
npm install --global
# if scriptor can not be found, set the node path (adjust to your system):
export NODE_PATH=/usr/local/lib/node_modules/
Quickstart
To run scriptor
you need the permission to execute docker run
.
Take a snapshot:
scriptor --input "{\"url\":\"https://github.com/webis-de/scriptor\"}" --output-directory output1
Use an input directory for more configuration options (e.g., configure the browser with all options of Playwright):
scriptor --input docs/example/snapshot-input/ --output-directory output2
Replace the default script with an own one (see Developing Own Scripts):
scriptor --script-directory path/to/my/own/script --output-directory output3
Have a look at available features:
scriptor --help
Output Directory Structure
output/
├─ browserContexts/
| └─ default/ # Shares the name of the browser context
| ├─ userData/ # Browser files (cache, cookies, ...)
| ├─ video/ # Recorded videos if --video is set
| ├─ warcs/ # Recorded web archive collection
| ├─ archive.har # Recorded web archive in HAR format
| ├─ browser.json # Used browser context options
| └─ trace.zip # Playwright trace
├─ id.txt # Hash of the directory to identify it
├─ input-id.txt # Hash of the input directory (if any)
└─ logs/
└─ scriptor.log # Container log
Scripts usually place additional data into the output
directory. For example, the default script adds a snapshot.
The warcs
directory is created using pywb and thus follows its directory structure. Note that efforts exist to standardize this structure: and they are looking for feedback!
To view the trace.zip
, see the Playwright docs or just directly load it into the progressive web app.
Scriptor uses Bunyan for logging. The Bunyan CLI allows to filter and pretty-print the logs.
Running on Archives (Replay)
Scriptor can be configured to use resources from web archives instead of the live web. Use --replay
to restrict to resources contained in the WARC files of the input or script directory. Use --replay rw
to use these resources, but allow to fall back to the live web. Use --warc-input <warc-file-or-directory>
to include resources in the specified file (or all files in a specified directory).
Developing Own Scripts
Create a Script.js
and extend AbstractScriptorScript:
const { AbstractScriptorScript, files, pages } = require('@webis-de/scriptor');
module.exports = class extends AbstractScriptorScript {
constructor() { super("MyScript", "0.1.0"); } // log script name and version
async run(browserContexts, scriptDirectory, inputDirectory, outputDirectory) { }
}
The directory that contains your Script.js
is called the "script directory": use the --script-directory
option to specify it on the command line and your script's run
method will be used instead of the one of the default script. The script and input directory are read-only. Everything the script produces should be written to the output directory.
Controlling the Browser(s)
Each of the browserContexts
is a Playwright BrowserContext object, roughly corresponding to a browser session. Your script can use the BrowserContext's newPage method to create a new Page (like a browser tab)—the object to open, read, and manipulate web pages. pages.js adds even more methods to this end.
If the script uses a single browser (the usual case), the run
method should start with
const browserContext = browserContexts["default"];
which gets a browser context configured using the browserContexts/default/browser.json
files in the script and input directory (specified by --input
) if they exist. The following configuration precedence applies (lowest to highest): defaults < script directory browser.json < input directory browser.json < scriptor command line options (e.g., --show-browser
). In addition to Playwright's options, the browserType
option allows to specify which browser to use: "chromium" (default), "firefox", or "webkit".
Place directories inside browserContexts
to receive correspondingly named browser contexts in run
's browserContexts
parameter. An output directory is created for each browser context.
Configuring the Script
Most scripts have parameters, which should be specified in a config.json
in the input directory—or by other options of --input
. A config.json
in the script directory can be used to specify defaults, though these could also be specified in the script's code. The recommended way for reading the JSON files is:
const defaultScriptOptions = { ... };
const requiredScriptOptions = [ ... ];
const scriptOptions = files.readOptions(files.getExisting(
"config.json", [ scriptDirectory, inputDirectory ]),
defaultScriptOptions, requiredScriptOptions);
Return Value and Chaining
If a script allows to continue from its output with the same or a different script (see "chaining"), its run-method should return true
, like the default script. By default, Scriptor stores the browser state (for each browser context) in the output directory so that it is loaded automatically when that output directory is used as the input directory for a new run. As a developer, you just have to take care that you store (updated, if necessary) all the input files for your script at the same location in the output directory. Chaining is intended to create "checkpoints" from which to continue after a crash or to serve as intermediate archives. Note that a script may return true
in some cases and false
in others.
Scriptor API Scriptor provides several static functions to assist you with manipulating Playwright pages or when dealing with the Scriptor directory structure. See the API documentation
Chaining
Usually, the output directory of Scriptor runs can serve as the input directory for a next run (as identified by the script's return value; see developing own scripts). To automate such chaining, use --chain [name]
to create the series of output directories within --output-directory
. A JSON-file in the --output-directory
(identified by name
) will be continuously updated to point the last successful run and read on start-up, so that you can execute the same scriptor
command to continue from the last successful run if the chain aborted for some reason.
Manual Browser Interaction
Scriptor allows for manual interactions with the browser, which can be useful to set cookies or similar. Specifically, using the --show-browser
option allows scripts to use the page.pause-method, which will pause the script until the user hits the resume
button in the dialog that pops up. The same dialog also allows to record interactions as Javascript code. For such simple use cases, the Manual script can be used: it contains (in essence) only the call to pause
.
Since Scriptor runs in a container, it can not directly open the browser window on your machine. Instead, it runs a VNC server inside the container that you can connect to with a VNC client at localhost:5942
to see the browser window. Depending on your operating system, you might already have a VNC client installed. If not, VNC Viewer is available for all major operating systems. The config options of --show-browser
allow to change the width and height of the virtual display, change the port, allow remote access, and set a password. See --help
.
If you want to run Scriptor on one machine and interact with it from another machine, make sure to read how to use x11vnc (Scriptor uses x11vnc as its VNC server), especially the sections on how to encrypt your traffic. By default, however, the Scriptor docker container is configured to accept only connections from the machine it is started on.
Running without NodeJS
At the cost of reduced convenience (timeout, nicer interface), you can run Scriptor with only a Docker installation:
docker run -it --rm \
--volume <script-directory>:/script:ro \
--volume <input-directory>:/input:ro \
--volume <output-directory>:/output \
ghcr.io/webis-de/scriptor:latest <parameters>
<script/input/output-directory>
are the absolute paths to the respective directories- The
<script-directory>
line can be omitted to run the Snapshot script - The
<input-directory>
line can be omitted to not set--input
or when the config is set by--input "{...}"
in the<parameters>
- The
<parameters>
are additional options; seedocker run -it --rm ghcr.io/webis-de/scriptor:latest --help
Chaining can also be used without NodeJS. However, the Docker container does exit after a single run (by design). Use the same command to continue the chain.