dbay-vogue
v0.2.0
Published
website scraper
Downloads
1
Readme
Web Site Scraper
Table of Contents generated with DocToc
Web Site Scraper
- Vogue
v:
Vogue
is the so-called 'hub'; the main object to be instantiatedv.scrapers:
Vogue_scrapers
v.scrapers.d:
{ [dsk]: Vogue_scraper, }
v.vdb:
Vogue_db
v.server:
Vogue_server
v.scheduler:
Vogue_scheduler
Vogue_ops
slim on-page script (renders trendlines &c)Vogue_telegraph
for websocket-based client/server communication- based on Datom XEmitter
Vogue Scraper
Scraped Data Format
Mandatory and Standard Items
Freeform Items
- field
details
must be a JSONifiable object (preferably a flat one) - can contain further items as seen fit
- will be rendered as a list of named entries in a format determined by CSS
Vogue_scraper
has prototype methodhtml_from_details()
to turn contents ofdetails
field into HTML; can be overridden for individual purposes
- field
Vogue Scheduler
Construction
schedular.add_interval: ( cfg ) ->
- Mandatory properties of
cfg
:task: function | asyncfunction
: the task to run. It will be called without arguments.repeat: duration
: how often to repeat a task.
- Optional properties of
cfg
:jitter: duration | percentage
(defaultnull
meaning0 seconds
): how much to randomly vary the repetition rate. Example: whenrepetition
is1 hour
,jitter
is5 minutes
and the task has been first started at00:00
, then it will be repeated somewhere between00:55
and01:05
. A jitter of25%
would equal15 minutes
in this case.timeout: duration
(default:null
meaning no timeout): how long to wait for a task to finish. When a task has been stopped because of timeout, the next session will be started following the same rules as if it finished normally.pause: duration
(default:null
meaning0 seconds
): specifies the minim time between the finishing of one session and the earliest allowed start of the next one, taking jitter into account. Example: When a task has been scheduled with{ repeat: '10 minutes', jitter: '3 minutes', pause: '5 minutes', }
, is first run at00:00
and took 9 minutes until00:09
to finish, it would—withoutpause
orjitter
—be scheduled to run next at00:09
(i.e. immediately). However, we have to wait at least until00:09 + 5 minutes
to account for the pause, and because a non-zerojitter
could cause the start to be that much earlier, we have to postpone until00:09 + 5 minutes + 3 minutes = 00:17
.Note that for simplicity, we have only used minutes and hours in the above, which in reality could have the undesirable effect that a task scheduled to have a pause of
1 minute
finishes at00:59:59
only to be run again at01:00:00
—just a second later. To avoid this, durations are internally reckoned in milliseconds, so1 minute
really means60,000 milliseconds
and1 hour
equals3,600,000 milliseconds
.In case a session runs longer than its
repeat
duration,
- Mandatory properties of
0|0———0|1———0|2———0|3———0|4———0|5———0|6———0|7———0|8———0|9———1|0———1|1———1|2———1|3———1|4———1|5———1|6———1|7———1|8———1|9
session pause jitter
|—————————————————————————————————————————————————————|=============================|~~~~~~~~~~~~~~~~~|
|:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::|~~~~~~~~~~~~~~~~~|
repeat jitter interval for start of next session
|+++++++++++++++++|+++++++++++++++|
session pause jitter
|—————————————————|=============================|~~~~~~~~~~~~~~~~~| jitter
|:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::|~~~~~~~~~~~~~~~~~|
repeat interval for start of next session
|+++++++++++++++++|+++++++++++++++|
session pause jitter
|—————|=============================|~~~~~~~~~~~~~~~~~| jitter
|:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::|~~~~~~~~~~~~~~~~~|
repeat interval for start of next session
|+++++++++++++++++|+++++++++++++++|
Concurrent Writes
- any number of sessions may be active at any given point in time
- for a number of reasons, it would be nice if no partial results from any session would appear in DB
- the solution for this would usually be to wrap each session into a DB transaction
- however, SQLite does not allow concurrent writes and sessions can take a long time (e.g. because the network is slow); this risks transaction timeouts which in turn call for logic to handle this failure mode
- on the other hand the expected number of rows to be inserted by any session is low (not more than a few dozens up to hundreds of rows), so at least the performance gain from a transaction-per-session model should be negligible
- another alternative is to queue rows in memory and pass them to a single writer method that does an atomic insert
- yet another possibility is to use implicit (or explicit, but short-duration) transactions and allow partial results
- in case each session also registers its start and finish timestamps we can always detect incomplete sessions and treat them accordingly (hide them or add notification)
- the 'random inserts with implicit short transactions' (RIWIST) model then seems to be the simplest way to deal with concurrent sessions
- we won't get consecutive row numbers with RIWIST, but that should not pose a problem
- since each insert is done immediately instead of being deferred, we get a chance to grab the (possibly
updated) row with
insert ... returning *;
statements, so that is a plus over a deferred model should we ever need that data.
Relevant Data Types
(absolute) duration
: a string spelling out a duration using a float literal and a unit, separated by a space. Allowed units areweek
,day
,hour
,minute
,second
, all in singular and plural. Examples:'1.5 seconds'
,'1e2 minutes'
,'1 week'
.percentage (for relative durations)
: a string spelling out a percentage with a float literal immediately followed by a percent sign,%
. Examples:'4.2%'
,0%
. The amount must be between0
and100
.
Async Primitives
These 'five letter' methods are available both on the Vogue_scheduler
class and its instances; for many
people, the most strightforward way to understand what these methods do will be to read the very simple
definitions and recognize they are very thin shims over the somewhat less convenient JavaScript methods:
every: ( dts, f ) -> setInterval f, dts * 1000
after: ( dts, f ) -> setTimeout f, dts * 1000
cease: ( toutid ) -> clearTimeout toutid
sleep: ( dts ) -> new Promise ( done ) => setTimeout done, dts * 1000
defer: ( f = -> ) -> await sleep 0; return await f()
In each case, dts
denotes an interval (delta time) measured in seconds (not milliseconds) and f
denotes a function. every()
and after()
return so-called timeout IDs (toutid
s), i.e. values that are
recognized by cease()
(clearTimeout()
, clearInterval()
) to stop a one-off or repetetive timed function
call. sleep()
returns a promise that should be awaited as in await sleep 3
, which will allow another
task on the event loop to return and resume execution no sooner than after 3000 milliseconds have elapsed.
Finally, there is defer()
, which should also be await
ed. It is a special use-case of sleep()
where the
timeout is set to zero, so the remaining effect is that other tasks on the event loop get a chance to run.
It accepts an optional function argument whose (synchronous or asynchronous) result will be returned.
Vogue DB
To Do
[–] documentation
[–] repurpose
posts
table to only contain unique posts with primary key( pid, version )
whereversion
is a small integer that allows to store updateddetails
for a given post.- [–] versioned
posts
would e.g. allow pricing fluctions to be linked to constant articles; IOW with versions, a constant 'article'/'product' can now have a varyingprice
detail - [–] old
posts
becomesposts_and_sessions
(?) which details occurrences ofposts
in completedsessions
('sightings').
- [–] versioned
[–] introduce tagging in its simplest form (only value-less, affirmative tags such as
+interesting
,+seen
,+hide
,pin-to-top
). Tags are linked to unversionedposts
viapid
.- [–] consider to allow to tie tags to versioned posts with the additional rule that a 'version' may
be a wildcard like
*
that applies to all versions of a post. Tying a tag to the version that I see in this listing then means 'OK not interested in this case but do show again if any details should have changed', tying a tag to symbolic version*
communicates 'I've seen this and am not interested in being alerted to any changes'. - [–] Some tags may be given systematic behavior:
+seen
—de-emphasize/grey-out post where it appears+watch
—the opposite of+seen
, hilite post where it appears+pin-to-top
—always show post near top of listings even when its position in the data source's listing is further down- other tags may be added at user request and be associated with a certain style / color / theme (maybe associate arbitrary CSS with each tag)
- [–] consider to allow to tie tags to versioned posts with the additional rule that a 'version' may
be a wildcard like
[–] allow to specify a list of keys into
details
containing facts that are to be writ large, such as the price of a product. Maybe also allow to track specific facets with a sparkline?[–] allow to specify from which key to derive
rank
from? In that case it would be best to keep all the data indetails
including № in the original listing and have a mapping like{ rank: 'nr', }
,{ rank: 'price', }
to determine how to derive which aspect from which field.- [–] problem with this: if the № becomes part of
details
then change in position of item in listing would in itself be treated as an update to the details of an item, which is not what we want- i.e. change in rank is categorically different from change in other details
- [–] problem with this: if the № becomes part of
[–] consider to either wrap
cheerio
handling (or make it optional), or use a library with a more transparent API[–]
Vogue_kaiju
(shim over Playwright) for in-browser scraping of dynamic HTML pages[–] see https://github.com/kudla/promise-status-async, https://stackoverflow.com/a/53328182/7568091 on how to check for promise status; https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/race on how to implement session timeout; https://github.com/dondevi/promise-abortable for cancelling promise
[–] what causes the error when
foreign key ( pid ) references scr__posts
inscr_tagged_posts
is uncommented?[–] make it so that sparklines are flush right i.e. the data point for the current session is 'now', and 'now' has maximum of the displayed x values
[–] add chart with all combined sparklines from 'latest' view
[–] add
status
field (first
,last
,return
,cont
) to each data point of trends to allow for coloring in charts[–] may want to implement option where only those sessions are show where at least one change occurred
[–] simplify construction of customized instances; currently one has to do:
path = 'path/to/dbay-vogue.db' db = new DBay { path, } vdb = new Vogue_db { db, } vogue = new Vogue { vdb, }
i.e. one must know and get hold of the secondary classes involved and introduce them at appropriate junctures.
Is Done
- [+] name
- [+] POC
- [+] rename
round
->session
- [+] rename
seq
->rank
- [+] rename
progress
->trends
? - [+] use D3, plot to produce sparklines (MVP)
- [+]
Vogue_scraper
->Vogue_scraper_ABC
- [+] in HN scraping, one could say that the 'ultimate' identity is really the content URL, which can be associated with any number of discussion URLs as time passes
- [+] only display the most recent session entry for each item ID (and up to so-and-so many items per page), i.e. if an item was present in sessions 12 and 11, only show the entry for session 12 (with the sparkline that does show a dot for session 11); if an item dropped out in session 9, also show it (as long as the table does not exceed the line limit)