a-extractor

v2.0.2

Published

3 years ago

Article content extraction database

Downloads

0High
0Medium
0Low

croqaz

text article extraction

📃 Article extractor

Database of expressions used for extracting content from blogs and articles.

The main database is JSON5 format, a strict subset of Javascript, also available as a normal JSON, for convenience.

The extraction expressions are Cheerio, similar with jQuery.

The targeted information is:

the author
the date when the article was written
and of course, the article text, as clean as possible

This project is designed to be used with Clean-Mark, but you can use it however you want.

86 domains available

abcnews.go.com
aeon.co
agroinfo.ro
arenait.net
arstechnica.com
articles.latimes.com
artsy.net
bbc.com
beta.theglobeandmail.com
bigthink.com
bindiribli.ro
bossfeed.net
businessinsider.com
collectivelyconscious.net
curentul.info
dailymail.co.uk
deepdotweb.com
digi24.ro
earthsky.org
edition.cnn.com
engadget.com
express.co.uk
farnamstreetblog.com
fastcompany.com
finesociety.ro
firstpost.com
foxnews.com
galacticconnection.com
gandeste.org
gazetadambovitei.ro
gnosticwarrior.com
hackread.com
hbr.org
hotnews.ro
howtogeek.com
huffingtonpost.com
info.localytics.com
infoalert.ro
irishmirror.ie
isgp-studies.com
jamesclear.com
jurnalul.ro
latimes.com
life.ro
mashable.com
merckmanuals.com
money.cnn.com
nautil.us
nbcnews.com
ncbi.nlm.nih.gov
neonnettles.com
news.com.au
newscientist.com
newyorker.com
nytimes.com
nzherald.co.nz
observator.tv
pri.org
qz.com
romaniaa.ro
rt.com
rts.earth
smh.com.au
start-up.ro
stiri.tvr.ro
stirileprotv.ro
techcrunch.com
techradar.com
telegraph.co.uk
theatlantic.com
theguardian.com
theliberal.ie
thenextweb.com
theverge.com
thrillist.com
torrentfreak.com
usatoday.com
usnews.com
vox.com
wakingtimes.com
wall-street.ro
washingtonpost.com
weforum.org
wsj.com
yahoo.com
ziare.com

Important

Clean-Mark already has algorithms to extract most of the info, if the website is SEO friendly, eg: it respects schema.org/Article, or Microformats, or the Open Graph protocol. But it's not a perfect tool 🤖 and it needs help from us humans 🙄

Contributions

We ❤️ contributions !!!

Want to report a bug, request a feature, or contribute? Things can only be contributed via the A-Extractor GitHub repository.

The "fork-and-pull" Git workflow:

Fork the repo on GitHub
Clone the project to your own machine
Work on your fork
1. Make your changes and additions
2. Change or add tests if needed
3. Run tests and make sure they pass
4. Add changes to README.md if needed
Commit changes to your own branch
Make sure you merge the latest from "upstream" and resolve conflicts if there is any
Push your work back up to your fork
Submit a Pull request so that we can review your changes

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

📃 Article extractor

86 domains available

Important

Contributions

License