aljazeera-crawler
v1.0.3
Published
Crawler for https://www.aljazeera.net/
Downloads
1
Maintainers
Readme
aljazeera-crawler
aljazeera-crawler is a command line application that helps crawl the https://www.aljazeera.net/ website.
Installation
Either installing the tool globally in your system path.
npm install -g aljazeera-crawler
Or using it directly with the help of npx:
npx aljazeera-crawler [options]
Usage
For CLI options, use the -h
(or --help
) argument:
aljazeera-crawler -h
Al Jazeera Crawler Usage: aljazeera-crawler [options]
Options: --version Show version number [boolean] -t, --threshold the minimum number of words to be crawled [number] [default: 1000] -d, --domain the domain to crawl [string] [required] [choices: "politics", "economy", "culture", "sport", "art", "technology", "heritage"] -h, --help Show help [boolean]
Let's say we want to crawl a minimum of 100k word in the technology domain
We will use either:
aljazeera-crawler -t 100000 -d technology
Or:
aljazeera-crawler --threshold 100000 --domain technology
After that a file named output-technology-100000.txt
will be created.
Domains
For the possible domains to crawl as of know are:
| Category | Link | | -------------------- | ---------------------------------------------------- | | politics سياسة | https://www.aljazeera.net/news/politics/ | | economy اقتصاد | https://www.aljazeera.net/news/ebusiness/ | | culture ثقافة | https://www.aljazeera.net/news/cultureandart/ | | sport رياضة | https://www.aljazeera.net/sport/ | | art فن | https://www.aljazeera.net/news/arts/ | | technology تكنولوجيا | https://www.aljazeera.net/news/scienceandtechnology/ | | heritage تراث | https://www.aljazeera.net/turath/ |
Licence
MIT