novels-raw-scraper
v1.4.5
Published
A library to handle scraping of chapter and table of contents for chinese novels.
Downloads
14
Maintainers
Readme
Novels Raw Scraper
A little library to handle scraping of chapter and table of contents for chinese novels.
Features.
- Supports PTWXZ and 69Shu.
- Get latest chapter info.
- Automatically change chinese encoding to UTF-8.
- Built on typescript.
- Great DX support due to typescript.
- Does not have fancy features and plain simple api.
Installation
Yarn
yarn add novels-raw-scraper
NPM
npm i novels-raw-scraper
Usage
Chapter Scraper
import { snChapterScraper } from "novels-raw-scraper";
// or
import { ptwxzChapterScraper } from "novels-raw-scraper";
// You need to input the chapter url, rest will be handled by the library.
await snChapterScraper("https://www.69shu.com/txt/35345/24661030");
await ptwxzChapterScraper("https://www.ptwxz.com/html/7/7811/9426848.html");
Output is an String[].
/*
[
'第4章 ,只因为在人群中多看了一眼',
'家里事物料理完毕,颜老太太就带着大孙女、三孙子,还有老仆二人,踏上了去往大儿上任的临宜县的路。',
'颜老太太在族里的辈分比较高,加之这些年,颜家没少帮衬族里,是以,他们走的时候,族长和族中辈分比较高的老者都来了。',
'“老嫂子,今年少雨,各个地方的收成都不太好,就我们颜家村,用了您老给的种子,收成比往年还要多上一成,我在这呀,替大家感谢你嘞。”',
...]
*/
Table of Contents Scraper
import { snTocScraper } from "novels-raw-scraper";
// or
import { ptwxzTocScraper } from "novels-raw-scraper";
// You need to input the chapter url, rest will be handled by the library.
await snTocScraper("https://www.69shu.com/31477/");
await ptwxzChapterScraper("https://www.ptwxz.com/html/7/7811/");
Output's `TocOutput` type of array which can easily be mapped later on.
/*
[
{
number: 1,
title: '牢狱之灾',
url: 'https://www.69shu.com/txt/31477/22493555'
},
{
number: 2,
title: '妖物作祟',
url: 'https://www.69shu.com/txt/31477/22493560'
},
{
number: 3,
title: '仙侠世界一样能推理',
url: 'https://www.69shu.com/txt/31477/22493564'
},
{
number: 4,
title: '是时候表演真正的技术了',
url: 'https://www.69shu.com/txt/31477/22493570'
},
... 820 more items
]
*/
Last chapter info scraper
import { lastChapterInfo } from "novels-raw-scraper";
import puppeteer from "puppeteer";
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await lastChapterInfo({
linkSelector: "link css selector",
numberSelector: "number css selector",
sourceUrl: "url of the page where we need to check",
titleSelector: "title css selector",
page: page, // Cheerio's page instance.
});
// OUTPUT
/*
{
link: "https://chapterslink.com/123,
number: 123,
title: "Some great title"
}
*/
API
We currently support 2 sites and 2 functions each site, one for chapter text and other for toc.
ptwxzChapterScraper();
ptwxzTocScraper();
snChapterScraper();
snTocScraper();
The last chapter info function can virtually run on all sort of sites.