npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

page-dweller

v1.0.10

Published

Getting metadata,schema/structured data, opengraph data, script src, stylesheet links, anchors, images, topics and term frequencies of a webpage

Downloads

8

Readme

page-dweller

page-dweller tries to extract all possible data points available in a webpage by implementing diffrent npm packages. Scraping webpage for metadata, schema information, resource links such as anchor, script src, images,social profile links,emails, phone number, plain text, topics discussed in the page and term frequencies.

Install

npm install page-dweller

Basic implementation

Example

const dweller = require('page-dweller');
( async() => {
    var url = "https://www.thehindu.com/news/national/opposition-protest-against-ib-ministry-advisory-in-the-backdrop-of-assam-violence/article30283682.ece?homepage=true";
    var pagetdata = await dweller.getPageDetails(url);

    console.log(JSON.stringify(pagetdata));

})();

Output format:

{
    header:{
        status:200,
        finalUrl:"https://example.com/",
        responseHeaders:{}
    },
    socialData:{
        twitters:String[],
        facebooks:String[],
        youtubes:String[],
        emails:String[],
        phones:String[],
        phonesUncertain:String[],
        linkedIns:String[],
        instagrams:String[]
    },
    schema: Object[],//all the ld json objects
    resources:{
        links:{
            canonical: String[],
            stylesheet: String[]
        }
        scripts:String[],//src attribute of all script element
        anchors: Object[],//{href:"a URL", text: "text content of <a> tag "}
        images: Object[]//{src:"image URL","alt":"alt text of the image"}
    },
    plainText: String,// text present inside body tag excluding script and stylesheet text
    nlpData:{
        dataGrams: Object[],//{size:1,count:43,normal:"hello"}
        topics: String[]
    }
}

For specific data point extraction from a webpage use getSpecificPageData method.


Table of Contents


Getting specific data points from a webpage

To extract any specific data points from a given webpage the properties must be present in fields varaible which is passed as argument to getSpecificPageData function. An empty array value against a key will return full data for that property. i.e: nlpData:[] will return both datagrams,topics in nlpData result.

var fields = {
    header:true, 
    metdata: true,
    schema: true,
    plainText:true,
    social:[],//possible array values for social['twitters','facebooks','youtubes','instagrams','emails','phones','phonesUncertain','linkedIns']
    nlpData:[],//possible array values['datagrams','topics']
    resources:[]//possible array values['links','anchors','scripts','images']
};
var pagedata = await getSpecificPageData(url,fields);

async Fetch function

This is an async/await implementation of fetch npm package.

function: fetchUrlAsync(url) implementation:

var response = await fetchUrlAsync(url);
var finalUrl = response.header.finalUrl;
var statusCode = response.status;
var html = response.body;

Loading HTML

jQuery variable is passed as parameters to getMetadata, getPageResources,innerText,getLdJson functions

var url = "https://www.example.com/";
var response = await dweller.fetchUrlAsync(url);
var html = response.body;
var $ = await dweller.loadElement(html);

Getting script,stylesheet, anchors, images links

dweller.getPageResources(jQuery,fieldNameArray)

var $ = await dweller.loadElement(html);
var resources = await dweller.getPageResources($,['scripts','links','images','anchros']);

Expected Output format:


{
  "links": {
    "canonical": [
      "http://www.rannutsav.com"
    ],
    "stylesheet": [
      "https://www.rannutsav.com/assets/front/css/creative.min.css"
    ]
  },
  "scripts": [
    "https://www.rannutsav.com/assets/front/vendor/jquery/jquery.min.js",
    "https://www.google.com/recaptcha/api.js"
  ],
  "anchors": [
    {
      "href": "http://www.akshartours.com/akshar-tour-categories/international-tours/1",
      "text": "International Tour Package"
    },
    {
      "href": "tel:18002339008",
      "text": ""
    }
  ],
  "images": [
    {
      "src": "https://www.rannutsav.com/assets/front/images/WILDLIFE.jpg",
      "alt": "special offer"
    },
    {
      "src": "https://www.rannutsav.com/assets/front/images/DESERT AND BEACH .jpg",
      "alt": "special offer"
    }
  ]
}

Getting Metadata

Getting opengraph data, meta description of the webpage

var metadata = await dweller.getMetadata($);

Expected Output:


{
  "charset": "utf-8",
  "viewport": "width=device-width, initial-scale=1, shrink-to-fit=no",
  "description": "Its time to celebrate most awaiting colourful event of Kutch Rann Utsav at 2019, 2020. Specially designed honeymoon tent for Couple at Rann utsav, Kutch, Gujart, India. Call at +91 - 79 2644 0626, + 91 - 79 - 2646 2166 or email us at [email protected]",
  "keywords": "Rann Utsav Tour, Package, Tent Booking 2019-20",
  "revisit-after": "1 days",
  "author": "Rann Utsav",
  "Robots": "all",
  "googlebot": "index, follow",
  "MSNbot": "index, follow",
  "rating": "General",
  "distribution": "global",
  "opengraph": {
    "site_name": "Rann Utsav",
    "url": "https://www.rannutsav.com/"
  }
}

Getting Social data(email,phones, twitter,facebook, instagram URLs)

Apify social Utils's parseHandlesFromHtml is used for the extraction of various social information. phonesUncertain(low chances of being a phone number) is limited to max 5 to avoid large size of data.

Function: getSocialData(html,fields)

var fields = {
    social:['twitters','facebooks',emails,'phones']
}
var socialData = await getSocialData(html,fields);

Output format:

{
    socialData:{
        twitters:String[],
        facebooks:String[],
        youtubes:String[],
        emails:String[],
        phones:String[],
        phonesUncertain:String[],
        linkedIns:String[],
        instagrams:String[]
    }
}

Getting Structured data(schema.org) from ld+json

function: getLdJson(jQueryElement)

var $ = await dweller.loadElement(response.body)
schema = await dweller.getLdJson($);

Output:

[
  {
    "@context": "http://schema.org",
    "@type": "WebSite",
    "name": "MySmartPrice",
    "alternateName": "MySmartPrice",
    "url": "http://www.mysmartprice.com",
    "potentialAction": {
      "@type": "SearchAction",
      "target": "http://www.mysmartprice.com/msp/search/search.php?s={search_term_string}#s={search_term_string}",
      "query-input": "required name=search_term_string"
    }
  },
  {
    "@context": "http://schema.org",
    "@type": "Organization",
    "url": "http://www.mysmartprice.com",
    "logo": "https://assets.mspimages.in/logos/mysmartprice/msp.png",
    "sameAs": [
      "https://www.facebook.com/mysmartprice",
      "https://www.linkedin.com/company/mysmartprice-com",
      "https://plus.google.com/+mysmartprice/"
    ]
  }
]

Getting plain text from html

function: innerText(jQueryElement) .

innerText function extracts the text content from body tag after removing <script> and <style> tags from it. It appends a new line character at the end of text content of each element. This is an similar to innertext where it contains spaces rather than new lines after each html element.

var $ = await dweller.loadElement(html);
var plainText = await dweller.innerText($);

Getting Nlp data such as data and term frequencies from plaintext

It implements compromise and compromise-ngrams npm package to extract topics and term freqencies from plain text.

function: getNlpData(text, fieldNamesArray)
fieldNamesArray: ["topics", "datagrams"]

By default only size:1 datagrams will be generated. To get all terms per your requirements use getDataGrams function with given parameters.

pagedata.plainText = await dweller.innerText($);//string can be directly used here.
pagedata.nlpData = await dweller.getNlpData(pagedata.plainText,['topics','datagrams']);

Output:

{
    "dataGrams":[
        {
            "size":1,
            "count":40,
            "normal": "vivo"
        },
        {
            "size": 1,
            "count":35,
            "normal": "mobiles"
        },
        {
            "size": 1,
            "count": 23,
            "normal": "Upcoming"
        }
    ],
    "topics": [
        "vivo",
        "vivo mobiles",
        "upcoming mobiles"
    ]
}

Getting datagrams

It extracts all the datagrams from text after removing the stopwords.

function: getDataGrams(plaintext, options)
options: - size (size of datagram required) - min (min size of datagram) - max (max size of datagram)

implementation:

var dataGrams = await getDataGrams(plainText,{size:1});//for one word terms