urldatabase
v1.0.15
Published
URL database with 90+ million categorized domains, using machine learning.
Downloads
16
Maintainers
Readme
URL Database
URL Database is a Node.js module that provides Content Category for around 90 million domains.
There are two Tiers of categories available. Tier 1 categories are:
'Style & Fashion', 'Religion & Spirituality', 'Events and Attractions', 'Shopping', 'Pop Culture', 'Fine Art', 'Books and Literature', 'Television', 'Travel', 'Movies', 'Careers', 'Home & Garden', 'Hobbies & Interests', 'Family and Relationships', 'Sports', 'Real Estate', 'Food & Drink', 'Healthy Living', 'Automotive', 'Medical Health', 'Video Gaming', 'Education', 'Music and Audio', 'Technology & Computing', 'News and Politics', 'Pets', 'Personal Finance', 'Science', 'Business and Finance'
Tier 2 categories are listed in Appendix of this Readme.
Categories of domains were determined with the following data acquisition and machine learning pipeline:
- website of domains was fetched
- text of websites was extracted and pre-processed (lemmatization, removal of punctuations, etc.)
- for non-english websites, text was translated to English text using NMT solution (with BLEU scores of language pairs NMT models >40)
- each text was classified with Tier 1 and Tier 2 classifier
Installation
npm i urldatabase
Usage example
var request = require('request');
var options = {
'method': 'POST',
'url': 'https://www.websitecategorizationapi.com/api/domains.php',
'headers': {
'Content-Type': 'application/x-www-form-urlencoded'
},
form: {
'domain': 'www.zdf.de'
}
};
request(options, function (error, response) {
if (error) throw new Error(error);
console.log(response.body);
});
Why the need for URL Categorization Database?
Internet URLs and domains count Let's start by talking about domains and URLs. Did you know that, according to Verisign's 2021 statistics, there are more over 360 million registered domains? About 15% of them are active.
Of course, each domain can have a ton of subpages, resulting in billions of URLs. One recent batch is accessible at https://commoncrawl.org/2022/02/january-2022-crawl-archive-now-available/, a renowned crawler organization.
2.95 billion webpages have been discovered; this is just from one crawl data collection. The common crawl is a fantastic method for evaluating website data and includes a wealth of helpful features for expedient parsing.
For instance, you may utilize the common crawl columnar index and parse it to discover all URLs that contain a string like "/pricing.php" in their URLs. The data for the columnar index is kept in parquet format. Here is more information about the parquet format: https://parquet.apache.org
Any type of classification of URLs must be automated because to the overwhelming amount of URLs present on the modern web, with millions more being uploaded every day.
The majority of URL categorisers, whether for content filtering or safety, are supervised machine learning models that fall under the category of text classification models because machine learning is by far the best option for classifications.
What categories do we attribute to URLs in the URL database? Depending on our goal, we categorize URLs in various ways. Let's imagine that an AdTech company or marketing in general will require the use of our URL database (if you are interested in learning more about AdTech, we created an introduction to main parts at the end of the blog post).
We have advertiser A from a certain business, such the automotive sector, who wants to display adverts on publishers' websites. Advertisers should place their advertising on publishers' websites that have content related to the automotive industry if they want their ads to convert better.
But how can we identify websites that contain automotive-related content?
The URL database is helpful in this situation.
It was created using a machine learning model classifier that was applied to billions of URLs, determining the type of content for each URL, and storing this information in a database of URLs.
How is data stored in the URL DataBase? The URL database only has to be downloaded and integrated into the company's existing app at this point.
There are numerous ways to store the categorization information in the database itself. It can be kept in text files, a SQL database, a NoSQL database, or both. For example, you could keep all the URLs or domains that fall under a given category in a single text file.
Taxonomy of URL Categories for Web Content When the goal is to categorize web site, one can either utilize their own, unique taxonomies or those that are accepted practices in their field.
The IAB's taxonomy, whose most recent iteration can be found here, serves as the industry standard for categorizing material in marketing
https://iabtechlab.com/press-releases/tech-lab-releases-content-taxonomy-3-0/
The taxonomy from Google Products may be more suited if one is interested in categorizing website content that is related to e-commerce:
https://www.google.com/basepages/producttype/taxonomy-with-ids.en-US.txt
Usage of URL Database
Our URL Categorization Database can be accessed either via API, as implemented above or you can receive in form of dataset file, which can serve as an offline URL Database.
Offline URL Database can be used in internal applications, e.g. for content filtering the websites of company's employees, by restricting access from non-work websites, like shopping, social media and gaming sites.
It can also be used for cybersecurity apps or in Ecommerce Saas platforms and services.
Frequently asked questions
What is a URL database? The phrase "URL Database" refers to a collection of URLs or links to subpages that typically have some information assigned to them, such as the content category, language, author, root domain, dwelling IP, number of tokens (content length), themes referenced in the URL, and others.
How do I locate the category for URLs? Take these actions: Choose the best taxonomy (IAB or Ecommerce), and then enter your URL into the WebsitecategorizationAPI tool (in the dashboard) or use our API endpoints to accomplish this. 3. You will receive your results in 10 seconds, and you can choose to select the main projected category or all categories with confidence levels higher than the threshold you specified.
Format of json
Example output from URL Database for "www.zdf.de" - Tier 1:
{
"classification": [
{
"category": "Television",
"value": 0.60773588801323
},
{
"category": "Movies",
"value": 0.29109074822883085
},
{
"category": "Events and Attractions",
"value": 0.07486490625416359
},
{
"category": "Family and Relationships",
"value": 0.005374985197691561
},
{
"category": "Hobbies & Interests",
"value": 0.005101833789390943
},
{
"category": "Video Gaming",
"value": 0.003984198425722353
},
{
"category": "Books and Literature",
"value": 0.002492840101745817
},
{
"category": "Fine Art",
"value": 0.0023078275948925885
},
{
"category": "Shopping",
"value": 0.000736829495733268
},
{
"category": "Travel",
"value": 0.0007148378661549944
},
{
"category": "Religion & Spirituality",
"value": 0.0006182756059490645
},
{
"category": "Music and Audio",
"value": 0.0006017436156576558
},
{
"category": "News and Politics",
"value": 0.0005944575220540115
},
{
"category": "Pop Culture",
"value": 0.0005872038218177597
},
{
"category": "Healthy Living",
"value": 0.0005831789414856245
},
{
"category": "Careers",
"value": 0.0005243635107021117
},
{
"category": "Automotive",
"value": 0.00039890616180756646
},
{
"category": "Technology & Computing",
"value": 0.0002859548776286219
},
{
"category": "Real Estate",
"value": 0.00027637364331928056
},
{
"category": "Personal Finance",
"value": 0.0001710230563593708
},
{
"category": "Sports",
"value": 0.00016042771723498377
},
{
"category": "Education",
"value": 0.00014381866308073145
},
{
"category": "Pets",
"value": 0.00012728402872631592
},
{
"category": "Business and Finance",
"value": 0.000123494990696087
},
{
"category": "Style & Fashion",
"value": 0.00011405926539219588
},
{
"category": "Food & Drink",
"value": 0.00010023782530038409
},
{
"category": "Science",
"value": 0.0000877636365314911
},
{
"category": "Home & Garden",
"value": 0.00007493299862686662
},
{
"category": "Medical Health",
"value": 0.000021605150073794945
}
],
"language": "de"
}
Here is the result for Tier 2 classification for same domain (only top probability categories shown):
{
"classification": [
{
"category": "Comedy TV",
"value": 0.12665120800837792
},
{
"category": "World Movies",
"value": 0.11467298561750293
},
{
"category": "Fantasy Movies",
"value": 0.07605491578220645
},
{
"category": "Drama Movies",
"value": 0.05372015353841327
},
{
"category": "Drama TV",
"value": 0.048950849776443935
},
{
"category": "Soap Opera TV",
"value": 0.043373118622605095
},
{
"category": "Science Fiction TV",
"value": 0.03838582265067825
},
{
"category": "Holiday TV",
"value": 0.024368499196304464
},
{
"category": "Cinemas and Events",
"value": 0.02408407549980423
},
{
"category": "Action and Adventure Movies",
"value": 0.02262422360894283
},
{
"category": "Children's TV",
"value": 0.01985699003319781
},
{
"category": "Crime and Mystery Movies",
"value": 0.016198758949356365
},
{
"category": "Reality TV",
"value": 0.01584871616578955
},
{
"category": "Horror Movies",
"value": 0.014501264118914434
},
{
"category": "Video Game Genres",
"value": 0.013148151950373053
},
{
"category": "Music TV",
"value": 0.013036725828882795
},
{
"category": "Animation TV",
"value": 0.01281354534376587
},
{
"category": "Romance Movies",
"value": 0.011537290751170815
},
{
"category": "Travel Books",
"value": 0.010342167707548545
},
{
"category": "Content Production",
"value": 0.008501663028851797
},...
}
Language support
URL Database contains English as well as non-english domains.
Appendix
Tier 2 categories of domains (first 20 out of 441):
'Beauty', 'Astrology', 'Polish', 'Fashion Trends', 'Street Style', 'Sales and Promotions', 'Celebrity Style', 'Fashion Events', 'Personal Celebrations & Life Events', 'Holiday Shopping', 'Body Art', 'Outdoor Decorating', 'Fiction', 'Personal Care', 'Interior Decorating', 'Auto Buying and Selling', 'Sci-fi and Fantasy', 'Images/Galleries', 'Gifts and Greetings Cards', 'Coupons and Discounts', 'Digital Arts', 'Soap Opera TV', "Women's Fashion",
Useful resources
URL Database package locations
https://openbase.com/js/urldatabase/documentation https://npmtrends.com/urldatabase https://yarnpkg.com/package/urldatabase https://npmmirror.com/package/urldatabase https://www.npmjs.com/package/urldatabase