purlfy
v0.3.11
Published
The ultimate URL purifier.
Downloads
31
Readme
pURLfy
English | 简体中文
The ultimate URL purifier.
[!NOTE] Do you know that the name "pURLfy" is a combination of "purify" and "URL"? It can be pronounced as
pjuɑrelfaɪ
.
🪄 Features
Usually, pURLfy is used for purifying URL, including removing redundant tracking parameters, skipping redirecting pages, and extracting the link that really matters. However, pURLfy is not limited to this. It is actually a powerful rule-based tool for transforming URLs, and example use cases include replacing the domain name and redirecting to an alternative of the given URL etc. It features:
- ⚡ Fast: Purify URLs quickly and efficiently.
- 🪶 Lightweight: Zero-dependency; Minified script less than 4kb.
- 📃 Rule-based: Perform purification based on rules, making it more flexible.
- 🔄️ Async: Calling
purify
won't block your thread. - 🔁 Iterative purification: If the URL still contains tracking parameters after a single purification (e.g. URLs returned by
redirect
rules), it will continue to be purified. - 📊 Statistics: You can track statistics of the purification process, including the number of links purified, the number of parameters removed, the number of URLs decoded, the number of URLs redirected, and the number of characters deleted, etc.
🤔 Usage
🚀 Quick Start
Visit our demo page, try out our Tampermonkey script, or simply node cli.js <url[]> [<options>]
to purify a list of URLs (For more information, please refer to the comments in the script).
// Somewhat import `Purlfy` class from https://cdn.jsdelivr.net/gh/PRO-2684/pURLfy@latest/purlfy.min.js
const purifier = new Purlfy({ // Instantiate a Purlfy object
fetchEnabled: true,
lambdaEnabled: true,
});
const rules = await (await fetch("https://cdn.jsdelivr.net/gh/PRO-2684/[email protected]/<ruleset>.json")).json(); // Rules
// You may also use GitHub raw link for really latest rules: https://raw.githubusercontent.com/PRO-2684/pURLfy-rules/core-0.3.x/<ruleset>.json
const additionalRules = {}; // You can also add your own rules
purifier.importRules(rules, additionalRules); // Import rules
purifier.addEventListener("statisticschange", e => { // Add an event listener for statistics change
console.log("Statistics increment:", e.detail); // Only available in platforms that support `CustomEvent`
console.log("Current statistics:", purifier.getStatistics());
});
purifier.purify("https://example.com/?utm_source=123").then(console.log); // Purify a URL
Here's a list of test URLs that you can use to test pURLfy:
- Bilibili's short link:
https://b23.tv/SI6OEcv
- Ordinary Tieba link:
https://tieba.baidu.com/p/7989575070?share=none&fr=none&see_lz=none&share_from=none&sfc=none&client_type=none&client_version=none&st=none&is_video=none&unique=none
- MC Wiki's external link:
https://link.mcmod.cn/target/aHR0cHM6Ly9naXRodWIuY29tL3dheTJtdWNobm9pc2UvQmV0dGVyQWR2YW5jZW1lbnRz
- Bing's search result:
https://www.bing.com/ck/a?!&&p=de70ef254652193fJmltdHM9MTcxMjYyMDgwMCZpZ3VpZD0wMzhlNjdlMy1mN2I2LTZmMDktMGE3YS03M2JlZjZhMzZlOGMmaW5zaWQ9NTA2Nw&ptn=3&ver=2&hsh=3&fclid=038e67e3-f7b6-6f09-0a7a-73bef6a36e8c&psq=anti&u=a1aHR0cHM6Ly9nby5taWNyb3NvZnQuY29tL2Z3bGluay8_bGlua2lkPTg2ODkyMg&ntb=1
- A URL nested too many times that cannot be opened normally:
https://www.minecraftforum.net/linkout?remoteUrl=https%3A%2F%2Fwww.urlshare.cn%2Fumirror_url_check%3Furl%3Dhttps%253A%252F%252Fc.pc.qq.com%252Fmiddlem.html%253Fpfurl%253Dhttps%25253A%25252F%25252Fgithub.com%25252Fjiashuaizhang%25252Frpc-encrypt%25253Futm_source%25253Dtest
📚 API
Constructor
new Purlfy({
fetchEnabled: Boolean, // Enable the redirect mode (default: false)
lambdaEnabled: Boolean, // Enable the lambda mode (default: false)
maxIterations: Number, // Maximum number of iterations (default: 5)
statistics: { // Initial statistics
url: Number, // Number of links purified
param: Number, // Number of parameters removed
decoded: Number, // Number of URLs decoded (`param` mode)
redirected: Number, // Number of URLs redirected (`redirect` mode)
visited: Number, // Number of URLs visited (`visit` mode)
char: Number, // Number of characters deleted
},
log: Function, // Log function (default is using `console.log` for output)
fetch: async Function, // Function to fetch the given URL, should at least support `method`, `headers` and `redirect` in `options` parameter (default is using `fetch`)
})
Instance Methods
importRules(...rulesets: object[]): void
: Import a series of rulesets.purify(url: string): Promise<object>
: Purify a URL.url
: The URL to be purified.- Returns a
Promise
that resolves to an object containing:url: string
: The purified URL.rule: string
: The matched rule.
clearStatistics(): void
: Clear statistics.clearRules(): void
: Clear all imported rules.getStatistics(): object
: Get statistics.addEventListener("statisticschange", callback: function): void
: Add an event listener for statistics change.- The
callback
function will receive anCustomEvent
/Event
object based on whether the platform supports it. - If platform supports
CustomEvent
, thedetail
property of the event object will contain the incremental statistics.
- The
removeEventListener("statisticschange", callback: function): void
: Remove an event listener for statistics change.
Instance Properties
You can change these properties after instantiation, and they will take effect for the next call to purify
.
fetchEnabled: Boolean
: Whether the redirect mode is enabled.lambdaEnabled: Boolean
: Whether the lambda mode is enabled.maxIterations: Number
: Maximum number of iterations.
Static Properties
Purlfy.version: string
: The version of pURLfy.
📖 Rulesets
Community-contributed rulesets are hosted on GitHub, and you can find them at pURLfy-rules. The format of a ruleset file is as follows:
{
"<domain>": {
"<path>": {
// A single rule
"description": "<description>",
"mode": "<mode>",
// Other parameters
"author": "<author>"
},
// ...
},
// ...
}
Formal definition of the format can be found at ruleset.schema.json
.
✅ Path Matching
<domain>
, <path>
: The domain and a part of path, such as example.com/
, /^.+\.example\.com$
, path/
and page
. Here's an explanation of them:
- The basic behavior is like paths on Unix file systems.
- If not ending with
/
, its value will be treated as a rule. - If ending with
/
, there's more paths under it, like "folders" (theoretically, you can nest infinitely) /
is not allowed in the middle of<domain>
or<path>
.
- If not ending with
- Note that if it starts with
/
, it will be treated as a RegExp pattern.- For example,
/^.+\.example\.com$
will match all subdomains ofexample.com
, and/^\d+$
will match a part of path that contains only digits. - Do remember to escape
\
,.
etc in JSON strings. - Empty regex will be ignored. (i.e.
/
or//
) - Using RegExp is not recommended unless necessary, since it will slow down the matching process.
- For example,
- If it's an empty string, it will be treated as a FallBack rule: this rule will be used when no other rules are matched at this level.
- If there's multiple rules matched, the best matched rule will be used. (Exact match > RegExp match > FallBack rule)
- If you want a rule to match all paths under a domain, you can omit
<path>
, but remember to remove the/
after the domain.
A simple example with comments showing the URLs that can be matched:
{
"example.com/": {
"a": {
// The rule here will match "example.com/a"
},
"path/": {
"to/": {
"page": {
// The rule here will match "example.com/path/to/page"
},
"/^\\d+$": { // Remember to escape `\`
// The rule here will match all paths under "example.com/path/to/" that are composed of digits
},
"": {
// The rule here will match "example.com/path/to", excluding "page" and digits under it
}
},
"": {
// The rule here will match "example.com/path", excluding "to" under it
}
},
"": {
// The rule here will match "example.com", excluding "path" under it
}
},
"example.org": {
// The rule here will match every path under "example.org"
},
"": {
// Fallback: this rule will be used for all paths that are not matched
}
}
Here's an erroneous example:
{
"example.com/": {
"path/": { // Path ending with `/` will be treated as a "directory", thus you should remove the trailing `/`
// Attempting to match "example.com/path"
}
},
"example.org": { // Path not ending with `/` will be treated as a rule, thus you should add a trailing `/`
"page": {
// Attempting to match "example.org/page"
}
},
"example.net/": {
"path/to/page": { // Can't contain `/` in the middle - you should nest them
// Attempting to match "example.net/path/to/page"
},
"/^\d+$": { // `\d` won't parse correctly in JSON strings, so use `\\d` instead
// Attempting to match all paths under "example.net/" that are composed of digits
}
}
}
📃 A Single Rule
Paths not ending with /
will be treated as a single rule, and there's multiple modes for a rule. The common parameters are as follows:
{
"description": "<Rule Description>",
"mode": "<Mode>",
// Mode-specific parameters
"author": "<Author>"
}
This table shows supported parameters for each mode:
| Param\Mode | white
| black
| param
| regex
| redirect
| visit
| lambda
|
| ---------- | -- | --- | -- | --- | -- | --- | -- |
| std
| ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| params
| ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| acts
| ❌ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ |
| regex
| ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| replace
| ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| ~~ua
~~ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ |
| headers
| ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ |
| lambda
| ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| continue
| ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
🟢 Whitelist Mode white
| Param | Type | Default |
| --- | --- | --- |
| params
| string[]
| Required |
Under Whitelist mode, only the parameters specified in params
will be kept, and others will be removed. Usually this is the most commonly used mode.
🔴 Blacklist Mode black
| Param | Type | Default |
| --- | --- | --- |
| params
| string[]
| Required |
| std
| Boolean
| false
|
Under Blacklist mode, the parameters specified in params
will be removed, and others will be kept. std
is for controlling whether the URL search string shall be deemed standard. Only if it is true
or the URL search string is indeed standard will the URL be processed.
🟤 Specific Parameter Mode param
| Param | Type | Default |
| --- | --- | --- |
| params
| string[]
| Required |
| acts
| string[]
| ["url"]
|
| continue
| Boolean
| true
|
Under Specific Parameter mode, pURLfy will:
- Attempt to extract the parameters specified in
params
in order, until the first existing parameter is matched. - Decode the parameter value using the processors specified in the
acts
array in order (if anyacts
value is invalid or throws an error, it is considered a failure and the original URL is returned). - Use the final result as the new URL.
- If
continue
is not set tofalse
, purify the new URL again.
🟣 Regex Mode regex
| Param | Type | Default |
| --- | --- | --- |
| acts
| string[]
| []
|
| regex
| string[]
| Required |
| replace
| string[]
| Required |
| continue
| Boolean
| true
|
Under Regex mode, pURLfy will, for each regex
-replace
pair:
- Match the RegExp pattern specified in
regex
against the URL. - Replace all matched parts with the "replacement string" specified in
replace
. - Decode the result using the processors specified in the
acts
array in order (if anyacts
value is invalid or throws an error, it is considered a failure and the original URL is returned).
If you'd like to learn more about the syntax of the "replacement string", please refer to the MDN documentation.
🟡 Redirect Mode redirect
[!CAUTION] For compatibility reasons, the
redirect
mode is disabled by default. Refer to the API documentation for enabling it.
| Param | Type | Default |
| --- | --- | --- |
| ~~ua
~~ | string
| undefined
|
| headers
| object
| {}
|
| continue
| Boolean
| true
|
Under Redirect mode, pURLfy will call constructor parameter fetch
to get the redirected URL, by firing a HEAD
request using headers
as the headers to the matched URL and return the Location
header or the updated response.url
. If continue
is not set to false
, the new URL will be purified again.
Note: ua
parameter will be deprecated in the future, and you should use headers
to set the User-Agent
header.
🟠 Visit Mode visit
[!CAUTION] For compatibility reasons, the
redirect
mode is disabled by default. Refer to the API documentation for enabling it.
| Param | Type | Default |
| --- | --- | --- |
| ~~ua
~~ | string
| undefined
|
| headers
| object
| {}
|
| acts
| string[]
| ["regex:<url_pattern>"]
|
| continue
| Boolean
| true
|
Under Visit mode, pURLfy will visit the URL with headers
as the headers, and if the URL has not beed redirected, it will call the processors specified in acts
in order (<url_pattern>
is https?:\/\/.(?:www\.)?[-a-zA-Z0-9@%._\+~#=]{2,256}\.[a-z]{2,6}\b(?:[-a-zA-Z0-9@:%_\+.~#?!&\/\/=]*)
). The initial input to acts
is of type string
, i.e. the text returned by visiting the URL. If the URL has been redirected, the redirected URL will be returned. If continue
is not set to false
, the new URL will be purified again.
Note: ua
parameter will be deprecated in the future, and you should use headers
to set the User-Agent
header.
🔵 Lambda Mode lambda
[!CAUTION] For security reasons, the
lambda
mode is disabled by default. If you trust the rules provider, refer to the API documentation for enabling it.
| Param | Type | Default |
| --- | --- | --- |
| lambda
| string
| Required |
| continue
| Boolean
| true
|
Under Lambda mode, pURLfy will try to execute the lambda function specified in lambda
and use the result as the new URL. The function shall be async, and its body should accept a single URL
parameter url
and return a new URL
object. For example:
{
"example.com": {
"description": "example",
"mode": "lambda",
"lambda": "url.searchParams.delete('key'); return url;",
"continue": false,
"author": "PRO-2684"
},
// ...
}
If URL https://example.com/?key=123
matches this rule, the key
parameter will be deleted. After this operation, since continue
is set to false
, the URL returned by the function will not be purified again. Of course, this is not a good example, because this can be achieved by using Blacklist mode.
🖇️ Processors
Some processors support parameters, simply append them to the function name separated by a colon (:
): func:arg
. The following processors are currently supported:
url
:string->string
, URL decoding (decodeURIComponent
)base64
:string->string
, Base64 decoding of UTF-8 strings (Adapted from MDN)slice:start:end
:string->string
, String slicing (s.slice(start, end)
),start
andend
will be converted to integersregex:<regex>
:string->string
, regex matching, returns the first match of the regex or an empty string if no match is founddom
:string->Document
, parse the string as a HTMLDocument
object (you'll need to defineDOMParser
globally if using in Node.js)sel:<selector>
:Any->Element/null
, select the first element using CSS selector<selector>
(The input shall havequerySelector
method)attr:<attribute>
:Element->string
, get the value of the attribute<attribute>
of the element (getAttribute
)text
:Element->string
, get the text content of the element (textContent
)
😎 Projects Using pURLfy
[!TIP] If you are using pURLfy in your project, feel free to submit a PR to add your project here!
- Our Demo Page
- ~~Our Telegram Bot @purlfy_bot~~ (Source code)
- pURLfy for Tampermonkey
- LiteLoaderQQNT-pURLfy
🎉 Acknowledgments
- Thanks to Tarnhelm for the initial inspiration of pURLfy.
- The logo of pURLfy is a combination of the "Incognito" icon and the "Ghost" icon from SVG Repo. It is combined using inkScape and optimized using SVGOMG.