@paladincyber/santa-list-builder

v0.4.1

Published

3 years ago

Module for interrogating Santa about whether a site is good or bad, and compiling this list

Downloads

0High
0Medium
0Low

santa-list-builder

What is this?

This repo is responsible for

documenting the dynamo DB table structure
parsing raw lists of various websites we want to block and then uploading them to dynamo
implementing the canonical server<->dynamo API

Where is this?

This project depends on paladin's sdk, primarily for the url hashing functionality, however, it also re-exports without itself using the client<->server API, which will be consumed by the santa-server. Reexporting the client<->server API prevents a diamond/triangle dependency between the sdk, santa-server, and this, which could result in different versions of the hashing functions running simultaneously. In addition to the re-exported code, this project exports a dynamo API which is consumed by santa-server to handle lookup requests from clients.

Dynamo

Table layout

The primary key is a compound of DK, the hashed URL, and L, a canonical JSON string object. The other columns in the table are:

I, a non-null boolean field that is true when the row is an index
EAT, the (integer) timestamp for when this entry expires
S, our (integer) score for how confident we are, which is hardcoded to 50 for now
CD, a nullable boolean field that is true when the index points to a subdomain
CP, a nullable boolean field that is true when the index points to a path

JSON Structure

The L column is a canonical JSON string with two possible structures

Index

This field is meaningless, however for dynamo's indexing purposes, it must be the same for every index entry. We hardcode it to "i".

Entry

This field has the following keys:

t, the type of the entry. It can be one of content, malicious, exempt (whitelisted by us), or alexa (something in the alexa top-X)
st, the subtype of the entry. Depending on the value of t it can have the following values:
- content: gambling, drugs, violence, aggressive, porn
- malicious: proxy, spyware, hacking, suspect, warez, phish, minejacking
- exempt: manual
- alexa: how highly ranked it is. 1000, 10000, 1000000, &c.
url, the url of the site matching the entry.

Appendix

Canonical JSON string representation

On the order of objects

The JSON object

{
  a: true,
  b: 1
}

can be stringified to either '{"a":true,"b":1}' or '{"b":1,"a":true}'. This poses problems for string comparisons of JSON strings. The solution is to either only compare JSON objects using a deep comparison (ie values not memory locations) or determine an ordering for the keys within an object so that regardless of an object's layout in memory the keys will always appear in the same order.

Co m p a c t n e s s

There is also an issue with whitespace with in a JSON string.

'{"a":true,"b":1}'

' {"a":true,"b":1} '

' { "a" : true , "b" : 1 } '

These strings all represent the same JSON object. The solution is to mandate rules for nonessential whitespace and optional delimiters. In our case, there will be no unnecessary whitespace within the canonical JSON string and no commas after the last item in an array/object, which minimizes the size of the string.

One encoding to rule them all

Unicode strings are an issue, because there are multiple ways to encode the same unicode string. At first glance it would appear that we could just normalize them, however this will break domains that use non-canonical UTF-8 in them. Additionally, since strings are 'just' a collection of arbitrary bytes, not every string can be safely treated as UTF-8, but that is a general concern not one that will apply to this project. For our purposes, we assume all strings can safely be parsed and normalized as UTF-8. This allows us to 'normalize' every string as part of stringification.

Since we are most likely to encounter unicode in the context of a domain or path, punycode should be acceptable.

Wherefore art though Numeric?

Finally, though this shouldn't come up for our purposes

floating point numbers are disallowed,
leading 0s in integers are removed,
-0 becomes 0,
all numbers are represented in base 10 format,

On the shoulders of giants

This standard is heavily influenced by this page.

Browser Extensions

The browser extensions interract with the Santa backend using the Paladin-SDK. The Santa lists are used for filtering bad or malicious websites, when user browses through the internet. All Santa related code and API calls are here.

If user has filter_malicious or filter_content settings set on, the extension will evaluate the website calling santa.ts: checkURL with the website url. To protect user's privacy we will anomalize the url by hashing the host and path components of the url before requesting the hits from Santa. The Santa backend will compare the url to the lists and return a list of hits (in case there are any) which the extension will iterate through. If the url has multiple matches in the same category (e.g. for the domain and subdomain), we will return the most spesific match. To streamline the process, there is also a in-memory cache for the Santa API calls.

Users can also white list urls. This happens when user enters a malicious web page but they decide to continue to the page. This will white list the entire domain for the user.

API Calls and their signatures:

checkURL/checkURLFromCache

Takes the url of the website and returns a list of Santa hits. This is used in the client for filling the caches when entering the website first time.

CheckURLFromCache has the same return signature but it fetches the results from cache and is therefore syncronous.

The response is in the following format:

const response = {
	additional: {},
	domain_hash: data.hlist[data.hlist.length-1],
	hits: {},
};

Example of a response for a website with NSFW content:

{
	"additional": {},
	"domain_hash": "4d236d9a.89e55d4f.62933a29",
	"hits": {
		"porn.com": {
			"class": {
				"content": {
					"porn": {
						"ein": 2563545
					}
				}
			}
		}
	}
}

An other example response with a whitelisted website:

{
	"additional": {
		"clean": "4d236d9a.fdc17bb3.e1c03b9a"
	},
	"domain_hash": "4d236d9a.fdc17bb3.e1c03b9a",
	"hits": {}
}

evaluateURL/evaluateURLFromCache

EvaluateUurl/evaluateURLFromCache has the same functionality than the checkURL/checkURLFromCache with the difference that the results are "evaluated". This means that for each category the most spesific match is returned (out of potentially returned domain/subdomain or path matches). This function is used for reading the results of the website.

The evaluated results are in the following format:

export interface ISantaResponse {
    clean ?: IFilterMatch,
    malicious ?: IFilterMatch,
    content ?: IFilterMatch,
}

export interface IFilterMatch extends IFilterEntry {
    url: string;
    type: string;
    pathLength?: number;
}

The categories and their meanings: clean (whitelisted url), malicious (potential malware), content (NSFW content (drups, pornography etc.))

whiteListURL

This function is called when user decides to continue to the website despite of it being flagged malicious/unsafe content. The URL is added to the in memory cache for one hour.