@paladincyber/santa-list-builder
v0.4.1
Published
Module for interrogating Santa about whether a site is good or bad, and compiling this list
Downloads
1
Maintainers
Keywords
Readme
santa-list-builder
What is this?
This repo is responsible for
- documenting the dynamo DB table structure
- parsing raw lists of various websites we want to block and then uploading them to dynamo
- implementing the canonical server<->dynamo API
Where is this?
This project depends on paladin's sdk, primarily for the url hashing
functionality, however, it also re-exports without itself using the client<->server API
, which will be consumed by the santa-server. Reexporting the client<->server API
prevents a diamond/triangle dependency between the sdk, santa-server, and this, which could result in different versions of the hashing functions running simultaneously. In addition to the re-exported code, this project exports a dynamo API
which is consumed by santa-server to handle lookup requests from clients.
Dynamo
Table layout
The primary key is a compound of DK
, the hashed URL, and L
, a canonical JSON string object. The other columns in the table are:
I
, a non-nullboolean
field that is true when the row is an indexEAT
, the (integer
) timestamp for when this entry expiresS
, our (integer
) score for how confident we are, which is hardcoded to50
for nowCD
, a nullableboolean
field that is true when the index points to a subdomainCP
, a nullableboolean
field that is true when the index points to a path
JSON Structure
The L
column is a canonical JSON string with two possible structures
Index
This field is meaningless, however for dynamo's indexing purposes, it must be the same for every index entry. We hardcode it to "i"
.
Entry
This field has the following keys:
t
, thetype
of the entry. It can be one ofcontent
,malicious
,exempt
(whitelisted by us), oralexa
(something in the alexa top-X)st
, the subtype of the entry. Depending on the value oft
it can have the following values:- content:
gambling
,drugs
,violence
,aggressive
,porn
- malicious:
proxy
,spyware
,hacking
,suspect
,warez
,phish
,minejacking
- exempt:
manual
- alexa: how highly ranked it is.
1000
,10000
,1000000
, &c.
- content:
url
, the url of the site matching the entry.
Appendix
Canonical JSON string representation
On the order of objects
The JSON object
{
a: true,
b: 1
}
can be stringified to either '{"a":true,"b":1}'
or '{"b":1,"a":true}'
. This poses problems for string comparisons of JSON strings. The solution is to either only compare JSON objects using a deep comparison (ie values not memory locations) or determine an ordering for the keys within an object so that regardless of an object's layout in memory the keys will always appear in the same order.
Co m p a c t n e s s
There is also an issue with whitespace with in a JSON string.
'{"a":true,"b":1}'
' {"a":true,"b":1} '
' { "a" : true , "b" : 1 } '
These strings all represent the same JSON object. The solution is to mandate rules for nonessential whitespace and optional delimiters. In our case, there will be no unnecessary whitespace within the canonical JSON string and no commas after the last item in an array/object, which minimizes the size of the string.
One encoding to rule them all
Unicode strings are an issue, because there are multiple ways to encode the same unicode string. At first glance it would appear that we could just normalize them, however this will break domains that use non-canonical UTF-8 in them. Additionally, since strings are 'just' a collection of arbitrary bytes, not every string can be safely treated as UTF-8, but that is a general concern not one that will apply to this project. For our purposes, we assume all strings can safely be parsed and normalized as UTF-8. This allows us to 'normalize' every string as part of stringification.
Since we are most likely to encounter unicode in the context of a domain or path, punycode should be acceptable.
Wherefore art though Numeric?
Finally, though this shouldn't come up for our purposes
- floating point numbers are disallowed,
- leading
0
s in integers are removed, -0
becomes0
,- all numbers are represented in base 10 format,
On the shoulders of giants
This standard is heavily influenced by this page.
Browser Extensions
The browser extensions interract with the Santa backend using the Paladin-SDK. The Santa lists are used for filtering bad or malicious websites, when user browses through the internet. All Santa related code and API calls are here.
If user has filter_malicious
or filter_content
settings set on, the extension will evaluate the website calling santa.ts: checkURL
with the website url. To protect user's privacy we will anomalize the url by hashing the host and path components of the url before requesting the hits from Santa.
The Santa backend will compare the url to the lists and return a list of hits (in case there are any) which the extension will iterate through.
If the url has multiple matches in the same category (e.g. for the domain and subdomain), we will return the most spesific match.
To streamline the process, there is also a in-memory cache for the Santa API calls.
Users can also white list urls. This happens when user enters a malicious web page but they decide to continue to the page. This will white list the entire domain for the user.
API Calls and their signatures:
checkURL/checkURLFromCache
Takes the url of the website and returns a list of Santa hits. This is used in the client for filling the caches when entering the website first time.
CheckURLFromCache has the same return signature but it fetches the results from cache and is therefore syncronous.
The response is in the following format:
const response = {
additional: {},
domain_hash: data.hlist[data.hlist.length-1],
hits: {},
};
Example of a response for a website with NSFW content:
{
"additional": {},
"domain_hash": "4d236d9a.89e55d4f.62933a29",
"hits": {
"porn.com": {
"class": {
"content": {
"porn": {
"ein": 2563545
}
}
}
}
}
}
An other example response with a whitelisted website:
{
"additional": {
"clean": "4d236d9a.fdc17bb3.e1c03b9a"
},
"domain_hash": "4d236d9a.fdc17bb3.e1c03b9a",
"hits": {}
}
evaluateURL/evaluateURLFromCache
EvaluateUurl/evaluateURLFromCache has the same functionality than the checkURL/checkURLFromCache with the difference that the results are "evaluated". This means that for each category the most spesific match is returned (out of potentially returned domain/subdomain or path matches). This function is used for reading the results of the website.
The evaluated results are in the following format:
export interface ISantaResponse {
clean ?: IFilterMatch,
malicious ?: IFilterMatch,
content ?: IFilterMatch,
}
export interface IFilterMatch extends IFilterEntry {
url: string;
type: string;
pathLength?: number;
}
The categories and their meanings: clean (whitelisted url), malicious (potential malware), content (NSFW content (drups, pornography etc.))
whiteListURL
This function is called when user decides to continue to the website despite of it being flagged malicious/unsafe content. The URL is added to the in memory cache for one hour.