palantir
v4.0.1
Published
Active monitoring and alerting system using user-defined Node.js scripts.
Downloads
19
Readme
Palantir
Active monitoring and alerting system using user-defined Node.js scripts.
Features
- Programmatic test cases (write your own checks using Node.js). (🔥 ready)
- Programmatic troubleshooting (write your own troubleshooting queries for test cases). (🔥 ready)
- Programmatic notifications (write your own mechanism for sending notifications). (🔥 ready)
- Filter tests using MongoDB-like queries. (🔥 ready)
- Track historical health of individual tests. (🗺️ in roadmap)
- Create browser-session specific dashboards. (🗺️ in roadmap)
- Produce charts using troubleshooting output. (🗺️ in roadmap)
- Hosted Palantir instance with tests run using serverless infrastructure, persistent dashboards, integrated timeseries database and notifcations. (💵 commercial) (🗺️ in roadmap)
Contents
Motivation
Existing monitoring software primarily focuses on enabling visual inspection of service health metrics and relies on service maintainers to detect anomalies. This approach is time consuming and allows for human-error. Even when monitoring systems allow to define alerts based on pre-defined thresholds, a point-in-time metric is not sufficient to determine service-health. The only way to establish service-health is to write thorough integration tests (scripts) and automate their execution, just like we do in software-development.
Palantir continuously performs user-defined tests and only reports failing tests, i.e. if everything is working as expected, the system remains silent. This allows service developers/maintainers to focus on defining tests that warn about the errors that are about to occur and automate troubleshooting.
Palantir decouples monitoring, alerting and reporting mechanisms. This method allows distributed monitoring and role-based, tag-based alerting system architecture.
Further reading
Usage
monitor program
monitor
program continuously performs user-defined tests and exposes the current state via Palantir HTTP API.
$ palantir monitor --service-port 8080 --configuration ./monitor-configuration.js ./tests/**/*
Every test file must export a function that creates a TestSuiteType
(see Palantir test suite).
- Refer to Palantir test specification.
- Refer to Monitor configuration specification.
alert program
alert
program subscribes to Palantir HTTP API and alerts other systems using user-defined configuration.
$ palantir alert --configuration ./alert-configuration.js --api-url http://127.0.0.1:8080/
- Refer to Alert configuration specification.
report program
report
program creates a web UI for the Palantir HTTP API.
$ palantir report --service-port 8081 --api-url http://127.0.0.1:8080/
test program
test
program runs tests once.
$ palantir test --configuration ./monitor-configuration.js ./tests/**/*
test
program is used for test development. It allows to filter tests by description (--match-description
) and by the test tags (--match-tag
), e.g.
$ palantir test --match-description 'event count is greater' --configuration ./monitor-configuration.js ./tests/**/*
$ palantir test --match-tag 'database' --configuration ./monitor-configuration.js ./tests/**/*
Specification
Palantir test
Palantir test is an object with the following properties:
/**
* @property assert Evaluates user defined script. The result (boolean) indicates if test is passing.
* @property configuration User defined configuration accessible by the `beforeTest`.
* @property explain Provides debugging information about the test.
* @property interval A function that describes the time when the test needs to be re-run.
* @property labels Arbitrary key=value labels used to categorise the tests.
* @property name Unique name of the test. A combination of test + labels must be unique across all test suites.
* @property priority A numeric value (0-100) indicating the importance of the test. Low value indicates high priority.
*/
type TestType = {|
+assert: (context: TestContextType) => Promise<boolean>,
+configuration?: SerializableObjectType,
+explain?: (context: TestContextType) => Promise<$ReadOnlyArray<ExplanationType>>,
+interval: (consecutiveFailureCount: number) => number,
+labels: LabelsType,
+name: string,
+priority: number
|};
In practice, an example of a test used to check whether HTTP resource is available could look like this:
{
assert: async () => {
await request('https://applaudience.com/', {
timeout: interval('10 seconds')
});
},
interval: () => {
return interval('30 seconds');
},
labels: {
project: 'applaudience',
source: 'http',
type: 'liveness-check'
},
name: 'https://applaudience.com/ responds with 200'
}
Palantir test suite
monitor
program requires a list of file paths as an input. Every input file must export a function that creates a TestSuiteType
object:
type TestSuiteType = {|
+tests: $ReadOnlyArray<TestType>
|};
Example:
// @flow
import request from 'axios';
import interval from 'human-interval';
import type {
TestSuiteFactoryType
} from 'palantir';
const createIntervalCreator = (intervalTime) => {
return () => {
return intervalTime;
};
};
const createTestSuite: TestSuiteFactoryType = () => {
return {
tests: [
{
assert: async () => {
await request('https://applaudience.com/', {
timeout: interval('10 seconds')
});
},
interval: createIntervalCreator(interval('30 seconds')),
labels: {
project: 'applaudience',
scope: 'http'
},
name: 'https://applaudience.com/ responds with 200'
}
]
}
};
export default createTestSuite;
Note that the test suite factory may return a promise. Refer to asynchronously creating a test suite for a use case example.
Monitor configuration
Palantir monitor
program accepts configuration
configuration (a path to a script).
/**
* @property after Called when shutting down the monitor.
* @property afterTest Called after every test.
* @property before Called when starting the monitor.
* @property beforeTest Called before every test.
*/
type ConfigurationType = {|
+after: () => Promise<void>,
+afterTest?: (test: RegisteredTestType, context?: TestContextType) => Promise<void>,
+before: () => Promise<void>,
+beforeTest?: (test: RegisteredTestType) => Promise<TestContextType>
|};
The configuration script allows to setup hooks for different stages of the program execution.
In practice, this can be used to configure the database connection, e.g.
import {
createPool
} from 'slonik';
let pool;
export default {
afterTest: async (test, context) => {
await context.connection.release();
},
before: async () => {
pool = await createPool('postgres://');
},
beforeTest: async () => {
const connection = await pool.connect();
return {
connection
};
}
};
Note that in the above example, unless you are using database connection for all the tests, you do not want to allocate a connection for every test. You can restrict allocation of connection using test configuration, e.g.
Test that requires connection to the database:
{
assert: (context) => {
return context.connection.any('SELECT 1');
},
configuration: {
database: true
},
interval: () => {
return interval('30 seconds');
},
labels: {
scope: 'database'
},
name: 'connects to the database'
}
Monitor configuration that is aware of the configuration.database
configuration.
import {
createPool
} from 'slonik';
let pool;
export default {
afterTest: async (test, context) => {
if (!test.configuration.database) {
return;
}
await context.connection.release();
},
before: async () => {
pool = await createPool('postgres://');
},
beforeTest: async (test) => {
if (!test.configuration.database) {
return {};
}
const connection = await pool.connect();
return {
connection
};
}
};
Alert configuration
Palantir alert
program accepts configuration
configuration (a path to a script).
/**
* @property onNewFailingTest Called when a new test fails.
* @property onRecoveredTest Called when a previously failing test is no longer failing.
*/
type AlertConfigurationType = {|
+onNewFailingTest?: (registeredTest: RegisteredTestType) => void,
+onRecoveredTest?: (registeredTest: RegisteredTestType) => void
|};
The alert configuration script allows to setup event handlers used to observe when tests fail and recover.
In practice, this can be used to configure a system that notifies other systems about the failing tests, e.g.
/**
* @file Using https://www.twilio.com/ to send a text message when tests fail and when tests recover.
*/
import createTwilio from 'twilio';
const twilio = createTwilio('ACCOUNT SID', 'AUTH TOKEN');
const sendMessage = (message) => {
twilio.messages.create({
body: message,
to: '+12345678901',
from: '+12345678901'
});
};
export default {
onNewFailingTest: (test) => {
sendMessage('FAILURE ' + test.name + ' failed');
},
onRecoveredTest: (test) => {
sendMessage('RECOVERY ' + test.name + ' recovered');
}
};
The above example will send a message for every failure and recovery, every time failure/ recovery occurs. In practise, it is desired that the alerting system includes a mechanism to filter out temporarily failures. To address this requirement, Palantir implements an alert controller.
Alert controller
Palantir alert controller abstracts logic used to filter temporarily failures.
palantir
module exports a factory method createAlertController
used to create an Palantir alert controller.
/**
* @property delayFailure Returns test-specific number of milliseconds to wait before considering the test to be failing.
* @property delayRecovery Returns test-specific number of milliseconds to wait before considering the test to be recovered.
* @property onFailure Called when test is considered to be failing.
* @property onRecovery Called when test is considered to be recovered.
*/
type ConfigurationType = {|
+delayFailure: (test: RegisteredTestType) => number,
+delayRecovery: (test: RegisteredTestType) => number,
+onFailure: (test: RegisteredTestType) => void,
+onRecovery: (test: RegisteredTestType) => void
|};
type AlertControllerType = {|
+getDelayedFailingTests: () => $ReadOnlyArray<RegisteredTestType>,
+getDelayedRecoveringTests: () => $ReadOnlyArray<RegisteredTestType>,
+registerTestFailure: (test: RegisteredTestType) => void,
+registerTestRecovery: (test: RegisteredTestType) => void
|};
createAlertController(configuration: ConfigurationType) => AlertControllerType;
Use createAlertController
to implement alert throttling, e.g.
import interval from 'human-interval';
import createTwilio from 'twilio';
import {
createAlertController
} from 'palantir';
const twilio = createTwilio('ACCOUNT SID', 'AUTH TOKEN');
const sendMessage = (message) => {
twilio.messages.create({
body: message,
to: '+12345678901',
from: '+12345678901'
});
};
const controller = createAlertController({
delayFailure: (test) => {
if (test.labels.scope === 'database') {
return 0;
}
return interval('5 minute');
},
delayRecovery: () => {
return interval('1 minute');
},
onFailure: (test) => {
sendMessage('FAILURE ' + test.description + ' failed');
},
onRecovery: () => {
sendMessage('RECOVERY ' + test.description + ' recovered');
}
});
export default {
onNewFailingTest: (test) => {
controller.registerTestFailure(test);
},
onRecoveredTest: (test) => {
controller.registerTestRecovery(test);
}
};
Palantir HTTP API
Palantir monitor
program creates HTTP GraphQL API server. The API exposes information about the user-registered tests and the failing tests.
Refer to the schema.graphql or introspect the API to learn more.
Recipes
Asynchronously creating a test suite
Creating a test suite might require to query an asynchronous source, e.g. when information required to create a test suite is stored in a database. In this case, a test suite factory can return a promise that resolves with a test suite, e.g.
const createTestSuite: TestSuiteFactoryType = async () => {
const clients = await getClients(connection);
return clients.map((client) => {
return {
assert: async () => {
await request(client.url, {
timeout: interval('10 seconds')
});
},
interval: createIntervalCreator(interval('30 seconds')),
labels: {
'client.country': client.country,
'client.id': client.id,
source: 'http',
type: 'liveness-check'
},
name: client.url + ' responds with 200'
};
});
};
In the above example, getClients
is used to asynchronously retrieve information required to construct the test suite.
Refreshing a test suit
It might be desired that the test suite itself informs the monitor about new tests, e.g. the example in the asynchronously creating a test suite recipe retrieves information from an external datasource that may change over time. In this case, a test suite factory can inform the monitor
program that it should recreate the test suite, e.g.
const createTestSuite: TestSuiteFactoryType = async (refreshTestSuite) => {
const clients = await getClients(connection);
(async () => {
// Some logic used to determine when the `clients` data used
// to construct the original test suite becomes stale.
while (true) {
await delay(interval('10 seconds'));
if (JSON.stringify(clients) !== JSON.stringify(await getClients(connection))) {
// Calling `refreshTestSuite` will make Palantir monitor program
// recreate the test suite using `createTestSuite`.
refreshTestSuite();
break;
}
}
})();
return clients.map((client) => {
return {
assert: async () => {
await request(client.url, {
timeout: interval('10 seconds')
});
},
interval: createIntervalCreator(interval('30 seconds')),
labels: {
'client.country': client.country,
'client.id': client.id,
source: 'http',
type: 'liveness-check'
},
name: client.url + ' responds with 200'
};
});
};
Development
There are multiple components required to run the service.
Run npm run dev
to watch the project and re-build upon detecting a change.
In order to observe project changes and restart all the services use a program such as nodemon
, e.g.
$ NODE_ENV=development nodemon --watch dist --ext js,graphql dist/bin/index.js monitor ...
$ NODE_ENV=development nodemon --watch dist --ext js,graphql dist/bin/index.js alert ...
Use --watch
attribute multiple times to include Palantir project code and your configuration/ test scripts.
report
program run in NODE_ENV=development
use webpack-hot-middleware
to implement hot reloading.
$ NODE_ENV=development babel-node src/bin/index.js report --service-port 8081 --api-url http://127.0.0.1:8080/ | roarr pretty-print