llm-rehearsal

v0.0.5

Published

a year ago

Prompt evaluation and regression testing

Downloads

0High
0Medium
0Low

datacamp-machine-user

Rehearsal

Prompt evaluation and regression testing

Modifying a prompt, sometimes even the smallest details, can have a big impact on the output. Rehearsal makes it easy to perform various tests or evaluations against LLM output. Use cases for Rehearsal include:

regression testing
QA
helping with prompt iteration

Installation

yarn add -D llm-rehearsal

Usage

One important aspect of Rehearsal is that it's completely agnostic of what's used to generate the text. Simply provide an async function that returns a {text: "llm response"} object:

import { rehearsal, expectations } from 'llm-rehearsal';

const { includesString } = expectations;

// Provide an LLM function
const { testCase, run } = rehearsal(async (input: { country: string }) => {
  // your custom code to call LLM here
  const textResponse = await callLLM({
    prompt: `What is the capital of ${country}?`,
  });
  return { text: textResponse }; // only requirement is to return llm response in `text` property
});

// Define test cases
testCase('France', {
  input: { country: 'France' },
  expect: [includesString('paris')],
});
testCase('Germany', {
  input: { country: 'Germany' },
  expect: [includesString('berlin')],
});

// Start test suite
run();

To run the tests, don't forget to call run() at the end and execute your file (with plain node for JS or ts-node for TS).

Expectations for all test cases

To run expectations on all test cases, use expectForAll():

const { testCase, run, expectForAll } = rehearsal(
  async (input: { country: string }) => {
    // your custom code to call LLM here
    const textResponse = await callLLM({
      prompt: `What is the capital of ${country}?`,
    });
    return { text: textResponse }; // only requirement is to return llm response in `text` property
  },
);

// This expectation will be run for all testCase
expectForAll([not(includesString('as a large language model'))]);

Mixing expectations

Expectations can be composed with boolean logic:

import { rehearsal, expectations } from 'llm-rehearsal';
const { includesString, not, and, or } = expectations;

const { testCase } = rehearsal(llmFunction);

testCase("don't say yellow", {
  input: {
    /* input variables */
  },
  expect: [not(includesString('yellow'))],
});

testCase('potato/tomato', {
  input: {
    /* input variables */
  },
  expect: [or(includesString('potato'), includesString('tomato'))],
});

testCase('the cake is a lie', {
  input: {
    /* input variables */
  },
  expect: [and(includesString('cake'), includesString('lie'))],
});

Built-in expectations

includesString - checks if the LLM response contains a given string
matchesRegex - checks if the LLM response matches a given regular expression
not - negates an expectation
and - compose multiple expectations with AND logic
or - compose multiple expectations with OR logic

Coming soon:

includesWord - check for separate words, not just substrings
askGPT - perform evaluation through a GPT prompt

Custom expectations

Custom expctations can be easily created:

import { createExpectation } from 'llm-rehearsal';

const { isLongerThan } = createExpectation(
  'isLongerThan',
  (count: number) => (output) => {
    return output.text.length > count
      ? { pass: true }
      : {
          pass: false,
          message: `Expected output text to be > ${count} characters, but instead is ${output.text.length}`,
        };
  },
);

// use it as the built-in expectations
testCase('long output', {
  input: {
    /* input variables */
  },
  expect: [isLongerThan(9000)],
});

// custom expectations can also be composed with boolean logic:
testCase('long output with sandwich in it', {
  input: {
    /* input variables */
  },
  expect: [and(isLongerThan(9000), includesString('sandwich'))],
});

If your function returns more than just a text (such as metadata or results of intermediate steps), you can create type-safe expectations:

import { rehearsal, expectations } from 'llm-rehearsal';

// notice that `createExpectation` is returned by the rehearsal() function,
// and is typed according to the input/output of the LLM function
const { testCase, createExpectation } = rehearsal(
  async (input: { country: string }) => {
    // your custom code to call LLM here
    const { textResponse, documents } = await callLLMChain({
      prompt: `What is the capital of ${country}?`,
    });
    return { text: textResponse, documents }; // we return more than just `text`
  },
);

const { usesDocuments } = createExpectation('usesDocuments', () => (output) => {
  return output.documents.length > 0 // output is properly typed
    ? { pass: true }
    : { pass: false, message: 'Expected documents to be returned, found none' };
});

Labels for expectations

To make test results more readable, expectations can attached a label:

testCase('my test case', {
  input: {},
  expect: [
    [includesString('banana'), 'include banana'],
    [matchesRegex(/^hello/), 'starts with "hello"'],
    // also works with composed expectations:
    [
      not(
        or(
          includesString('hamburger'),
          includesString('fries'),
          includesString('hotdog'),
          includesString('chicken nuggets'),
          includesString('burritos'),
        ),
      ),
      'no fastfood',
    ],
  ],
});

Describe

Just like most testing library, you can group test cases using describe:

import { rehearsal, expectations, describe } from 'llm-rehearsal';

const { includesString } = expectations;
const { testCase, run } = rehearsal(async (input: { country: string }) => {
  // your custom code to call LLM here
  const textResponse = await callLLM({
    prompt: `What is the capital of ${country}?`,
  });
  return { text: textResponse };
});

describe('Countries', () => {
  testCase('France', {
    input: { country: 'France' },
    expect: [includesString('paris')],
  });
  testCase('Germany', {
    input: { country: 'Germany' },
    expect: [includesString('berlin')],
  });
});

Note: describe does not support only. This should be supported in the future.

Only

To isolate a test case and run only this one (or only a few), use textCase.only:

testCase('France', {
  input: { country: 'France' },
  expect: [includesString('paris')],
});
testCase.only('Germany', {
  input: { country: 'Germany' },
  expect: [includesString('berlin')],
});

This will only run the Germany test case. Multiple test case can be marked "only" to run a selected set.

Local development

To install a local build of Rehearsal, the recommended method is to use Yalc. Make sure to install yalc globally.

Build the library: yarn build
Publish to the yalc local store (does not leave your computer): yarn publish-local
On the consuming side (the NodeJS project where you want to install Rehearsal): yalc install llm-rehearsal

Note
Keep in mind that Yalc will copy the package to the store, and then copy it again when installed on the consuming side. After a new build, you'll need to run yarn publish-local in this repository and also yalc update on the consuming side.