fold-r-pp

v1.0.1

Published

3 years ago

The implementation details of FOLD-R++ algorithm. The target of FOLD-R++ algorithm is to learn an answer set program for a binary classification task.

Downloads

0High
0Medium
0Low

hwd404

FOLD-R-PP

The implementation details of FOLD-R++ algorithm and how to use it are described here. The target of FOLD-R++ algorithm is to learn an answer set program for a classification task. Answer set programs are logic programs that permit negation of predicates and follow the stable model semantics for interpretation. The rules generated are essentially default rules. Default rules (with exceptions) closely model human thinking.

Installation

Prerequisites

The FOLD-R++ algorithm is developed with only python3. Numpy is the only dependency:

python3 -m pip install numpy

Instruction

Data preparation

The FOLD-R++ algorithm takes tabular data as input, the first line for the tabular data should be the feature names of each column. The FOLD-R++ algorithm does not have to encode the data for training. It can deal with numerical, categorical, and even mixed type features (one column contains both categorical and numerical values) directly. However, the numerical features should be identified before loading the data, otherwise they would be dealt like categorical features (only literals with = and != would be generated).

There are many UCI example datasets that have been used to pre-populate the data directory. Code for preparing these datasets has already been added to datasets.py.

For example, the UCI kidney dataset can be loaded with the following code:

attrs = ['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv',
         'wbcc', 'rbcc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane']
nums = ['age', 'bp', 'sg', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc']
model = Classifier(attrs=attrs, numeric=nums, label='label', pos='ckd')

data = model.load_data('data/kidney/kidney.csv')
data_train, data_test = split_data(data, ratio=0.8, rand=True)

X_train, Y_train = split_xy(data_train)
X_test,  Y_test = split_xy(data_test)

attrs lists all the features needed, nums lists all the numerical features, label is the name of the output classification label, pos indicates the positive value of the label, model is an initialized classifier object with the configuration of kidney dataset. Note: For binary classification tasks, the label value with more examples should be selected as the label's positive value.

Training

The FOLD-R++ algorithm generates an explainable model that is represented by an answer set program for classification tasks. Here's a training example for kidney dataset:

model.fit(X_train, Y_train, ratio=0.5)

Note that the hyperparameter ratio in fit function can be set by the user, and ranges between 0 and 1. Default value is 0.5. This hyperparameter represents the ratio of training examples that are part of the exception to the examples implied by only the default conclusion part of the rule. We recommend that the user experiment with this hyperparameter by trying different values to produce a ruleset with the best F1 score. A range between 0.2 and 0.5 is recommended for experimentation.

The rules generated by foldrpp will be stored in the model object. These rules are organized in a nested intermediate representation. The nested rules will be automatically flattened and decoded to conform to the syntax of answer set programs by calling print_asp function:

model.print_asp()

An answer set program, compatible with the s(CASP) answer set programming system, is printed as shown below. The s(CASP) system is a system for direclty executing predicate answer set programs in a query-driven manner.

% the answer set program generated by foldr++:
label(X,'ckd'):-hemo(X,N14),N14=<12.0.
label(X,'ckd'):-pcv(X,'?'),not pc(X,'normal').
label(X,'ckd'):-sc(X,N11),N11>1.2.
label(X,'ckd'):-sg(X,N2),N2=<1.015.
% acc 0.95 p 1.0 r 0.9149 f1 0.9556

Testing in Python

Given X_test, a list of test data samples, the Python predict function will predict the classification outcome for each of these data samples.

Y_test_hat = model.predict(X_test)

The classify function can also be used to classify a single data sample.

y_test_hat = model.classify(x_test)

The code of the above examples can be found in main.py. The examples below with more datasets and more functions can be found in example.py

Save model and Load model

save_model_to_file(model, 'example.model')
model2 = load_model_from_file('example.model')
save_asp_to_file(model2, 'example.lp')

A trained model can be saved to a file with save_model_to_file function. load_model_from_file function helps load model from file. The generated ASP program can be saved to a file with save_asp_to_file function.

Justification and Rebuttal

FOLD-R++ provides simple format justification and rebuttal for predictions with explain function, the parameter all_flag means whether or not to list all the answer sets.

model.explain(X_test[i], all_flag=True)

Here is an example for a instance from cars dataset. The generated answer set program is :

% cars dataset (1728, 6)
ab2(X):-doors(X,'2'),persons(X,'more').
ab3(X):-buying(X,'low'),not maint(X,'vhigh'),not ab2(X).
ab4(X):-not persons(X,'more').
ab5(X):-doors(X,'2'),not ab4(X).
ab6(X):-buying(X,'med'),not maint(X,'vhigh'),not maint(X,'high'),not ab5(X).
label(X,'negative'):-buying(X,'vhigh'),maint(X,'vhigh').
label(X,'negative'):-lugboot(X,'small'),not safety(X,'high'),not ab3(X),not ab6(X).
label(X,'negative'):-persons(X,'2').
label(X,'negative'):-safety(X,'low').
% acc 0.9509 p 1.0 r 0.9267 f1 0.962
% foldr++ costs:  0:00:00.035228

And the generated justification for an instance predicted as positive:

answer  1 :
[T]label(X,'negative'):-[T]safety(X,'low').
{'safety: low'} 

answer  2 :
[F]ab3(X):-[F]buying(X,'low'),not [F]maint(X,'vhigh'),not [U]ab2(X).
[F]ab6(X):-[F]buying(X,'med'),not [F]maint(X,'vhigh'),not [F]maint(X,'high'),not [U]ab5(X).
[T]label(X,'negative'):-[T]lugboot(X,'small'),not [F]safety(X,'high'),not [F]ab3(X),not [F]ab6(X).
{'safety: low', 'buying: vhigh', 'lugboot: small', 'maint: med'}

There are 2 answers have been generated for the current instance, because all_flag has been set as True when calling explain function. Only 1 answer will be generated if all_flag is False. In the generated answers, each literal has been tagged with a label. [T] means True, [F] means False, and [U] means unnecessary to evaluate. And the smallest set of features of the instance is listed for each answer.

For an instance predicted as negative, there' no answer set. Instead, the explaination has to list the rebuttals for all the possible rules, and the parameter all_flag will be ignored:

rebuttal  1 :
[F]label(X,'negative'):-[F]persons(X,'2').
{'persons: 4'} 

rebuttal  2 :
[F]label(X,'negative'):-[F]safety(X,'low').
{'safety: high'} 

rebuttal  3 :
[F]label(X,'negative'):-[T]buying(X,'vhigh'),[F]maint(X,'vhigh').
{'buying: vhigh', 'maint: high'} 

rebuttal  4 :
[F]label(X,'negative'):-[F]lugboot(X,'small'),not [T]safety(X,'high'),not [U]ab3(X),not [U]ab6(X).
{'safety: high', 'lugboot: med'}

Justification by using s(CASP)

The installation of s(CASP) system is necessary for this part. The above examples do not need the s(CASP) system.

Classification and its justification can be conducted with the s(CASP) system. However, each data sample needs to be converted into predicate format that the s(CASP) system expects. The load_data_pred function can be used for this conversion; it returns the data predicates string list. The parameter numerics lists all the numerical features.

nums = ['Age', 'Number_of_Siblings_Spouses', 'Number_Of_Parents_Children', 'Fare']
X_pred = load_data_pred('data/titanic/test.csv', numerics=nums)

Here is an example of the answer set program generated for the titanic dataset by FOLD-R++, along with a test data sample converted into the predicate format.

survived(X,'0'):-class(X,'3'),not sex(X,'male'),fare(X,N4),N4>23.25,not ab7(X),not ab8(X).
survived(X,'0'):-sex(X,'male'),not ab2(X),not ab4(X),not ab6(X).
... ...
ab7(X):-number_of_parents_children(X,N3),N3=<0.0.
ab8(X):-fare(X,N4),N4>31.275,fare(X,N4),N4=<31.387.
... ...

id(1).
sex(1,'male').
age(1,34.5).
number_of_siblings_spouses(1,0.0).
number_of_parents_children(1,0.0).
fare(1,7.8292).
class(1,'3').

An easier way to get justification from the s(CASP) system is to call scasp_query function. It will send the generated ASP rules, converted data and a query to the s(CASP) system for justification. A previously specified natural language translation template can make the justification easier to understand, but is not necessary. The template indicates the English string corresponding to a given predicate that models a feature. Here is a (self-explanatory) example of a translation template:

#pred sex(X,Y) :: 'person @(X) is @(Y)'.
#pred age(X,Y) :: 'person @(X) is of age @(Y)'.
#pred number_of_sibling_spouses(X,Y) :: 'person @(X) had @(Y) siblings or spouses'.
... ...
#pred ab2(X) :: 'abnormal case 2 holds for @(X)'.
#pred ab3(X) :: 'abnormal case 3 holds for @(X)'.
... ...

The template file can be loaded to the model object with load_translation function. Then, the justification is generated by calling scasp_query. If the input data is in predicate format, the parameter pred needs to be set as True.

load_translation(model, 'data/titanic/template.txt')
print(scasp_query(model, x, pred=False))

Here is the justification for a passenger in the titanic example above (note that survived(1,0) means that passenger with id 1 perished (denoted by 0):

% QUERY:I would like to know if
'goal' holds (for 0).

ANSWER:	1 (in 2.049 ms)

JUSTIFICATION_TREE:
'goal' holds (for 0), because
    'survived' holds (for 0, and 0), because
	person 0 paid 7.8292 for the ticket, and
	person 0 is of age 34.5.
The global constraints hold.

MODEL:
{ goal(0),  survived(0,0),  not sex(0,female),  not ab2(0),  not fare(0,Var0 | {Var0 \= 7.8292}),  fare(0,7.8292),  not ab4(0),  not class(0,1),  not ab6(0),  not age(0,Var1 | {Var1 \= 34.5}),  age(0,34.5) }

s(CASP)

All the resources of s(CASP) can be found at https://gitlab.software.imdea.org/ciao-lang/sCASP.