[PYTHON] Skope-rules algorithm for learning interpretable rules-Can also be used from Microsoft Research's lnterpretML-

I found a rules learning library called skope-rules. This is a scikit-learn-contrib project. As a test, if you run the sample code of the Iris dataset that everyone loves, you will get the following rule.

Rules for iris virginica
('petal_length > 4.75 and petal_width > 1.75', (0.9743589743589743, 0.9047619047619048, 1))
('petal_width > 1.6500000357627869', (0.9514316093263462, 0.9218081435472739, 3))
('petal_length > 5.049999952316284', (0.9696691176470589, 0.8007518796992481, 2))

There is one rule per line, for example, the first rule petal_length> 4.75 and petal_width> 1.75 has a precision of about 0.97, a recall of about 0.90, and the number of occurrences in the decision tree from which it was learned. Read it once. I see.

Use of rule learning

By the way, should we normally classify by the decision tree in the above "original decision tree"? Right and left husbands who thought, yes, if you are in an environment where you can infer with a complicated decision tree, you should use XGBoost or LightGBM for accuracy, but there are some sites where that is not the case

--Complicated decision trees cannot be incorporated into edges --In rare cases, behavior that causes unexpected misinference is not allowed. --The same data set and verification environment as at the time of inference cannot be prepared at the time of learning

This is the case when simple rules that can be interpreted rather than accuracy are required. Moreover, since there are many features and it is difficult to make trial and error by guessing people, I think that such rule learning is addictive when you want to mine.

skope-rules algorithm

The algorithm overview is quoted because the README diagram is easy to understand.

schema.png skope-rules/schema.png

In other words, it is a flow of learning a lot of decision trees, disassembling them to get a large number of rule candidates, filtering and excluding similar ones, and then outputting accurate ones. The purpose of this article is to read this detail from the code. However, the main body is one file with less than 700 lines, so the flow is easy to follow.

Bagging estimator

First of all, [bagging] each of the classification tree and the regression tree (http://ibisforest.org/index.php?%E3%83%90%E3%82%AE%E3%83%B3%E3%82%B0 ), And while changing the depth setting, create a lot of original decision trees. Although skope-rules deals with binary classification, it is also treated as a pseudo regression problem by passing sample weights. However, if no weight is passed, the regression tree will also be learned at 0/1. It is unclear how much learning not only the classification tree but also the regression tree contributes to the final rule, but anyway, the inaccurate rule will be dropped later, so for the time being, it is an atmosphere to rush to increase the rule candidates. I feel.

As a code,

for max_depth in self._max_depths:
    bagging_clf = BaggingClassifier(
        base_estimator=DecisionTreeClassifier(
           max_depth=max_depth,
...
    bagging_reg = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(
            max_depth=max_depth,

skope-rules / skrules / skope_rules.py * Formatted for readability Quoted

With that feeling, I am using the decision tree class of scikit-learn. Therefore, the drawbacks of not being able to handle categorical variables and missing values have been taken over as they are .... This area seems to be supporting categorical variables in scikit-learn itself, so it may be resolved soon. https://github.com/scikit-learn/scikit-learn/pull/12866

-> Set of logical rules

Follow the decision tree from the root to the leaves, and connect the branching conditions with AND for each path to make a rule.

def recurse(node, base_name):
    if tree_.feature[node] != _tree.TREE_UNDEFINED:
        name = feature_name[node]
        symbol = '<='
        symbol2 = '>'
        threshold = tree_.threshold[node]
        text = base_name + ["{} {} {}".format(name, symbol, threshold)]
        recurse(tree_.children_left[node], text)
        text = base_name + ["{} {} {}".format(name, symbol2, threshold)]
        recurse(tree_.children_right[node], text)
    else:
        rule = str.join(' and ', base_name)
        rule = (rule if rule != '' else ' == '.join([feature_names[0]] * 2))
        rules.append(rule)

skope-rules/skrules/skope_rules.py

This is a depth-first search because we are following it with a recursive function.

-> High performing rules

Out-of-Bag of the original decision tree for each rule Calculate precision and recall for the data and from the set minimum value Leave only the high rules.

What's interesting here is that instead of using the original decision tree to determine if the rule is true,

def _eval_rule_perf(self, rule, X, y):
    detected_index = list(X.query(rule).index)

skope-rules/skrules/skope_rules.py

I'm feeding pandas.DataFrame.query as. In skope-rules, the rules are kept as strings such as petal_length> 4.75 and petal_width> 1.75, so they can be treated as queries as they are. However, since the character string is split again when listing the features described later, there is a feeling that it cannot be implemented.

Semantic Rule Deduplication

Group similar rules that survived above to leave the best F1 score rule for each group. Without this, a number of rules with slightly different thresholds would be output.

Whether they are similar or not

def split_with_best_feature(rules, depth, exceptions=[]):
...
    for rule in rules:
        if (most_represented_term + ' <=') in rule[0]:
            rules_splitted[0].append(rule)
        elif (most_represented_term + ' >') in rule[0]:
            rules_splitted[1].append(rule)
        else:
            rules_splitted[2].append(rule)
        new_exceptions = exceptions+[most_represented_term]
        return [split_with_best_feature(ruleset, depth-1, exceptions=new_exceptions)
                for ruleset in rules_splitted]

skope-rules/skrules/skope_rules.py

Looking at the commonly used features and their inequalities in order, it feels like they are sorted into the same group if they are together up to the set depth.

Try calling from lnterpretML

Now, skope-rules is included in the package lnterpretML developed by Microsoft Research, so let's call it from there as well. A class called DecisionListClassifier is a wrapper for skope-rules.

from interpret.glassbox import DecisionListClassifier
from interpret import show
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
dlc = DecisionListClassifier()
dlc.fit(pd.DataFrame(iris.data, columns=iris.feature_names),
        iris.target == iris.target_names.tolist().index('virginica'))
show(dlc.explain_global())

DecisionListClassifier.png

There is only a package for interpretability, and it looks good.

Miscellaneous feelings

Recently, I'm interested in the interpretability of machine learning models. For details on the current state of interpretability, see "Explanation of the basis for judgment of machine learning models (Ver.2)". There are various reasons for asking for interpretability, but in lnterpretML's interpret / README,

--Debug: Why did you remove inference? --Fairness: Isn't it a discriminatory model? --Collaboration with people: How can you understand and trust the model? --Compliance: Do you meet the requirements of the law? --High risk areas: medical care, finance, justice ...

Is written. I think that the points will change depending on the work, but my personal motivation is that I also want to acquire the knowledge discovered by the model, and I put my own thoughts and intuition into it and arrange or divert the knowledge. I feel that I want to do it. For that purpose, I think that we should learn "a readable model that does not require explanation" from the beginning instead of managing a complicated model, but in that case, how many methods can be used? In Interpretable Machine Learning --Chapter 4 Interpretable Models], linear regression, logistic regression, generalized linear model, generalized additive model ・ Decision tree ・ RuleFit ・ Naive Bayes ・ k The neighborhood method is mentioned. RuleFit, like skope-rules, gets rules from decision trees, except that they are used as features for linear regression. Here, of course, there should be a trade-off between the accuracy and interpretability of the model. The figure of "Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI" introduced in "19/190000)" is a feeling. I will quote it because it is easy to understand.

Trade-off between model interpretability and performance

Therefore, there must be a barren area even if you seek perfect interpretation, and in domains such as images and natural language where deep learning has been successful, the opponent is human cognitive ability, so it is not surprising that the model becomes complicated and strange. I wonder, but I have wishful thinking that a more balanced method is sleeping in the physical or engineering domain, which traditionally captures phenomena with beautiful formulas. I am. About ten years ago, "[Artificial intelligence that discovered the laws of physics on its own](https://wired.jp/2009/04/15/%e3%80%8c%e7%89%a9%e7%90%86" % e6% b3% 95% e5% 89% 87% e3% 82% 92% e8% 87% aa% e5% 8a% 9b% e3% 81% a7% e7% 99% ba% e8% a6% 8b% e3 % 80% 8d% e3% 81% 97% e3% 81% 9f% e4% ba% ba% e5% b7% a5% e7% 9f% a5% e8% 83% bd /) ”Genetic Programming became a hot topic, but I wish I could learn a small calculation graph like this. If you use it for feature generation like "Feature generation by genetic programming", it can be combined with various methods, so I expect that there is no way around here. It's about this time today.

Recommended Posts

Skope-rules algorithm for learning interpretable rules-Can also be used from Microsoft Research's lnterpretML-
Overview and useful features of scikit-learn that can also be used for deep learning