使用 scikit-learn 进行 Iris 分类

在这里,我们使用著名的 Iris 鸢尾花数据集来说明 SHAP 如何解释许多不同模型类型的输出,从 k 近邻到神经网络。这个数据集非常小,只有 150 个样本。我们使用 130 个随机样本进行训练,20 个样本用于测试模型。由于这是一个小数据集,只有几个特征,我们使用整个训练数据集作为背景。在具有更多特征的问题中,我们希望仅传递训练数据集的中位数或加权 k-中位数。虽然我们只有少量样本,但预测问题相当容易,所有方法都实现了完美的准确率。有趣的是,不同的方法有时依赖于不同的特征集来进行预测。

加载数据

[1]:
import time

import numpy as np
import sklearn
from sklearn.model_selection import train_test_split

import shap

X_train, X_test, Y_train, Y_test = train_test_split(*shap.datasets.iris(), test_size=0.2, random_state=0)

# rather than use the whole training set to estimate expected values, we could summarize with
# a set of weighted kmeans, each weighted by the number of points they represent. But this dataset
# is so small we don't worry about it
# X_train_summary = shap.kmeans(X_train, 50)


def print_accuracy(f):
    print(f"Accuracy = {100 * np.sum(f(X_test) == Y_test) / len(Y_test)}%")
    time.sleep(0.5)  # to let the print get out before any progress bars


shap.initjs()

K 近邻

[2]:
knn = sklearn.neighbors.KNeighborsClassifier()
knn.fit(X_train, Y_train)

print_accuracy(knn.predict)
Accuracy = 96.66666666666667%

解释测试集中的单个预测

[3]:
explainer = shap.KernelExplainer(knn.predict_proba, X_train)
shap_values = explainer.shap_values(X_test.iloc[0, :])
shap.force_plot(explainer.expected_value[0], shap_values[:, 0], X_test.iloc[0, :])
Using 120 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
[3]:
可视化已省略,未加载 Javascript 库!
您是否在本 notebook 中运行了 `initjs()`?如果此 notebook 来自其他用户,您还必须信任此 notebook (文件 -> 信任 notebook)。如果您在 github 上查看此 notebook,则 Javascript 已被剥离以确保安全。如果您正在使用 JupyterLab,则此错误是因为尚未编写 JupyterLab 扩展。

解释测试集中的所有预测

[4]:
shap_values = explainer.shap_values(X_test)
shap.force_plot(explainer.expected_value[0], shap_values[..., 0], X_test)
[4]:
可视化已省略,未加载 Javascript 库!
您是否在本 notebook 中运行了 `initjs()`?如果此 notebook 来自其他用户,您还必须信任此 notebook (文件 -> 信任 notebook)。如果您在 github 上查看此 notebook,则 Javascript 已被剥离以确保安全。如果您正在使用 JupyterLab,则此错误是因为尚未编写 JupyterLab 扩展。

带有线性核的支持向量机

[5]:
svc_linear = sklearn.svm.SVC(kernel="linear", probability=True)
svc_linear.fit(X_train, Y_train)
print_accuracy(svc_linear.predict)

# explain all the predictions in the test set
explainer = shap.KernelExplainer(svc_linear.predict_proba, X_train)
shap_values = explainer.shap_values(X_test)
# this is multiclass so we only visualize the contributions to first class (hence index 0)
shap.force_plot(explainer.expected_value[0], shap_values[..., 0], X_test)
Accuracy = 100.0%
Using 120 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
[5]:
可视化已省略,未加载 Javascript 库!
您是否在本 notebook 中运行了 `initjs()`?如果此 notebook 来自其他用户,您还必须信任此 notebook (文件 -> 信任 notebook)。如果您在 github 上查看此 notebook,则 Javascript 已被剥离以确保安全。如果您正在使用 JupyterLab,则此错误是因为尚未编写 JupyterLab 扩展。

带有径向基函数核的支持向量机

[6]:
svc_linear = sklearn.svm.SVC(kernel="rbf", probability=True)
svc_linear.fit(X_train, Y_train)
print_accuracy(svc_linear.predict)

# explain all the predictions in the test set
explainer = shap.KernelExplainer(svc_linear.predict_proba, X_train)
shap_values = explainer.shap_values(X_test)
# this is multiclass so we only visualize the contributions to first class (hence index 0)
shap.force_plot(explainer.expected_value[0], shap_values[..., 0], X_test)
Accuracy = 100.0%
Using 120 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
[6]:
可视化已省略,未加载 Javascript 库!
您是否在本 notebook 中运行了 `initjs()`?如果此 notebook 来自其他用户,您还必须信任此 notebook (文件 -> 信任 notebook)。如果您在 github 上查看此 notebook,则 Javascript 已被剥离以确保安全。如果您正在使用 JupyterLab,则此错误是因为尚未编写 JupyterLab 扩展。

逻辑回归

[7]:
linear_lr = sklearn.linear_model.LogisticRegression(solver="newton-cg")
linear_lr.fit(X_train, Y_train)
print_accuracy(linear_lr.predict)

# explain all the predictions in the test set
explainer = shap.KernelExplainer(linear_lr.predict_proba, X_train)
shap_values = explainer.shap_values(X_test)
shap.force_plot(explainer.expected_value[0], shap_values[..., 0], X_test)
Accuracy = 100.0%
Using 120 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
[7]:
可视化已省略,未加载 Javascript 库!
您是否在本 notebook 中运行了 `initjs()`?如果此 notebook 来自其他用户,您还必须信任此 notebook (文件 -> 信任 notebook)。如果您在 github 上查看此 notebook,则 Javascript 已被剥离以确保安全。如果您正在使用 JupyterLab,则此错误是因为尚未编写 JupyterLab 扩展。

决策树

[8]:
import sklearn.tree

dtree = sklearn.tree.DecisionTreeClassifier(min_samples_split=2)
dtree.fit(X_train, Y_train)
print_accuracy(dtree.predict)

# explain all the predictions in the test set
explainer = shap.KernelExplainer(dtree.predict_proba, X_train)
shap_values = explainer.shap_values(X_test)
shap.force_plot(explainer.expected_value[0], shap_values[..., 0], X_test)
Accuracy = 100.0%
Using 120 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
[8]:
可视化已省略,未加载 Javascript 库!
您是否在本 notebook 中运行了 `initjs()`?如果此 notebook 来自其他用户,您还必须信任此 notebook (文件 -> 信任 notebook)。如果您在 github 上查看此 notebook,则 Javascript 已被剥离以确保安全。如果您正在使用 JupyterLab,则此错误是因为尚未编写 JupyterLab 扩展。

随机森林

[9]:
from sklearn.ensemble import RandomForestClassifier

rforest = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, random_state=0)
rforest.fit(X_train, Y_train)
print_accuracy(rforest.predict)

# explain all the predictions in the test set
explainer = shap.KernelExplainer(rforest.predict_proba, X_train)
shap_values = explainer.shap_values(X_test)
shap.force_plot(explainer.expected_value[0], shap_values[..., 0], X_test)
Accuracy = 100.0%
Using 120 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
[9]:
可视化已省略,未加载 Javascript 库!
您是否在本 notebook 中运行了 `initjs()`?如果此 notebook 来自其他用户,您还必须信任此 notebook (文件 -> 信任 notebook)。如果您在 github 上查看此 notebook,则 Javascript 已被剥离以确保安全。如果您正在使用 JupyterLab,则此错误是因为尚未编写 JupyterLab 扩展。

神经网络

[10]:
from sklearn.neural_network import MLPClassifier

nn = MLPClassifier(solver="lbfgs", alpha=1e-1, hidden_layer_sizes=(5, 2), random_state=0)
nn.fit(X_train, Y_train)
print_accuracy(nn.predict)

# explain all the predictions in the test set
explainer = shap.KernelExplainer(nn.predict_proba, X_train)
shap_values = explainer.shap_values(X_test)
shap.force_plot(explainer.expected_value[0], shap_values[..., 0], X_test)
Accuracy = 96.66666666666667%
Using 120 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.
[10]:
可视化已省略,未加载 Javascript 库!
您是否在本 notebook 中运行了 `initjs()`?如果此 notebook 来自其他用户,您还必须信任此 notebook (文件 -> 信任 notebook)。如果您在 github 上查看此 notebook,则 Javascript 已被剥离以确保安全。如果您正在使用 JupyterLab,则此错误是因为尚未编写 JupyterLab 扩展。