Benchmark XGBoost 解释

此 notebook 比较了几种应用于 XGBoost 模型的不同解释方法。这些方法在许多不同的评估指标上进行了比较。解释误差是我们排序的主要指标,但我们也比较了许多其他指标,因为没有一个单一的指标能够完全捕捉归因解释方法的性能。

有关此处使用的每个指标的更详细解释,请查看各种类的文档字符串。

在加州房价 XGBoost 回归模型上进行解释器基准测试

构建模型和解释

[1]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import xgboost
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

import shap
import shap.benchmark

warnings.filterwarnings("ignore")
[2]:
model = GradientBoostingRegressor(subsample=0.3)
X, y = shap.datasets.california(n_points=1000)
X = X.values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
model.fit(
    X_train,
    y_train,
    # eval_set=[(X_test, y_test)],
    # early_stopping_rounds=10,
    # verbose=False,
)

# define the benchmark evaluation sample set
X_eval = X_test[:]
y_eval = y_test[:]

# use an independent masker
masker = shap.maskers.Independent(X_train)
pmasker = shap.maskers.Partition(X_train)

# build the explainers
explainers = [
    ("Permutation", shap.PermutationExplainer(model.predict, masker)),
    ("Permutation part.", shap.PermutationExplainer(model.predict, pmasker)),
    ("Partition", shap.PartitionExplainer(model.predict, pmasker)),
    ("Tree", shap.TreeExplainer(model)),
    ("Tree approx.", shap.TreeExplainer(model, approximate=True)),
    ("Exact", shap.ExactExplainer(model.predict, masker)),
    ("Random", shap.explainers.other.Random(model.predict, masker)),
]

# # dry run to get all the code warmed up for valid runtime measurements
for name, exp in explainers:
    exp(X_eval[:1])

# explain with all the explainers
attributions = [(name, exp(X_eval)) for name, exp in explainers]

运行基准测试

[3]:
results = {}

smasker = shap.benchmark.ExplanationError(masker, model.predict, X_eval)
results["explanation error"] = [smasker(v, name=n) for n, v in attributions]

ct = shap.benchmark.ComputeTime()
results["compute time"] = [ct(v, name=n) for n, v in attributions]

for mask_type, ordering in [
    ("keep", "positive"),
    ("remove", "positive"),
    ("keep", "negative"),
    ("remove", "negative"),
]:
    smasker = shap.benchmark.SequentialMasker(mask_type, ordering, masker, model.predict, X_eval)
    results[mask_type + " " + ordering] = [smasker(v, name=n) for n, v in attributions]

cmasker = shap.maskers.Composite(masker, shap.maskers.Fixed())
for mask_type, ordering in [("keep", "absolute"), ("remove", "absolute")]:
    smasker = shap.benchmark.SequentialMasker(
        mask_type,
        ordering,
        cmasker,
        lambda X, y: (y - model.predict(X)) ** 2,
        X_eval,
        y_eval,
    )
    results[mask_type + " " + ordering] = [smasker(v, name=n) for n, v in attributions]

显示所有解释器的所有指标的得分

此多指标基准测试图按第一个方法对方法进行排序,并重新调整每个指标的分数以使其相对化,以便最佳分数出现在顶部,最差分数出现在底部。

[4]:
shap.plots.benchmark(sum(results.values(), []))
../../../_images/example_notebooks_benchmarks_tabular_Benchmark_XGBoost_explanations_9_0.png

再次显示总体性能,但不包括 Random

由于随机分数比合理的解释方法差得多,我们再次绘制相同的图,但不包括 Random 方法,以便我们可以看到性能方面较小的差异。

[5]:
shap.plots.benchmark(filter(lambda x: x.method != "Random", sum(results.values(), [])))
../../../_images/example_notebooks_benchmarks_tabular_Benchmark_XGBoost_explanations_11_0.png

显示每种指标类型的详细图

如果我们一次绘制一个指标的分数,那么我们可以看到更详细的方法比较。一些方法只有得分(解释误差和计算时间),而其他方法则有完整的性能曲线,得分是这些曲线下(或上)的面积。

[6]:
num_plot_rows = len(results) // 2 + len(results) % 2
fig, ax = plt.subplots(num_plot_rows, 2, figsize=(12, 5 * num_plot_rows))

for i, k in enumerate(results):
    plt.subplot(num_plot_rows, 2, i + 1)
    shap.plots.benchmark(results[k], show=False)
if i % 2 == 0:
    ax[-1, -1].axis("off")
plt.tight_layout()
plt.show()
../../../_images/example_notebooks_benchmarks_tabular_Benchmark_XGBoost_explanations_13_0.png

在人口普查收入 XGBoost 分类模型上进行解释器基准测试

构建模型和解释

[7]:
# build the model
model = xgboost.XGBClassifier(n_estimators=1000, subsample=0.3)
X, y = shap.datasets.adult(n_points=1000)
X = X.values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
model.fit(
    X_train,
    y_train,
    eval_set=[(X_test, y_test)],
    early_stopping_rounds=10,
    verbose=False,
)


def logit_predict(X):
    return model.predict(X, output_margin=True)


def loss_predict(X, y):
    probs = model.predict_proba(X)
    return [-np.log(probs[i, y[i] * 1]) for i in range(len(y))]


# define the benchmark evaluation sample set (limited to 1000 samples for the sake of time)
X_eval = X_test[:1000]
y_eval = y_test[:1000]

# use an independent masker
masker = shap.maskers.Independent(X_train)
pmasker = shap.maskers.Partition(X_train)

# build the explainers
explainers = [
    ("Permutation", shap.PermutationExplainer(logit_predict, masker)),
    ("Permutation part.", shap.PermutationExplainer(logit_predict, pmasker)),
    ("Partition", shap.PartitionExplainer(logit_predict, pmasker)),
    ("Tree", shap.TreeExplainer(model)),
    ("Tree approx.", shap.TreeExplainer(model, approximate=True)),
    ("Random", shap.explainers.other.Random(logit_predict, masker)),
    ("Exact", shap.ExactExplainer(logit_predict, masker)),
]

# # dry run to get all the code warmed up for valid runtime measurements
for name, exp in explainers:
    exp(X_eval[:1])

# explain with all the explainers
attributions = [(name, exp(X_eval)) for name, exp in explainers]
PartitionExplainer explainer: 251it [00:30,  5.34it/s]

运行基准测试

[8]:
results = {}

# we run explanation error first as the primary metric
smasker = shap.benchmark.ExplanationError(masker, logit_predict, X_eval)
results["explanation error"] = [smasker(v, name=n) for n, v in attributions]

# next compute time
ct = shap.benchmark.ComputeTime()
results["compute time"] = [ct(v, name=n) for n, v in attributions]

# then removal and addition of feature metrics based on model output
for mask_type, ordering in [
    ("keep", "positive"),
    ("remove", "positive"),
    ("keep", "negative"),
    ("remove", "negative"),
]:
    smasker = shap.benchmark.SequentialMasker(mask_type, ordering, masker, logit_predict, X_eval)
    results[mask_type + " " + ordering] = [smasker(v, name=n) for n, v in attributions]

# then removal and addition of feature metrics based on model loss
cmasker = shap.maskers.Composite(masker, shap.maskers.Fixed())
for mask_type, ordering in [("keep", "absolute"), ("remove", "absolute")]:
    smasker = shap.benchmark.SequentialMasker(mask_type, ordering, cmasker, loss_predict, X_eval, y_eval)
    results[mask_type + " " + ordering] = [smasker(v, name=n) for n, v in attributions]

显示所有解释器的所有指标的总体曲线下面积得分

[9]:
shap.plots.benchmark(sum(results.values(), []))
../../../_images/example_notebooks_benchmarks_tabular_Benchmark_XGBoost_explanations_20_0.png

再次显示总体性能,但不包括 Random

[10]:
shap.plots.benchmark(filter(lambda x: x.method != "Random", sum(results.values(), [])))
../../../_images/example_notebooks_benchmarks_tabular_Benchmark_XGBoost_explanations_22_0.png

显示每种指标类型的详细图

[11]:
num_plot_rows = len(results) // 2 + len(results) % 2
fig, ax = plt.subplots(num_plot_rows, 2, figsize=(12, 5 * num_plot_rows))

for i, k in enumerate(results):
    plt.subplot(num_plot_rows, 2, i + 1)
    shap.plots.benchmark(results[k], show=False)
if i % 2 == 0:
    ax[-1, -1].axis("off")
plt.tight_layout()
plt.show()
../../../_images/example_notebooks_benchmarks_tabular_Benchmark_XGBoost_explanations_24_0.png

有更多有用的示例的想法吗? 欢迎提交 Pull Request 来添加到此文档 notebook!