理解简单模型的Tree SHAP

特征的SHAP值是通过在所有特征排序中一次引入一个特征时，以该特征为条件，模型输出的平均变化。虽然这很容易说明，但计算起来具有挑战性。因此，本笔记本旨在提供一些简单的示例，在这些示例中，我们可以看到这对于非常小的树是如何发挥作用的。对于任意大的树，通过查看树来直观地猜测这些值是非常困难的。

[1]:

import graphviz
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor, export_graphviz

import shap

单分割示例

[2]:

# build data
N = 100
M = 4
X = np.zeros((N, M))
X.shape
y = np.zeros(N)
X[: N // 2, 0] = 1
y[: N // 2] = 1

# fit model
single_split_model = DecisionTreeRegressor(max_depth=1)
single_split_model.fit(X, y)

# draw model
dot_data = export_graphviz(
    single_split_model,
    out_file=None,
    filled=True,
    rounded=True,
    special_characters=True,
)
graph = graphviz.Source(dot_data)
graph

[2]:

../../../_images/example_notebooks_tabular_examples_tree_based_models_Understanding_Tree_SHAP_for_Simple_Models_3_0.svg

解释模型

请注意，偏差项是模型在训练数据集上的预期输出 (0.5)。模型中未使用的特征的SHAP值始终为0，而对于 \(x_0\) 来说，它只是预期值与模型输出之间的差异。

[3]:

xs = [np.ones(M), np.zeros(M)]
df = pd.DataFrame()
for idx, x in enumerate(xs):
    index = pd.MultiIndex.from_product([[f"Example {idx}"], ["x", "shap_values"]])
    df = pd.concat(
        [
            df,
            pd.DataFrame(
                [x, shap.TreeExplainer(single_split_model).shap_values(x)],
                index=index,
                columns=["x0", "x1", "x2", "x3"],
            ),
        ]
    )
df

[3]:

		x0	x1	x2	x3
示例 0	x	1.0	1.0	1.0	1.0
示例 0	shap_values	0.5	0.0	0.0	0.0
示例 1	x	0.0	0.0	0.0	0.0
示例 1	shap_values	-0.5	0.0	0.0	0.0

双特征AND示例

在这个例子中我们使用两个特征。如果特征 \(x_{0} = 1\) AND \(x_{1} = 1\)，目标值为1，否则为零。因此我们称之为AND模型。

[4]:

# build data
N = 100
M = 4
X = np.zeros((N, M))
X.shape
y = np.zeros(N)
X[: 1 * N // 4, 1] = 1
X[: N // 2, 0] = 1
X[N // 2 : 3 * N // 4, 1] = 1
y[: 1 * N // 4] = 1

# fit model
and_model = DecisionTreeRegressor(max_depth=2)
and_model.fit(X, y)

# draw model
dot_data = export_graphviz(and_model, out_file=None, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph

[4]:

../../../_images/example_notebooks_tabular_examples_tree_based_models_Understanding_Tree_SHAP_for_Simple_Models_8_0.svg

解释模型

请注意，偏差项是模型在训练数据集上的预期输出 (0.25)。未使用的特征 \(x_2\) 和 \(x_3\) 的SHAP值始终为0。对于 \(x_0\) 和 \(x_1\) ，它只是预期值 (0.25) 与模型输出之间的差异，并在它们之间平均分配（因为它们对AND函数贡献相同）。

[5]:

xs = np.array([np.ones(M), np.zeros(M)])
# np.array([np.ones(M), np.zeros(M), np.array([1, 0, 1, 0]), np.array([0, 1, 0, 0])]   # you can also check these examples
df = pd.DataFrame()
for idx, x in enumerate(xs):
    index = pd.MultiIndex.from_product([[f"Example {idx}"], ["x", "shap_values"]])
    df = pd.concat(
        [
            df,
            pd.DataFrame(
                [x, shap.TreeExplainer(and_model).shap_values(x)],
                index=index,
                columns=["x0", "x1", "x2", "x3"],
            ),
        ]
    )
df

[5]:

		x0	x1	x2	x3
示例 0	x	1.000	1.000	1.0	1.0
示例 0	shap_values	0.375	0.375	0.0	0.0
示例 1	x	0.000	0.000	0.0	0.0
示例 1	shap_values	-0.125	-0.125	0.0	0.0

[6]:

y.mean()

[6]:

0.25

以下是如何获得示例 1 的Shap值：偏差项 (y.mean()) 为 0.25，目标值为 1。这剩下 1 - 0.25 = 0.75 在相关特征之间分配。由于只有 \(x_0\) 和 \(x_1\) 对目标值有贡献（并且程度相同），因此在它们之间分配，即每个特征 0.375。

双特征OR示例

我们对上面的例子做一个小的变体。如果 \(x_{0} = 1\) OR \(x_{1} = 1\) 目标值为1，否则为0。你能猜出SHAP值而无需向下滚动吗？

[7]:

# build data
N = 100
M = 4
X = np.zeros((N, M))
X.shape
y = np.zeros(N)
X[: N // 2, 0] = 1
X[: 1 * N // 4, 1] = 1
X[N // 2 : 3 * N // 4, 1] = 1
y[: N // 2] = 1
y[N // 2 : 3 * N // 4] = 1

# fit model
or_model = DecisionTreeRegressor(max_depth=2)
or_model.fit(X, y)

# draw model
dot_data = export_graphviz(or_model, out_file=None, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph

[7]:

../../../_images/example_notebooks_tabular_examples_tree_based_models_Understanding_Tree_SHAP_for_Simple_Models_15_0.svg

解释模型

请注意，偏差项是模型在训练数据集上的预期输出 (0.75)。模型中未使用的特征的SHAP值始终为0，而对于 \(x_0\) 和 \(x_1\) ，它只是预期值与模型输出之间的差异，并在它们之间平均分配（因为它们对OR函数贡献相同）。

[8]:

xs = np.array([np.ones(M), np.zeros(M)])
# np.array([np.ones(M), np.zeros(M), np.array([1, 0, 1, 0]), np.array([0, 1, 0, 0])]   # you can also check these examples
df = pd.DataFrame()
for idx, x in enumerate(xs):
    index = pd.MultiIndex.from_product([[f"Example {idx}"], ["x", "shap_values"]])
    df = pd.concat(
        [
            df,
            pd.DataFrame(
                [x, shap.TreeExplainer(or_model).shap_values(x)],
                index=index,
                columns=["x0", "x1", "x2", "x3"],
            ),
        ]
    )
df

[8]:

		x0	x1	x2	x3
示例 0	x	1.000	1.000	1.0	1.0
示例 0	shap_values	0.125	0.125	0.0	0.0
示例 1	x	0.000	0.000	0.0	0.0
示例 1	shap_values	-0.375	-0.375	0.0	0.0

双特征XOR示例

[9]:

# build data
N = 100
M = 4
X = np.zeros((N, M))
X.shape
y = np.zeros(N)
X[: N // 2, 0] = 1
X[: 1 * N // 4, 1] = 1
X[N // 2 : 3 * N // 4, 1] = 1
y[1 * N // 4 : N // 2] = 1
y[N // 2 : 3 * N // 4] = 1

# fit model
xor_model = DecisionTreeRegressor(max_depth=2)
xor_model.fit(X, y)

# draw model
dot_data = export_graphviz(xor_model, out_file=None, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph

[9]:

../../../_images/example_notebooks_tabular_examples_tree_based_models_Understanding_Tree_SHAP_for_Simple_Models_19_0.svg

解释模型

请注意，偏差项是模型在训练数据集上的预期输出 (0.5)。模型中未使用的特征的SHAP值始终为0，而对于 \(x_0\) 和 \(x_1\) ，它只是预期值与模型输出之间的差异，并在它们之间平均分配（因为它们对XOR函数贡献相同）。

[10]:

xs = np.array([np.ones(M), np.zeros(M)])
# np.array([np.ones(M), np.zeros(M), np.array([1, 0, 1, 0]), np.array([0, 1, 0, 0])]   # you can also check these examples
df = pd.DataFrame()
for idx, x in enumerate(xs):
    index = pd.MultiIndex.from_product([[f"Example {idx}"], ["x", "shap_values"]])
    df = pd.concat(
        [
            df,
            pd.DataFrame(
                [x, shap.TreeExplainer(xor_model).shap_values(x)],
                index=index,
                columns=["x0", "x1", "x2", "x3"],
            ),
        ]
    )
df

[10]:

		x0	x1	x2	x3
示例 0	x	1.00	1.00	1.0	1.0
示例 0	shap_values	-0.25	-0.25	0.0	0.0
示例 1	x	0.00	0.00	0.0	0.0
示例 1	shap_values	-0.25	-0.25	0.0	0.0

双特征AND + 特征增强示例

[11]:

# build data
N = 100
M = 4
X = np.zeros((N, M))
X.shape
y = np.zeros(N)
X[: N // 2, 0] = 1
X[: 1 * N // 4, 1] = 1
X[N // 2 : 3 * N // 4, 1] = 1
y[: 1 * N // 4] = 1
y[: N // 2] += 1

# fit model
and_fb_model = DecisionTreeRegressor(max_depth=2)
and_fb_model.fit(X, y)

# draw model
dot_data = export_graphviz(and_fb_model, out_file=None, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph

[11]:

../../../_images/example_notebooks_tabular_examples_tree_based_models_Understanding_Tree_SHAP_for_Simple_Models_23_0.svg

解释模型

请注意，偏差项是模型在训练数据集上的预期输出 (0.75)。模型中未使用的特征的SHAP值始终为0，而对于 \(x_0\) 和 \(x_1\) ，它只是预期值与模型输出之间的差异，并在它们之间平均分配（因为它们对AND函数贡献相同），外加对 \(x_0\) 的额外 0.5 影响，因为它本身具有 \(1.0\) 的效果（如果开启则 +0.5，如果关闭则 -0.5）。

[12]:

xs = np.array([np.ones(M), np.zeros(M)])
# np.array([np.ones(M), np.zeros(M), np.array([1, 0, 1, 0]), np.array([0, 1, 0, 0])]   # you can also check these examples
df = pd.DataFrame()
for idx, x in enumerate(xs):
    index = pd.MultiIndex.from_product([[f"Example {idx}"], ["x", "shap_values"]])
    df = pd.concat(
        [
            df,
            pd.DataFrame(
                [x, shap.TreeExplainer(and_fb_model).shap_values(x)],
                index=index,
                columns=["x0", "x1", "x2", "x3"],
            ),
        ]
    )
df

[12]:

		x0	x1	x2	x3
示例 0	x	1.000	1.000	1.0	1.0
示例 0	shap_values	0.875	0.375	0.0	0.0
示例 1	x	0.000	0.000	0.0	0.0
示例 1	shap_values	-0.625	-0.125	0.0	0.0