text

本笔记本旨在演示(并因此记录)如何使用 shap.plots.text 函数。它使用 transformers 包中的精馏 PyTorch BERT 模型来对 IMDB 电影评论进行情感分析。

请注意,我们定义的预测函数接受字符串列表,并返回正类别的 logits 值。

[9]:
import nlp
import numpy as np
import scipy as sp
import torch
import transformers

import shap

# load a BERT sentiment analysis model
tokenizer = transformers.DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = transformers.DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
).cuda()


# define a prediction function
def f(x):
    tv = torch.tensor([tokenizer.encode(v, padding="max_length", max_length=500, truncation=True) for v in x]).cuda()
    outputs = model(tv)[0].detach().cpu().numpy()
    scores = (np.exp(outputs).T / np.exp(outputs).sum(-1)).T
    val = sp.special.logit(scores[:, 1])  # use one vs rest logit units
    return val


# build an explainer using a token masker
explainer = shap.Explainer(f, tokenizer)

# explain the model's predictions on IMDB reviews
imdb_train = nlp.load_dataset("imdb")["train"]
shap_values = explainer(imdb_train[:10], fixed_context=1)

单实例文本图

当我们将单个实例传递给文本图时,我们会得到每个标记的重要性,这些重要性覆盖在与该标记对应的原始文本上。红色区域对应于文本中包含时会增加模型输出的部分,而蓝色区域对应于文本中包含时会减少模型输出的部分。在情感分析模型的上下文中,红色对应于更积极的评论,蓝色对应于更消极的评论。

请注意,为文本模型返回的重要性值通常是分层的,并遵循文本的结构。标记组之间的非线性交互通常会被保存,并在绘图过程中使用。如果传递给文本图的 Explanation 对象具有 .hierarchical_values 属性,则具有强非线性效应的小标记组将自动合并在一起,形成连贯的块。当 .hierarchical_values 属性存在时,也意味着解释器可能没有完全枚举所有可能的标记扰动,因此已将文本块视为基本上是单个单元。发生这种情况是因为我们通常希望在评估文本模型的次数少于文档中标记数量的情况下解释文本模型。每当输入文本的某个区域未被解释器分割时,文本图都会将其显示为单个单元。

文本上方的力图旨在概述文本的所有部分如何组合以产生模型的输出。有关更多详细信息,请参阅 `力图 <>`__ 笔记本,但该图的一般结构是正面的红色特征“推动”模型输出更高,而负面的蓝色特征“推动”模型输出更低。力图提供了比文本着色更定量的的信息。将鼠标悬停在文本块上将突出显示力图中与该文本块对应的部分,而将鼠标悬停在力图的某部分上将突出显示相应的文本块。

请注意,单击任何文本块都会显示归因于该块中标记的 SHAP 值的总和(再次单击将隐藏该值)。

[10]:
# plot the first sentence's explanation
shap.plots.text(shap_values[3])
-2.171297base value-5.200698-8.2300990.8581053.8875066.9169083.6333723.633372f(x)2.49 But 2.385 lovable 2.222 impressive 1.676 is 1.319 still, 0.977 Its not The Fisher King, but its not crap, either. 0.518 some of the most traditionally reviled members of 0.484 is 0.083 very -0.958 society -0.775 Many of the jokes fall flat. -0.684 this film -0.627 Sure, its flawed. It does not give a realistic view of homelessness -0.554 My only complaint is that Brooks should have cast someone else in the lead -0.518 in a -0.511 . -0.4 and to pull that off in a story about -0.357 easily the most -0.346 (I love Mel as a Director and Writer, not so much as a lead). -0.176 underrated film inn the Brooks cannon. -0.167 (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS) -0.167 This is -0.093 way many comedies are not, -0.012 . -0.004 truly -0.0 -0.0
-0.0
-0.167 / 2
This is
-0.357 / 3
easily the most
-0.176 / 8
underrated film inn the Brooks cannon.
-0.627 / 15
Sure, its flawed. It does not give a realistic view of homelessness
-0.167 / 27
(unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS)
-0.511
.
-0.775 / 7
Many of the jokes fall flat.
2.49
But
1.319 / 2
still,
-0.684 / 2
this film
0.484
is
0.083
very
2.385 / 2
lovable
-0.518 / 2
in a
-0.093 / 6
way many comedies are not,
-0.4 / 9
and to pull that off in a story about
0.518 / 9
some of the most traditionally reviled members of
-0.958
society
1.676
is
-0.004
truly
2.222
impressive
-0.012
.
0.977 / 13
Its not The Fisher King, but its not crap, either.
-0.554 / 14
My only complaint is that Brooks should have cast someone else in the lead
-0.346 / 18
(I love Mel as a Director and Writer, not so much as a lead).
-0.0

多实例文本图

当我们将多行解释对象传递给文本图时,我们会得到每个输入实例的单实例图,这些图经过缩放,因此它们具有一致的可比较的 x 轴和颜色范围。

[11]:
# plot the first sentence's explanation
shap.plots.text(shap_values[:3])
Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray

第 0 个实例
-2.165315base value-5.158718-8.152122-11.1455260.8280893.821492-3.729354-3.729354f(x)1.306 The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at ......... 0.153 other programs about school life, 0.098 Bromwell High is 0.085 STUDENT: Welcome to Bromwell High. 0.022 think that Bromwell 0.0 0.0 -0.396 ran -0.329 "Teachers". -0.319 My 35 years in the teaching profession lead -0.318 satire is much closer to reality than is -0.275 same time as some -0.216 High is -0.177 What a pity that it isn't! -0.168 a cartoon comedy -0.143 It -0.128 m here to sack one of your teachers. -0.121 such as "Teachers". -0.116 . -0.115 A classic line: INSPECTOR: I' -0.101 me to believe that Bromwell High's -0.1 fetched -0.058 at the -0.051 . -0.04 I expect that many adults of my age -0.033 . High. -0.026 far
0.0
0.098 / 5
Bromwell High is
-0.168 / 3
a cartoon comedy
-0.051
.
-0.143
It
-0.396
ran
-0.058 / 2
at the
-0.275 / 4
same time as some
0.153 / 6
other programs about school life,
-0.121 / 6
such as "Teachers".
-0.319 / 8
My 35 years in the teaching profession lead
-0.101 / 10
me to believe that Bromwell High's
-0.318 / 8
satire is much closer to reality than is
-0.329 / 4
"Teachers".
1.306 / 82
The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .........
-0.033 / 3
. High.
-0.115 / 8
A classic line: INSPECTOR: I'
-0.128 / 9
m here to sack one of your teachers.
0.085 / 9
STUDENT: Welcome to Bromwell High.
-0.04 / 8
I expect that many adults of my age
0.022 / 5
think that Bromwell
-0.216 / 2
High is
-0.026
far
-0.1 / 2
fetched
-0.116
.
-0.177 / 9
What a pity that it isn't!
0.0

第 1 个实例
-0.722620base value-3.716024-6.709427-9.7028312.2707845.264187-4.128328-4.128328f(x)1.915 it shows a tender side compared to his slapstick work such as Blazing Saddles, 1.385 films where prior to being a comedy, 1.159 Maybe they should give it to the homeless instead of using it like Monopoly money.<br /><br />Or maybe this film will inspire you to help others. 0.838 <br /><br />While the love connection between Molly 0.557 The bet's on where Bolt is thrown on the street with a bracelet on his leg to monitor his every move where he can't step off the sidewalk. 0.386 it's fight or flight, kill or be killed. 0.324 to show what it's like having something valuable before losing it the next day or on the other hand making a stupid bet like all rich people do when they don't know what to do with their money. 0.3 be one of Mel Brooks' observant 0.119 Young Frankenstein, or Spaceballs for the matter, 0.046 and her pals Sailor (Howard Morris) and Fumes (Teddy Wilson) who are already used to the streets. They're survivors. Bolt isn't. 0.0 0.0 -2.169 Stinks -2.105 I found -1.711 "Life -0.948 necessary to plot, -0.909 not used -0.451 " to -0.407 He's given the nickname Pepto by a vagrant after it's written on his forehead where Bolt meets other -0.378 Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. -0.347 to reaching -0.275 He's -0.269 mutual agreements like he once did when being rich where -0.2 Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets. -0.154 and Bolt wasn't -0.086 <br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without the luxuries; if Bolt succeeds, he can do what he wants with a future project of making more buildings. -0.024 characters including a woman by the name of Molly (Lesley Ann Warren) an ex-dancer who got divorce before losing her home,
0.0
-0.378 / 49
Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter.
-0.2 / 52
Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.
-0.086 / 157
<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without the luxuries; if Bolt succeeds, he can do what he wants with a future project of making more buildings.
0.557 / 33
The bet's on where Bolt is thrown on the street with a bracelet on his leg to monitor his every move where he can't step off the sidewalk.
-0.407 / 24
He's given the nickname Pepto by a vagrant after it's written on his forehead where Bolt meets other
-0.024 / 26
characters including a woman by the name of Molly (Lesley Ann Warren) an ex-dancer who got divorce before losing her home,
0.046 / 34
and her pals Sailor (Howard Morris) and Fumes (Teddy Wilson) who are already used to the streets. They're survivors. Bolt isn't.
-0.275 / 3
He's
-0.909 / 2
not used
-0.347 / 2
to reaching
-0.269 / 10
mutual agreements like he once did when being rich where
0.386 / 12
it's fight or flight, kill or be killed.
0.838 / 14
<br /><br />While the love connection between Molly
-0.154 / 5
and Bolt wasn't
-0.948 / 4
necessary to plot,
-2.105 / 2
I found
-1.711 / 2
"Life
-2.169 / 2
Stinks
-0.451 / 2
" to
0.3 / 9
be one of Mel Brooks' observant
1.385 / 8
films where prior to being a comedy,
1.915 / 17
it shows a tender side compared to his slapstick work such as Blazing Saddles,
0.119 / 10
Young Frankenstein, or Spaceballs for the matter,
0.324 / 43
to show what it's like having something valuable before losing it the next day or on the other hand making a stupid bet like all rich people do when they don't know what to do with their money.
1.159 / 35
Maybe they should give it to the homeless instead of using it like Monopoly money.<br /><br />Or maybe this film will inspire you to help others.
0.0

第 2 个实例
-2.184386base value-5.177789-8.171193-11.1645970.8090183.8024214.3469024.346902f(x)1.598 is also 0.836 Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. 0.718 superb 0.695 is a 0.441 After being accused of 0.373 in Blazing Saddles. 0.322 as anything 0.299 being a turncoat, 0.272 on 0.255 The 0.24 selling out his boss, 0.232 as good 0.215 and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I'm a lawyer" he says. 0.179 The corn on face 0.173 Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. 0.158 take 0.143 lawyers 0.1 classic 0.087 Look for the legs scene and the two big diggers fighting (one bleeds). 0.022 (which is quite often). 0.0 0.0 -0.59 . -0.225 This movie gets better each time I see it -0.013 ,
0.0
0.836 / 30
Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none.
0.179 / 4
The corn on face
0.695 / 2
is a
0.1
classic
-0.013
,
0.232 / 2
as good
0.322 / 2
as anything
0.373 / 5
in Blazing Saddles.
0.255
The
0.158
take
0.272
on
0.143
lawyers
1.598 / 2
is also
0.718
superb
-0.59
.
0.441 / 4
After being accused of
0.299 / 5
being a turncoat,
0.24 / 5
selling out his boss,
0.215 / 24
and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I'm a lawyer" he says.
0.173 / 63
Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics.
0.087 / 18
Look for the legs scene and the two big diggers fighting (one bleeds).
-0.225 / 9
This movie gets better each time I see it
0.022 / 7
(which is quite often).
0.0

总结文本解释

虽然使用文本图绘制几个实例级别的解释可能非常 informative,但有时您需要大量实例中标记影响的全局摘要。有关更多详细信息,请参阅 `Explanation 对象 <>`__ 文档,但您可以通过折叠多行解释对象的所有行(在本例中通过求和)来轻松汇总数据集中标token的重要性。这样做会将每个文本输入标记类型视为一个特征,因此折叠后的 Explanation 对象将具有与原始多行解释对象中唯一标记一样多的列。如果 Explanation 对象中存在分层值,则任何大型组都会被划分,并且组中每个标记都会获得组总体重要性值的均等份额。

[12]:
shap.plots.bar(shap_values.abs.sum(0))
../../../_images/example_notebooks_api_examples_plots_text_7_0.png

请注意,您如何总结特征的重要性可能会产生很大的差异。在上图中,a 标记非常重要,因为它既对模型产生影响,又因为它非常常见。下面我们改为使用 max 函数总结实例,以查看标记在任何实例中的最大影响。

[13]:
shap.plots.bar(shap_values.abs.max(0))
../../../_images/example_notebooks_api_examples_plots_text_9_0.png

您还可以使用标记作为输入名称,从所有实例中切出一个标记(请注意,输入名称左侧的灰色值是生成标记的原始文本)。

[14]:
shap.plots.bar(shap_values[:, "but"])
../../../_images/example_notebooks_api_examples_plots_text_11_0.png
[15]:
shap.plots.bar(shap_values[:, "but"])
../../../_images/example_notebooks_api_examples_plots_text_12_0.png

文本到文本可视化

[16]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import shap

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-es").cuda()

s = ["In this picture, there are four persons: my father, my mother, my brother and my sister."]

explainer = shap.Explainer(model, tokenizer)

shap_values = explainer(s)

文本到文本可视化在左侧包含模型的输入文本,在右侧包含输出文本(在默认布局中)。将鼠标悬停在右侧(输出)标记上时,每个输入标记的重要性都会覆盖在其上,并由标记的背景颜色表示。红色区域对应于文本中包含时会增加模型输出的部分,而蓝色区域对应于文本中包含时会减少模型输出的部分。可以通过单击输出标记来锚定特定输出标记的解释(可以通过再次单击来取消锚定)。

请注意,与上述单输出图类似,为文本模型返回的重要性值通常是分层的,并遵循文本的结构。具有强非线性效应的小标记组将自动合并在一起,形成连贯的块。同样,解释器可能没有完全枚举所有可能的标记扰动,因此已将文本块视为基本上是单个单元。此预处理是为每个输出标记完成的,并且每个输出标记的合并行为可能不同,因为每个输出标记的交互作用效果可能不同。合并后的块可以通过将鼠标悬停在输入文本上来查看,一旦输出标记被锚定。合并块的所有标记都将以粗体显示。

一旦输出文本被锚定,就可以单击输入标记以查看确切的 shap 值(将鼠标悬停在输入标记上也会弹出包含值的工具提示)。自动合并的标记显示在该块中标记数量上划分的总值。

将鼠标悬停在输入文本上会显示每个输出标记的该标记的 SHAP 值。这再次由输出标记的背景颜色表示。可以通过单击输入标记来锚定此值。

注意:所有标记(输入和输出)的颜色缩放比例是一致的,最亮的红色分配给任何输出标记的输入标记的最大 SHAP 值。

注意:可以通过使用“布局”下拉菜单更改两段文本的布局。

[17]:
shap.plots.text(shap_values)

第 0 个实例
可视化类型
输入/输出 - 热图
布局
输入文本
In
this
picture
,
there
are
four
persons
:
my
father
,
my
mother
,
my
brother
and
my
sister
.
输出文本
En
este
cuadro
,
hay
cuatro
personas
:
mi
padre
,
mi
madre
,
mi
hermano
y
mi
hermana
.

有更多有用的示例的想法吗?欢迎提交 Pull Request 以添加到此文档笔记本!