使用开源图像描述模型和分区解释器解释图像描述(图像到文本)
本笔记本演示了如何使用 SHAP 解释图像描述模型的输出,即给定图像,模型输出图像的描述。
在这里,我们使用来自 https://github.com/ruotianluo/ImageCaptioning.pytorch 的预训练开源模型来获取图像描述。所有预训练模型都可以在 https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/master/MODEL_ZOO.md 中找到。特别是,本笔记本使用使用 ResNet101 特征训练的模型,该模型链接在 “FC+new_self_critical” 模型和指标 https://drive.google.com/open?id=1OsB_jLDorJnzKz6xsOfk1n493P3hwOP0 下。
局限性
为了解释图像描述,我们正在沿轴分割图像(即超像素/二分之一、四分之一、八分之一的分区...);另一种方法/未来的改进可能是语义分割图像,而不是轴对齐分区,并使用分段而不是超像素生成 SHAP 解释。https://github.com/shap/shap/issues/1738
我们正在使用 transformer 语言模型(例如 distilbart)在给定图像和masked图像描述之间进行对齐评分,假设外部模型是原始描述模型的语言头的良好替代品。通过使用描述模型自己的语言头,我们可以消除这种假设并消除依赖性。(例如,请参阅 text2text 笔记本示例)。有关更多详细信息,请参阅下面的“加载语言模型和分词器”部分。https://github.com/shap/shap/issues/1739
用于生成解释的评估越多,SHAP 运行所需的时间就越长。但是,增加评估次数会增加解释的粒度(300-500 次评估通常会生成详细的地图,但更少或更多次评估也通常是合理的)。有关更多详细信息,请参阅下面的“使用包装模型和图像掩码器创建解释器对象”部分。
设置开源模型
注意:严格按照下面给出的设置说明进行操作以确保笔记本运行非常重要。
克隆 https://github.com/ruotianluo/ImageCaptioning.pytorch 仓库。在终端中输入:‘git clone https://github.com/ruotianluo/ImageCaptioning.pytorch’。
更改下面的 PREFIX 变量,使其具有您的 ImageCaptioning.pytorch 文件夹的绝对路径。这是确保所有文件路径都被正确访问的重要步骤。
下载以下文件并将它们放置在给定的文件夹中
“model-best.pth”:从这里下载 https://drive.google.com/drive/folders/1OsB_jLDorJnzKz6xsOfk1n493P3hwOP0 并将其放置在克隆的目录中。
“infos_fc_nsc-best.pkl”:从这里下载 https://drive.google.com/drive/folders/1OsB_jLDorJnzKz6xsOfk1n493P3hwOP0 并将其放置在克隆的目录中
“resnet101”:从这里下载 https://drive.google.com/drive/folders/0B7fNdx_jAqhtbVYzOURMdDNHSGM 并将其放置在克隆目录下的 ‘data/imagenet_weights’ 文件夹中。如果 ‘data’ 目录下不存在 ‘imagenet_weights’ 文件夹,请创建它。
在终端中,导航到克隆的文件夹并键入 “python -m pip install -e .” 或在 jupyter 笔记本的单元格中键入 “!python -m pip install -e .” 以安装模块。
重启并清除内核输出以运行此笔记本。
可选:在运行下面“加载示例数据”部分中的单元格(加载示例数据并创建 ‘./test_images/’ 文件夹)后,尝试在终端中使用 “python tools/eval.py –model model-best.pth –infos_path infos_fc_nsc-best.pkl –image_folder test_images –num_images 10” 命令来验证安装是否成功。如果失败,请安装任何缺少的软件包。例如,如果缺少 ‘lmdbdict’ 软件包,请尝试使用 “pip install git+https://github.com/ruotianluo/lmdbdict.git” 安装。如果在终端输出中显示描述,则安装成功。
##### 注意:如果这些命令在 jupyter 笔记本中进行测试,请在命令前面添加 !。例如 “!python -m pip install -e .”
加载示例数据
[1]:
import os
import shap
from shap.utils.image import (
add_sample_images,
is_empty,
load_image,
make_dir,
save_image,
)
[3]:
# change PREFIX to have absolute path of cloned directory of ImageCaptioning.pytorch
PREFIX = r"<place full path to the cloned directory of ImageCaptioning.pytorch>/ImageCaptioning.pytorch"
os.chdir(PREFIX)
# directory of images to be explained
DIR = "./test_images/"
# creates or empties directory if it already exists
make_dir(DIR)
add_sample_images(DIR)
# directory for saving masked images
DIR_MASKED = "./masked_images/"
[4]:
import gc
import sys
# to suppress verbose output from open source model
from contextlib import contextmanager
import captioning.models as models
import captioning.modules.losses as losses
import captioning.utils.eval_utils as eval_utils
import captioning.utils.misc as utils
import numpy as np
import torch
from captioning.data.dataloader import DataLoader
from captioning.data.dataloaderraw import DataLoaderRaw
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
@contextmanager
def suppress_stdout():
with open(os.devnull, "w") as devnull:
old_stdout = sys.stdout
sys.stdout = devnull
old_stderr = sys.stderr
sys.stderr = devnull
try:
yield
finally:
sys.stdout = old_stdout
sys.stderr = old_stderr
cider or coco-caption missing
使用开源模型获取描述
[5]:
class ImageCaptioningPyTorchModel:
"""Wrapper class to get image captions using Resnet model from setup above.
Note: This class is being used instead of tools/eval.py to get predictions (captions).
To get more context for this class, please refer to tools/eval.py file.
"""
def __init__(self, model_path, infos_path, cnn_model="resnet101", device="cuda"):
"""Initializing the class by loading torch model and vocabulary at path given and using Resnet weights stored in data/imagenet_weights.
This is done to speeden the process of getting image captions and avoid loading the model every time captions are needed.
Parameters
----------
model_path : pre-trained model path
infos_path : pre-trained infos (vocab) path
cnn_model : resnet model weights to use; options: "resnet101" (default), "resnet152"
device : "cpu" or "cuda" (default)
"""
# load infos
with open(infos_path, "rb") as f:
infos = utils.pickle_load(f)
opt = infos["opt"]
# setup the model
opt.model = model_path
opt.cnn_model = cnn_model
opt.device = device
opt.vocab = infos["vocab"] # ix -> word mapping
model = models.setup(opt)
del infos
del opt.vocab
model.load_state_dict(torch.load(opt.model, map_location="cpu"))
model.to(opt.device)
model.eval()
crit = losses.LanguageModelCriterion()
# setup class variables for call function
self.opt = opt
self.model = model
self.crit = crit
self.infos_path = infos_path
# free memory
torch.cuda.empty_cache()
gc.collect()
def __call__(self, image_folder, batch_size):
"""Function to get captions for images placed in image_folder.
Parameters
----------
image_folder: folder of images for which captions are needed
batch_size : number of images to be evaluated at once
Output
-------
captions : list of captions for images in image_folder (will return a string if there is only one image in folder)
"""
# setting eval options
opt = self.opt
opt.batch_size = batch_size
opt.image_folder = image_folder
opt.coco_json = ""
opt.dataset = opt.input_json
opt.verbose_loss = 0
opt.verbose = False
opt.dump_path = 0
opt.dump_images = 0
opt.num_images = -1
opt.language_eval = 0
# loading vocab
with open(self.infos_path, "rb") as f:
infos = utils.pickle_load(f)
opt.vocab = infos["vocab"]
# creating Data Loader instance to load images
if len(opt.image_folder) == 0:
loader = DataLoader(opt)
else:
loader = DataLoaderRaw(
{
"folder_path": opt.image_folder,
"coco_json": opt.coco_json,
"batch_size": opt.batch_size,
"cnn_model": opt.cnn_model,
}
)
# when evaluating using provided pretrained model, vocab may be different from what is in cocotalk.json.
# hence, setting vocab from infos file.
loader.dataset.ix_to_word = opt.vocab
del infos
del opt.vocab
# getting caption predictions
_, split_predictions, _ = eval_utils.eval_split(self.model, self.crit, loader, vars(opt))
captions = []
for line in split_predictions:
captions.append(line["caption"])
# free memory
del loader
torch.cuda.empty_cache()
gc.collect()
return captions if len(captions) > 1 else captions[0]
[6]:
# create instance of ImageCaptioningPyTorchModel
osmodel = ImageCaptioningPyTorchModel(
model_path="model-best.pth",
infos_path="infos_fc_nsc-best.pkl",
cnn_model="resnet101",
device="cpu",
)
# create function to get caption using model created above
def get_caption(model, image_folder, batch_size):
return model(image_folder, batch_size)
加载数据
‘./test_images/’ 是将要解释的图像的文件夹。‘./test_images/’ 目录已为您创建,并且复制笔记本中显示的示例所需的示例图像已放置在该目录中。
注意:替换或添加您想要在“./test_images/”文件夹中解释(测试)的图像。
[7]:
# checks if test images folder exists and if it has any files
if not is_empty(DIR):
X = []
print("Loading data...")
files = [f for f in os.listdir(DIR) if os.path.isfile(os.path.join(DIR, f))]
for file in files:
path_to_image = os.path.join(DIR, file)
print("Loading image:", file)
X.append(load_image(path_to_image))
with suppress_stdout():
captions = get_caption(osmodel, "test_images", 5)
if len(X) > 1:
print("\nCaptions are...", *captions, sep="\n")
else:
print("\nCaption is...", captions)
print("\nNumber of images in test dataset:", len(X))
Loading data...
Loading image: 1.jpg
Loading image: 2.jpg
Loading image: 3.jpg
Loading image: 4.jpg
Captions are...
a woman sitting on a bench
a bird sitting on top of a tree branch
a group of horses standing next to a fence
a group of people playing with a soccer ball
Number of images in test dataset: 4
加载语言模型和分词器
Transformer 语言模型 ‘distilbart’ 和分词器在这里用于标记图像描述。这使得图像到文本的场景类似于多类问题。‘distilbart’ 用于在原始图像描述和正在生成的masked图像描述之间进行对齐评分,即,当给定masked图像描述的上下文时,获取原始图像描述的概率如何变化?(又名,我们正在强制 ‘distilbart’ 始终为masked图像生成原始图像描述,并获取描述中每个标记化词的 logits 变化,作为过程的一部分)。
注意:我们在这里使用 ‘distilbart’ 是因为在实验过程中,我们发现它为图像提供了最有意义的解释。我们已经与其他语言模型(如 ‘openaigpt’ 和 ‘distilgpt2’)进行了比较。请随时探索您选择的其他语言模型并比较结果。
[8]:
# load transformer language model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-xsum-12-6")
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-xsum-12-6").cuda()
使用包装模型和图像掩码器创建解释器对象
用于实验的解释器对象的各种选项
mask_value:图像掩码器默认使用图像修复技术进行掩码(即 mask_value = “inpaint_ns”)。还有其他可用于模糊/图像修复的掩码选项,例如 “inpaint_telea” 和 “blur(kernel_xsize, kernel_xsize)”。注意:不同的掩码选项可以生成不同的解释。
max_evals:为获得 SHAP 值而对底层模型进行的评估次数。建议的评估次数为 300-500,以获得具有有意义的超像素粒度的解释。评估次数越多,粒度越高,但运行时间也会增加。默认设置为 300 次评估。
batch_size:一次要评估的masked图像的数量。默认大小设置为 50。
fixed_context:用于构建分区树的掩码技术,选项为 ‘0’、‘1’ 或 ‘None’。“fixed_context = None” 是生成有意义结果的最佳选择,但它比 fixed_context = 0 或 1 相对较慢,因为它会生成完整的分区树。默认选项设置为 ‘None’。
[9]:
# setting values for logging/tracking variables
make_dir(DIR_MASKED)
image_counter = 0
mask_counter = 0
# define function f which takes input (masked image) and returns caption for it
def f(x):
global mask_counter
# emptying masked images directory
make_dir(DIR_MASKED)
# saving masked array of RGB values as an image in masked_images directory
path_to_image = os.path.join(DIR_MASKED, f"{image_counter}_{mask_counter}.png")
save_image(x, path_to_image)
# getting caption of masked image
with suppress_stdout():
caption = get_caption(osmodel, "masked_images", 5)
mask_counter += 1
return caption
# function to take a list of images and parameters such as masking option, max evals etc. and return shap_values_objects
def run_masker(X, mask_value="inpaint_ns", max_evals=300, batch_size=50, fixed_context=None):
"""Function to take a list of images and parameters such max evals etc. and return shap explanations (shap_values) for test images(X).
Paramaters
----------
X : list of images which need to be explained
mask_value : various masking options for blurring/inpainting such as "inpaint_ns", "inpaint_telea" and "blur(pixel_size, pixel_size)"
max_evals : number of evaluations done of the underlying model to get SHAP values
batch_size : number of masked images to be evaluated at once
fixed_context : masking technqiue used to build partition tree with options of '0', '1' or 'None'
Output
------
shap_values_list: list of shap_values objects generated for the images
"""
global image_counter
global mask_counter
shap_values_list = []
for index in range(len(X)):
# define a masker that is used to mask out partitions of the input image based on mask_value option
masker = shap.maskers.Image(mask_value, X[index].shape)
# wrap model with TeacherForcingLogits class
wrapped_model = shap.models.TeacherForcingLogits(f, similarity_model=model, similarity_tokenizer=tokenizer)
# build a partition explainer with wrapped_model and image masker
explainer = shap.Explainer(wrapped_model, masker)
# compute SHAP values - here we use max_evals no. of evaluations of the underlying model to estimate SHAP values
shap_values = explainer(
np.array(X[index : index + 1]),
max_evals=max_evals,
batch_size=batch_size,
fixed_context=fixed_context,
)
shap_values_list.append(shap_values)
# output plot
shap_values.output_names[0] = [word.replace("Ġ", "") for word in shap_values.output_names[0]]
shap.image_plot(shap_values)
# setting values for next iterations
mask_counter = 0
image_counter += 1
return shap_values_list
测试图像的 SHAP 解释
[25]:
# SHAP explanation using masking option "blur(pixel_size, pixel_size)" for blurring
shap_values = run_masker(X, mask_value="blur(56,56)")
Partition explainer: 2it [36:16, 1088.29s/it]
Partition explainer: 2it [05:47, 173.60s/it]
Partition explainer: 2it [05:44, 172.26s/it]
Partition explainer: 2it [05:42, 171.46s/it]
[26]:
# SHAP explanation using masking option "inpaint_telea" for inpainting
shap_values = run_masker(X[3:4], mask_value="inpaint_telea")
Partition explainer: 2it [05:38, 169.49s/it]