使用开源图像描述模型和分区解释器解释图像描述（图像到文本）

本笔记本演示了如何使用 SHAP 解释图像描述模型的输出，即给定图像，模型输出图像的描述。

在这里，我们使用来自 https://github.com/ruotianluo/ImageCaptioning.pytorch 的预训练开源模型来获取图像描述。所有预训练模型都可以在 https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/master/MODEL_ZOO.md 中找到。特别是，本笔记本使用使用 ResNet101 特征训练的模型，该模型链接在 “FC+new_self_critical” 模型和指标 https://drive.google.com/open?id=1OsB_jLDorJnzKz6xsOfk1n493P3hwOP0 下。

局限性

为了解释图像描述，我们正在沿轴分割图像（即超像素/二分之一、四分之一、八分之一的分区...）；另一种方法/未来的改进可能是语义分割图像，而不是轴对齐分区，并使用分段而不是超像素生成 SHAP 解释。https://github.com/shap/shap/issues/1738
我们正在使用 transformer 语言模型（例如 distilbart）在给定图像和masked图像描述之间进行对齐评分，假设外部模型是原始描述模型的语言头的良好替代品。通过使用描述模型自己的语言头，我们可以消除这种假设并消除依赖性。（例如，请参阅 text2text 笔记本示例）。有关更多详细信息，请参阅下面的“加载语言模型和分词器”部分。https://github.com/shap/shap/issues/1739
用于生成解释的评估越多，SHAP 运行所需的时间就越长。但是，增加评估次数会增加解释的粒度（300-500 次评估通常会生成详细的地图，但更少或更多次评估也通常是合理的）。有关更多详细信息，请参阅下面的“使用包装模型和图像掩码器创建解释器对象”部分。

设置开源模型

注意：严格按照下面给出的设置说明进行操作以确保笔记本运行非常重要。

克隆 https://github.com/ruotianluo/ImageCaptioning.pytorch 仓库。在终端中输入：‘git clone https://github.com/ruotianluo/ImageCaptioning.pytorch’。
更改下面的 PREFIX 变量，使其具有您的 ImageCaptioning.pytorch 文件夹的绝对路径。这是确保所有文件路径都被正确访问的重要步骤。
下载以下文件并将它们放置在给定的文件夹中
1. “model-best.pth”：从这里下载 https://drive.google.com/drive/folders/1OsB_jLDorJnzKz6xsOfk1n493P3hwOP0 并将其放置在克隆的目录中。
2. “infos_fc_nsc-best.pkl”：从这里下载 https://drive.google.com/drive/folders/1OsB_jLDorJnzKz6xsOfk1n493P3hwOP0 并将其放置在克隆的目录中
3. “resnet101”：从这里下载 https://drive.google.com/drive/folders/0B7fNdx_jAqhtbVYzOURMdDNHSGM 并将其放置在克隆目录下的 ‘data/imagenet_weights’ 文件夹中。如果 ‘data’ 目录下不存在 ‘imagenet_weights’ 文件夹，请创建它。
在终端中，导航到克隆的文件夹并键入 “python -m pip install -e .” 或在 jupyter 笔记本的单元格中键入 “!python -m pip install -e .” 以安装模块。
重启并清除内核输出以运行此笔记本。
可选：在运行下面“加载示例数据”部分中的单元格（加载示例数据并创建 ‘./test_images/’ 文件夹）后，尝试在终端中使用 “python tools/eval.py –model model-best.pth –infos_path infos_fc_nsc-best.pkl –image_folder test_images –num_images 10” 命令来验证安装是否成功。如果失败，请安装任何缺少的软件包。例如，如果缺少 ‘lmdbdict’ 软件包，请尝试使用 “pip install git+https://github.com/ruotianluo/lmdbdict.git” 安装。如果在终端输出中显示描述，则安装成功。

##### 注意：如果这些命令在 jupyter 笔记本中进行测试，请在命令前面添加 !。例如 “!python -m pip install -e .”

加载示例数据

[1]:

import os

import shap
from shap.utils.image import (
    add_sample_images,
    is_empty,
    load_image,
    make_dir,
    save_image,
)

[3]:

# change PREFIX to have absolute path of cloned directory of ImageCaptioning.pytorch
PREFIX = r"<place full path to the cloned directory of ImageCaptioning.pytorch>/ImageCaptioning.pytorch"
os.chdir(PREFIX)

# directory of images to be explained
DIR = "./test_images/"
# creates or empties directory if it already exists
make_dir(DIR)
add_sample_images(DIR)

# directory for saving masked images
DIR_MASKED = "./masked_images/"

[4]:

import gc
import sys

# to suppress verbose output from open source model
from contextlib import contextmanager

import captioning.models as models
import captioning.modules.losses as losses
import captioning.utils.eval_utils as eval_utils
import captioning.utils.misc as utils
import numpy as np
import torch
from captioning.data.dataloader import DataLoader
from captioning.data.dataloaderraw import DataLoaderRaw
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer


@contextmanager
def suppress_stdout():
    with open(os.devnull, "w") as devnull:
        old_stdout = sys.stdout
        sys.stdout = devnull
        old_stderr = sys.stderr
        sys.stderr = devnull
        try:
            yield
        finally:
            sys.stdout = old_stdout
            sys.stderr = old_stderr

cider or coco-caption missing

使用开源模型获取描述

[5]:

class ImageCaptioningPyTorchModel:
    """Wrapper class to get image captions using Resnet model from setup above.
    Note: This class is being used instead of tools/eval.py to get predictions (captions).
    To get more context for this class, please refer to tools/eval.py file.
    """

    def __init__(self, model_path, infos_path, cnn_model="resnet101", device="cuda"):
        """Initializing the class by loading torch model and vocabulary at path given and using Resnet weights stored in data/imagenet_weights.
        This is done to speeden the process of getting image captions and avoid loading the model every time captions are needed.

        Parameters
        ----------
        model_path  : pre-trained model path
        infos_path  : pre-trained infos (vocab) path
        cnn_model   : resnet model weights to use; options: "resnet101" (default), "resnet152"
        device      : "cpu" or "cuda" (default)

        """
        # load infos
        with open(infos_path, "rb") as f:
            infos = utils.pickle_load(f)
        opt = infos["opt"]

        # setup the model
        opt.model = model_path
        opt.cnn_model = cnn_model
        opt.device = device
        opt.vocab = infos["vocab"]  # ix -> word mapping
        model = models.setup(opt)
        del infos
        del opt.vocab
        model.load_state_dict(torch.load(opt.model, map_location="cpu"))
        model.to(opt.device)
        model.eval()
        crit = losses.LanguageModelCriterion()

        # setup class variables for call function
        self.opt = opt
        self.model = model
        self.crit = crit
        self.infos_path = infos_path

        # free memory
        torch.cuda.empty_cache()
        gc.collect()

    def __call__(self, image_folder, batch_size):
        """Function to get captions for images placed in image_folder.

        Parameters
        ----------
        image_folder: folder of images for which captions are needed
        batch_size  : number of images to be evaluated at once
        Output
        -------
        captions    : list of captions for images in image_folder (will return a string if there is only one image in folder)

        """
        # setting eval options
        opt = self.opt
        opt.batch_size = batch_size
        opt.image_folder = image_folder
        opt.coco_json = ""
        opt.dataset = opt.input_json
        opt.verbose_loss = 0
        opt.verbose = False
        opt.dump_path = 0
        opt.dump_images = 0
        opt.num_images = -1
        opt.language_eval = 0

        # loading vocab
        with open(self.infos_path, "rb") as f:
            infos = utils.pickle_load(f)
        opt.vocab = infos["vocab"]

        # creating Data Loader instance to load images
        if len(opt.image_folder) == 0:
            loader = DataLoader(opt)
        else:
            loader = DataLoaderRaw(
                {
                    "folder_path": opt.image_folder,
                    "coco_json": opt.coco_json,
                    "batch_size": opt.batch_size,
                    "cnn_model": opt.cnn_model,
                }
            )

        # when evaluating using provided pretrained model, vocab may be different from what is in cocotalk.json.
        # hence, setting vocab from infos file.
        loader.dataset.ix_to_word = opt.vocab
        del infos
        del opt.vocab

        # getting caption predictions
        _, split_predictions, _ = eval_utils.eval_split(self.model, self.crit, loader, vars(opt))
        captions = []
        for line in split_predictions:
            captions.append(line["caption"])

        # free memory
        del loader
        torch.cuda.empty_cache()
        gc.collect()

        return captions if len(captions) > 1 else captions[0]

[6]:

# create instance of ImageCaptioningPyTorchModel
osmodel = ImageCaptioningPyTorchModel(
    model_path="model-best.pth",
    infos_path="infos_fc_nsc-best.pkl",
    cnn_model="resnet101",
    device="cpu",
)


# create function to get caption using model created above
def get_caption(model, image_folder, batch_size):
    return model(image_folder, batch_size)

加载数据

‘./test_images/’ 是将要解释的图像的文件夹。‘./test_images/’ 目录已为您创建，并且复制笔记本中显示的示例所需的示例图像已放置在该目录中。

注意：替换或添加您想要在“./test_images/”文件夹中解释（测试）的图像。

[7]:

# checks if test images folder exists and if it has any files
if not is_empty(DIR):
    X = []
    print("Loading data...")
    files = [f for f in os.listdir(DIR) if os.path.isfile(os.path.join(DIR, f))]
    for file in files:
        path_to_image = os.path.join(DIR, file)
        print("Loading image:", file)
        X.append(load_image(path_to_image))
    with suppress_stdout():
        captions = get_caption(osmodel, "test_images", 5)
    if len(X) > 1:
        print("\nCaptions are...", *captions, sep="\n")
    else:
        print("\nCaption is...", captions)
    print("\nNumber of images in test dataset:", len(X))

Loading data...
Loading image: 1.jpg
Loading image: 2.jpg
Loading image: 3.jpg
Loading image: 4.jpg

Captions are...
a woman sitting on a bench
a bird sitting on top of a tree branch
a group of horses standing next to a fence
a group of people playing with a soccer ball

Number of images in test dataset: 4

加载语言模型和分词器

Transformer 语言模型 ‘distilbart’ 和分词器在这里用于标记图像描述。这使得图像到文本的场景类似于多类问题。‘distilbart’ 用于在原始图像描述和正在生成的masked图像描述之间进行对齐评分，即，当给定masked图像描述的上下文时，获取原始图像描述的概率如何变化？（又名，我们正在强制 ‘distilbart’ 始终为masked图像生成原始图像描述，并获取描述中每个标记化词的 logits 变化，作为过程的一部分）。

注意：我们在这里使用 ‘distilbart’ 是因为在实验过程中，我们发现它为图像提供了最有意义的解释。我们已经与其他语言模型（如 ‘openaigpt’ 和 ‘distilgpt2’）进行了比较。请随时探索您选择的其他语言模型并比较结果。

[8]:

# load transformer language model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-xsum-12-6")
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-xsum-12-6").cuda()

使用包装模型和图像掩码器创建解释器对象

用于实验的解释器对象的各种选项

mask_value：图像掩码器默认使用图像修复技术进行掩码（即 mask_value = “inpaint_ns”）。还有其他可用于模糊/图像修复的掩码选项，例如 “inpaint_telea” 和 “blur(kernel_xsize, kernel_xsize)”。注意：不同的掩码选项可以生成不同的解释。
max_evals：为获得 SHAP 值而对底层模型进行的评估次数。建议的评估次数为 300-500，以获得具有有意义的超像素粒度的解释。评估次数越多，粒度越高，但运行时间也会增加。默认设置为 300 次评估。
batch_size：一次要评估的masked图像的数量。默认大小设置为 50。
fixed_context：用于构建分区树的掩码技术，选项为 ‘0’、‘1’ 或 ‘None’。“fixed_context = None” 是生成有意义结果的最佳选择，但它比 fixed_context = 0 或 1 相对较慢，因为它会生成完整的分区树。默认选项设置为 ‘None’。

[9]:

# setting values for logging/tracking variables
make_dir(DIR_MASKED)
image_counter = 0
mask_counter = 0


# define function f which takes input (masked image) and returns caption for it
def f(x):
    global mask_counter

    # emptying masked images directory
    make_dir(DIR_MASKED)

    # saving masked array of RGB values as an image in masked_images directory
    path_to_image = os.path.join(DIR_MASKED, f"{image_counter}_{mask_counter}.png")
    save_image(x, path_to_image)

    # getting caption of masked image
    with suppress_stdout():
        caption = get_caption(osmodel, "masked_images", 5)
    mask_counter += 1

    return caption


# function to take a list of images and parameters such as masking option, max evals etc. and return shap_values_objects
def run_masker(X, mask_value="inpaint_ns", max_evals=300, batch_size=50, fixed_context=None):
    """Function to take a list of images and parameters such max evals etc. and return shap explanations (shap_values) for test images(X).
    Paramaters
    ----------
    X               : list of images which need to be explained
    mask_value      : various masking options for blurring/inpainting such as "inpaint_ns", "inpaint_telea" and "blur(pixel_size, pixel_size)"
    max_evals       : number of evaluations done of the underlying model to get SHAP values
    batch_size      : number of masked images to be evaluated at once
    fixed_context   : masking technqiue used to build partition tree with options of '0', '1' or 'None'
    Output
    ------
    shap_values_list: list of shap_values objects generated for the images
    """
    global image_counter
    global mask_counter
    shap_values_list = []

    for index in range(len(X)):
        # define a masker that is used to mask out partitions of the input image based on mask_value option
        masker = shap.maskers.Image(mask_value, X[index].shape)

        # wrap model with TeacherForcingLogits class
        wrapped_model = shap.models.TeacherForcingLogits(f, similarity_model=model, similarity_tokenizer=tokenizer)

        # build a partition explainer with wrapped_model and image masker
        explainer = shap.Explainer(wrapped_model, masker)

        # compute SHAP values - here we use max_evals no. of evaluations of the underlying model to estimate SHAP values
        shap_values = explainer(
            np.array(X[index : index + 1]),
            max_evals=max_evals,
            batch_size=batch_size,
            fixed_context=fixed_context,
        )
        shap_values_list.append(shap_values)

        # output plot
        shap_values.output_names[0] = [word.replace("Ġ", "") for word in shap_values.output_names[0]]
        shap.image_plot(shap_values)

        # setting values for next iterations
        mask_counter = 0
        image_counter += 1

    return shap_values_list

测试图像的 SHAP 解释

[25]:

# SHAP explanation using masking option "blur(pixel_size, pixel_size)" for blurring
shap_values = run_masker(X, mask_value="blur(56,56)")

Partition explainer: 2it [36:16, 1088.29s/it]

../../../_images/example_notebooks_image_examples_image_captioning_Image_Captioning_using_Open_Source_17_1.svg

Partition explainer: 2it [05:47, 173.60s/it]

../../../_images/example_notebooks_image_examples_image_captioning_Image_Captioning_using_Open_Source_17_3.svg

Partition explainer: 2it [05:44, 172.26s/it]

../../../_images/example_notebooks_image_examples_image_captioning_Image_Captioning_using_Open_Source_17_5.svg

Partition explainer: 2it [05:42, 171.46s/it]

../../../_images/example_notebooks_image_examples_image_captioning_Image_Captioning_using_Open_Source_17_7.svg

[26]:

# SHAP explanation using masking option "inpaint_telea" for inpainting
shap_values = run_masker(X[3:4], mask_value="inpaint_telea")

Partition explainer: 2it [05:38, 169.49s/it]

../../../_images/example_notebooks_image_examples_image_captioning_Image_Captioning_using_Open_Source_18_1.svg