使用 Azure 认知服务和 Partition Explainer 解释图像描述(图像到文本)
本笔记本演示了如何使用 SHAP 解释图像描述模型的输出,即给定一张图像,模型输出该图像的描述。
在这里,我们使用 Azure 认知服务计算机视觉 (COGS CV) 图像理解(分析图像)功能 https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/#features 来获取图像描述。
局限性
为了解释图像描述,我们沿着轴线(即超像素/二分之一、四分之一、八分之一...的分区)分割图像;另一种方法/未来的改进可能是语义分割图像,而不是轴对齐分区,并使用分段而不是超像素生成 SHAP 解释。https://github.com/shap/shap/issues/1738
我们使用 Transformer 语言模型(例如 distilbart)在给定图像和掩码图像描述之间进行对齐评分,假设外部模型是原始描述模型的语言头的良好替代品。通过使用描述模型自身的语言头,我们可以消除这种假设并移除依赖性。(例如,参考 text2text 笔记本示例)。有关更多详细信息,请参考下面的“加载语言模型和分词器”部分。https://github.com/shap/shap/issues/1739
此处使用 Azure 认知服务来获取图像描述。为了更快地获得解释,请获取付费服务,因为 API 调用不会受到速率限制。定价详情可以在这里找到:https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/
Azure 认知服务对图像大小和文件格式有一定的限制。API 详细信息可以在这里找到:https://westcentralus.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-1-ga/operations/56f91f2e778daf14a499f21b。请参考“加载数据”部分了解更多详情。
大型图像会减慢 SHAP 解释的生成速度。因此,任何一个维度大于 500 像素的图像都会被调整大小。请参考“加载数据”部分了解更多详情。
用于生成解释的评估次数越多,SHAP 运行时间就越长。但是,增加评估次数会增加解释的粒度(300-500 次评估通常会生成详细的地图,但较少或更多次也通常是合理的)。请参考下面“使用包装模型和图像掩码器创建 explainer 对象”部分了解更多详情。
[9]:
import json
import os
from collections import defaultdict
import numpy as np
import requests
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import shap
from shap.utils.image import (
add_sample_images,
check_valid_image,
display_grid_plot,
is_empty,
load_image,
make_dir,
resize_image,
save_image,
)
API 详细信息
要使用 Azure COGS CV 并运行此笔记本,请获取特定于您的 Azure COGS CV 订阅的 API 密钥和端点,并在下面的代码中替换 <>。建议获取付费服务而不是免费服务,以便 API 调用不受速率限制并快速获得解释。
[10]:
# place your Azure COGS CV subscription API Key and Endpoint below
API_KEY = "<your COGS API access key>"
ENDPOINT = "<endpoint specific to your subscription>"
ANALYZE_URL = ENDPOINT + "/vision/v3.1/analyze"
[11]:
def get_caption(path_to_image):
"""Function to get image caption when path to image file is given.
Note: API_KEY and ANALYZE_URL need to be defined before calling this function.
Parameters
----------
path_to_image : path of image file to be analyzed
Output
-------
image caption
"""
headers = {
"Ocp-Apim-Subscription-Key": API_KEY,
"Content-Type": "application/octet-stream",
}
params = {
"visualFeatures": "Description",
"language": "en",
}
payload = open(path_to_image, "rb").read()
# get image caption using requests
response = requests.post(ANALYZE_URL, headers=headers, params=params, data=payload)
results = json.loads(response.content)
# return the first caption's text in results description
caption = results["description"]["captions"][0]["text"]
return caption
加载数据
“./test_images/”是将被解释的图像文件夹。“./test_images/”目录已为您创建,复制笔记本中所示示例所需的示例图像已放置在该目录中。
Azure COGS CV 图像要求:
要获得图像描述的解释,请将要解释的图像放置在当前笔记本工作目录中名为“test_images”的文件夹中。
Azure COGS CV 接受以下文件格式的图像:JPEG (JPG)、PNG、GIF、BMP、JFIF
Azure COGS CV 对图像有 < 4MB 的大小限制和 50x50 的最小尺寸。因此,下面的代码中正在调整大型图像文件的大小,以提高 SHAP 解释的速度并为图像描述运行 Azure COGS。如果图像 (pixel_size, pixel_size) 的任何一个维度大于 500,则图像将被调整大小,使其维度 > 500 的最大像素大小为 500,而另一个维度将调整大小以保持原始宽高比。
注意:调整大小后的图像描述可能与原始图像描述不同。如果需要原始图像的解释,请将下面的“reshape”变量切换为“False”。但是,请注意,这可能会显着减慢解释过程,或者导致 Azure COGS CV 无法生成描述(SHAP 将无法生成此图像的解释。)
[12]:
# directory of images to be explained
DIR = "./test_images/"
# creates or empties directory if it already exists
make_dir(DIR)
add_sample_images(DIR)
# directory for saving resized images
DIR_RESHAPED = "./reshaped_images/"
make_dir(DIR_RESHAPED)
# directory for saving masked images
DIR_MASKED = "./masked_images/"
make_dir(DIR_MASKED)
注意:替换或添加您想要在“./test_images/”文件夹中解释(测试)的图像。**
[13]:
# check if 'test_images' folder exists and if it has any files
if not is_empty(DIR):
X = []
reshape = True
files = [f for f in os.listdir(DIR) if os.path.isfile(os.path.join(DIR, f))]
for file in files:
path_to_image = os.path.join(DIR, file)
# check if file has of any of the following acceptable extensions: JPEG (JPG), PNG, GIF, BMP, JFIF
if check_valid_image(file):
print("\nLoading image:", file)
print("Image caption:", get_caption(path_to_image))
image = load_image(path_to_image)
print("Image size:", image.shape)
# reshaping large image files
if reshape:
image, reshaped_file = resize_image(path_to_image, DIR_RESHAPED)
if reshaped_file:
print("Reshaped image caption:", get_caption(reshaped_file))
X.append(image)
else:
print("\nSkipping image due to invalid file extension:", file)
print("\nNumber of images in test dataset:", len(X))
# delete DIR_RESHAPED if empty
if not os.listdir(DIR_RESHAPED):
os.rmdir(DIR_RESHAPED)
Loading image: 1.jpg
Image caption: a woman wearing glasses
Image size: (224, 224, 3)
Loading image: 2.jpg
Image caption: a bird on a branch
Image size: (224, 224, 3)
Loading image: 3.jpg
Image caption: a group of horses standing on grass
Image size: (224, 224, 3)
Loading image: 4.jpg
Image caption: a basketball player in a uniform
Image size: (224, 224, 3)
Number of images in test dataset: 4
加载语言模型和分词器
此处使用 Transformer 语言模型“distilbart”和分词器来标记图像描述。这使得图像到文本的场景类似于多类问题。“distilbart”用于在原始图像描述和正在生成的掩码图像描述之间进行对齐评分,即当给定掩码图像描述的上下文时,获得原始图像描述的概率如何变化?(又名,我们正在教师强制“distilbart”始终为掩码图像生成原始图像描述,并在过程中获取描述中每个标记化单词的 logits 变化)。
注意:我们在这里使用“distilbart”是因为在实验过程中,我们发现它为图像提供了最有意义的解释。我们已经与其他语言模型(如“openaigpt”和“distilgpt2”)进行了比较。请随意探索您选择的其他语言模型并比较结果。
[14]:
# load transformer language model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-xsum-12-6")
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-xsum-12-6").cuda()
使用包装模型和图像掩码器创建 explainer 对象
用于试验 explainer 对象的各种选项
mask_value:图像掩码器默认使用图像修复技术进行掩码(即 mask_value = “inpaint_ns”)。还有其他可用于模糊/图像修复的掩码选项,例如“inpaint_telea”和“blur(kernel_xsize, kernel_xsize)”。注意:可以通过不同的掩码选项生成不同的解释。
max_evals:为了获得 SHAP 值而对底层模型进行的评估次数。建议的评估次数为 300-500,以获得具有有意义的超像素粒度的解释。评估次数越多,粒度越高,但运行时间也越长。默认设置为 300 次评估。
batch_size:一次评估的掩码图像数量。默认大小设置为 50。
fixed_context:用于构建分区树的掩码技术,选项为“0”、“1”或“None”。“fixed_context = None”是生成有意义结果的最佳选项,但它比 fixed_context = 0 或 1 相对较慢,因为它生成完整的分区树。默认选项设置为“None”。
[15]:
# setting values for logging/tracking variables
make_dir(DIR_MASKED)
image_counter = 0
mask_counter = 0
masked_captions = defaultdict(list)
masked_files = defaultdict(list)
# define function f which takes input (masked image) and returns caption for it
def f(x):
""" "
Function to return caption for masked image(x).
"""
global mask_counter
# saving masked array of RGB values as an image in masked_images directory
path_to_image = os.path.join(DIR_MASKED, f"{image_counter}_{mask_counter}.png")
save_image(x, path_to_image)
# get caption for masked image
caption = get_caption(path_to_image)
masked_captions[image_counter].append(caption)
masked_files[image_counter].append(path_to_image)
mask_counter += 1
return caption
# function to take a list of images and parameters such as masking option, max evals etc. and return shap_values objects
def run_masker(
X,
mask_value="inpaint_ns",
max_evals=300,
batch_size=50,
fixed_context=None,
show_grid_plot=False,
limit_grid=20,
):
"""Function to take a list of images and parameters such max evals etc. and return shap explanations (shap_values) for test images(X).
Paramaters
----------
X : list of images which need to be explained
mask_value : various masking options for blurring/inpainting such as "inpaint_ns", "inpaint_telea" and "blur(pixel_size, pixel_size)"
max_evals : number of evaluations done of the underlying model to get SHAP values
batch_size : number of masked images to be evaluated at once
fixed_context : masking technqiue used to build partition tree with options of '0', '1' or 'None'
show_grid_plot : if set to True, shows grid plot of all masked images and their captions used to generate SHAP values (default: False)
limit_grid : limit number of masked images shown (default:20). Change to "all" to show all masked_images.
Output
------
shap_values_list: list of shap_values objects generated for the images
"""
global image_counter
global mask_counter
shap_values_list = []
for index in range(len(X)):
# define a masker that is used to mask out partitions of the input image based on mask_value option
masker = shap.maskers.Image(mask_value, X[index].shape)
# wrap model with TeacherForcingLogits class
wrapped_model = shap.models.TeacherForcingLogits(f, similarity_model=model, similarity_tokenizer=tokenizer)
# build a partition explainer with wrapped_model and image masker
explainer = shap.Explainer(wrapped_model, masker)
# compute SHAP values - here we use max_evals no. of evaluations of the underlying model to estimate SHAP values
shap_values = explainer(
np.array(X[index : index + 1]),
max_evals=max_evals,
batch_size=batch_size,
fixed_context=fixed_context,
)
shap_values_list.append(shap_values)
# output plot
shap_values.output_names[0] = [word.replace("Ġ", "") for word in shap_values.output_names[0]]
shap.image_plot(shap_values)
# show grid plot of masked images and their captions
if show_grid_plot:
if limit_grid == "all":
display_grid_plot(masked_captions[image_counter], masked_files[image_counter])
elif isinstance(limit_grid, int) and limit_grid < len(masked_captions[image_counter]):
display_grid_plot(
masked_captions[image_counter][0:limit_grid],
masked_files[image_counter][0:limit_grid],
)
else:
print("Enter a valid number for limit_grid parameter.")
# setting values for next iterations
mask_counter = 0
image_counter += 1
return shap_values_list
测试图像的 SHAP 解释
[22]:
# run masker with test images dataset (X) and get SHAP explanations for their captions
shap_values = run_masker(X)
Partition explainer: 2it [03:40, 110.24s/it]
Partition explainer: 2it [02:56, 88.21s/it]
Partition explainer: 2it [03:22, 101.31s/it]
Partition explainer: 2it [03:18, 99.35s/it]
[28]:
# SHAP explanation using alternate masking option for inpainting "inpaint_telea"
# displays grid plot of masked images and their captions
# change limit_grid = "all" to show all masked images instead of limiting to 24 masked images
shap_values = run_masker(X[2:3], mask_value="inpaint_telea", show_grid_plot=True, limit_grid=24)
Partition explainer: 2it [03:51, 115.99s/it]