使用 CLIP 嵌入改进 GPT-4 Vision 的多模态 RAG

2024 年 4 月 10 日

多模态 RAG 将额外的模态整合到传统的基于文本的 RAG 中，通过提供额外的上下文和锚定文本数据来增强法学硕士的问答能力，从而提高理解力。

采用服装搭配师 Cookbook 中的方法，我们直接嵌入图像进行相似性搜索，绕过有损的文本字幕过程，以提高检索准确率。

使用基于 CLIP 的嵌入进一步允许使用特定数据进行微调或使用未见过的图像进行更新。

这项技术通过使用用户提供的技术图像搜索企业知识库来展示，以提供相关信息。

安装

首先，让我们安装相关的软件包。

#installations
%pip install clip
%pip install torch
%pip install pillow
%pip install faiss-cpu
%pip install numpy
%pip install git+https://github.com/openai/CLIP.git
%pip install openai

然后，让我们导入所有需要的软件包。

# model imports
import faiss
import json
import torch
from openai import OpenAI
import torch.nn as nn
from torch.utils.data import DataLoader
import clip
client = OpenAI()

# helper imports
from tqdm import tqdm
import json
import os
import numpy as np
import pickle
from typing import List, Union, Tuple

# visualisation imports
from PIL import Image
import matplotlib.pyplot as plt
import base64

现在，让我们加载 CLIP 模型。

#load model on device. The device you are running inference/training on is either a CPU or GPU if you have.
device = "cpu"
model, preprocess = clip.load("ViT-B/32",device=device)

我们现在将

创建图像嵌入数据库
设置对视觉模型的查询
执行语义搜索
将用户查询传递给图像

创建图像嵌入数据库

接下来，我们将从图像目录创建图像嵌入知识库。这将是我们搜索的技术知识库，以便为用户上传的图像提供信息。

我们传入存储图像的目录（以 JPEG 格式），并循环遍历每个图像以创建我们的嵌入。

我们还有一个 description.json 文件。此文件包含知识库中每个图像的条目。它有两个键：“image_path”和“description”。它将每个图像映射到此图像的有用描述，以帮助回答用户问题。

首先，让我们编写一个函数来获取给定目录中的所有图像路径。然后，我们将从名为“image_database”的目录中获取所有 JPEG 文件

def get_image_paths(directory: str, number: int = None) -> List[str]:
    image_paths = []
    count = 0
    for filename in os.listdir(directory):
        if filename.endswith('.jpeg'):
            image_paths.append(os.path.join(directory, filename))
            if number is not None and count == number:
                return [image_paths[-1]]
            count += 1
    return image_paths
direc = 'image_database/'
image_paths = get_image_paths(direc)

接下来，我们将编写一个函数，用于从 CLIP 模型中获取给定一系列路径的图像嵌入。

我们首先使用之前获得的预处理函数预处理图像。这执行了一些操作，以确保 CLIP 模型的输入格式和维度正确，包括调整大小、归一化、颜色通道调整等。

然后，我们将这些预处理后的图像堆叠在一起，以便我们可以一次性将它们传递到模型中，而不是在循环中传递。最后返回模型输出，这是一个嵌入数组。

def get_features_from_image_path(image_paths):
  images = [preprocess(Image.open(image_path).convert("RGB")) for image_path in image_paths]
  image_input = torch.tensor(np.stack(images))
  with torch.no_grad():
    image_features = model.encode_image(image_input).float()
  return image_features
image_features = get_features_from_image_path(image_paths)

我们现在可以创建我们的向量数据库。

index = faiss.IndexFlatIP(image_features.shape[1])
index.add(image_features)

并且还导入我们的 json 以进行图像-描述映射，并创建 json 列表。我们还创建了一个辅助函数来搜索此列表以查找我们想要的给定图像，以便我们可以获得该图像的描述

data = []
image_path = 'train1.jpeg'
with open('description.json', 'r') as file:
    for line in file:
        data.append(json.loads(line))
def find_entry(data, key, value):
    for entry in data:
        if entry.get(key) == value:
            return entry
    return None

让我们显示一个示例图像，这将是用户上传的图像。这是 2024 年 CES 上发布的一项技术。它是 DELTA Pro Ultra 全屋电池发电机。

im = Image.open(image_path)
plt.imshow(im)
plt.show()

Delta Pro

查询视觉模型

现在让我们看看 GPT-4 Vision（以前从未见过这项技术）会将其标记为什么。

首先，我们需要编写一个函数以 base64 编码我们的图像，因为这是我们将传递到视觉模型的格式。然后，我们将创建一个通用的 image_query 函数，以允许我们使用图像输入查询法学硕士。

def encode_image(image_path):
    with open(image_path, 'rb') as image_file:
        encoded_image = base64.b64encode(image_file.read())
        return encoded_image.decode('utf-8')

def image_query(query, image_path):
    response = client.chat.completions.create(
        model='gpt-4-vision-preview',
        messages=[
            {
            "role": "user",
            "content": [
                {
                "type": "text",
                "text": query,
                },
                {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{encode_image(image_path)}",
                },
                }
            ],
            }
        ],
        max_tokens=300,
    )
    # Extract relevant features from the response
    return response.choices[0].message.content
image_query('Write a short label of what is show in this image?', image_path)

'Autonomous Delivery Robot'

正如我们所见，它尽力从它接受训练的信息中进行推断，但由于它在训练数据中没有见过任何类似的东西而犯了一个错误。这是因为它是一个模棱两可的图像，使其难以推断和演绎。

执行语义搜索

现在，让我们执行相似性搜索，以在我们的知识库中找到两个最相似的图像。我们通过获取用户输入的 image_path 的嵌入，检索数据库中相似图像的索引和距离来完成此操作。距离将是我们相似性的代理指标，较小的距离意味着更相似。然后，我们根据距离降序排序。

image_search_embedding = get_features_from_image_path([image_path])
distances, indices = index.search(image_search_embedding.reshape(1, -1), 2) #2 signifies the number of topmost similar images to bring back
distances = distances[0]
indices = indices[0]
indices_distances = list(zip(indices, distances))
indices_distances.sort(key=lambda x: x[1], reverse=True)

我们需要索引，因为我们将使用它来搜索我们的 image_directory，并选择索引位置的图像以馈送到视觉模型以进行 RAG。

让我们看看它带回了什么（我们按相似度顺序显示这些图像）

#display similar images
for idx, distance in indices_distances:
    print(idx)
    path = get_image_paths(direc, idx)[0]
    im = Image.open(path)
    plt.imshow(im)
    plt.show()

Delta Pro2

Delta Pro3

我们可以在这里看到它带回了两张包含 DELTA Pro Ultra 全屋电池发电机的图像。在其中一张图像中，它还有一些背景可能会分散注意力，但仍然设法找到了正确的图像。

用户查询最相似的图像

现在对于我们最相似的图像，我们希望将它及其描述传递给 gpt-v 以及用户查询，以便他们可以询问他们可能购买的技术。这就是视觉模型的强大之处，您可以在模型上询问模型尚未明确训练过的常规查询，并且模型会以高精度响应。

在下面的示例中，我们将询问有关所讨论项目的容量。

similar_path = get_image_paths(direc, indices_distances[0][0])[0]
element = find_entry(data, 'image_path', similar_path)

user_query = 'What is the capacity of this item?'
prompt = f"""
Below is a user query, I want you to answer the query using the description and image provided.

user query:
{user_query}

description:
{element['description']}
"""
image_query(prompt, similar_path)

'The portable home battery DELTA Pro has a base capacity of 3.6kWh. This capacity can be expanded up to 25kWh with additional batteries. The image showcases the DELTA Pro, which has an impressive 3600W power capacity for AC output as well.'

我们看到它能够回答问题。这只有通过直接匹配图像并从那里收集相关描述作为上下文才有可能实现。

结论

在本笔记本中，我们介绍了如何使用 CLIP 模型，使用 CLIP 模型创建图像嵌入数据库的示例，执行语义搜索，最后提供用户查询来回答问题。

这种使用模式的应用遍及许多不同的应用领域，并且可以轻松改进以进一步增强该技术。例如，您可以微调 CLIP，您可以像在 RAG 中一样改进检索过程，并且您可以提示工程师 GPT-V。