Implementation of the library of CLIP multimodal model in Python

CLIP(Contrastive Language–Image Pretraining) is a multimodal model proposed by OpenAI. It can map images and text into the same embedding space, thereby realizing tasks such as graphic matching, zero-sample classification, and graphic retrieval.

Although OpenAI did not publish a single one calledclipThe official Python library, but the community version is likeopen_clip, CLIP from OpenAI, CLIP-as-serviceAll of them are widely used. The following main introduction:

OpenAI Official CLIP
Community Editionopen-clip(Support more models)

1. Install OpenAI official CLIP

pip install git+/openai/

rely:torch、numpy, PIL

2. Quick use example

import clip
import torch
from PIL import Image

# Loading models and preprocessing methodsdevice = "cuda" if .is_available() else "cpu"
model, preprocess = ("ViT-B/32", device=device)

# Load the image and preprocess itimage = preprocess(("")).unsqueeze(0).to(device)

# Write a text descriptiontext = (["a photo of a cat", "a photo of a dog"]).to(device)

# Extract features and calculate similaritywith torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probabilities:", probs)

3. Model Options

Supported models are:

"ViT-B/32": Fastest, most commonly used
"ViT-B/16": Bigger and more accurate
"RN50"、"RN101": Based on ResNet

4. Text encoding

text = ["a photo of a banana", "a dog", "a car"]
tokens = (text).to(device)

with torch.no_grad():
    text_features = model.encode_text(tokens)

5. Image encoding

from PIL import Image

image = ("")
image_input = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image_input)

6. Similarity comparison

import  as F

# Cosine Similaritysimilarity = F.cosine_similarity(image_features, text_features)
print(similarity)

7. Zero-sample image classification

labels = ["a dog", "a cat", "a car"]
text_inputs = ([f"a photo of {label}" for label in labels]).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_inputs)
    image_features = model.encode_image(image)

# Normalizationimage_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# Similarity scorelogits = (image_features @ text_features.T)
pred = ().item()

print(f"Predicted label: {labels[pred]}")

8. Compare with other libraries

characteristic	CLIP	BLIP / Flamingo	BERT / GPT
Alignment of pictures and text	yes	yes	no
Multimodal capability	Strong (image + text)	Stronger (support generation)	weak
Zero sample capability	powerful	powerful	none
Suitable for tasks	Graphic search, matching, classification	Generate description, Q&A, VQA	Language Tasks

9. More powerful: open_clip

open_clipIt is a stronger version supported by the community, supporting more pre-trained models (such as provided by LAION):

pip install open_clip_torch

import open_clip

model, preprocess, tokenizer = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')

10. Summary

Function	method
Loading the model	`()`
Text encoding	`model.encode_text()`
Image encoding	`model.encode_image()`
Similarity between pictures and text	`model(image, text)`Or cosine similarity
Image classification (zero sample)	Select the maximum similarity after embedding the text description
Support model	`"ViT-B/32"`, `"ViT-B/16"`wait

CLIPIt is a model of modern multimodal AI model and can be widely used in scenarios such as image retrieval, graphic classification, image question and answer, cross-modal search. It can also perform well under "zero sample" conditions and is a powerful tool for building a general graphics and text understanding system.

This is the end of this article about the implementation of the CLIP multimodal model library in Python. For more related content of Python CLIP multimodal model, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!