CLIP
(Contrastive Language–Image Pretraining) is a multimodal model proposed by OpenAI. It can map images and text into the same embedding space, thereby realizing tasks such as graphic matching, zero-sample classification, and graphic retrieval.
Although OpenAI did not publish a single one calledclip
The official Python library, but the community version is likeopen_clip
, CLIP from OpenAI
, CLIP-as-service
All of them are widely used. The following main introduction:
- OpenAI Official CLIP
- Community Edition
open-clip
(Support more models)
1. Install OpenAI official CLIP
pip install git+/openai/
rely:torch
、numpy
, PIL
2. Quick use example
import clip import torch from PIL import Image # Loading models and preprocessing methodsdevice = "cuda" if .is_available() else "cpu" model, preprocess = ("ViT-B/32", device=device) # Load the image and preprocess itimage = preprocess(("")).unsqueeze(0).to(device) # Write a text descriptiontext = (["a photo of a cat", "a photo of a dog"]).to(device) # Extract features and calculate similaritywith torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text) logits_per_image, logits_per_text = model(image, text) probs = logits_per_image.softmax(dim=-1).cpu().numpy() print("Label probabilities:", probs)
3. Model Options
Supported models are:
-
"ViT-B/32"
: Fastest, most commonly used -
"ViT-B/16"
: Bigger and more accurate -
"RN50"
、"RN101"
: Based on ResNet
4. Text encoding
text = ["a photo of a banana", "a dog", "a car"] tokens = (text).to(device) with torch.no_grad(): text_features = model.encode_text(tokens)
5. Image encoding
from PIL import Image image = ("") image_input = preprocess(image).unsqueeze(0).to(device) with torch.no_grad(): image_features = model.encode_image(image_input)
6. Similarity comparison
import as F # Cosine Similaritysimilarity = F.cosine_similarity(image_features, text_features) print(similarity)
7. Zero-sample image classification
labels = ["a dog", "a cat", "a car"] text_inputs = ([f"a photo of {label}" for label in labels]).to(device) with torch.no_grad(): text_features = model.encode_text(text_inputs) image_features = model.encode_image(image) # Normalizationimage_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) # Similarity scorelogits = (image_features @ text_features.T) pred = ().item() print(f"Predicted label: {labels[pred]}")
8. Compare with other libraries
characteristic | CLIP | BLIP / Flamingo | BERT / GPT |
---|---|---|---|
Alignment of pictures and text | yes | yes | no |
Multimodal capability | Strong (image + text) | Stronger (support generation) | weak |
Zero sample capability | powerful | powerful | none |
Suitable for tasks | Graphic search, matching, classification | Generate description, Q&A, VQA | Language Tasks |
9. More powerful: open_clip
open_clip
It is a stronger version supported by the community, supporting more pre-trained models (such as provided by LAION):
pip install open_clip_torch
import open_clip model, preprocess, tokenizer = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
10. Summary
Function | method |
---|---|
Loading the model | () |
Text encoding | model.encode_text() |
Image encoding | model.encode_image() |
Similarity between pictures and text |
model(image, text) Or cosine similarity |
Image classification (zero sample) | Select the maximum similarity after embedding the text description |
Support model |
"ViT-B/32" , "ViT-B/16" wait |
CLIP
It is a model of modern multimodal AI model and can be widely used in scenarios such as image retrieval, graphic classification, image question and answer, cross-modal search. It can also perform well under "zero sample" conditions and is a powerful tool for building a general graphics and text understanding system.
This is the end of this article about the implementation of the CLIP multimodal model library in Python. For more related content of Python CLIP multimodal model, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!