SoFunction
Updated on 2025-04-28

Implementation of the library of CLIP multimodal model in Python

CLIP(Contrastive Language–Image Pretraining) is a multimodal model proposed by OpenAI. It can map images and text into the same embedding space, thereby realizing tasks such as graphic matching, zero-sample classification, and graphic retrieval.

Although OpenAI did not publish a single one calledclipThe official Python library, but the community version is likeopen_clipCLIP from OpenAICLIP-as-serviceAll of them are widely used. The following main introduction:

  • OpenAI Official CLIP
  • Community Editionopen-clip(Support more models)

1. Install OpenAI official CLIP

pip install git+/openai/

rely:torchnumpyPIL

2. Quick use example

import clip
import torch
from PIL import Image

# Loading models and preprocessing methodsdevice = "cuda" if .is_available() else "cpu"
model, preprocess = ("ViT-B/32", device=device)

# Load the image and preprocess itimage = preprocess(("")).unsqueeze(0).to(device)

# Write a text descriptiontext = (["a photo of a cat", "a photo of a dog"]).to(device)

# Extract features and calculate similaritywith torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probabilities:", probs)

3. Model Options

Supported models are:

  • "ViT-B/32": Fastest, most commonly used
  • "ViT-B/16": Bigger and more accurate
  • "RN50""RN101": Based on ResNet

4. Text encoding

text = ["a photo of a banana", "a dog", "a car"]
tokens = (text).to(device)

with torch.no_grad():
    text_features = model.encode_text(tokens)

5. Image encoding

from PIL import Image

image = ("")
image_input = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image_input)

6. Similarity comparison

import  as F

# Cosine Similaritysimilarity = F.cosine_similarity(image_features, text_features)
print(similarity)

7. Zero-sample image classification

labels = ["a dog", "a cat", "a car"]
text_inputs = ([f"a photo of {label}" for label in labels]).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_inputs)
    image_features = model.encode_image(image)

# Normalizationimage_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# Similarity scorelogits = (image_features @ text_features.T)
pred = ().item()

print(f"Predicted label: {labels[pred]}")

8. Compare with other libraries

characteristic CLIP BLIP / Flamingo BERT / GPT
Alignment of pictures and text yes yes no
Multimodal capability Strong (image + text) Stronger (support generation) weak
Zero sample capability powerful powerful none
Suitable for tasks Graphic search, matching, classification Generate description, Q&A, VQA Language Tasks

9. More powerful: open_clip

open_clipIt is a stronger version supported by the community, supporting more pre-trained models (such as provided by LAION):

pip install open_clip_torch
import open_clip

model, preprocess, tokenizer = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')

10. Summary

Function method
Loading the model ()
Text encoding model.encode_text()
Image encoding model.encode_image()
Similarity between pictures and text model(image, text)Or cosine similarity
Image classification (zero sample) Select the maximum similarity after embedding the text description
Support model "ViT-B/32""ViT-B/16"wait

CLIPIt is a model of modern multimodal AI model and can be widely used in scenarios such as image retrieval, graphic classification, image question and answer, cross-modal search. It can also perform well under "zero sample" conditions and is a powerful tool for building a general graphics and text understanding system.

This is the end of this article about the implementation of the CLIP multimodal model library in Python. For more related content of Python CLIP multimodal model, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!