ONNX Pipeline Models : CLIP Multi-Modal Embedding
ONNX pipeline models provides multi-modal embedding models that accepts both image and text as an input and produces embeddings as output. The pipeline models also provide the necessary pre-processing needed.
CLIP Multi-Modal Embedding Pipeline
CLIP models are multi-modal which allows you to perform embedding on both text and image input data. The main advantage of this model is to do image-text similarity. You can compare vectors related to text snippets and a given image, to understand which text can better describe given image. The pipeline generator will generate two ONNX pipeline models for a pretrained CLIP model, distinguished by their suffixes. The pipeline model for images is suffixed with _img and the model for text is suffixed with _txt. The same models with their suffixes will be loaded to the database when using export2db. For performing CLIP related tasks such as image-text similarity, both models will need to be used at inference time.
- Input: CLIP models consist of two pipelines: an image embedding pipeline and a text embedding pipeline. The Image pipeline takes images as described in the Image Embedding Pipeline section of ONNX Pipeline Models : Image Embedding, and the text Pipeline takes text as described in ONNX Pipeline Models : Text Embedding (Text Embedding support was introduced in OML4Py 2.0).
- Pre-processing: The image pipeline for CLIP models utilizes the same pre-processing strategy as described in the Image Embedding Pipeline section of ONNX Pipeline Models : Image Embedding. That is, an image processor that matches the model's configuration is utilized to prepare the images. The text pipeline utilizes a specific tokenizer: the CLIPTokenizer (transformers.models.clip.CLIPTokenizer).
- Post-processing: As a post-processing step, Normalization is added to both the image and text pipelines.
- Output: Both models will produce vectors of the same shape that can then be compared using some similarity measure.
CLIP Multi-Modal Embedding Examples
- Exporting a pre-configured image model to a file:
The following example will produce two pipelines called clip_img.onnx and clip_txt.onnx, which can be used for image and text embeddings respectively.
from oml.utils import ONNXPipeline pipeline = ONNXPipeline("openai/clip-vit-large-patch14") pipeline.export2file("clip")
- Exporting a pre-configured image model to the database:
This example will produce two in-database models called clip_img and clip_txt, which can be used for image and text embeddings respectively.
fromoml.utils import ONNXPipeline import oml pipeline = ONNXPipeline("openai/clip-vit-large-patch14") oml.connect("pyquser","pyquser",dsn="pydsn") pipeline.export2db("clip")
- Exporting a non pre-configured with a template to a file:
This example will work for clip models that are not preconfigured. It will create two files called clip_14_img.onnx and clip_14_txt.onnx.
from oml.utils import ONNXPipeline, ONNXPipelineConfig config = ONNXPipelineConfig.from_template("multimodal_clip") pipeline = ONNXPipeline("openai/clip-vit-base-patch16",config=config) pipeline.export2file("clip_16")
Parent topic: Import Pretrained Models in ONNX Format