What’s the most important and difficult part of a successful machine learning project? Take a second to ponder that question. There are probably lots of valid answers, but in my experience it’s the data. Is there enough of it? Where is it stored? Is it clean or noisy? Does it have all of the information we need? I’ve found that the amount of high quality data and the cleverness required of your machine learning team are inversely related.
It’s not surprising that a lot of effort has been put into approaches to ML that need less and less data. We’ll focus on supervised learning, where the data consists of things and labels: emails and whether they’re spam, infrared images of concrete and whether defects are present, credit card transactions and whether they’re fraudulent.
You’ll see terms like “few-shot” learning where you’ve got just a handful of labeled examples. There’s also one-shot learning where you have just one example: here’s a single spam email, now go through my mailbox and weed out all of the spam, but don’t throw away any good emails. If that sounds like a tall order, it is.
But this blog post considers an even harder problem - zero-shot learning where you have no examples! That may sound impossible but it isn’t, or this would be a very short blog. We’ll focus on image classification (automatically determining if an image has a particular thing in it or not, like a dog or cracked concrete) and object detection (putting a box around the dog or the crack in the image) and see shortly that the trick to zero-shot learning is to use language to describe what you’re looking for.
Almost all of ML boils down to converting things - text documents, images, patient records - into vectors of numbers (called embeddings) so that similar things have similar vectors. That’s often accomplished by training a deep neural network, specifically some form of a Convolutional Neural Network (CNN) for image-based tasks, that learns to both create useful embeddings from images and to classify or detect using those embeddings at the same time.
Building those networks from scratch is very data intensive. For example, the ImageNet dataset contains more than 14M images that were manually labeled with more than 21K categories. All of that manual effort was a game changer in computer vision. It became commonplace to train CNNs on subsets of ImageNet yielding networks that had good general-purpose embeddings, and then fine tune the embeddings for specific tasks using much less annotated data from the target domain. After all, if a CNN can tell the difference between a dingo, a barber chair, and a bulletproof vest (among 997 other things) it’s embeddings are probably extracting lots of generalizable and useful information from naturally occurring images. A few tweaks using limited data was often enough to get very powerful domain-specific image classifiers.
So where are we in our story? The hard part is getting a neural network that turns images (or image patches) into embeddings such that the target objects have similar embeddings and other objects have different embeddings. We’d like to be able to use language to describe objects and get embeddings. Enter CLIP - Contrastive Language-Image Pre-training - from OpenAI.
The team at OpenAI gathered 400M image/text pairs from the Internet. There was a lot of cleverness (remember what I said about needing to be clever if you are trying to get away with very little data at training time) in many aspects of CLIP’s development. Perhaps the most clever of their tricks had to do with how CLIP was trained.
Note that the text associated with an image was not a label. It could be a picture on Instagram of a cat sunning itself with the caption “Eli being lazy on the deck”. To flexibly associate text with images, CLIP grabbed batches of n image/text pairs and trained a network to predict which image went with which bit of text. For each image there are n possibilities, but each image has one and only one text string that is the “correct” answer. Note that this approach does not require a human to provide any ground truth. You just assume that the text says something relevant about the image and let the magic of a massive dataset and a giant neural network do their thing.
Remember that the trick is to learn embeddings. CLIP has two encoders that map things to embeddings, one for text and one for images. For a batch of n image/text pairs there will be n image embeddings and n text embeddings, and there are n2 pairs of embeddings. Out of those pairs, CLIP learns to pick the n pairs that truly go together. It does that using two objectives (the Contrastive part of CLIP), maximizing the similarity of the two halves of the n correct pairs while minimizing the similarity of the halves of the remaining n2 - n incorrect pairs.
The final part of the story is how CLIP uses what it learned from 400M image/text pairs to do zero-shot image classification and object detection. I’ll gloss over some of the details, but note that CLIP has two encoders, one for text and one for images. If I hand the text encoder “an orange cat sleeping on a deck” it will produce an embedding T. If I then hand the image encoder a picture it will produce an embedding I. Recall that CLIP learned to classify whether a particular T and I “go together”, and we can leverage that same machinery to tell whether the current image is one of “an orange cat sleeping on a deck”.
It’s worth considering how remarkable that is. First, the space of things that I can visually identify is open ended. Even though ImageNet had 21K categories, orange-cat-sleeping-on-a-deck was not one of them. Traditional methods would require me to collect and label a set of images, maybe with orange cats sleeping on decks or maybe of sleeping cats and decks with some code to figure out if both are in an input image. Instead, with CLIP you just describe what you want to find visually with no data collection required! The hard data collection part was done by the OpenAI team when they scraped 400M text/image pairs.
It’s important to know when using CLIP that it requires a little “prompt engineering”. For example, the text string “dog” is not as effective as the string “a photo of a dog”. That’s because text in the wild associated with images is rarely a single word. Or when classifying satellite images it helps to use “a satellite image of a lake”. Finally, using multiple related prompts and allowing them to vote can help, such as “a photo of a large cat” and “a photo of a small cat”.
The last technical detail to cover is how CLIP can be used for object detection, or finding bounding boxes around objects rather than just saying that they occur somewhere in the image. One simple trick is to turn the image into a set of non-overlapping patches. Each of the patches is then processed as if it were a complete image in the image classification framework. Does this chunk of the image look like it contains an orange cat or not? Then sets of adjacent patches that look like orange cat parts are framed by the minimal rectangle that encloses them. That’s it. Now you’ve got a zero-shot object detector!
As amazing as CLIP is, there are some cautions. Consider one of the examples we used earlier when describing supervised learning: infrared images of concrete and whether defects are present. Because CLIP was trained from images scraped from the Internet with associated text, it is really good at identifying interesting combinations of objects and features seen in that kind of content. You can expect CLIP to do well with classes like “concrete” or “concrete structure” or “cracked concrete”. But I’d be surprised if there were enough infrared images of concrete or other materials with and without defects to yield a good zero-shot classifier for “infrared images of concrete and whether defects are present”.
Also, CLIP is not the only game in town. Many of the modern LLMs, like the ChatGPT series, boast zero-shot object detection capabilities. Informal tests by me and others suggest that they are not very good as of this writing. But like all things in ML and AI, that can change very quickly.
—Dr. Tim Oates, Co-Founder & Chief Data Scientist at Synaptiq
Photo by Steven Wright on Unsplash
Synaptiq is an AI and data science consultancy based in Portland, Oregon. We collaborate with our clients to develop human-centered products and solutions. We uphold a strong commitment to ethics and innovation.
Contact us if you have a problem to solve, a process to refine, or a question to ask.
You can learn more about our story through our past projects, our blog, or our podcast.