Image Instance Retrieval: Overview of state-of-the-art (31:55)
July 5, 2017
August 22, 2016
All Captioned Videos Brains, Minds and Machines Summer Course 2016
Vijay Chandrasekhar, A*STAR
Overview of current computer vision systems for retrieving similar images from a large database, describing the image processing stages leading to a representation of image features suitable for matching, the application of deep learning methods to the extraction of feature descriptors, and future challenges for large scale image and video retrieval systems.
VIJAY CHANDRASEKHAR: Good afternoon, everyone. So let me start by introducing myself. So I got my undergrad and masters at CMU. So I think I saw some CMU people here. So that's great. And I got my PhD at Stanford in 2013. And since then I moved to Singapore.
So I'm currently having an adjunct position at the Nanyang Technology University, which is one of the two big universities in Singapore. I'm also associated with the Institute for Infocomm Research. So can I have a quick show of hands? So how many computer vision people do we have in this room?
OK, a handful, so that's good. OK, good, so I'll keep it as high level as possible so that you guys can appreciate some of this. So I'll start by introducing A-STAR. So if you guys are looking for a fun summer internship like halfway across the globe, A-STAR is a good option. So think of us as the National Science Foundation of the US, so we're like a big funding agency. It's I think a couple of billion a year at this point, which is huge for a small country like Singapore.
We have lots of research institutes, so across the biomedical side and the science and engineering side. So we have about 5,000 scientists. They are from all over the world. And we also have a commercialisation arm. We work closely with industry. So think of us as like for people familiar with the German research system, like think of us like more Fraunhofer like less Max Planck.
So we work on more translational work, engaged with a lot of different industry partners. So anyways that's a quick, so this is the outline of the talk. So let's start with what's image retrieval. So so far, I think we've had a few talks on computer vision. So I thought we can deep dive on image retrieval, which is a slightly different problem than image classification.
So we'll start with an overview of that. So we'll talk about what image retrieval looked like pre-deep learning, what happened post-deep learning, and then some of the challenges ahead. There's a panel discussion on industrial related stuff. So I will talk briefly about some of the opportunities in this space.
So basic image retrieval, so image instance retrieval is different from the image classification problem. So image classification problem, think of it as like cat or dog, right? Image instance retrieval is like, find me that specific cat, or, for example, in this particular example, find me this very specific building from a large database. So it's finding the exact instance of the object, rather than just the general class it belongs to. And it's a pretty challenging problem, because there are lots of like things you have to be sort of invariant to.
There are changes in lighting conditions, scale, rotation, blur, lighting effects, et cetera. And also like especially for outdoor environments, the environment changes can be quite significant across two same pictures. Lots of applications for visual search, we've seen a lot of companies in this space, that are involved in comparison shopping, augmented TV, like wine recognition apps. The visual search is slowly sort of making its way up many industries.
So let's think of like text retrieval, right? Before we jump into image retrieval. So here's the American Declaration of Independence. So this has words like, for example, self-evident, like liberty, truth, happiness, right? So when you search the Declaration of Independence on Google, or any of these words, this document pops up, right? So essentially what Google does behind the scenes is converts this documents, sort of a bag of words, right? Where any time someone searches for self-evident, liberty, truths, like there's an inverted list, right, which keeps track of like, OK, these are all the documents that contain those particular words.
In this particular case, like liberty, the Declaration of Independence would show up as one of the top documents. So we want to do the same for image retrieval as well. So we're representing an image as a bag of visual words. So what are these words? So these words are like highly localized patches within the image.
So here you see like specific structures from The Golden Gate Bridge, which I'll go into how we extract them in a bit. So we basically extract these text patches. So here is a high level pipeline. So it might be useful to just get a good understanding of it.
So, typically, you'd start off with a color image. You convert it to grayscale, you apply a bunch of filters, so here are like horizontal derivatives, vertical derivatives, mixed derivatives, so that gives you a bunch of images. You compute a blob response as a function of these, you apply these filters at different scales and then you basically get the entire scale space of these images. So this is a determinant of Hessian interest point detector.
Then you find maxima in this scale space, the maximum meaning you have to be higher than all your neighbors on your same level and above and below, so a 27-pixel neighborhood is looked at, you find a maximum. And then so these are examples of patches that you get at different scales. And then you sort of orient each one of these along a dominant gradient. You get something that looks like that.
And then for each one of these patches, you basically go compute a gradient field, and then compute a bunch of interesting statistics on it. So a good example would be SIFT and SURF, which most people are probably familiar with. So SIFT and SURF extract interesting gradient statistics from the patch. So in this particular example, the patch is divided into smaller sub-patches, and in each one of these patches, you extract information like the average gradient magnitude, or the average sum of like x and y gradient.
That sort of gives you a descriptor, which describes that particular patch. So this is your visual word similar to what your textual words are. But these descriptors tend to be sort of very high dimensional. So you end up sort of quantizing them.
So for people familiar with quantization techniques, think of this as just K-means. So but then the space is very, very large, like if you have millions of code words, you don't want to do flat K-means. You tend to typically do it in a hierarchical fashion, which means you have, for example, here, three cluster centers, and then you sort of hierarchically cluster this data.
And each one of these leaf nodes or intermediate nodes represent a visual word, analogous to the textual words we talked about. So that's what you do to describe the image. So what happens at query time? That's exactly similar to what Google does.
In comes an image. You quantize it to a bunch of words, and then all the items in the database rank against it. So the image which has the highest number of visual words in common will end up being the most relevant image, for example, in this particular case, the first image of the CD cover.
So there are two steps in the process. The first step gives you an approximate result. What's different in computer vision is often you need to find the exact image. You want very high precision. So you want your answer to be absolutely right, which is precision 1. The way you get close to precision 1 is by applying geometric consistency checks.
So these are algorithms like RANSAC, where you start matching these point features within pairs of images. So for example here are feature matches from these two CDs, but there's obviously no valid geometric transform between these images. So that would not pass a geometric check.
Here again, no, and then you keep walking down the list till you find the right answer, which has a valid geometric transform between the two. So image retrieval can be broadly classified as two steps, like find an approximate set of matches, and then find the exact one by applying some geometric transforms. So these are the kind of data sets people work with.
So if you think of like image classification, everyone looks at ImageNet for retrieval. So you look at data sets like the Oxford buildings data set, there's a Stanford mobile visual search data set, which I was involved in collecting like a few years ago, there's the University of Kentucky benchmark, and there's like INRIA holidays. And these data sets are typically scene-centric or object-centric. So object-centric refers to like indoor objects, scenes are like outdoor scenes.
So if you are working on this problem, these are the data sets you would work with. There was a simple retrieval pipeline, which showed how the text retrieval systems and these image retrieval systems are analogous to each other. So I'll go into some of these image retrieval building blocks, of like a more advanced retrieval system. And then, so the first one I'll talk about is pre-deep learning.
And then we'll see what happened post-deep learning and how like both of them are still relevant today. So pre-deep learning, so I think there is a quick call-out here, as there is actually an MPEG standard on computer vision. So MPEG, these are the guys who did all our video codecs, the audio codecs, so they actually have an excellent standard on computer vision, where many universities and companies participated in this for a period of like three or four years. So they have, if you want to get up and running, you can download this piece of software and use it, sort of benchmark yourself.
So here's what an image feature extraction pipeline looks like. So you start off with the query image. You extract some interest points. So this was similar to the interest point detector I showed before. You extract some descriptors, then there's some feature selection. There's some compression.
And then you have two descriptors. You have a global descriptor and a local descriptor. And then that sort of gives you a representation of the image. So I'll quickly walk through some of these blocks, just to give you an idea of what it looks like. So local feature selection, so typically like all features are not equal, right?
So when you try to match images, you quickly realize that features that are very small tend to be more noisy. Features that are too big tend to be unstable as well. Features that have a higher peak response tend to be more repeatable, so you can apply statistical methods to figure out like which features are more important than others.
And these are not individual dimensions of a single descriptor as much as like different features, that you collect. So each image produces probably like a couple of thousand features, and you want to rank these features to figure out which ones are more important than others. So that feature selection step helps you do that, then feature descriptor compression. So what happens here is, so these feature descriptors tend to be very high dimensional.
So if you look at the SIFT descriptor, which is like the benchmark, it's almost 1,000 bits per descriptor. So even if you have a few thousand of these descriptors, you end up having like hundreds of kilobytes, or even up to a megabyte, for a single image. So then that's the amount of data that is larger than the size of the image itself. So how do you compress some of these descriptors?
So here is one of the state of the art techniques, which sort of compresses distributions of data. So most descriptors are based on gradient distributions, because they capture the patch statistics in a very robust fashion. So here, you have a gradient angle distribution which is just the statistics of the gradient angle across the patch.
So this data lies in a very specific portion of the space, in this place it lies in, for example, what we call the An simplex in 3D, which is just a bunch of points which sum up to one. You can do some clever things. You can sort of like-- sorry, so you can-- so feature descriptors based on histograms tend to lie in this probability simplex.
So now you can sort of start placing points and quantizing points on this particular simplex. So there is this work I'd like to point you guys to, which talks about how to compress distributions. And you can reduce the size of data by an order of magnitude without losing performance. So that's image compression.
So, global descriptor, so think of global descriptors as like, typically databases are very, very large. You typically have millions or tens of millions of images. Or you want to get search results back typically in a few hundred milliseconds. So what you want to do is, typically, compare your-- you want to extract a really good descriptor from your image, your query image, and then compare it to descriptors from the database. And, typically, you want them to be binary, so you can make really, really fast comparisons.
A really simple global descriptor is one based on bag of words. So think of like you have a dictionary, like this is hierarchical K-means, and you sort of just keep track of the counts in each bin. So that's a good example of a global descriptor.
But that is a very simple one. People have proposed more sophisticated ones, so, for example, most state of the art global descriptors are based on extracting residual statistics in each one of these bins. So an example of that is, you start off with a bunch of descriptors. You start quantizing them into these different bins.
You compute the residual vector for each bin. And you aggregate these statistics, and you have a representation for each bin. You concatenate all of these, and you get a so-called global descriptor, which represents the image.
So a high level overview, you start off with a query image. You extract a bunch of local features. You quantize them to a very small codebook. Then you aggregate the residuals in each code word. Then, typically, these are still very high dimensional. So you reduce the dimensionality, you binarize them.
And then you have a global descriptor that represents the image. And this is a powerful representation, typically because you can then compare them directly if they're binary and retrieve very similar images from a large database.
AUDIENCE: And where does the codebook come from?
VIJAY CHANDRASEKHAR: Yeah, the codebook is trained from a large number of like image patches. So you'd extract a lot of features and you would train them. And, typically, these codebooks tend to be small. So you quantize them and then aggregate some statistics in each one of these bins.
So this is like a typical global descriptor pipeline. There are like many, many variants of it. So this was a general image feature extraction pipeline, before deep learning, right? So a quick recap, you need some interest point detector, something like a difference of Gaussian detector, determinant of Hessian, there are many more. So this gives you sort of invariance to scale, invariance to rotation, some invariance lighting conditions.
Then you need some sort of feature descriptor. So this is what describes each individual patch in the image. You need some sort of feature selection, to discard the good-- I mean, to find the good ones from the bad ones, which are typically based on statistical methods. These features tend to be really high dimensional, so they need to be compressed.
So you typically use techniques from the compression literature for compressing them. And then you have also a global descriptor, which represents the entire image. So these are some of the techniques, like VLAD, Fischer vectors, REVV, these are examples of global descriptors that are obtained by aggregating sort of residual statistics.
So that was the general image feature extraction pipeline, before deep learning. So what happened after deep learning? So deep learning, so deep learning's back. So we've talked about deep learning in many of these tutorials, lots of data, lots of computing power, and essentially lots of incremental innovation in the last few years have driven this trend up. Obviously, there is remarkable performance on every single task, like image classification, face recognition, pedestrian detection, pose estimation, et cetera.
So each one of these tasks, like deep learning is pretty much at the top of the table now. But there's something different for image retrieval, and I'll go into it next. So networks are getting deeper. So like, for example, the AlexNet was at seven layers, the Oxford net, which is very popular, has like 16. Google's Inception network was 19, and then Microsoft's deep networks have hundreds to even like thousands of layers, using a residual trick.
So that's for the classification problem, right? So the first problem is, how do we extract CNN descriptors for the image retrieval problem, right? So if you went all the way, so if you look at a deep network, typically the data on the left, the first layer is just the image itself. And then your last layer is like the class label, cat. So as you go deeper and deeper the layers become more and more semantic, right?
So for the image classification problem, you typically just care about the final layer. But I can't use the final layer for image retrieval, because let's say I had two different cats. They would get mapped to the exact same representation using a deep network. While I want to find a specific cat from a large database, so it's a different problem.
So how do you extract CNN descriptors for the image instance retrieval problem? So what people do is they end up using an intermediate layer in the network. So these are four different data sets, like holidays, Oxford, UK badge, graphics. So on the x-axis is the layer in the network. So Pool 5, FC-6, as you're going deeper and deeper, on the y-axis, you have a performance measure.
So typically, for most of these data sets, what works really well is just extracting an intermediate layer from the network, and then using that for image retrieval. And that seems to work pretty well. That's the first thing that people have started doing after like deep learning happened. Second, people have realized there's very little invariance with CNN descriptors, right?
Even though they're trained with lots of data augmentation at training time, for the image retrieval problem, invariances are very important. Like you have to get the same object, even if it's scaled, rotated, or under different lighting conditions. So here's an example of results with the CNN. So you're rotating the query.
On the y-axis, you have some retrieval measure like mean average precision. And you have the first fully connected layer of the Oxford net. And as you rotate the query, performance drops pretty rapidly, which is bad. But if you look at techniques from the previous methods, so this is the Fischer vector with difference of Gaussian interest points, this stays relatively flat, because with previous techniques, your scale and rotation invariance came from the interest point detector.
So the interest point detector found patches at different scales, different rotations, and that sort of gave you invariance. So you had invariance, while even as, for example, you had invariance to rotation and you had invariance to scale as well, to a certain degree. So which is why the performance of these previous methods stayed roughly flat, like the Fischer vector of a difference of Gaussian interest points.
But then the peak performance for the CNN is better, so you need like the best of both worlds, like I want better performance from the CNN, but I also sort of want invariance, which existed in the previous set of techniques. So how do you get invariance? I think we've used techniques proposed by Poggio in his recent work on iTheory. In a broad nutshell, like iTheory says that, if you extract descriptors on groups, rather than on an individual image, you can sort of be invariant to that group of transformation.
So if I go extract a descriptor based off of an orbit, the orbit corresponds to some transformation group, and if I extract a descriptor which is invariant, for example, based on a distribution, I would be invariant to that particular transformation group. So CNNs inherently are invariant to local translations, because you can think of these convolutions as templates. And by sort of max pooling in locally, you are sort of gaining translation invariance.
The computing moments like the max, that pooling operation gives you essentially invariance to local transformations. But what it doesn't have is it has very limited invariance to scale and rotation. So here's some work which we did recently with Tommy on how to incorporate scale and rotation invariance for the image retrieval problem. So this is for the image retrieval problem, which is slightly different from the image classification problem.
And this technique involves using a bunch of transformations and then sort of nesting them and then computing invariant representations based on higher and higher moments for computing representations that are completely invariant to a group of transformations. So I won't go into the detail here, but then here's another technique in deep learning, which is very popular for the instance retrieval problem. So for the image classification problem, what you would do is, you first, for any problem today, what you would do is first go train your network on the hardest possible problem ever, that you can find, right?
So you would start off by training on ImageNet. And then, let's say, you had full classification. You would sort of take that network, and then fine-tune for your specific problem, right? Image retrieval, what do you do, is, I mean, there's no notion of classes. So you typically fine-tune based on triplet networks.
So what you do is, you first train your network in the hardest possible problem, for example, ImageNet. Then bring it back for your specific database, and you tell this network, look, you define a bunch of matching pairs and non-matching pairs. So the matching pairs are images that correspond to the same object, non-matching pairs are pairs that correspond to different objects. And you say, network, bring the matching pairs closer and the non-matching pairs like further apart.
So the triplet networks learns and fine-tunes the parameters and adapts it for like a given data set. So that's fine-tuning. There's R-CNN for image object detection. People have proposed something similar for the image retrieval problem as well. So R-MAC is what R-CNN is for object detection. R-MAC is this equivalent for image retrieval, where people do clever pooling across a bunch of regions, to then obtain a representation.
So a bunch of recent papers on this, one from ICLR, just this year. Then people have also started, so with image retrieval there's still a lot of benefits from the previous pipeline. With image retrieval, there's a lot of good stuff before deep learning came along as well. Good stuff, like interest point detectors, are still very much relevant today, because they give you precise scale and rotation invariance.
A lot of these global descriptors are really powerful as well, because they aggregate a bunch of local descriptors, which are then hashed, so people have started combining these two as well. The beauty of these end-to-end learning systems is you can throw all kinds of crazy stuff into the network. So here's an example of where people have thrown the entire global descriptor pipeline, which is the aggregation stuff I talked about, into the CNN pipeline, where you learn everything end-to-end with backdrop.
And that gives you a more powerful representation, rather than learning some of these things independently.
AUDIENCE: VLAD stacks the global descriptors that you chose, right?
VIJAY CHANDRASEKHAR: Yes, they all belong to a class of global descriptors. So VLAD is one specific one, which aggregates these residuals that I talked about. So hashing, so people have applied like RBMs to it. So, typically, these descriptors tend to be extremely high dimensional. Like typically, it can go up to like tens of thousands.
So how do you come up with extremely compact representations? That's very important for retrieval, because you're searching like very large databases. So people have used like stacked RBMs, give like the best performance, for coming up with extremely compact representations for some of these very high dimensional descriptors. So a few references here.
So that was a quick whirlwind tour of deep learning for image retrieval. So if you guys are interested, you can always go and look up some of all those important references that I found. So challenges ahead, what are like some really interesting problems to work on? So still like image retrieval. It works very well when things are like objects are rigid, planar, textured, right?
Like that's where like still state of the art pipelines nail it, right? But textureless objects are still pretty hard. There's still plenty of databases where you have 50% like MAP. 3D objects are hard, reflective surfaces are hard, transparent objects are hard, non-rigid objects are hard. So these are still hard, especially if you want to achieve like very high precision retrieval.
And for lots of image retrieval applications, like high precision is really important. Like I want precision one. I want to determine, I just don't want the wrong answer. I just want just the right answer. And when that's the case, like all these scenarios are still very challenging.
So so far we've seen image retrieval. So video is the next frontier. So if you think of like all the products in the world, there probably aren't more than a million products. But it's very easy to get billions of video frames, right? So how do you retrieve from very large databases? So video is the next frontier. And for most of these large video databases, the performance drops pretty quickly as for the size of the database grows.
So how do you make your retrieval more robust, especially when database size increases? So that's another interesting problem to work on. For streaming mobile augmented reality, there's still lots of interesting challenges there. So think of this as the Pokemon Go scenario, right? If you played Pokemon go right now, there's no visual search in the loop.
Like the Pokemon Go character just sticks on some surface, and then runs around, right? But like a few years down the road, hopefully, you want Pokemon Go to also understand the environment around you, recognize all the objects there, from a large database, and then do interesting overlays. So that's actually a lot of data. How do you build systems like that? That's still quite challenging, because if data has to go back and forth between a server, you won't have real-time performance.
There's very little work on mathematical modeling of retrieval systems. With any of these retrieval systems, there's a lot of parameters, right? Typically if you have codebooks, you typically have a bunch of parameters associated with the codebook, like you can change knobs and the number of features in the query image, versus how many features extract on the database side.
How many images down the list should I check, to achieve a certain precision level? What's my performance as the size of the database grows? So I think it's a little difficult to go run each one of these experiments, as you change parameters. What would be nice to have is some mathematical modeling so that gives you intuition on how to choose like hyperparameters.
This is a pretty exciting area of topic. There's very little work on this. So there's some that came out a few years ago. I think for image retrieval, like all the work before, pre-deep learning, is just as relevant as the ones that came after. So how do you combine the best of both worlds.
You want to have some of the best of like SIFT, like the scale, rotation invariance that came from the interest point detector. But you also want the more powerful representations that come from CNNs. How do you sort of combine them, et cetera. So that's also interesting to work on.
So hashing is an interesting problem to work on. So these are like some state of the art results. So on the x-axis is the size of the descriptor in bits. On the y-axis, you have recall, right? So, ideally, as you want, even at 64 bits, so 64 bits is a lot, right? So smaller the representation, the faster you can search.
Like 64 bits can represent like 2 to the power of 64 items in the universe, right? So that's a lot. But then if you look at these curves, performance starts dropping very rapidly as the size of the hash decreases. So how do we come up with representations that are really compact, but have really good performance? So that's also a fun problem to work on.
So anyway that's like some of the interesting challenges in image retrieval. I think there are lots of interesting problems to work on, lots of open problems as well. And finally, this is the last bullet I had, so opportunities ahead. So I think we have a panel discussion afterwards. So maybe there'll be some interesting food for thought here.
So deep learning has been called a $500 billion opportunity in the last 10 years. So I'm always a bit skeptical of some of these reports. But for what it's worth, that's a couple of graphs there. So there are two graphs there. So there's hardware and there's software and services. So hardware is going to be about $5 billion a year 10 years from now, and then software is going to be a hundred, right?
And I think what's really interesting about AI is that it's sort of touching every single sector. Like people are using these algorithms, these tools, for improving productivity, for increasing revenue across all these sectors. So it's not unimaginable that this is a half a trillion opportunity in the next 10 years.
So what does the deep learning landscape today look like? So roughly, like if you think about it like going from the bottom, so you have hardware. So pretty much Nvidia today dominates hardware. So they've built some of the best hardware out there for deep learning.
What they've done really well is they provide SDKs, and they've accelerated every piece of software out there on their hardware, making their ecosystem extremely sticky. So they have CUDA DNN, cuBLAS, et cetera, so they keep optimizing all the software to run on their platform. And it's amazing how a lot of fundamental work that was done in FFTs and all that, it's all back now again, because if you can sort of reduce training time by even a factor of 2, that's still significant, because when you're training over, let's say, a few hundred thousand hours of data, it still takes weeks.
So if I can improve even my FFT by a factor of 2, using some technique that was proposed in the 1960s, I can bring that training down to like a week from two weeks. So Nvidia is providing a lot of these libraries with a tremendous amount of software in the open source. So pretty much everyone at this point has open sourced their deep learning stack. TensorFlow is very popular today, Caffe has done a great job. Microsoft has CNTK, like Facebook uses Torch and Tiano, Twitter had something as well, I forget the name.
But pretty much everyone's deep learning stack is in the open source. So you have the hardware and then you have these platforms, and then on top you have all kinds of interesting applications which pretty much build on these same building blocks. A lot of these building blocks are also available in the open source today. So there is building blocks for vision, speech, like behavior, et cetera.
So that's sort of what the landscape looks like. I think maybe during the panel discussion, we can talk about sort of interesting companies that have been sort of built at every step. So I think, at the hardware level, we just saw a big exit, a company called Nirvana, which got acquired by Intel for about $400 million. There are plenty of exits at the platform level as well. Like we have MetaMine here today.
There's also like Hinton's company, which got acquired, there was Deep Mine, which got acquired. But I think for starting one of those platform companies, I think you had to start it like a few years ago, because now, what was considered like very esoteric like two or three years ago, is now pretty much, a lot of it's common knowledge. And then there are obviously all kinds of interesting end-to-end applications.
So here's an example of what we call a full stack company, where you build an end-to-end system. So Mobileye, which was I think one of Tommy's students, is now a billion company. They have a camera system that recognizes pedestrians real time. It's used for like assistive driving, autonomous driving, et cetera. So all kinds of interesting companies to be built across the stack.
Hopefully, let's end with this, I just wanted to end with saying that AI is not going to kill us. This is crazy talk. Thing is, that's the last slide I had. Any questions?