Characterizing models of visual intelligence

Characterizing models of visual intelligence

Date Posted:  December 16, 2020
Date Recorded:  December 12, 2020
CBMM Speaker(s):  David Mayo
  • All Captioned Videos
  • SVRHM Workshop 2020

PRESENTER: Our next speaker is David Mayo. And David is currently a research specialist at the Computer Science and the iLab here at MIT and at the Center for Brains, Minds, and Machines, where he is advised by Professor Boris Katz and Andrei Barbu. And David will be presenting his work on characterizing the models of visual intelligence.

DAVID MAYO: Hi, I'm David. I'm actually now an incoming PhD student at MIT. So I want us all to understand how good our object detector models are, and how well they perform in the real world. More specifically, maybe, also how similar are they to brains? And what part of the visual system have we been able to capture with what our models currently do?

So these aren't idle questions, because we really have to understand how good our models are in order to make progress in deploying safe systems and in using them in the real world. And at the same time better understanding the relationship between brains and our models can really help in guiding future development of our models and directions that maybe we should be moving in.

So if you just look at raw performance on many of our benchmarks today, you think that computer vision is well on its way to being solved. And performance is really not that far away from human level. And in some very recent cases, it may even have surpassed people, and many specific tasks even leaving us behind entirely.

But let's take a look at this paradox. So what we do is we take ImageNet images, we show them to our models, we show them to people. And what we're comparing is just the accuracy, the overall recognition performance on many images on objects.

And we can see that these accuracy numbers are pretty similar and very human-like for image. When we run these things in the real world, and this is an example of running this over Mr. Bean movie, we see that objects will blink in and out of existence. So between frames or losing them, there's many objects that are actually not being picked up at all.

For example, there's a bucket and some boats in here that are just not getting picked up, but are in the object detector's set of classes. And we can see that there's a clear drop-off in performance between what we'd expect from these ImageNet benchmarks and what's actually going on when we try to use our object recognition models.

So to try to address this and to quantify this generalization gap, my work is done building a data set called ObjectNet. When we use ObjectNet, which is a set of images that were captured to look much more like the real visual world and to control for a set of specific visual features, we can see that in terms of overall accuracy. There's this large performance gap between what people are able to do and what our machines are currently able to do.

So I'll play this quick clip, which is how we have collected our data set to give you an idea. Basically, we have thousands of Mechanical Turk workers on the internet that go through this system that I've developed to capture images of objects in different locations in different backgrounds in different rotations. And these objects are all collected at their home and are naturalistic, regular, everyday objects they use.

By doing the scavenger hunt and starting with the label's first approach, we end up collecting a data set of images that looks very different from other computer vision data sets. In particular, this is an example of ImageNet for an exercise weight. These weights often occur on sometimes white backgrounds, sometimes in the gym, often correlated with people and how they're being used.

They're usually canonical perspectives and viewpoints. And they're taken for artistic purposes, which makes sense because these are images that are webscraped from Flickr.

When we control for the rotation of objects, the viewpoint that they're imaged from, and the background that they were taking on, so these objects were taken in the bathroom, the kitchen, the family room, entirely unrelated to what object class it was, it becomes a much more realistic-looking data set of what weights might look like in your home or in the real world. And it becomes much harder for object recognition models.

Here's another example. This is a bottle opener for ImageNet, usually seen highly correlated with bottles and the people, as well as in canonical views such as just sitting out on a table. And these are ObjectNet bottle openers, which are in all kinds of different viewpoints and rotations and also have a lot more class variability and are generally much more like the real visual world.

So as you can just see, there's a striking difference in even just a couple of classes there. This striking difference, controlling for just a few parameters, adds up to a large generalization gap. So previously, this top one error curve, the light blue line on ImageNet, which in recent years if we extend this x-axis out even further, has reached and maybe even surpassed human level performance, is much lower when we try to run this on our ObjectNet real world images with visual controls. And this is about a 40% to 45% drop that'd be preserved.

Zooming in a little bit further, when we look at specific classes, actually there are many classes in the ObjectNet where these models are performing well and many where they're not performing well at all. And interestingly, here I've plotted these bar plots. Each represents the accuracy of many models aggregated together.

And interestingly, the models that the object classes where models were already doing well have actually improved that significantly. So that's where a lot of the gains have been. And object classes were models never did well have seen very little performance improvement. And I've actually seen this trend as well across even more fine grain categories, like rotation and view point.

So it's also very unpredictable what's making object categories difficult or making them easy. So some random selection of some of the worst categories would be milk, coffee, French press, a bench, a dishrag. And some of the easiest ones that are also difficult to explain are things like a plunger or a safety pin or a hairdryer.

Further work has built an ObjectNet recently. A recent paper on measuring the robustness to natural distribution shifts has run many, many models on top of both ImageNet and ObjectNet, and compared them.

And here, there's blue dots, there's some orange dots, and there's some green dots. And these dots represent some different classes of models. So we have vanilla models. We have some models that we're trained for additional robustness. And then importantly, these green dots are models that we're trained with massively more data.

So overall, running many models, there's still this large generalization gap between ImageNet performance and ObjectNet performance. And the few models that have actually managed to break this trend and have seen additional improvement on ObjectNet than what we just expect from their increase in ImageNet score are models that were trained with massively more data. But this data is on the order of hundreds of millions, or even billions of images. And the relative improvement that you're getting, so that massive increase in data set, is fairly small.

So using ObjectNet, we can see that our models are far from human-like. But accuracy is really a fairly coarse metric. Everything in the model, and even the brain, has to go right in order to get to that final output answer. So we can instead look at neuron recordings and try to compare those directly to the activations of our model.

To do this, my recent work has been on taking this data from the Allen Brain Institute. And this is in collaboration with Colin Conwell from Harvard. So my recent work has been taking the data from the Allen Brain Institute, which includes calcium-2 imaging of individual mouse neurons for about 256 mice and 120 images.

These images are perhaps the same ones to both models and to mouse brains. And then we take the activations from models. And we take the neurons from mouse brains. We then do a little bit of processing to dimensionality reduce our activation of the models and combine our neurons across the different mice.

And then the question we want to answer is how predictive are these activations from our models of actual neuron response. And this is building on the work of [? brain-like ?] score. Here, we're looking at a metric of just regressions of productivity from our model activations to neurons.

And when we do that, the first key result is that our train models are much more brain-like than our entirely randomly initially models. So here on the left are ImageNet models that have been trained. Each represents a different model. And on the y-axis, we have a r-squared score, so how predictive are activations of the mouse brain neurons.

And then we have also our entirely randomly initialized models. And the blue-greenish lines here are showing the connection between the models that were the same model architecture that was randomly initialized versus the model architecture that was ImageNet trained. And in every case, our ImageNet models are actually becoming more brain-like. But much like what we see with ObjectNet, we still have a long way to go before actually wholly explaining what's going on inside of the brain.

Another key result that we can get out of this, but it's difficult to see with accuracy, is suggestions as to how we can improve our models to make them more brain-like. So what we've seen by ranking many models that we've run across these metrics is that models that have more layers is better correlated with having a high brain score, with having a high mouse brain score.

Also, models that are narrower as opposed to wider, and models that have fewer parameters, score better. And are more brain-like.

Additionally, we found that our mouse brain score metric is highly correlated with primate brain score, showing that now we have another animal model and a set of more neurons that we can use to compare our deep learning models to brains. Particularly, we think it's interesting. We think that a lot of this accuracy could be accounted for by earlier processing where the mouse brain is a bit more similar.

So all of these metrics compare full models to full brains. But what if models aren't capturing everything that a ventral visual system does? It turns out that by masking the input, which is showing people images and then disrupting their processing with a visual mask afterwards, we can compare models against partial human brains.

So by tuning how long you have to process the time, so how long you seeing an image, we can dial in how much processing your ventral visual system can do. And of course, this isn't perfect. Mapping isn't perfect, but because it doesn't fully interrupt all the visual processing of your brain. But the disruption will leave a signature in terms of a decrease in your accuracy or a specific error patterns that humans make when they're limited in time.

And one exciting hypothesis we're interested in is that, since our models are feed forward, do they better match humans that have been time limited. So if a human has only 150 milliseconds or so to process an image, that really leaves a limited amount of time and allows them to do by one pass through the ventral visual system.

To set up these kinds of experiments and look more closely at humans and compare them to our models, we've designed an experiment where first our test subjects look at a fixation cross this fixation is there so that way the object that appears immediately afterwards requires no [INAUDIBLE] for it to be able to recognize. And it's resized based on the subject's distance from the screen.

So it's inside of their fovea's field of view. Then an object appears, and then a backward mask, or a mask after the image appears. And what's important here is in red we put the timing. So we've been running these experiments for about 60 milliseconds to 230 millisecond durations to see what happens when humans are time limited and they only have this feed forward processing to rely on to recognize objects.

So these are very preliminary results. And we've run this on many of our lab Members and so far, what we found is there seems to be a large discontinuity when we set this up as a binary choice problem. So there's a category A and a category B. And you're picking which of these two categories actually applies to the object that you just saw that makes this dashed line here about 50% random chance of guessing.

And we've seen that we can create this discontinuity where, at a certain time point, they're suddenly able to recognize objects. And if we go too fast, they're really not able to. And we've zoomed in and focus some more experiments in this discontinuity region where they're first able to start doing recognition processing.

And we start to look at what are the error patterns that humans make. So we set up an experiment with time limiting to that short duration where they're able to recognize objects in a one out of 30 category test. These are 30 categories selected from ImageNet. They actually overlap with ObjectNet classes. And these are the correct labels on our y-axis for those images. And these are the responses that we got back.

And it's interesting to decipher what are some of these error patterns that are going on. A few examples are speakers are mistaken for microwave ovens. We think this is likely because of the square shape and often happens with interesting lighting.

Matches and candles become often mistaken. And this could be because they saw something on fire or the brightness or general color. Also, there is maybe some shape areas with things with handles, which are things like ladles and frying pans.

So we're really interested in zooming in and looking at, now that we can control for different times, are the patterns different for different times. And potentially, are feed-forward humans a bit more like feed-forward models? And can we maybe compare deeper models, or unroll a ResNet to deeper layers and see if the error patterns actually change as we use deeper models and compare that to people with additional amounts of processing time.

We're also interested in human vision, specifically trying to characterize these different discontinuities in accuracy. So there's certainly one here, but it's not clear how quickly this saturates. And there's also a number of edge cases of interesting images. Some that are very easy and some that require much more processing time.

We're using ObjectNet that for some of these experiments now, particularly because ObjectNet that has so many difficult examples for humans to recognize. And it has some where you can stare at for a few seconds before you're finally able to put together what you actual saw.

And with that, we clearly need models that are more human-like and are robust to the real world. And to encourage that, we're actually announcing an ObjectNet competition. So this was developed by the great team at MIT, as well as in collaboration with the team at IBM.

And the competition is set up like this. So we want teams to come up with better data sets they can use to train their models, better models that are able to actually generalize to the real world, or other kinds of innovations. And train those models on their own servers. They can then package these models up in docker containers and submit them to us through EvalAI.

And with this system, models will remain entirely hidden. So we've collected and used 6,000-image ObjectNet data set that has never been released before. Looks just like ObjectNet, but was collected a bit more recently.

Models can be submitted to our competition once per day inside of these docker containers. And the evaluation will occur entirely on our side. That way the images remain hidden. And we can also limit the frequency at which models are submitted, which will keep ObjectNet truly a test set and prevent people from being able to over fit and use it for too much fine tuning.

We've also created three different competition tracks. So we have the traditional one, which is what we record a lot of our results for, which is training a model entirely on ObjectNet, and then testing that on the 113 ObjectNet overlapping classes. But we also have two tracks for training on any data set. So bring your own data and then evaluate that on either the 113 ObjectNet classes that overlap with ImageNet, if you're using those categories.

Or you can also attempt to use our full ObjectNet set, which we actually have 313 categories of all kinds of different objects inside of [? them. ?] And with this, we're releasing some starter code in both PyTorch and in TensorFow where it's very easy, once you train the model, to stick in your model weights to update the model description files and the data set it transforms for however your model was trained.

And then upload this to us, and we'll evaluate and let you know how well your models are doing on real world data sets,

So this competition opens up this coming Monday. It was announced recently on the IBM blog. And it will end during CVPR June 21. So this gives teams about six months to work on this problem.

If you're interested in participating, please check out objectnet.dev. And there'll be links to code repositories and EvalAI where you can read docs and learn about the spinning.