THINGS: A large-scale global initiative to study the cognitive, computational, and neural mechanisms of object recognition in biological and artificial intelligence
Date Posted:
December 16, 2020
Date Recorded:
December 12, 2020
Speaker(s):
Martin Hebart
All Captioned Videos SVRHM Workshop 2020
Description:
Martin Hebart, Max Planck Institute for Human Cognitive and Brain Sciences
PRESENTER 1: Martin obtained his PhD from the Bernstein Center for Computational Neuroscience and the Berlin School of [INAUDIBLE] with John Dillon Haynes. He then received a postdoc training with Hans [INAUDIBLE] at the University Medical Center in Hamburg and later at NIMH with Chris Baker, supported to the prestigious Humboldt fellowship.
He is now an independent research group leader at the Max Planck Institute for Human, Cognitive, and Brain Sciences in Leipzig, Germany. And we are excited to hear about his talk today, titled THINGS-- a large scale global initiative to study the cognitive, computational, and neural mechanisms of object recognition in biological and artificial intelligence. Take it away, Martin.
PROFESSOR 2: Thanks a lot, Brandon. Yeah, and thanks again also for this fantastic workshop. I'm really, really looking forward to all of this. Before I start, I just briefly would like to mention-- like, dedicate this talk to Leslie Ungerleider, who sadly passed away yesterday, which left many of us in shock. I want to mention that I believe she was a giant in the field of neuroscience and an inspiration to many of us but also a mentor who cared deeply for students and postdocs. And I think she will be sorely missed by the community.
I would like to come back to my talk about the initiative that we called THINGS that we started several years ago. Now many of us here at this workshop are interested in understanding the similarities and differences in object recognition between, on the one hand, brain and behavior and, on the other hand, artificial intelligence. And I think there's two important reasons for this.
So number one is that we want to learn from in silicon models, which can help us understand the brain recognition and then perhaps ultimately allow us to treat disease. On the other hand, we are interested in building better models, and we would like to learn, perhaps, from the mistakes that [INAUDIBLE] models make and perhaps how problems are solved in humans or other biological intelligence. Now our lab is mostly working on understanding visual recognition. So the focus of my talk will be on understanding object recognition.
So what are the typical approaches that we use for relating brain behavior and artificial intelligence? Well, typically, what we do in-- and that's, like, one of the many approaches that exist-- is that we take a deep convolutional neural network, and we pass some images through it of different object categories. And this would then-- we extract the activations from, let's say, the penultimate layout or any layer that we're interested in studying. And this gives us a bunch of activation vectors for different object images. And now we can do the same with the brain.
We can present participants or animals with different stimuli and then focus on specific parts of the brain. Or we could look at behavior. But in this particular case, we would then be extracting something similar-- in this case for biological units, for a bunch of different stimuli. And then the last step we would be comparing these extract of representations between the artificial units on the one hand and the biological units on the other hand, for example, by using regularized regression, representation similarity analysis, canonical correlation analysis, et cetera, et cetera.
OK, so now a lot of focus has been placed on understanding this link, and a lot of focus has been put on the model architectures themselves that we are trying to understand. So people are comparing, let's say, different deep convolutional network architectures and trying to figure out which best explains the brain. And other people have focused a lot on trying to improve the link by introducing different methods that allow us to better capture these similarities and differences. But one aspect which I think is really important and I think that people maybe would like to focus on a little bit more is the stimuli that we are using and that we are presenting both to deep convolutional networks and to the brains.
And now what we ultimately are interested in understanding, I think, in object recognition is something that can be characterized as a real world object space. So this is just like a two dimensional depiction of what our object representations may look like in the brain or in deep convolutional neural networks. But what we're really interested in understanding is the real world object space in the brain.
Now in order to capture this, there's different approaches that we can take, and I just want to focus here on one dimension that has been particularly important, I think, which is looking at a number of different object categories. Very traditional experimental approaches have been using very small number of carefully controlled stimuli, and with this approach we've learned a lot about representations in the brain. But at the same time, as you can see in this depiction, we can only cover a very small part of this representational space in each given experiment.
Now there's other approaches which are using a broader stimulus set which are also known as condition rich designs. And these designs allow us to capture perhaps a broader part of the representational space. But at the same time, it's possible that we are limited to the specific parts of this representational space that we think are important or that we think are useful. For example, if we only look at animate or inanimate stimuli, then we might be missing a lot of other parts of the representational space that we might be caring about.
Well, finally, of course what we could also do is we could make use of large and broad existing machine learning databases which would allow us to immediately compare the results to deep neural network representations-- in the case of image [INAUDIBLE], up to 1,000 different object classes. But I guess as many of you know, many of the individual images in image net are of very low quality, and I think we would have to dig pretty deep to find a subset of actually useful for experiments in humans or in other animals and that would all have, let's say, the same aspect ratio, of sufficient quality, don't have watermarks on them, don't have text on them, et cetera, et cetera. And I think that's an issue we can address, and I know that this issue has been addressed in the past. For example, there's this bold 5,000 dataset which [INAUDIBLE] very nicely.
But to understand real world object recognition, I would say that what is important is how well we actually cover this real world object space. For example, in image net, many of the classes were selected randomly. So this leaves open whether there are important parts of this object space that are not covered. And of course, this likely doesn't matter for studying low level or mid-level vision since a broad set of images will likely cover those more like low level or mid-level representations already.
But if we only capture a part of this representational space for high level recognition [INAUDIBLE] for object categories, then it's possible that we're missing the big picture. Another potential downside of using such image sets is the range of categories used, which may be over-representing certain image classes relative to the importance for humans. And even worse, humans might be picking up on these biases, which could affect how they look at those images.
So what is then the ideal way of studying, of covering this representational space? Well, I think it's probably something that looks a bit like this. And this would give us a very systematic view of object representations.
Well, what I've done here is I've painted it like a systematic and very regular grid on these objects spaces. And I think the only way we can really address this is by introducing a second dimension here, which I would term experimental control or system multicity. And this might then allow us to really reach this goal of having very broad coverage and also representative coverage.
And this is not only important for relating humans to artificial intelligence. It actually also matters for the study of animal models and their relationship to humans. But it also matters for each individual of these little parts-- so even if you're only interested in studying, let's say, objects using functional MRI in humans, or let's say even if you're interested in studying, let's say, monkey behavior. So this all motivated us to develop the THINGS database.
So how did we go about and do this? How can we actually get like a broad coverage that is in some way representative? And of course, there's going to be biases in the way that we approach it as well. But at least then we chose like a systematic approach for getting at this.
First what we did is we created a large set of different object concepts where we had a compilation of roughly 7,000 concrete nouns in the American English language. And then what we did is we presented participants on Amazon Mechanical Turk with example images for this very, very broad set of objects. And then we asked them, well, what is the object that you're seeing?
For example, if we showed them a picture of, let's say, a carp, then they might be responding with fish, which would then say that perhaps the category of carp is not so important to their visual discrimination between different objects. So for that reason, we would then be dropping that. And this approach left us with a set of 1,854 unique meanings of different objects that were covered.
In the next step, we took these object categories, and we carried out a large scale web search for these different images. And we thought this would be actually rather easy. But it turns out that for a large number of these different object categories, there are very, very few high quality images available on the internet. And this whole process took us more than a year to actually accumulate at least 12 high quality examples of these images with natural background that are, to a certain degree, controlled for their visual appearance.
In the end, we ended up with around 26,170 images in the final database. And just to show you some examples here are pictures of burritos. Here are pictures of cows. And you can see that they also vary quite nicely in the visual statistics. And here are pictures of vending machines.
If you're interested in using the THINGS database, you can find the OSF [INAUDIBLE] here where it's all freely available. So the next step what we did is we actually ended up-- well, we actually were interested in collecting brain data using this THINGS database. And of course, there's a lot of different images that we could be presenting to these participants which doesn't allow us to do this in a single experimental session.
So what we ended up doing is we collected brain data in humans using FMRI over the course of 12 different sessions to address where in the brain objects are represented. And we used MEG also over the course of 12 sessions to investigate when these objects that represented. And to stabilize [INAUDIBLE] between these different sessions to get really comparable, results we use individualized head casts.
In [INAUDIBLE], we ended up with three participants. We used 720 of the object categories with a total of around 9,000 individual images. And in MEG, we had four participants and a total of around 22,000 different images.
I just want to give you a quick preview about the quality of the results that we can expect to find. So this is work that's currently being conducted by Oliver Coutier. He's a PhD student in our lab. And what you see here are maps of the noise ceiling. And everything that's in very light colors here would mean that we can actually explain a lot of variance in the data and everything with our colors, which show you that you can explain a lot of variance in the data.
And what you can see is that in all three participants, you can find very nice noise ceiling estimates going up to almost perfect noise ceilings, actually, in all three participants. You can also see individual differences. So for example, in participant one, we have much, much more widespread activations, but this participant also was truly a soldier and didn't, like, move at all in the scanner over the course of all 12 sessions.
So this is very interesting. I'm really looking forward to these results. Now with MEG, and this is work done in collaboration with Lina Teichmann at NIMH, we are already like one step further. And what we look at here is the representation of different object dimensions that may matter for the representation across time. And what we can see that this allows us to give us a very, very detailed and very fine grained picture about how these objects are represented over the course of individual milliseconds.
So this is again like a very exciting development. But we're not the only people who are collecting or who have collected data using things. Indeed, there's a wide range of different groups who are currently collecting brain data or planning on collecting brain data soon for animals and humans using EEG, humans using intracranial recordings, ecog, you name it.
At the same time, there's also [INAUDIBLE], who's working on developing a deep neural network using THINGS. So many of these datasets will, over the course of the next few months or even years, become available to the general public, and we hope that this will serve as a benchmark not only to test deep neural network models but also for relating different species and different measurement modalities to each other. If you are interested in contributing to THINGS, I would be very happy if you contacted us, and then perhaps we could coordinate this effort.
Now the next step, I would just briefly want to mention some of the research that we have done using THINGS. And one of the central questions which is, I think, like a long standing issue in cognitive science is, well, what are the core dimensions underlying our mental representation of objects? For example, you can say a dalmatian dog can be characterized by a number of different features, such as that it has four legs, has tails. It's black and white, has spots, et cetera, et cetera.
But of course, there are a lot of different dimensions that may not matter so much-- for example, the fact that it appears in Disney movies-- for discriminating it to other objects. So the question that we asked here is, well, what is the core set of dimensions that allows us to distinguish objects from each other? Now to address this issue, we use a three pronged approach. First, we used a triple open out task, where participants would tell us which object was the least similar to the other two.
We used online crowdsourcing and collected 1.46 million responses using this task. And then we developed a computational model which we termed sparse positive similarity embedding, which would then just mimic the process that we think is going on when people are carrying out this task. So we would be computing a similarity using a set of object dimensions. We'd be predicting the choice, comparing this choice to human behavior, and then updating the weights using a number of loss terms.
How well is this model doing-- and what you can see here is individual choice behavior for individual trials. And the noise ceiling denotes the best possible performance any model could achieve. And what you can see is that this model is actually doing a pretty good job. We are very, very close to the noise ceiling in terms of individual trial behavior or explaining individual behavior in humans.
Now if you're interested in the similarity between different objects, here is the predicted similarity for a subset of 48 different objects. And here is the true similarity that we measured where we got a complete measure of all similarities. And you can see that's a very, very nice correspondence.
So now this is interesting. This gives us a model that allows us to capture behavior very well. But can we actually also learn something about the representations that humans have?
What are the core dimensions? It turns out that the restrictions that we imposed on our model, the sparsity constraint and the positivity constraint, tend to leave us with very interpretable dimensions. So now what we can do is we can just inspect these dimensions and see what they look like.
Here are just a few example dimensions. Some of them like highly categorical-- for example, something that's animal related. But others are also related to colors, to value, even to shape, like something circular, something maybe that's being like a fine grained pattern or many, many different things, or something to be related to fire.
Now we can turn this whole process around, and we can characterize individual objects using these different dimensions. For example, a peacock would be characterized as being animal-related, plant-related, or green, colorful, something valuable, special, and having a repetitive pattern. A rocket could be characterized as being artificial, something transportation-related, sky or flying-related, shiny because of the fire, and maybe something fire-related.
So this is interesting. And one thing that we're asking ourselves, and it's something I would like to show you in the context of this workshop, is to address the question, well, what actually makes deep neural networks similar to humans, and what actually are the differences? Of course, we are able to compare deep neural networks directly by using these different methods like regression or representation similarity analysis.
But this only gives us a number of how well they correspond. And it doesn't necessarily tell us what are the similarities-- so what do these representations share and what do they have in common, or what are the differences. Now using these dimensions, Lukas Muttenthaler, who's a grad student in our group, used this approach for comparing the dimensions in humans to similar dimensions extracted from VGG-16.
These are very recent results, and what you can see here on the left is for a given dimension what human behavior would look like. And then we try to identify the most similar dimension extracted from VGG-16. And you can see that this indeed looks very, very similar.
Now importantly, what this also allows us to do, then, is actually look at, OK, so these different dimensions are correlated. But what, then, are the differences in these representations? And what you can see here is that, for example, in humans, it continues to be representing green things or plants whereas in VGG-16 it seems to be doing something different. And I think that's something interesting that we can then explore further to try to understand, well, why is VGG-16 not representing these things similar to humans?
Just to give you another example, here is something that I think everyone would agree on that is representing the color red, both in human behavior and VGG-16. but again, if you look at these differences, something that in some cases, obviously belongs to the class of being red is not identified in VGG-16 and vice versa. VGG-16 seems to be responding to things that indeed contain something as red but that for humans would not be dominated by the color or the attribution of the color red. So again, I think that's something interesting that's worth addressing.
We have an ongoing collection of metadata for THINGS. So this is something that we've done in collaboration with Wilma Bainbridge, where we've collected THINGS memorability, where we asked around 14,000 participants on Amazon Mechanical Turk to give us memory ratings or carry out a memory task which can give us a memorability score for all of the 26,000 images in THINGS. And I think this would be very, very interesting to relate this to deep neural network activations but also to brain activity in humans and in a different species.
We are currently carrying out rating tasks with Laura [? Stransky ?] and Jonas [INAUDIBLE] in our lab where people are carrying out image naming for all 26,000 images. We have property ratings-- for example, size. We also get image segmentation, and [INAUDIBLE] who's currently collecting feature production norms extracted from GPT-3 for all of these 1,854 different concepts, which, again, I think would be very interesting and very useful because it could give us a model of the semantic representation of all these different objects. So with that, I would like to thank the vision and computational cognition group and the machine learning core at NIMH with whom we've conducted lots of the behavioral research I just spoke about, the Baker Lab, where I've collected the MEG and FMRI data--