It's about time. Modelling human visual inference with deep recurrent neural networks.
December 16, 2020
December 12, 2020
Tim Kietzmann, Donders Institute for Brain, Cognition and Behaviour
All Captioned Videos SVRHM Workshop 2020
APURVA RATAN MURTY: He obtained his PhD from the University of Osnabruck, where he was-- and he was also a visiting student with a friend at Vanderbilt. He was a postdoc with Professor Konig at the University of Osnabruck before becoming a research associate at the MRC cognition and brain sciences unit at the University of Cambridge. And he's currently an assistant professor at the Donders Institute for Brain, Cognition, and Behavior at Radboud University.
His work fits nicely at the nexus of human and machine visual intelligence. We are excited to hear all about this in his talk, "It's all about time. Modeling human visual Inference with deep recurrent neural networks."
TIM KIETZMANN: Great. Yeah, thanks for the introduction. Thanks for the invitation. "It's about time," I mean the talk is about time and about how we use recurrent connectivity in deep neural networks to kind of better understand visual inference in the human brain.
So when I start-- so I should say that I thought a lot back and forth, gave Arturo a hard time not telling him what the title of the talk was, because I went back and forth between whether I should talk about ecoset, which is a new data set to train deep nets on, which are new object categories, or this. Now, given the last two talks being about data sets, maybe ecoset would have been a good choice, but now it's something a bit different maybe. Maybe that's good, too.
So when I show you this image, almost immediately everyone is able to extract meaning from it. You see the pixels, but your brain within a fraction of a second can tell that this is a little girl holding maybe her newborn sibling, and that she has maybe mixed feelings about holding a baby, or having a sibling. And that's remarkable.
And our visual system is just generally very good at these visual tasks. It's very fast. It's very versatile. And this is the community that I think appreciates this most of all communities. This is maybe a bit boring introduction, but OK. So the visual system is fast and versatile. It's very reliable. It's very energy efficient. And it's very data efficient.
So in my lab what we try to do, or try to understand, many of us here, is, how does the brain do that? How can it be so quick and yet so robust and so efficient? And in addition to trying to understand how the brain does that, we also would like to know, can we learn from looking at the brain and improve machine vision in the same time? Which means that we're in this midway between computer vision and cognitive neuroscience where we try to improve our understanding on the one hand, and on the other hand try to improve machine vision.
And deep nets really offer themselves for this sort of task between the two [INAUDIBLE], between the two worlds, because they can work as computational models of inference in the brain. And at the same time, many of them are image computable, which means they give us a handle on computing representations from pixels, which is what machine vision is ultimately about. So we think that deep nets are really ideal for this sort of dual task.
I think many of us have started using deep nets since they've basically revolutionized computer vision, having been around for a long time. But AlexNet in 2012 coming onto the scene gave a lot of people-- left a lot of computer vision researchers in shock that they could do this. And it didn't take long for neuroscientists to also appreciate deep nets as models of brain function.
What I'm showing you here is data from humans and macaques. I don't want to go into details, but it basically says that computer vision models trained on the task of object categorization happened to be quite good models of what the human brain is doing if we look at neuroimaging data or also at macaques and firing rates. So today DNNs are really the best image computer models that we have for predicting primate ventral stream, which doesn't say they're the best. It doesn't say they're perfect. It just says currently these are the avenue that we take at understanding human vision, and they're currently the best models for doing so.
It didn't take long from these early papers who started to show that there are, in fact, some similarities between deep nets and how the brain solves the task of object recognition for many other labs and many people to join in. And now the field is full of computer vision DNNs.
And the reason for that is, in part, that it's computational convenience. We use these models, these feed-forward models because they're there. We can download AlexNet. We can download VGG. We can run our models on it and compare it. And that's great. And it's very useful. And in terms of CO2 emission, that makes a lot of sense.
But we must be aware that with many of them, most of them are feed-forward architectures trying to categorize. And we often predict time average data. So one thing that we think is missing from the equation here for understanding the brain, and maybe improving computer vision is the temporal domain, or recurrence.
So if you want to understand the brain, ideally, you would like these three things, which I took from Marieke's and Niko's paper from 2014. We'd like to know what, when, and where, which means we'd like to know for the ventral stream or basically everywhere in the brain, we'd like to know, what is the distinction that is currently being made? What is the representation like? How does it change across time? And how does it change across space?
And so ideally we would like to have models that do the same. We'd like models that don't only predict temporal average data, because the code can change through time. We want models that basically do this. We want models that reproduce the whole representational trajectory as information enters the retina and goes from one ventral stream area to the next.
But also within a given ventral stream area or visual area, we'd like to know how the code changes through time in that very area. So we need both. We need variants in space. And we need variants in time. And all of that, ideally, should be in one big deep neural network model that then kind of is a good model of what the brain does. If it follows through the same trajectories, representational trajectories as the brain.
So the problem is with many of the models that we took and that we took out of convenience that I said already, they're feed-forward. So dynamics are kind of out of the question there. But the problem is that a lot of these models that we typically use to make inference about the brain are doing two steps at a time, basically. They assume a given architecture. That could be a feed-forward architecture. And they assume a given task. Let's say object categorization, but that's two assumptions at a time.
And in this project that's published last year, we kind of took a step back and said, well, let's try and let's play with different architectures. And let's try to enforce the architectures to be as brain-like as we could possibly get them. And let's not even train them to be doing a given task. Let's train them to be brain-like.
And then there will be some architectures who do really well in this task. And some architectures who are not very good at this task. And the ones who are not even good at this task if we force them to are maybe not good candidates for being a good model of the brain. That's the rationale, basically.
So how do we do that? How do we try to enforce brain-like representations in deep nets? This is an extension to what's known as representational distance learning, which we call dynamic representational distance learning. And the idea is super simple. The idea is that you show objects to a brain, and you extract for a given region at a given point in time the representational pattern, let's say, for seeing an elephant or a pineapple here to the-- can you see my cursor, Arturo?
ARTURO DEZA: Can you move it again?
TIM KIETZMANN: Now you can.
ARTURO DEZA: Yeah. We can see. We can see.
TIM KIETZMANN: OK, good. So basically you show an elephant and a pineapple to a brain. And you extract the distance and the pattern activation at a given point in time and space. And let's assume that distance would be 0.8.
You can do the same with a deep net. And let's assume the distance was 0.2. And the realization is that you can treat this as an error. You could say if this were a model that's more brain-like, it should be 0.8. It shouldn't be 0.2.
And you can use this to drive learning to define a loss function and to drive learning in these networks. The point is that if you have enough images, in this case, we had 92, and you define-- you can compute these distances for all pairs of objects. So in this case, we get about a bit less than 5,000 distances. You can enforce this whole geometry of what things are treated the same way or differently in the brain. And you can force the whole thing that you enforce on the brain. You can do the same in the deep net.
All what you observe in the brain you can enforce in the deep net. And that's really promising, because it doesn't only allow us to test different neural network structures for the ability to mirror the brain. It also enables us to directly inject. If you want, neural data into these networks. And maybe that will help us get them more robust. Know knows? It's one of the avenues of research in the lab.
So let's do this. Let's look at MEG data where the code changes in different regions across time. And let's take two networks and force them this way to be brain-like and see which one is better. And we'll do two things. So we'll-- the first model will be known as ramping feed-forward, or we called it-- it's not known for it-- we called it ramping feed-forward.
And it's a feed-forward network. But each unit has a connection to itself, which means it can slowly integrate evidence over time. It can ramp up its activity over time. This gives feed-forward networks some non-linear dynamics, but still information is only flowing from the bottom to the top of the network.
And then we have what Niko termed BLT models. It's not bacon, lettuce, and tomato, but bottom up, lateral, and top down connectivity. And those are basically unrolled recurrent neural networks-- unrolled convolutional networks.
I have so much to talk about today. So I'll really just go through this very, very briefly. And I'm happy to answer questions and everything else also in the paper.
What you see here basically is going to be a movie. This is V4 and LO. This is what we extract from the human brain. This is going to be a movie that shows you how the representations change across time, how distances between different objects that you see in the left change over time.
And this is what the recurrent model predicts the dynamics should be. The model has never-- none of the models that we test here have seen these stimuli, but they've seen these categories of stimuli before during training. But this is their prediction of what the response to these stimuli should be.
And we train them on-- we do our home work. We train them on 1/2 of the data. We test them on the other 1/2 and so on and so forth.
So to get a visual idea of which model is good at doing this task and which is not I'll just play this movie. And what you'll see is that there's a lot of stuff happening in the brain. Even though you're within one region, the code changes quite dramatically over time. And this loops around.
And you can see that the recurrent model here in the center is not perfect, but it's quite closely able to at least track the large scale organizational changes in B4, whereas the feed-forward model, even though it has non-linear dynamics that it can learn, so it can learn the ramp-up parameter, only very initially there's as a bit of nonlinear change. And then it kind of settles on a good average.
And we can, of course, quantize this. And for all regions that we tested across the ventral stream, the recurrent models are very much better able to kind of follow the dynamics that we observe in the brain compared to these parameter matched feed-forward models. So really from this what we take is that even if you want to-- this is 150 milliseconds already. So even in the earliest processing parts of the response, recurrent models will be better able to capture what's happening in the brain.
What I promised you in the start was, though, that maybe if we understand something about the brain we can get models to be better at the task. So let's look at that next. Let's look at computational benefits of recurrence.
And this is work that was led by a PhD student, Courtney Spoerer. And it's published in PLOS CB. So let's assume we have a feed-forward model. This is now a computer vision type task. This is trained on ImageNet, actually.
Let's assume you have a feed-forward model, just a convolutional network. And now we can add lateral connectivity to it. So units get-- we unroll it through time, and units can now get information from surrounding units.
Adding lateral connections adds a whole lot of parameters. So if that model were better than the base model, you could say, well, it's just more parameters, so really, it's not that surprising. So we enter three more control models, B-K, B-F, B-D. BK is just larger kernel sizes. BF is more feature maps per layer. And BD is just more layers.
The point is that B has 11 million parameters. B-K and G-F have 14 million parameters. And B-D and BL have 30 million parameters. So these models are on the same order of magnitude in terms of parameters. All quite closely match what BL has in terms of parameters.
So what we'll do now is we'll take these models for a spin. We'll train them on ImageNet. And this is the top one, accuracy, that you see here. And you can see that BL is-- note that this does start at 0. But so BL is better able at doing this task compared to B-D, B-F. B-K seems to be overfit to the task because it has more parameters, but it does worse than B. So overall, we think that adding this lateral connectivity helped solve ImageNet, or perform better at ImageNet.
Now, what people tell us when they see this is, yeah, that's great, but now if this expands through time, I have to wait a bit longer for my results. Maybe I don't want that.
And so what we plug here in the right is we can cut the recurrent model short. We can-- after each time point, we can ask for a response. And now what we do is we compute the entropy of the probability distribution at the output of the model at every time point. And if the entropy dips below a threshold, we think the model is certain enough. And then it gives a response. And that way we can get reaction times from the model. If it's certain enough, it gets a response.
And now what we can do is we can change this threshold going from 4 down to 0. And you can see that if we change, if we lower the threshold, the model will get better and better and better. But it will also take longer to compute.
The interesting point here is that these three models, which B, B-F, and B-D, in terms of floating point operations, they're round about the same as the recurrent model. So the recurrent model isn't slower. It's just as good as the feed-forward models at the fixed number of floating port operations. You can just choose to let it run for longer, and then it gets better. That's the point here.
So what we can do in addition with this sort of model that now gives us reaction times, true reaction times, though the time steps of the model are now the reaction times, is we can compare it to human reaction time data. This is data collected by Ian Charest, who is in Birmingham. It was collected at the CDU, still in Cambridge.
And now we can ask, how well can-- if we show these stimuli-- and this is an animate-- sorry, I should have said this is an animate-- this is a speeded animate classification task, which means you see an image. And you need to say is it animate or not as fast as possible.
Now we can train our models on the same task. We train them on, let's say, ImageNet. And then we train a new readout that says is it animate or not. We can now compare the model reaction times to human reaction times.
This is what I'm showing you here. In the top row, you have how well human reaction time patterns correlate with human reaction time patterns. So that's what we call human consistency.
And then red, you see how well our recurrent models work, again, not having seen or being fit to these data, but being trained on a separate set of data to do this animacy detection task. And we do cross-validate to fit the threshold, the entropy threshold, to get reaction times out.
And compared to all the models that we tested, VGG, ResNet, DenseNet, Inception, and so on and so forth, and our control models, the recurrent models clearly outperformed these sometimes very deep feed-forward networks and their ability to predict human data.
You may ask, how did you even-- how did we even get feed-forward models to give us reaction times, because they basically compute once, and then you get that. We basically treated each layer as a time step. So we trained readouts for each layer. And then we do the same entropy game as with the recurrent model.
But you go to different depths into the network. We call that a reaction time. So I'm aware that this is a bit of a fire hose talk, but I want to talk about this third project, too, which is why I'll take a quick break here and kind of reminisce on what we've done so far.
So we've trained recurrent models. And we enforce them to be brain-like. And we found that adding lateral and top-down connectivity made these recurrent models a bit more brain-like in terms of following the same representational dynamics as what we observed in human observers in MET.
And then I've shown you that adding lateral connections to feed-forward models, increased their performance on ImageNet and ecoset, which is a different data set. This is similar to ImageNet in scale. And we've shown that we can use this entropy trick to get reaction times out of these recurrent models. And it turns out that if you train these recurrent models on ImageNet or ecoset, then their predicted reaction times are currently the best models that we have for predicting this behavioral data set by Ian Charest.
So now in the remaining minutes I want to talk about this new project, which isn't published anywhere. So this is-- I wouldn't-- I wish I could say it's hot off the press. It's not even in press anywhere. It's just the sort of thing we're doing right now.
So I call it closing the loop, because what I haven't done yet is following the normative approach in terms of modeling neuroimaging data. Either we enforce models to be like the brain, but we haven't trained models on a given task, and then tested how brain-like they were, which is what brain score in these data sets do as well, benchmarks do.
So the rationale is really we train a deep recurrent network to categorize visual input. And then we test how well the internal representations agree with neuroimaging data. We use RSA for that.
Now, this data set that we're testing on is peculiar. It's, I think, the greatest neuroimaging data set out there. It's the Natural Scenes Dataset, which is spearheaded by Thomas Naselaris and Kendrick Kay. It's 7-Tesla fMRI data. It's eight participants. It's 73,000 different stimuli that they showed to these participants over tons of sessions. And it's 10,000 images with three repetitions per participant.
And so this is super high SNR. And you get RDMs that are basically 10,000 by 10,000. They give you a huge variety of different images and brain responses to this. And this project is headed by Ian Charest who's putting all the pieces together, taking the NSD Dataset and giving it another twist that I'm going to talk about in a second.
And what we're going to test is the BL network that I just showed you a bit earlier. It's a model that has feed-forward lateral connectivity. And we'll train it on ecoset, which I would have loved to talk about, but didn't have time today. It's a new data set that will hopefully come out soon. It's 1 and 1/2 million images, mirroring the 565 most concrete and most frequent basic level categories in the English language.
So what are we going to do with all this? We're going to take these thousands of images. We're going to show it to participants in the 7-Tesla. And we'll extract RDMs from their [INAUDIBLE] responses. We'll show the same images to a pre-trained model. And we'll get RDMs. Since this is the recurrent model, we'll get different time points and different layers.
And what I'm going to show you next is the results of a searchlight approach where we took each time point in each layer in the model. And we searched through the whole brain and marked how well the RDMs of the model agreed with what we found in the brain. So I hope this wasn't too fast.
So I'm going to show you a movie now. So this is the representaiton of the agreement between the current CNN and 7-Tesla data on this NSD Dataset. And it's going to show eight time points per layer before jumping onto the next layer. And you can see marked in gray what the current layer is. So let's look at this movie.
So you can see that across time, the spatial arrangement is sort of similar, but it kind of changes in intensity. But as we go across the layers, I hope it will be clear that different brain regions will light up. It was early brain regions in the early few layers. Now we're onto high level regions. And in the final layer of the model we end up with this beautiful pattern where the last time step of the model, last time step in the last layer of the model agrees well with the regions highlighted in red here, which are clearly not early visual areas, but they are higher level visual areas.
The question is now, what are these areas? And what are they doing? How can we understand this? I mean, it just shows you that the later layers in the network agree well with these regions. But we don't know why.
So here comes here Ian with this great idea. So this is what we've done so far. This is just basically a snapshot of the final image that I showed you in the movie. This is the agreement of the recurrent model and the NSD Dataset at the final time step and final layer. So we basically showed an image to the model, and now this is the agreement.
What we can do in addition is we can get an image caption. So for this image-- I don't know how visible it is for you, but it's basically a young boy sitting on a bed with a lamp. Beside it, this is an image caption.
And this gives you an idea of the semantics. This isn't talking about low level elements in the scene. This isn't saying there's a bright spot to the top right and a red curtain in the top right that has certain vertical patterns. So this is a very high level semantic description, language or linguistic description of what happens in the scene.
And what Ian did was he took it, and threw it into the Google Universal Sentence Encoder, which is a linguistic embedding space called GUSE. And we can do this with all of the NSD Dataset scenes, which also have captions. And we can get an idea of where in this linguistic embedding space these different concepts are. And we can again compute an RDM of the similarity matrix and go searching for these regions in the brain as we just did with our deep neural network representations for the visual case.
And interestingly enough, if you do that if, you use GUSE and run the same analysis on the brain data, pretty much the same areas light up, which tells you that-- and that's interesting, because this is data that's collected from people seeing these images in the scanner. So it's a visual paradigm, but it's a prediction to the right. It's a prediction that's derived from image captions that someone else somewhere wrote about this very image. So it goes across observers. It goes across modalities, Because it's from image to language and it happens to agree well with the same regions that we observe to agree well with the image computable recurrent deep net.
OK. I'll wrap up here. At least many of the projects of the lab deal with recurrence. And we think recurrent connectivity is key for modeling dynamics, neuro dynamics for modeling behavior for better usage of parameters in computer vision tasks. And maybe what I showed you last, for mapping from pixels to actual cross-domain semantic information.
With that, I thank the members of the lab and my collaborators and every one of you for your attention. Thanks.