Using neural decoding to study object and action recognition in the human brain (1:03:53)
April 20, 2016
July 15, 2015
All Captioned Videos CBMM Summer Lecture Series
Leyla Isik, post-doctoral researcher at MIT and Boston Children's Hospital, explains how to use neural decoding to study object and action recognition in the human brain. By decoding the information contained in MEG signals measured in human observers viewing images of visual scenes and objects, Dr. Isik shows how object representations in the brain, that are invariant to size and position, develop in stages over 150 ms. Action representations in the brain, generated while viewing videos of humans performing different actions, are extracted within 200 ms and are immediately invariant to changes in actor and viewpoint.
Leyla Isik's website Isik, L., Meyers, E., Leibo, J. Z. & Poggio, T. (2014)
The dynamics of invariant object recognition in the human visual system, Journal of Neurophysiology 111(1):91-102. Isik, L., Tacchetti, A. & Poggio, T. (2018)
A fast, invariant representation for human action in the visual system, Journal of Neurophysiology 119(2):631-640.
LEYLA ISIK: My name is Leyla Isik. And I'm going to be talking today, though, about what I did during my PhD in Tommy Poggio's lab. And so the title of my talk is Using Neural Decoding to Study Object and Action Recognition in the Human Brain.
So neural decoding refers to the application of machine learning to analyze neural data. So just a brief overview of what I'm going to talk about. I'm going to give a background on invariant object and action recognition, which is the topic that I used neural coding to study during my PhD. I'm going to talk about the tools and methods I used. So first, magnetoencephalography, or MEG, is the type of neuroimaging I used. And then I'm going to give a brief overview of basic machine learning concepts for those who aren't familiar. And then I'm going to get into the research, so I'm going to talk about how we used MEG decoding to study the dynamics of invariant object recognition and action recognition. And please feel free to interrupt me at any point if you have questions.
All right. So this is a video of a scene probably similar to the one you just experienced. It's my lab mates and I getting food before meeting. And what you can notice watching it, or what anyone navigating the scene would have to do, is be able to recognize objects, people, and their actions. So more specifically, you can do object detection where you pick out the different objects and you can add labels to them. Like you can say, these are people, that's food, that's the table, et cetera. And if you see the people, you can then recognize their actions. So you can see some people were talking, picking up food, eating, et cetera, and this seems effortless for you.
However, it's a very hard problem to solve computationally as evidenced by the fact that even state of the art computer vision systems still don't match human performance on most of these tasks. So where my interests lie are studying how the human brain solve this problem so that we can develop better computer vision algorithms to do the same thing.
And so what makes this problem so challenging are transformations that change the visual appearance of objects, such as changes in the objects size, position, viewpoint, et cetera. So this is a clip from that last video of somebody in the scene eating. This is a second video clip of the same person eating. And this is a third clip of that person drinking. So what you might notice is that if you were to just compare these videos on a pixel by pixel basis, the last two videos look a lot more similar than the first two, but all of you can easily tell that the first two videos are of someone eating and the last one is him drinking.
So in other words, you're able to generalize across all the different transformations between the first two videos and still make fine discriminations between actions like eat and drink. And like I said, this might be something-- this is something you probably do without even thinking about it, but it's a very challenging computational problem.
So what do we know about how the human brain solves this problem? So this is a diagram of visual cortex. In the back in blue is V1 which is primary visual cortex. It is the root-- the first area of cortex where visual input is received. And from there the visual signals split roughly into two different pathways. The ventral stream, which is shown in purple along the bottom, and the dorsal stream in green. And roughly speaking, people like to designate these pathways into the what and where pathways. So people think the ventral stream is involved in object recognition and the dorsal stream is involved in things that require spatial information, action recognition, those sorts of things.
However, this is a very simplified view of visual cortex. Another way to look at it is this block diagram. So each of these blocks present- so I know you can't read this I'm just putting it up to give you a sense of how complex the system actually is. So each of these blocks is a different visual area. You can see the dorsal stream areas are on the left and ventral stream is on the right. And there's crosstalk between all the different areas, there's feedback, all those sorts of things. So this is actually-- this is why it's so challenging to study this problem.
But again, roughly speaking, we like to think of this, at least the early visual responses, being hierarchical, meaning that there are many different visual layers and the output of one layer serves as input to the next. And so like I said, there's primary visual cortex where the cells respond to oriented lines and edges. And then the top layer, IT-- I'm talking about the ventral stream now, sorry-- is the inferior temporal cortex, and there you have cells that are selective for whole objects. So a cell might fire in response to a face but not a car, and it would respond, and it's invariant, meaning that it's able to generalize across different transformations. So it fires in response to that face regardless of what viewpoint, size, et cetera you show it, whereas cells in V1 are very selective and prefer a specific orientation bar and that's it.
So the question really is, how do you get from simple representations of lines and edges to whole, complex, object representations? So to study this I used magnetoencephalography or MEG. It's a noninvasive neuroimaging technique. So a lot of you probably either work on or have heard about fMRI in one of these previous lectures, so this is complementary to that.
So MEG works by detecting the magnetic fields that are induced when many neurons fire synchronously. And we're talking about on the order of tens of millions of neurons that need to fire for you to pick up the signal. And because it's a direct measure of neural firing it has millisecond temporal resolution. However, these magnetic fields are extremely weak. They're several orders of magnitude weaker than the Earth's magnetic field. So to detect them-- So for example, the magnetic fields that we're measuring on the order of 10 in the negative 12th Tesla while the Earth's magnetic field is 10 to the negative fifth Tesla.
So to measure them this-- is a picture of the MEG scanner downstairs-- the subject will sit in this magnetically shielded room to try and block out as much noise as possible. And this sensor, this cap of sensors, will sit around their head. And that cap has 306-- Sorry. There's 306 sensors that are distributed all across the head.
And so the subject will sit there while we measure their brain activity and while they view images on the screen. And so every millisecond we get a new reading in each of the 306 sensors of their brain activity. So unlike fMRI, it has very good temporal resolution. However, it tends to have worse spatial resolution than fMRI, so that's the main trade-off between the two different types of methods.
The way we analyze this data with decoding analysis. So I mentioned that that's the application of machine learning to the neural data. So like I said, the subject sits in the scanner, looks at images like this one on the screen. We record the activity from the 306 sensors distributed across their scalp. And then at each time point I can take that vector of their 306 sensor responses and I can put it into a machine learning classifier. And this classifier will give me a prediction once I train it about what image the subject was viewing based only on their brain activity. So if the classifier were to predict face, that would be correct, but if the classifier were to predict car, that would be incorrect. And we can use the accuracy of this prediction to assess what information is present in the neural signals.
Interestingly, one sensor can pick up information from anywhere in the brain, but the strength of the signals it measures drops off with the square of the distance. So roughly people say-- so the reason MEG has such bad spatial resolution is because you're only measuring 300 sensors, but in general you're trying to measure the activity in say 10,000 voxels on the brain.
And so the problem of going from 300 to 10,000 is very ill-posed. There are, infinitely, many solutions. And so there are some tricks to get around this. So people often say that the spatial resolution of MEG is on the order of centimeters. But in order to have a magnetic field that's strong enough to be detected, around 50 million neurons need to be firing.
All right, so a bit of a background on machine learning for those who may not be familiar. It's going to be very basic. So I apologize to those of you who have done it before, but hopefully, it gives you guys some basis to understand the rest of this talk and other talks you might hear that applying machine learning to biological data.
So machine learning is a branch of computer science and artificial intelligence that uses algorithms to recognize patterns in data, and make predictions about that data. So like I said, you can either do a method where you try and tease out patterns from data, or you can make new predictions. And so it's really permeated our culture right now. Is everything OK?
CREW: Yeah, let me just grab this out phone, sorry.
LEYLA ISIK: Yeah, no problem. So can anyone name a popular example application of machine learning that you use in your everyday life? [INAUDIBLE]
LEYLA ISIK: Exactly, yeah-- so how they recommend videos for you to watch on YouTube.
LEYLA ISIK: Oh.
Preferences regarding Google searches based on what you've search on before.
LEYLA ISIK: Exactly, so what Google search results you get. Are you going to say something?
AUDIENCE: Yeah, I was going to say on Facebook, like, tag someone's face?
LEYLA ISIK: Yeah, exactly, so now Facebook-- and that's actually, even more specifically, a computer-vision application. It can recognize people's faces and tag them for you.
AUDIENCE: [INAUDIBLE] say something [INAUDIBLE]. What annoying ads they would put.
LEYLA ISIK: Yeah, annoying ads they put on your screen.
AUDIENCE: Facebook friend suggestions.
LEYLA ISIK: Yeah, Facebook friends suggestions, exactly. So it's a lot of recommender systems, is what these are kind of more broadly called. Except for the face-tagging, that's computer vision, but that's a big one. And so, it's something that's in your phones, online, you experience every day. And it has really gotten extremely popular.
There are two main types of machine learning algorithms-- supervised and unsupervised. And what supervised means is whether or not you have labeled data, and you use that in your algorithm. So for example, I think most of the examples you all named. So if you take, for example, the Facebook tagger.
So Facebook has access to millions of people's photos online and can use that. And since it can use your previous tags to say, oh, this is Leyla's face. So, when I put a new photo on, it makes a prediction based on the previous labeled data. So that's known as supervised, whereas unsupervised, you might do if you don't have labeled data and you just want to extract some sort of pattern of information.
So just to give you a quick idea of a simple algorithm of both the unsupervised and supervised types, and then we'll go on with the rest of the talk. One popular unsupervised learning algorithm is known as k-means. It's a clustering algorithm, which means it tries to break your data into different clusters based on its input features. And the reason it's called k-means is because you specify a number, "k," and that's how many clusters it breaks your data into.
So this is an unsupervised learning example, so you don't need to have any labels, and it will just extract the information from the features you give it and cluster it based on what you say. So for example, if the features we're looking at now are space-- x and y-coordinates. And I told you I wanted to break this data into two clusters, how would it divide?
LEYLA ISIK: Yeah, exactly, just down the middle like that. But what if I said now three clusters?
LEYLA ISIK: Yeah, I would chop off this last one. So the results of this algorithm depend on not only the input data, but also you need to give it some information. So in this case, you're providing the information k, but there are even ways to choose that k in a totally unsupervised way. So it will just extract some sort of underlying structure for you.
A popular type of supervised learning is called a Support Vector Machine, or SVM. So again, this requires label training data. So in the MEG example, I showed you-- if I were to show my subject images of faces in cars, I would know when they were looking at a face or a car, and I could train my algorithm on a subset of this data to try and classify a new test point.
And the way SVM works is it finds a hyperplane, so a high-dimensional line, essentially, to separate two classes of data. So if we start with that same data again, but this time I give it labels. I say the green ones are one label, the black ones are a second label. I can train a linear algorithm to separate these. And what it would do is it finds the line that is maximally distant from the two closest points in the cluster. And this distance is called the margin, so sometimes people refer to this as a large margin-classifier.
You can also train more and more complex classifiers. So for example, you can do a nonlinear separation, and that might look something like this, say. So again, it's the same idea, where it tries to maximize the margin between the two closest points, but this time, it fits a non-linear function.
LEYLA ISIK: Yeah?
AUDIENCE: Is this still machine learning?
LEYLA ISIK: Yeah.
AUDIENCE: All right.
LEYLA ISIK: Yeah, yeah.
LEYLA ISIK: Yeah, yeah, so this is a supervised learning example, a supervised example machine learning. It's called the Support Vector Machine, or SVM. You might hear about it, but most-- I think it gives a good idea, conceptually, of how most machine learning works. And most machine learning that you hear about, both in neuroscience, biology, and what Google and Facebook do, are supervised. So it's something, conceptually, like this, where they have labeled data. They try and separate it based on those labels and use that separation.
So now I have this line. If I put a new point on here, I can try and predict which class it belongs to. So we want to evaluate our algorithm based on a test data point. Or if you're Google, you want to figure out whether or not your predictor works based on how many ad clicks a new person gets. So what you're trying to do is make a prediction, here.
And so these are the lines we fit based on our training data. But we evaluated our new test data. So these dark green and black points are the training points. Let's say, now I give you these light green and gray points, and these are test data, and so you want to try and classify it.
So these two algorithms, just toy examples of a linear and non-linear example, but does anyone have a sense of how these two algorithms would perform on these new tests points?
AUDIENCE: The linear will perform pretty well. The non-linear will lose
some of the points on the outside.
LEYLA ISIK: Exactly, so you always want to-- you can fit an arbitrarily complicated function to separate these two points. But the reason you don't want to do that is because you need to be able to generalize to new data. And the more complicated you get, the less likely you are to be able to do that. So in general, another intuition for machine learning is people try and find the simplest solution that separates their data so that it will generalize well.
All right, and so the last thing is what do I mean when I say it uses the 'features' to separate them? So we were talking about spatial features before. I was just doing it based on how close the points were to each other. But this could be any property of the data that you want to use to classify it.
So for example, we could take this shape example. Say I wanted to cluster these into two shapes. What they look-- two shape categories, what would they look like, do you think?
AUDIENCE: Square and circles.
LEYLA ISIK: Yeah, what if I said three?
LEYLA ISIK: Exactly, circles, squares, and rectangles. So you can use this shape example. But in our case, we're using the neural data as our features. So we're actually putting the machine learning-- the MEG data, into the machine learning algorithm to get out the prediction.
So like I said, I record activity from 306 sensors. These are traces of different sensor activity, just to give you an idea of what the MEG data looks like. And I use a five-millisecond sliding window, which means that every five millisecond time point, I averaged the activity in each sensor.
So I get out that vector that I mentioned. I put it into a linear machine learning-classifier, and I then, for a new test point, will get out a predicted label. Yeah?
AUDIENCE: So is each stimuli also being shown per five milliseconds?
LEYLA ISIK: No, so the stimuli, I'll get to that, but in the first set of experiments, are being showed for 50 milliseconds. So this time-binning parameters is something you need to choose. I went with something smaller than 50 milliseconds because MEG has such good temporal resolution. I was interested in the dynamics, even within when the stimulus was on the screen.
So the question was, if there's a delay between two signals, if they match really well, but one is offset by five milliseconds? In the case I'm talking about here, we would not pick that up. But later, I'll talk about a case where we can look for that.
So here, I'm am training and testing on exactly the same time point. So unless two signals match at exactly the same time, I won't get good accuracy. Are there questions? So this is why-- but since MEG have such good timing resolution, that was the question. Is the data consistent enough to give us, within a five-millisecond window, reproducible data?
And then I can repeat this at each five-millisecond time point. So at each five milliseconds, I'm training and testing a totally different classifier. I mean, you can think of it like that support vector machine example, where it just fits a line to separate the data points based on their labels. And so at every five milliseconds, I get a new value for the accuracy of that prediction.
And we can use this accuracy as a measure of what information is present in the neural signals. And like I said, I'm using a simple linear machine learning classifier. It's actually very similar to the SVM I just talked about.
All right, so for this first part of the research project, where we studied the dynamics of invariant object recognition, people hadn't really applied this technique much to MEG data. So we were just curious. I mean, you have to get millions of neurons to fire at the same time. They have to be very well time locked, so we were even wondering if we could decode different visual stimuli in the MEG data. And then we wanted to ask what the dynamics were of these signals. And then finally, we wanted to see if we could decode invariant information, so not just faces and cars in one position, but can we generalize across all the transformations I talked about at the beginning of the talk?
All right, so the first part can we decode the visual stimuli from MEG data? So just as a first pass, we put a subject in the MEG scanner and showed them these 25 scene images. So they're five beach examples, five cities, five forests, five highways, and five mountains. And we flashed these images all for 50 milliseconds, I'm sorry.
And so like I said, we are training and testing our classifier at each time point so we can get a plot of the classification accuracy, so how well our prediction works across time. And time zero is when the stimulus goes on. So before that, decoding accuracy is at 4%, which is chance, because it's one over the 25 images that we showed.
But then, beginning around 60 milliseconds after stimulus-onset, decoding accuracy goes up, and we can predict with 40% accuracy what image the person was viewing based on totally new MEG data. And the blue line at the bottom indicates when decoding is significant with a P less than 0.05. And so, we were pretty impressed by this. I mean, this is typically thought of as very noisy data. It's very low spatial resolution. But still, we can tell very reliably which scene the person is looking at.
Yeah, so is 40% accuracy good? So there are kind of two approaches to decoding. One is to maximize your performance. So like I said, we're using a very simple linear classifier. I'm not pre-processing the data much, so I think it's good. So my threshold is generally, is it significantly above chance? Because I'm just interested in if the neural information is there.
Some people are doing this for engineering performance. So if you wanted to design a brain machine interface, 40% would not be very good, because you would want it to be pretty close to 100%. So there are different goals and different methods that people apply for each.
So the question was if you put the image on for more than 50 milliseconds, would you see better accuracy? I've never directly compared that. But what's interesting is, so the accuracy goes up and down, but even if you leave the image on the screen for longer, you still see that the accuracy will go down. So if I were to leave the image on for a second, what happens is your visual system habituates to that image. So at some point, the neurons stop firing in the same way, and you would stop getting what-- when other people have seen is even if they put the image on for a second, after about 500 milliseconds, the signal decays, like, here, as well.
Oh, great question-- so are the subjects doing a task? Here, no, I am just flash-- so they're doing a test, but it's not related to this. They're reporting the color of-- they're instructed to fixate centrally and are reporting the color of the fixation cross, for a reason I'll explain in a bit. So this is all passively viewing. So the person isn't even consciously saying, oh, that's a beach. That's a mountain. This all just happens automatically in your brain.
So then we wanted to try a range of different stimulus sets. So next, we did these black letters on a white background. And again, the results look very similar. Around 60 or 70 milliseconds, decoding accuracy goes up and stays up for a few hundred milliseconds.
And then finally, we did these grayscale objects, which are rendered 3D models of different round objects, so it might be hard to see from where you are, but there's like a hand, a basketball, a bowling ball, etc. And again, these look very similar. So it seems like we can reliably decode a wide range of stimuli. And the dynamics of this seems to be very similar, where around 60 milliseconds, you can start decoding.
So the question is if they were behaviorally responding, trying to classify the different objects, how long would that take? So even just the motor response takes a couple of hundred milliseconds. And so that's why I think-- so a lot of people ask, well, can't you just study this all with behavioral timing? And I think the margin of error on that is worse than we can get with the MEG. So I think this allows us to ask very fine-grain timing questions.
So then we can ask, what are the dynamics? So this gets back to your previous question about what if things are just time-shifted, say? So what I was doing in all the results I showed you was training and testing on the exact same time point.
But what I can do instead is train on one time point and use that classifier train to test it on all other time points. So then I can get a matrix of train times versus test times. And red is the high-accuracy signals in blue are the low-accuracy signals.
So this is the experiment I just showed you with those grayscale objects. And what you see is that decoding accuracy's only high along the diagonal, when you train and test at the same time. And what's, I think, even more striking, is that this time window is so narrow. It's about 50 milliseconds wide.
So the neural signals, even though you can decode for 400 milliseconds, that representation is changing so rapidly that if you try and test it 50 milliseconds later, it's already different and your classifier doesn't work. So the question was are you the decoding from the whole brain? Yes, but I'm doing some feature selection to down-sample the data. That doesn't seem to affect this, though, but it does help remove some noise from the data.
So the question was-- so this is pretty low-level, what we're decoding, right? Presumably this can all happen in primary visual cortex because an edge detector can tell the difference between a hand and a bowling ball, right? They look very different at the low-level.
So Tyler's question was what if I was doing some higher-level, more conceptual task? So yes, I would expect an overall time delay. And I would expect that this window would get a little wider. And I'll show some data like that in a bit, as well.
AUDIENCE: Are you training and decoding all within one participant?
LEYLA ISIK: Yeah, so I'm doing the decoding all within one participant. I will show you data for multiple participants later, but that's all averaging the final decoding accuracy. Because there's no good way to align to MEG subjects-- I mean, it's hard even to align two people's brains when you have great spatial resolution. But if you don't have good spatial resolution, it's even harder. So the process of training on one subject and testing on another just doesn't work that well. But I think that's another interesting engineering question that people are working on, like, if you did want to implement this in BMIs, or something like that, you would want some sort of algorithm that could be more flexible than this.
All right, so the last part is we were wondering if then we could decode invariant information. And what we can tell about the underlying computational steps. So like I said, primary visual cortex can solve all of these problems. They can tell if a hand is a-- or if it's a hand or a bowling ball that you're viewing.
So we wanted to ask, well, can these representations generalize to get some sort of higher order picture of things? So to do this, I took six of the objects I presented in the last study-- bowling ball, basketball, football. I picked them all to be roughly round. And I presented them at three different positions, so the top center and lower half of the visual field, and three different scales-- 6, 4, and 2 degrees of visual angle. So for those who don't study vision, one degree of visual angle, if you stick your thumb out, is roughly the width of your thumb. So just to give you an idea of how big these are.
And then we wanted to ask, if I trained my classifier with data from images presented at one position and tested a second position, can the neural signals generalize across that transformation? So in other words, would I have an invariant representation? So here I am, going to show in blue, an example where I train on images all presented on the lower half of the visual field and test on centered images. And I'm just demonstrating with this bowling ball, but we trained on the classification between all six images at the lower half of the visual field.
And these are the results. So what you see is that, again, yes, we can decode. But what's interesting is that now it's more like after 100 milliseconds that decoding accuracy goes up. So overall, it takes longer for this computation to be carried out than it does in the case without any generalization.
And so we can repeat this for all six position comparisons since we have three different positions. So just to give you a frame of reference, that gray line is where the non-invariant signals first arose. So you do see a good, like, 50-millisecond time shift.
So you can repeat this for all six position comparisons. And you see that we can decode across all of these transformations, but it does take longer. So it seems like the brain is taking more time to build up a representation that's invariant to position.
AUDIENCE: And those are all the same [INAUDIBLE].
LEYLA ISIK: Yes, and the thickness of the bar-- it's kind of hard to see on this projector-- is how many subjects it's significant for at that time point. So I should say, this is the average of eight subjects. But it's the same significance criteria as before. So even though the accuracy is overall less, it's still significantly above chance.
AUDIENCE: So you didn't necessarily get significance for every [INAUDIBLE].
LEYLA ISIK: I did, not all at the same time, necessarily.
LEYLA ISIK: I think there was only one subject and one condition where we didn't get significant decoding in this case. So we can do the same thing for the size comparison. So we can train on one size, test on a second, and do that for all six size comparison. And again, you see that this comes on later.
But what you might notice here is that there's a pretty striking difference between the different conditions. So this blue and red condition, which is training on the largest and testing on the second largest, and vice versa, comes up a lot sooner and has a lot higher accuracy than the other conditions. So we wanted to investigate this further, so what I can do is look at the time when decoding first becomes significantly above a chance for each subject, and plot that for each of these 12 different conditions.
So in other words, that's the onset latency. I'm calling it my decoding onset. So I can first do that for the non-invariant case if I were to train and test on the same position. And these are like what you saw before, around 70, 80 milliseconds. You can decode significantly above chance. These are for the size invariant cases, the six different comparisons and the six position invariant comparisons.
So looking at this, you might notice a couple of things. First, it seems like the non-invariant, overall, come on first, followed by size and position invariant. But what I think is even more interesting is if you look within the size and position cases, you see some latency differences, as well.
So for example, these are the blue and red conditions that I pointed out last time, the training on largest, testing on second largest. And this is the case where you train on the largest and test on the smallest condition. And what it seems like is these smaller transformations-- so the difference between six and four degrees, comes on faster than the largest transition, which is trying to generalize between six degrees and two degrees, so the largest and smallest, which might make sense intuitively, but I think it's pretty interesting that we can see this with these timing values. And so it seems very much like the timing, the onset time, is directly related to the amount of transformation you have, or the shift. And so it seems like invariance arises in stages with invariance to smaller transformations occurring before invariance to larger transformations.
So most people in the Poggio Lab don't do neuroimaging. They do machine learning or computer vision. And we have this model of visual cortex in the lab. And this seems very consistent with that and other hierarchical feedforward models of computer vision.
So I'll go over the model that we have in our lab, but you should know that this is similar if you guys have heard about deep learning, or deep neural networks. This works in a conceptually very similar way to those networks, so I think it's a nice proof of concept, biologically, for those types of networks.
So these models were inspired by Hubel and Wiesel's findings in visual cortex. And what they found is that you have simple cells, which I told you about earlier, which are, essentially, edge detectors. So they fire in response to a line in a given orientation.
But then you have complex cells, which seem to pool over those simple cells, so it will take their response to three different simple cells and pool over all of them. So now you have a cell that fires in response not only to the line here, but the line in any position. And that's roughly how they predicted you build up invariance. So we have a model that does the same thing. So it takes an input image, and it has all these different edge detectors that it convolves the image, with so you get different edge maps for different orientations.
And then there are complex cells that pool over the simple cells. So the simple cells do template matching, and this is thought to build up selectivity. And the complex cells perform pooling. So like I said, taking a max over all the responses of the simple cells. So for example, this red complex cell pools over all of these simple cells so it fires in response to this feature, anywhere here. And then this is repeated at each layer until you have a global max pooling, so a feature that's invariant to all positions and scales.
And why this is consistent with our model-- with our data, is that you first see invariance to small local regions. So this cell, at this layer, is invariant to this feature anywhere in the small region. By the time you get to the top layer, you're invariant to all scale and position shifts. So the fact that we also see invariance arise in stages, I thought was pretty neat.
So Tyler's question was, our data says that invariance increases with time, whereas this model, and the way people typically-- when you're looking at higher spatial resolution data, you say in this, from layer to layer to layer, your invariance increases. So here, I am equating early times with early layers, which is true in a purely feedforward network, but not necessarily true otherwise.
But I think it's interesting to study this from a timing perspective is that-- I think if you want to understand-- I don't know. I think if you want to carry out these computations, it's important to know what order they occur in, right? It's not-- and so I think no other methods can give us that right now, other than MEG and EEG. So even though electrophysiology has very high spatial and temporal resolution, you can only record from one or two brain regions at a time. So I think it's nice-- so I think there are pros and cons to the fact that we are recording whole brain activity. I mean, I think there are some cases where you can narrow it down based on you know that V1 response to oriented lines and edges, but I agree that it does leave some ambiguity.
So the question was why do we use a feedforward model instead of a recurrent model? And the reason is a lot of people think the earliest visual response is almost primarily feedforward. and it's just simpler. So my approach is how much can we explain with this simple feedforward model, how much can we push its performance? And then I think it makes more sense to start thinking about recurrence. Otherwise, I think it gets-- one, really complicated to measure, model the biology. Two, it can get really computationally intractable.
So that's a great question. So I've done that, also. You can explicitly just choose the sensors that are in the back of the helmet over visual cortex. And that seems to work very similarly, but I liked this sort of data-driven way of choosing the sensors, because just in the off-- so one, this is a pretty simple case, so you know that it's these backs. And in fact, if you look at a map of the sensor activity, sensors that are chosen, it is all the very occipital back sensors over visual cortex. But there are more complicated cases where that's not the case, so I think it's nice to kind of do both approaches.
All right, so just to summarize this object recognition part, we saw that we could read out visual stimuli with high accuracy from MEG data, as early as 60 milliseconds after stimulus onset, which is quite fast, and invariant to size and position after a 100 milliseconds. We saw that the representation, even though you could decode for several hundred milliseconds, it was highly dynamic, so it was changing within, a, say, 50-millisecond window. And then finally, we saw that size and position invariance seems to develop in stages, within invariance to smaller transformations occurring before invariance to larger transformations.
So we can take this data and kind of start to make a temporal map of early vision in humans. So the stimulus is on for 50 milliseconds, and then around 60 to 80 milliseconds, we build up a representation of objects that is not invariant. Around 80 to 120 milliseconds, you get some local invariance. And then around 150 milliseconds, you seem to be invariant to a wide range of sizes and position shifts.
And what's nice is this is very consistent with the timing information we know from the Macaque literature. So in Macaque electrophysiology studies, you can invasively record from different brain regions. And so like I said, this has high spatial and temporal resolution data, but doesn't allow you to do the whole-brain type of studies that we can do here. But what's nice is that the timing seems very consistent with macaque V1, which occurs around 60 millisecond, and macaque IT, which neural signals reach around 100 milliseconds.
So the question was if I show an image of a cow and then show the word cow, you would expect to see some sort of invariant representation somewhere. I haven't done that experiment, but it's in the pipeline. But I think that's really interesting. I think it would be quite a bit later than this, is my guess.
LEYLA ISIK: Um, so the question is when. I don't know. I mean, so the other issue is you can't-- I would have to do that one where I think you train-- I think that would be one case where you would see off-diagonal activity, because you have to read the word 'cow.' It doesn't have as clear of an onset as when you flash an image.
All right, so in the second part of my thesis, I looked at how we can decode people's actions from videos. And the reason I wanted to move in this direction is because I wanted to go more towards real-world stimuli. So when you view objects in the real world, they aren't static. They aren't on a gray background. And so we thought recognizing actions from videos would be a good next step towards more realistic input.
And I think that action recognition is a gateway to higher-level visual and social questions that I'm ultimately interested in my research. So for example, if you wanted to try to understand people's social interactions, emotions, if you wanted to understand a narrative in a story you were watching, that would be very interesting, I think. So the next thing we did was try to look at action recognition.
So what we know about action recognition is that humans can quickly recognize actions, even from very impoverished stimuli. So Johansson did these experiments where he put lights on people's joints and had them do different actions in a dark room. What he noticed was that even from just these moving dot, point-light stimuli, people could recognize actions within a couple hundred milliseconds.
And since then, people have done neuroimaging and electrophysiology studies to isolate the superior temporal sulcus, which is an area between that gets input from both ventral and dorsal streams, as being implicated in recognizing this biological motion. And also having some slight invariant representations, as well, so we know that neurons in this region can recognize actions, regardless of what actor's performing them, and there is some small viewpoint invariance, as well. But like with object recognition, we don't have a clear picture of how you get from lines-- what's the whole temporal course of how this information evolves?
The other thing we wanted to look at was-- what we wanted to do was see if we could decode action, and we could do it invariant to viewpoint and actor? And then in this work, we are using these MEG insights, and doing it in collaberation with a lab mate, to develop a new computational model that's based on the one I just explained, but to recognize actions from videos. And then we wanted to look at the effects of both form and motion on action recognition. So I showed you examples of how people can recognize actions just from the motion of dots, so stimuli with very little form. But there's also examples where you can recognize actions from static videos, and so there's a big question in the literature about how these two types of information interact.
So when we were starting this project, we wanted to know what data set we should show the subjects. So we turned to the computer vision literature, because there's a lot more action recognition data sets there. But they seem to roughly come in two different categories. So there are these very large and uncontrolled data sets that mostly come from YouTube clips.
So this is the UCF data set, and it has hundreds of categories of people knitting, mixing batter, writing on the board. And they're YouTube clips, so it's very large, very unconstrained. We wanted something a little bit more controlled to start with, but the other type of data set tend to be very small and have people performing different actions on a fixed background.
And the other thing-- so this is the Weitzmann dataset, which is a very simple, but popular, computer vision data set And so these are two different people walking on the same background. But what you might notice is that the camera is very zoomed out, so it's hard to see the people. And therefore, it's pretty trivial to generalize between person one and person two, because they just both look like small figures walking.
So since we weren't really satisfied with many of the existing data sets, we decided to film our own. So we filmed a data set with five actors performing five actions, which we chose run, walk, jump, eat, and drink. And we had everyone perform all actions on a treadmill. And the reason we did this was so that we could avoid the problem of having a very large camera view, and people moving in and out of the frame. I should say that the treadmill was static, except for when they were running and walking.
We had them hold the same objects in each video, so regardless of what action they were holding, they had an apple and a water bottle in each hand, so that drink wouldn't always be the video where you picked out a water bottle. And then we filmed it from five different viewpoints. So we kept the background fixed, and we moved the actor and the treadmill to five different views.
So to give you an idea of what these videos look like, this is me running, someone else walking, jumping, eating, and drinking. And Sarah probably recognizes all these people. And so like you might notice, we had everyone do all these actions and the background stays fixed between them. And then the other thing we did was film it at 5 viewpoints, so zero, which is shown here, on the left, straight on, 45 degrees, a 90-degree profile, 135, and 180-degree back view.
And so, like with the object case, we're just curious. Can you even decode action from these videos? And if so, can you do it invariant to actor and to viewpoint?
So to see if we could decode action, we put people in the MEG scanner and showed them all 125 videos-- so five actions, five actors, five views. And we tried to just say, what action are you seeing in that video? So here, because it was a harder task, we have them doing an action recognition task, just because that will boost performance a little. I have a feeling it would work if they were passively viewing, but we haven't tried it.
So these are the results. So again, the stimulus goes on at time zero, and before that, decoding accuracy's at chance, which is 20%. And then as early as 200 milliseconds, we can already decode what action people are viewing. So that's quite fast. It's slightly longer, but on the same order as the invariant object recognitions.
AUDIENCE: What time did you say?
LEYLA ISIK: 200.
LEYLA ISIK: Yeah, the question was what time? So it starts around 200 milliseconds. All right, but again, what we're really interested in is this generalization, the invariance case. So we first looked at the actor invariant case. So we train our classifier on four actors at all views, and then test on data from the subject watching the fifth actor. So the classifier has never seen any data from the subject viewing this actor.
And again, just like in the case I showed at the beginning of the talk, the trick here is to be able to find discrimination between me running and walking, but still be able to generalize between me and Andrea, to recognize that both Andrea and I are walking. You can do the same thing with viewpoint. So we can train the classifier on all actors at four different views, and test on the fifth view that the classifier's never seen.
And we looped through this so that we hold out each of the five views or actors. So this is the actor invariant decoding case. And what you might notice is that it looks very similar to the case where there's no generalization. So again, around 200 milliseconds, you can tell, invariant to actor, what action the subject was viewing. And it looks the same with the view invariant case.
So it was maybe not surprising that the non-invariant representation was so fast. But we were quite surprised, particularly that you could generalize across viewpoints for this and still recognize action. And this is time-relative to when the video starts. So these are two-second long videos, and after six frames, you can already tell what action someone's viewing. So we thought, maybe it's the case that if you show the classifier data from the subject watching four views and test on a fifth, that's not a very hard generalization. So maybe it just has enough information to interpolate, and that's why these results looks similar.
So we tried a new set of experiments, which we called the "extreme" view invariant experiments, where you train on one view. We picked the front view and test on the second view, the side view. And so now, instead of getting input of data from four different views, we only get one view.
So this is the result for the within-view decoding. So the case without any generalization, where I train and test with data from the same view. So I'm either training on the zero-degree case and testing on the zero-degree case, or training on the 90 and testing on the 90. And again, around 200 milliseconds, we can decode.
But what's interesting is the case where you have the generalization, where you train on one view and test on a second view-- any guesses? So have one guess for less time. A lot of people are saying the same.
We actually thought it would be longer because we thought this generalization, like in the object case, would take more time. So it took more time for a larger transformation. So this is an even larger transformation. We thought it might take more time. But you guys are right, it's the same. But we thought this was pretty surprising-- so not only do you get a representation for action 200 milliseconds after your video starts, but that's already invariant to viewpoint.
The question is do we only interpret actions on an invariant level? So I am not exactly doing the same thing as the object case, where I am not training on one video and testing on different instances of the same video. I'm keeping all five different actors in there.
So my interpretation is that in order to recognize action, even within one view, you need to do some-- and we're starting all these videos at different random time points. So we chunked these videos totally randomly. So jump could start at the bottom of a jump, or the top of a jump. So I think there's already some generalization you need to do to account for that. And I think that that is the same-- that generalization gets you to the same point, where you can also be invariant to view and actor.
All right, so next, we wanted to see if we could use this insight to build a hierarchical computational model to recognize actions. So this was done with my lab mate, Andrea. So it's again, one of these hierarchical models inspired by Hubel and Wiesel's findings in early visual cortex. But now we have video, so everything is not just x and y, but also time.
So instead of an input image, we have this input video. So I'm showing the different frames from left to right. And you, again, have simple cells, but instead of just being an oriented line, it's a small chunk of video where the line moves across the screen with time.
And we can convolve our video with those input templates. And then we can do pooling to build invariance, but instead of just pooling in x and y, we're also pooling in time. And we can do the same thing again.
But in addition to extending this to videos, the other thing we wanted to do was recognize these actions invariant to viewpoint. So we also have to pool over view somewhere. And because we saw from the MEG data that the view invariance happens at the same time as the action recognition, one thing that people have done in previous models is just add a view invariance module on top of this. But because we think it happens at the same time, same layer as the other computations, we implemented in this pooling, a pooling across viewpoints.
Oh, and so I should mention that these S1 templates are hard-coded to mimic the cells that are found in V1 and MT. But S2 templates are sampled randomly from our training videos, and we pass those video chunks through this model to get our templates. I'll explain that a bit more in a second.
So like I said, we wanted to pool over viewpoints in the same model layer. And we tested one model, which is a structured model, where the C2 cells pool over the S2 templates from videos of the same action. So these come from our training videos. So we can use our training labels to say, for example, these are all the videos from walks, so pool over their templates. And these are all the videos from run, so pool over their templates to build this cell.
So to give you another sense of what this looks like, this run cell would get this chunk of torso as its template, so it compares all input videos to this chunk of torso. But we're explicitly telling it-- take that torso at different viewpoints and compare that. And we can compare that to a model that has the exact same templates, but we pool it randomly. So we don't enforce this structure pooling across views. And that would look something like this, where it takes the same set of templates overall, but the wiring to each cells is not hardwired like that.
So we can train the model on four actors at one view, and either test on the same, or different view. So sorry, it might be a bit hard to see, but the black is the model with structured pooling, so our experimental model, and the white is the control model. And what you see is that if you train and test in one view, both models do quite well, they get around 80% accuracy.
But if you now do the case where you train on one view and test on a second, performance for both drops. But what you see is that this model with structured pooling performs at around 40%-some accuracy versus 30% accuracy-- so significantly better than the case with just random template pooling.
So some of you might be wondering, this still isn't super high accuracy, right? It drops from 80% to 40%. That's because this is a pretty hard problem. So unlike most computer vision algorithms that train on millions of labels, millions of examples of labeled data, here, we are just training on four actors at one view. This model has never seen anyone run at this second view before. So I actually think that this is pretty good, that it can recognize this at all.
So even the MEG data, people have seen people running at all views before they get in the scanner. So it's not quite a fair comparison of the direct accuracy.
AUDIENCE: Are you still assuming it's all feedforward for this [INAUDIBLE]?
LEYLA ISIK: So the question is are we still assuming it's all feedforward? And the answer is yes. So like I said, there are some models that first, do some sort of computation to say recognize-- that could do some sort of computation and then later pool that information back and use that to calculate the viewpoint invariance. And we didn't do that. One, because we saw that the representation occurred so quickly, so we thought it might-- it's very plausible that it's restricted to the feedforward-only part of vision. And two, the viewpoint invariance also occurs very early. So that's why we, again, wanted to implement it in this feedforward model.
I think if you were to look at more complex videos-- I think this is definitely not how the brain is working, right? It's a very simplified model. But our question is how much can this one mechanism-- we're really only testing one simple pooling mechanism, to explain our data.
So no one had done viewpoint invariance from videos, but somebody had viewpoint invariant face recognition, and there, it was later. So what happened was it was the same object recognition model I showed you, and they just stuck a viewpoint invariance module that pooled at the top layer. So that prediction from that model would be that viewpoint invariant face recognition occurs later than normal face recognition, which seems to be true from the biological data, but doesn't seem to be as true in the case of actions. And again, I think that's because to recognize actions, you can't just match lines and edges. You have to do some more complex integration of form and motion information.
Question is, is the mimicking biology the best way to solve this problem? So I think state-of-the-art computer vision systems right now, are these deep neural networks, which very loosely mimic the biology, in that they are hierarchies. But I think there's a lot of ways in which they aren't really biologically plausible. So like I said before, they have to be trained on millions of labeled examples.
And while babies do get lots of natural input, there's not so much supervision where someone is a million times, like, this is a dog. This is a dog. This is a dog, right? And so while I think this may not be the best performing model right now, I think pushing this and trying to get new insights from biology is how we will get better, more human-like performance.
So the question is can you exploit the temporal dynamics of your videos, so the fact that you know which direction time is moving, both in the videos and the neural data, to get better accuracy? So the answer is yes. With the MEG data, we haven't done that much. So I'm just training and testing at one time point. I think I would get much better decoding if I use the temporal information in a smarter way than that.
With the modeling, one way that people are trying to make these deep neural networks more biologically plausible is by training them not with millions of labeled examples, but with a short video segments. And if you see the short video segment, you can assume that the objects in the first frame are the same as the objects in the frame two seconds later, and use that as supervision, instead of actual label dog. So instead of that, if you show a video of a dog, you can say, OK, I think this object is the same is this object, and it's different than this object that I see 30 minutes later.
And actually, what's really interesting, is that these are really new models, and they seem to be doing quite well, almost as well as the supervised cases. So I think that's a really exciting avenue for AI. And I think it's something people have thought that happens in the brain for awhile, and test it there also.
The question was can you get signals from two other parts of cortex in this amount of time? So in the 60 milliseconds, we're pretty sure that's just V1. Here, that's definitely not the case. And yes, I think that's definitely within this time scale. I've done some source localization, which is where you try and infer from the MEG data where in the brain the signals are coming from.
The methods are far from perfect, so I don't typically like to show that data, because I have mixed faith in it. But it seems there like you're mostly getting ventral and dorsal stream activity. But I think it's hard to say. And same thing with the model, as we tend to be very visual cortex-centric, but I think it's important to start thinking about, especially as you move to more complex tasks, how the rest of the brain plays a role in this. Sarah?
AUDIENCE: Have you thought about matching your data with fMRI data?
LEYLA ISIK: So the question was, have we thought about matching this with fMRI data? And yes, I think that that really is how to push this forward. And during my postdoc, that's something I'm doing, is looking at fMRI electrophysiology and MEG, on the same stimulus set, and trying to find commonalities between all three. Because ideally, you want high spatial resolution, high temporal resolution, and whole brain coverage. And you just don't have that in any one method.
So actually, it's even simpler than an SVM. It's a correlation coefficient classifier. So it takes all of your training data. And it's also very similar to K means. So it finds the mean feature vector for run, and the main feature vector for walk. And then it takes a new data point and says, which of these mean feature vectors is it most correlated with? It gets that class label.
AUDIENCE: For each [INAUDIBLE]?
LEYLA ISIK: For each sensor, so that it correlates the vector of sensor activity with the mean vector of sensor activities for each class. Super simple-- all linear classifiers tend to work equally well. You can get some improvement if you do some things that regularize in smart ways. But this is much faster, so I do that.
The reason I'm using a linear classifier instead of a more complicated classifier, is because I really want to see what information is present in the neural signals. So I really want to let the features drive the classification, rather than putting some fancy classifier on top of the neural data. Because the fancy-enough classifier can solve the problem based on the image itself, so.
And so then finally, we wanted to look at the effects of form in motion on action recognition. So to do that, we did two different experiments. One, I showed subjects just static frames in the MEG. I handpicked the frames for certain actions because things like eat and drink are kind of ambiguous when the hands are down here. So every frame was chosen to be very clear with what action it was. And this is the same case where you train on one view and test on a second view.
So here are the results. The blue is the within-view case and the red is the across-view. So what you might notice is that they both look quite a bit worse than the case where you have whole videos. And I think that makes sense. If you look at the behavioral performance, it also drops. People are still well above chance. They're doing this task at 75%.
But they were well in the 90s in the case of full videos. And the reason is there are some case-- from certain views, you can be occluded. And so from just a single frame, it can be challenging to tell what action they're viewing. So I think this indicates that, especially for the invariance case, the motion information is really important to recognizing the action.
AUDIENCE: Is the onset also later?
So So for the across-view case, yes. It's a bit hard to say, though, because once the accuracy drops a certain amount, the onset gets later. So if I have-- if the slope of my line goes down because my accuracy is going down, then so does my onset point where it crosses the line, the threshold for significance. I'm interested-- I think it'd be interesting to just do the behavioral experiment and see if people's reaction times get slower, but I haven't.
That's a good question, is-- you might expect that the video would take longer, because maybe, you need to get to a certain informative frame before you recognize the action. And that doesn't seem to be the case. So that's what we thought. Maybe, like, it's not until the person is here that your brain is like, drink. But that doesn't seem to be what's happening. It really seems like the motion information helps you do it better and possibly faster.
So we can do the same thing. So I mentioned these Johansson point-light walkers. In 1973, they way he did this was actually stuck lights on the person, and have them run and walk in a dark room. But today, what we can do is actually ask people on Mechanical Turk to label the points of their different joints. So this is actually the same video from our data set of somebody running, somebody drinking, and somebody eating.
So these are the results. And again, what you see is that within-view case is pretty good, but the across-view seems to be quite a bit worse. So it seems like not only is motion important for this invariant recognition, but so is form. So we think both of these play a pretty integral role in recognizing actions.
And when most people study action recognition in neuroscience, they do it from these impoverished stimuli, which I think have a lot of merit, because it lets you look at more controlled scenarios. But I think we really need to push towards understanding more realistic stimuli, because clearly, something different is happening when you look at videos.
Right, so to summarize, actions can be decoded as early as 200 milliseconds after the stimulus is shown. And this early representation is invariant to both actor and viewpoint. And with a feedforward hierarchical model, we can provide a mechanistic explanation for how this viewpoint invariant action recognition is happening so quickly in the brain. And it seems like both form and motion are also important for this invariant recognition.
So if we go back to this timeline of object recognition or visual recognition in the human brain, it seems like very shortly after you've computed an invariant representation for objects, you also have an invariant representation for actions. So I think this work really helped provide a timeline for these phenomena that we didn't have before in humans.
So just to summarize the talk overall, we showed that size and position object recognition develops in stages, between 80 to 180 milliseconds. And that the size of the transformation determined how long it took for that information to arise. To contrast that, we showed that early action recognition signals were already view and actor invariant.
And what I think is most neat about this work is the same class of hierarchical feedforward models can explain the mechanisms behind both the object recognition and the action recognition, even though they don't necessarily seem immediately consistent with each other. So going forward, when you're watching this video, you're not actually saying, that's a person. They're walking. That's a sandwich. That's a table.
You're probably telling a story and trying to think of what's happening. So you might say, these are people getting food before a talk, or these are hungry grad students. Or you might be trying to infer, or even just subconsciously trying to recognize, what people's goals and emotions are. And I think this work provides a framework and a set of tools to start to get at these more higher level questions, which is what I'm interested in studying next.
Associated Research Thrust: