The visual alphabet 2.0 (32:57)
July 31, 2019
July 31, 2019
All Captioned Videos CBMM Summer Lecture Series
Carlos Ponce, Professor of Neuroscience at Washington University School of Medicine, describes a recent study that explores how neurons in inferotemporal (IT) cortex of the monkey visual system encode complex patterns for recognition. The video begins with a reflection on the discovery of orientation selectivity in the cat visual cortex by David Hubel and Torsten Wiesel, who won the Nobel Prize for their work. Neurons in IT cortex of the monkey visual system play a role in the recognition of complex objects. Dr. Ponce and his colleagues developed a novel technique for discovering the complex visual patterns that yield the strongest responses of these cells. The method uses a generative adversarial network (GAN), which combines aspects of deep neural networks and genetic algorithms, to drive the evolution of rich synthetic images of objects with complex combinations of shapes, colors, and textures, that are preferred by these neurons. The images sometimes resemble familiar objects and provide insight into the vocabulary of features encoded in IT cortex to support visual recognition.
Slides: Carlos Ponce's Slides (pdf)
CARLOS R. PONCE: I'm Carlos Ponce, I'm going to introduce myself, but first, I want to say thank you all for having me, and thank you to Mandana for the invitation into CBMM I think it's really quite a delight to be able to talk to students about some of this work, and hopefully, you will enjoy it.
So, I was asked to tell you a little bit about myself. I was born in Mexico. I first came to the US when I was 13 years old. I learned English in high school, in Salt Lake City, Utah, where I also ended up going to college, where I majored in biology. And after that, I came to Harvard Medical School to get my MD/PhD, and I specialize in neurobiology, that's where I started understanding some of the problems that I'm about to describe to you. And then afterwards, I did post-doctoral work at Harvard and also here, at CBMM, which is how I got to know so many of the wonderful folks here.
So, last year, I started my own laboratory at Washington University in St. Louis. And as you guys think about applying for your PhDs or your MD/PhDs, I'm going to strongly encourage you to keep this wonderful institution in mind and feel free to ask me questions or send me emails about it.
For now, today, what I want to do is I want to tell you about this research project that I started working on while I was a post-doc here at CBMM, and then that I've continued to branch off in interesting directions in my own lab. And the project really revolves around the issue of visual recognition in the primate visual system. So we want to understand how neurons in the brain of an animal like this allows him to determine what they're looking at.
And to start, basically, I know you guys are at the end of your program. So you probably know a lot about this stuff, but just to put us all in the same base, what we know is that object recognition is realized by a set of cortical areas in the brain that extend from area V1, V2, V3, V4, all the way to inferotemporal cortex, which is also divided into posterior central and anterior parts. And all of this has been described as the ventral stream. And that's what gives rise to visual recognition.
And we know that these areas contain neurons that respond to specific patterns of shapes and color, in the receptive field, by emitting different rates of action potentials, of electrical impulses. And so, the main hypotheses is that in these responses, what neurons are trying to do is to signal depressants, or rather, represent the presence of special features in the world. And some of these features can be simple, like edges, or they can be more complicated, like textures, or even things like faces, hands, or some people think ever more semantically sophisticated concepts, like animate versus inanimate.
And the problem is that even though you have, in the primary visual system, billions of these neurons that can give rise to visual processing, the world's still way too complicated. There's way too many things, too many features, that these neurons might be able to learn to give rise to visual recognition. And so this visual system has to be particularly efficient about how it's going to allocate its neurons to learn features of the world. And those features should be important to the brain that hosts them. So if you are a neuron in a monkey's brain, you better be learning about what's a predator, what's a conspecific.
And so the idea that the visual system then allocates its attention, its focus, into a specific set of stereotypical features, is what a lot of people have described in the past as the visual alphabet. And this is a concept that I first read by Leslie Ungerleider and Andrew Bell, and I really like it, because I think it encapsulates what we think is happening in the brain. And so the idea is that, just like in English, you can take 26 symbols, and with the use of grammar, allow them to express any thought. So we think that maybe the brain creates a subset of key neuronal representations that could allow the decoding of any visual scene that comes to your retina.
So there's been some really good progress in the past trying to identify what neurons in, for example, the inferotemporal cortex, are encoding in trying to define this visual alphabet. And this is some examples that were done by Kenji Tanaka, one of my favorite scientists, and as you can see, he suspects that these are a lot of the-- 16 at least, of the specific symbols that the brain has learned to encode. And he believes that there could be as few as 1500, just based on some features about the anatomy of our temporal cortex.
So the question before us, then, is how do we know what any particular feature, what any particular neuron, has come to represent about the world? And to tell you this, I think what I wanted to do is to go back to the very beginning, and revisit some of the classic experiments by David Hubel and Torsten Wiesel that really gave rise to the field of visual neuroscience. And just out of curiosity, how many of you guys know about Hubel and Wiesel? OK, great. So about half of you. So, Hubel and Wiesel were the founders of the first neuroscience program, their biology program, in the world, as I understand, and they are-- at Harvard Medical School, and then they also earned the Nobel Prize, in 1982, for their work in visual processing.
And what they set out to do is that they wanted to understand what neurons in the primary visual cortex of the cat were encoding about the world. And so, their set up, their experimental design, which is basically what all of us have continued to use, is that we have an animal face a screen, or some other way in which we can convey visual information, while we introduce electrodes in the brain in such a way that we can record the electrical signals, or the spikes and action potentials, coming in from cells in these different parts of the brain.
So the goal, what they were trying to do, is first of all, identify the part of the visual field that the cell would respond to, and that's called the receptive field. And they wanted to know which specific shapes made the cell fire more. So they knew, because of work that their mentor, Steve Kuffler, had done in the cat retina, that spots were actually very good at driving the responses of cells in the retina. So what they wanted to do is to see if it would also work in primary visual cortex. As they relate the story in their book, Brain and Visual Perception, which is quite a delight to read, and I encourage you guys to do it. As I said, they decided that the very first thing they we're going to try are the spots to stimulate V1 cells, because it worked so well for their mentor.
And so they were using something called a modified ofthamascope, that basically looked like this, very cool, old school, 1950s technology. So the cat goes here, and then a signal gets-- an image gets projected directly into the cat's retina. And the problem's that at first, nothing was working. As a way that they say that they couldn't get the cells to fire reliably, and they thought that their early failures were a matter of finding the right stimulus. And what really changed is that-- well, first of all, they were working really hard, day and night. They would take shifts to try and understand what the cells were trying to encode. And then suddenly, they found that there was one slide with a dark spot on it. And when they inserted it, the cell just began to fire up, like a machine gun.
And what they realized, it's not that the cell was responding to the spot, it was actually responding to the edge of the slide as they were introducing it. It created like a very light shadow, and the cell began to respond to it. So what they found is that if they take an edge, put it in the receptor field of the cell, and then just change its orientation, the cell would fire more action potentials. And if you changed it to a different orientation, it failed to do so. So what they have done is that they had found orientation selectivity. And this is the basis of visual neuroscience. They found that these V1 neurons were responding lines and edges, and therefore, we can interpret that as meaning that these cells are representing instances in the visual world where there's different kinds of orientations.
Now, this was the beginning of everything that you're about to see, that I'm going to tell you about. And I think it's one of the most-- one of the coolest stories in visual neuroscience, not just because of the progress that it made in revealing something about the visual brain, but also, because I think it provides a blueprint about how a lot of us can do good science. And so one of them, for example, is that they were explorers. They understood at the time that not enough information was known about the brain, and so they couldn't come to very narrow hypotheses just yet, they had to keep their minds open about what they would find in the brain.
The second thing is that they relied on previous successes, in Kuffler's spots. And even though that wasn't what gave them their final success, it served as a springboard. So relying on what's worked before is a really good sign. And the other thing is that they succeeded because of a little luck in what they called bullheaded persistence. And here, this is a quote from their book. And I actually thought that I would read you the rest of that, because it's really a good lesson that I learned about early on.
And what they said was, "People hearing the story of how we stumbled on orientation selectivity might conclude that the discovery was a matter of luck. While never denying the importance of luck, we would rather say that it was more of a matter of bullheaded persistence, a refusal to give up when we seem to be getting nowhere. If something is there, and you try hard enough to find it, without that persistence, you certainly won't." And I think that's a really good lesson for pretty much succeeding in science, so I wanted to give you that full expression.
Nevertheless, they also, in their experiments, laid out the problem, that a lot of us would inherit, in trying to understand the rest of the brain. As we continue to explore cells in the brain, how do we not miss that proverbial slide edge? Will we get-- will be as good as that? And the problem gets even worse as you move down the ventral stream. Because when you start in V1, neurons have small receptive fields. But as you move further down and down, towards inferotemporal cortex, [INAUDIBLE] receptive field size gets larger and larger. And so that means that a lot of these cells can be tested, and we'll respond to, very complex objects, like actual pictures.
And so, for example, Bob Desimone and others have found that in IT, these neurons [? herd ?] can now respond to pictures of monkeys, human faces, hands, places, artificial objects, you name it. And we know that they respond to them because-- and this experiment's by Bob Desimone and Charlie Gross-- they found that if you take a picture, and then just begin to strip down the details, the responses also begin to go down for both hands and faces. So this sounds like complex pictures.
And the problem is that the more complex your picture is, the harder it is to summarize it in a way that makes you understand what the cells really care about. So for example, when you're dealing with edges, there's only a small subset of transformations that you can apply to them in such a way that you can change it. With a line, you can change its orientation, you can turn it into a [? gabor ?] and change with a special frequency, or change the color, but there isn't a simple parametric transformation that would allow you to take something like a face and convert it into a place or a hand.
And so, what do we do? When we try to understand cells in inferotemporal cortex that respond to pictures? Basically, what we do is we'll take a lot of images and we'll divide them in a way that we can understand intuitively. So we'll say, all right, this picture is a picture of a face and place and artificial objects, so these are human designed labels. And we the cells respond more to faces and to places and artificial objects, we'll have, like, face cells and we'll have place cells. And that's OK, that's allowed us a lot of good understanding about the brain. The problem is that if you go to many labs, everybody's got a different idea of what different levels they should use to try and understand this part of the brain. Nobody agrees on how you can take a given picture and assign it to any one category.
And ultimately, neurons, in the monkey's brain, at least, don't really care that much about category. So even if you find cells that respond a lot to faces, you'll still find that if you present to them the right picture of a fruit or a clock, the cell will also fire, suggesting that the cell is not really obeying the categorical lines. It's rather responding to a pattern that is present in faces and also can occur in pictures of other categories. So again, we get stuck with the idea of finding the right stimulus. There's way too many things to try, limited amount of time to try it.
So what do we do? Hubel and Wiesel said, work hard, try to find the search. And we agree with that, but why should we do the search? Why don't we let the actual cell find the picture that it wants to see in the world? Let them do some of the work, for a change. And this basic idea is something that Charles Connor had basically started experimenting with in 2008, and in this particular paper, they were using evolutionary search algorithms with IT cells. And the idea is that they have a shape generator.
In this case, they used unopened GL library that allows him to create splice, which are basically bounded curvatures. So you can have 50 random shapes that you can present to the cell, examine how the cell responds to each one of them, and then take the most successful ones and have them vary and just sort of change shape. And then you can do that iteratively, over time, until you end up with a population of shapes that make the cell fire a lot. And that gets you closer to what the cell really wants to see.
So we thought, all right, this is a really good idea. So why don't we try to do it with an even larger shape generator space? And so the idea was, what if you could take a map of the observable universe, like this is, and make it about shapes, in such a way that a cell could really just travel and go to the places where it wants to create-- where it can create shapes that has imprinted and has learned to represent, whether they're realistic or not realistic? And this, basically, is what the talk is going to be about, how we use this particular model of generative adversarial networks as a way for the cell to create any of the shapes that it's encoded, and we don't have to guess what they are.
Any questions, so far? All right. So, I need to tell you about generative adversarial networks. Now, this is MIT, I suspect you guys have learned a little bit about them? Some yeses? Excellent, all right. I'll gloss over a little bit over them, but if you have any specific questions, I can give you more. To explain generative adversarial networks, I have to go to a different part of visual recognition, and this is in machine learning. So we want to understand the machine learning community has also been trying very hard at designing models that can also, like the brain, classify objects in the world. So for example, in the ImageNet Large Scale Visual Recognition Challenge, this is a place, every year it happens, where teams compete by submitting models that can classify objects and pictures. And it happens every year.
And back in 2012, a team led by Geoff Hinton, in Toronto, entered AlexNet, which is a convolutional neural network that actually has reduced the error rate from 26% to 16%, it was unparalleled at the time. It made a lot of news, and it launched a new revolution in visual recognition and in machine learning. So a convolutional neural network, really, is just a very deep neural network that has a hierarchically arranged set of simple operations, like dot products and looking for the maximum of different values. And they're arranged almost like the brain. And they take in, as an input, an image, and what they output is a sparse vector, a short vector, that can be interpreted as a probability distribution that tells you if, for example, there was a deer or a cat present in the picture. That's a very simple description of convolutional neural network.
But it's important to understand-- well, first of all, AlexNet has exactly the same form, we'll come back to its architecture a little bit more. But it's important to understand the convolutional neural network, because a generative adversarial network is sort of the opposite, where you can enter a small vector, a short vector, into the model, and what it emits is a big picture.
So this could be like 100 or a few thousand elements, and it could give out at 256x256 color picture. So this generative adversarial networks, you guys, you probably have seen a little bit of them in the media, but they're really just very good models. They can take sets of complex objects, like pictures, and sort of abstract specific information about them in such a way that they can create a probability distribution that allows you to sample, and come up with new examples, of what the network feels best characterizes that population.
You guys have probably seen these pictures about-- this is a GAN that was created by NVIDIA. And as you can see, these are its interpretations of what faces look like. Of course, none of these people exist, these are all computer generated.
AUDIENCE: That's crazy.
CARLOS R. PONCE: Yeah, there's now actually an interesting website that you can test yourself to see whether you can distinguish a regular face from a GAN generated face. And as I understand it, some of those rates are about 75%, is the best that people can do. Wired just had an article, and I quizzed myself, and I failed. You can often tell the difference by little artifacts, like around the hair, do you see these high frequency artifacts? This is not a very good resolution, and you can use tricks like that. But otherwise, they are quite compelling.
And so, a lot of these GANs are being used for the video game industry, they're being used for a lot of applications in the private sector. But the cool thing about them is that some of them can give you a very, very good way to just kind of transverse all of shapes space. So for example, this is BigGAN, that was just released by Google this year. And as you can see, you can actually interpolate between real world objects and many unrealistic, but still fascinating, kinds of images. And so it's a pretty good way to parameterize the visual world.
Well, we thought this would be a perfect thing, then, to use with the brain. And this was the basic idea that our colleagues, Will Xiao and Gabriel Kreiman, both of whom are here at CBMM, came up with. They felt that, all right, can we take one of these GANs and connect it, put it in the same pipeline, with the responses of a cell in the monkey's brain in such a way that we can let it build, and again, search that space to come up with the picture that it wants to see? So the rest of the talk is a paper that we just published a couple of months ago. And these are the co-authors, they're all here at CBMM-- actually, Till has gone on to a very successful career in the private sector, in machine learning. But I was a post-doctoral fellow when we began this project.
So, what Will and Gabriel have done is that they have identified a GAN that had been trained by Anne Winn and Jeff Clune at the University of Nebraska. This GAN just takes a 4,096 element vector and it puts out a picture. And this GAN had been trained to [? involve ?] representations in AlexNET, layer FC6, and that's an important detail to remember when we talk more about the results. But when you enter just random numbers into this GAN, you end up with some weird textures, like this. But if you enter non-random vectors, you can come up with images that are fairly realistic and interpretable.
So what we wanted to do is test them with actual brains. So we recorded from monkeys, six different monkeys, that had arrays implanted in inferotemporal cortex, different parts. In my lab now, we're doing it along the entire ventral stream. And the way that the experiments worked, in any given day, were like this. We began by taking a set of-- like Ed Connor did, a set of random codes, and we generated just random textures in black and white. And then we showed this to the neurons, with the monkeys, again, just kind of fixating. We ended up with response rates to each one of these pictures. And then we used these response rates to keep the best codes and then recombine and mutate the other ones. So this is what's commonly known as a genetic algorithm, because we're taking these input codes and then just mixing them, based on their fitness.
That allows them to create a new set of pictures. And then we can throw this to the neurons. So we can just do this, for like an hour or two, and while this was happening, the monkeys just simply fixating. We present reference images, natural world images, that allows us to keep track of the stability of experiment, make sure that we're not losing the cell or doing anything like that, while the synthetic images are presented and evolving over time. And the first times that we started doing that, it was pretty amazing, because here, I'm showing you the responses of a posterior inferotempral cell, inferotemporal cortex cell. This is the main firing rate, per generation. This is the response rate to the natural images, like this, and responses to the synthetic images over time, over the different generations.
And as you can see, this cell began to respond to the synthetic images in ways that we'd never seen before with natural images. It was as if the cell was just sort of screaming, yes, yes, this is exactly what I've been trying to tell you! So even though there's a little bit of adaptation over time. And so we fit a curve into these different changes in firing rate over time, and we plotted them here. So we replicated this across multiple animals. And as you can see, this is the change in response to the synthetic images during the experiment, and this is the change in response to the controlled images. And as you can see, we found that pretty much every cell showed us great changes in its firing rate, meaning that something was evolving that these cells really cared about.
Not only that, find that in a lot of these cells, a lot of the time, most of the time, these synthetic images were evolving, were creating, eliciting, stronger responses than natural images. And that's important, because what I'm about to show you. So, let me show you the pictures that evolved. To illustrate them, for every one of these generations, we had about 30 synthetic images. So, we're just going to average all of those images, and I'm going to present each one as a frame. And this is going to be the average of the top five images, actually, for this particular IT neuron.
So, at first, this is generation, 1, 2, 3, 4, at every three minutes or so. And at first, what seems to happen is that a black spot gets captured, and it begins to slowly get elaborated in such a way that-- I don't know what exactly-- to me, it sort of looks like a bit of an eye type thing. But then suddenly, something else begins to kind of appear up in that corner. And over the next few generations, that spot gets color, it becomes a little bit more elaborated, until you end up with this. Just to compare, this is the response rate to these images right here. So this is a single cell in the macaque's brain that basically, we pulled out a representation of an image that it has selected in the world. I'm not sure if you guys see anything there.
Here's another example. Here's the evolution, here's what it ended up with. Same regular region. So, as these images are evolving, a lot of us felt compelled to try and interpret them, just by looking at them. And that's not something we're really allowed to do, because this is not a psychology test, it's not a Rorschach test. So we had to come up with objective ways to try and interpret what these cells were evolving. And we did it a couple of ways. The first way is that you can take the same cell that gives rise to a picture like this, and if you can have a very good monkey that likes to work long enough, you can actually, then, try and show as many natural pictures as possible.
And so, this experiment by Peter Schade, for example, they showed over 2,500 natural. Images and then we asked what firing rates-- you know, we sorted them based on the amount of firing rate elicited by each one of the pictures. And this is actually the sort. So you can see that these objects were eliciting higher responses than the rest of these ones down here. And when you zoom in, you find that these are the pictures that this cell liked as much as-- that this cell liked, even though it's responding more to this particular kind of synthetic picture. So this is something that fits its template a lot better than this. But you can see now, by analogy, what is it that the cell is looking for. And these are pictures of mammals, monkeys, humans.
And all of them seem to have certain specific features. So for example, you have this black spot surrounded by this convex shape. You have a color, right, and a certain texture. You get like this overall gist of an object, of a bounded object in all of these cases. And importantly, and something that I've started to notice now in my lab, too, is that the cells are doing some sort of segmentation, like they're really trying hard to get rid of stuff in the background. They love to evolve things on this white background. The better the evolution, often-- the better the evolution goes, the fewer stuff you have in the background.
So, that was one way to try and interpret what these synthetic images are. The other way is that I mentioned that this GAN had been trained to invert representations in AlexNet. And specifically, one in FC6, so one of the things you can do is now you can take this picture, along with 100,000 other pictures of the natural world, feed them into AlexNet, and then ask it, basically, what images in the world are closest to this particular picture? And what we found was that the pictures, according to AlexNet, that were closer to this were, again, pictures of mammals like this. Now, the cool thing is that all of these pictures are labeled, because of the competition.
And so therefore, we can actually create histograms telling us what the semantic words humans would use to describe these images. And when you look at the whole population of IT cells in this particular monkey, it was macaques that actually elicited the highest similarity to the synthetic images. And that makes a lot of sense to us, because macaques are highly social animals. So if they were to encode anything important about the world, it would probably be related to their conspecifics. Even things like this, by the way, the Windsor tie, these are pictures of humans wearing ties. So quite consistent.
So, if it's true that these approximations are real to the cells, and we can also take these images and present them back to the neuron, and when we do that, we find that indeed-- so, for this particular evolution, these were the nearest match, these are intermediate, these are far, so we presented this back to the cells, and we found that indeed, neurons responded better to the pictures that were closest to the synthetic image, as per this neural network.
So, this was interesting to us, because now we feel like we're getting a little bit closer to this final story of the visual alphabet. But again, I mentioned that this-- we have to do a lot of work to try and interpret them. Right now, these are a bit correlational, some of these explanations about what it is that these cells are evolving. And they really could come from a lot of different ways, but we're actually thinking that these representations must be important to the animal's survival. They must have some sort of cognitive relevance. And I think there's some clues that this is the case.
For example, some of these cells-- so, here I'm showing you four evolutions for two different cells, and this is a receptive field in red. So this is the area that-- this is a part of the cells that make them fire, parts of the images that makes the cell fire. And when you actually try to identify what the natural images are that are closest to this, per the cell, they tended to be pictures of animal care staff that visit the animal pretty much daily. So, people wearing masks. And so this might be evidence that, then, these cells really are imprinting on information that is important, based on daily experience of the animal itself.
And then, finally, the issue is that a lot of these shapes, we can kind of relate to natural objects in the world. This is some evolutions that have been happening in my lab, recently. You can still see, even in something like this, a bit of a face-ish type cell, but who knows what it really is? Here's another one. But then, some of these guys, I mean, I don't know what I would relate this, maybe you can think of them as some-- maybe it comes from the representation of an arm. It's hard to say. And then some of them, again, the cell's just firing like crazy when it sees these things, but it's getting much, much harder to relate them to things.
And I think that's interesting, because it suggests that we have been missing things that the brain encodes when we just simply use pictures. There's something else, a different language that these cells are using, and the interesting challenge is to try and get at them. But I said that I would come back to the Tanaka alphabet. And so indeed, even though they use a different kind of approach to identify what they think are some of these representations, I think that we're getting pretty close. I do think that we are advancing some of their work, using different technology. And so I'm hoping that over the next few years, in my lab, one of the things we're trying to do is we want to find as many of these representations as possible. We want to explain and kind of relate these representations to actually things that monkeys care about, and we're using a lot of behavioral tasks to get to that point.
And then, finally, we want to know if these cells-- if these images are, in fact, related by some sort of parameter space, just like Hubel and Weisel did with orientation relating lines through orientation transformations. So, yeah, I think that's pretty much the whole story. So just, in conclusion, we're using generative adversarial networks and generic algorithms to get these neural representations. They are very successful at eliciting responses from the cell. They are not photorealistic, which means that it's not quite-- that's not a necessity for the cell, although the style of the pictures may change a little bit as we try different kinds of GANs, like the BigGAN and some of the other ones. But we think that they are important to the monkey, and we're about to be relating that.
So that's the end of the talk. I just want to say thanks to my collaborators here at CBMM, with whom we started these projects, and now, we're continuing them in my own lab. I want to thank CBMM for the opportunity to present this to you guys. And for you guys, for being here. And again, as you guys think about applying to PhD programs, Wash U is fantastic, academically. And the best part about living in Midwest is you can actually live in your own place, like it's fantastically affordable. I love that. So, yeah, that's it. Thank you very much.