As simple as possible, but not simpler: features of the neural code for visual recognition

As simple as possible, but not simpler: features of the neural code for visual recognition

Date Posted:  December 16, 2020
Date Recorded:  December 12, 2020
CBMM Speaker(s):  Carlos Ponce
  • All Captioned Videos
  • SVRHM Workshop 2020

PRESENTER 1: Carlos Ponce received his M.D. PhD from Harvard and was then a postdoc with [INAUDIBLE] Livingston and Gabriel [INAUDIBLE] here at the Center for Brains, Minds, and Machines. He is currently an assistant professor of neuroscience at the Washington University School of Medicine in St. Louis.

His lab was interested in how different brain regions interact to solve motion processing and visual object recognition, and he asks this question using a combination of electrophysiology, causal perturbation techniques, as well as computational modeling. His talk is titled as simple as possible, but not simpler-- features of the neural code for visual recognition. Carlos.

CARLOS PONCE: All right. Well, thank you very much for the invitation. I have to say one of the funnest workshops I've attended, and the poster session started great. I didn't realize how much I would like to being able to Zoom into it, not having to reach over somebody's shoulder to look at the poster. So thanks for putting this together.

OK, so my talk today is about visual processing in the primary brain, specifically focusing on the macaque monkey and in object recognition. We study the macaque because their brains are so similar to ours and because we can access them using electrophysiology. So the main problem that I think the macaque's brain and, in fact, any visual system processing device has to encounter is the idea that the world has too much information, but not all of it really is that useful.

So for example, if you were to cross the street, you might be interested in learning where the identity of the people in the street are or the location of moving vehicles, and you might be less interested about incidental features like the color of a t-shirt or the particular shape that a tire makes against the asphalt. So the question that we're interested in is, what information are neurons in the visual cortex parsing and representing from scenes?

When I think about this question, I always go back to this paper by Fred [INAUDIBLE], where he laid out this really useful concept. And this is a paper in 1954 called "Some Informational Aspects of Visual Perception," which I think directly inspired by Claude Shannon's information theory, which had just been published a few years earlier. And in this paper, Fred [INAUDIBLE] invited us to consider the problem of redundancy in visual images. So here he presented an image which he called an ink bottle in the corner of the desk.

And what he said is that even though this image comprises 4,000 pixels, he said that it would be possible to reconstruct this image from a blank page just by guessing the individual color in one of the cells, and he [INAUDIBLE] could do that with as few as 20 errors. And the way that it could be done simply by getting the color in each cell and then making some assumptions about continuity and symmetry in the image. I don't know if that's actually possible, but he did suggest that you could compress the image quite a bit.

This is the useful concept that I was talking about-- the idea that these images do have parts of it where information is concentrated, such as contours. And he laid out this idea you can take something as complicated as a cat sleeping and basically convey that using simple 38 points of maximum curvature. So this raised the question that at the time, these results were all based on perception.

But it raised the question that if the visual world is indeed redundant, that it can be compressed. And this is before actual neural data was available. But as it turns out, just a few years later, Hubel and Wiesel did their fundamental studies demonstrating that in posterior visual cortex, particularly V1 and V2, there were neurons that indeed showed preferences, represented straight and curved contours. And by preferences, we mean something that there will [INAUDIBLE] more spikes. And so they identified orientation, tuning, and stopping in visual cortex.

These kinds of cells that they found actually have been found in a bunch of other species, from the cat to the macaque. And even they occur in neural network models that are trained to be sparse and efficient and of course in deep networks as well. So this combination of efficiency and information concentrated features and these responses of neurons really raises the question in my mind, besides straight and curved contours, what other information concentrating features do neurons encode, especially once you get out of primary visual cortex and you begin to explore some of the areas that Layla was talking about, for example, in front of temporal cortex, what we also record?

So to think about this question, there's two other inspirations that we follow. One is Horace Barlow, who also inspired my information theory, published a paper in 1962 called "Possible Principles Underlying Transformations of Sensory Messages," where he postulated a number of criteria, one of which was that these feature probably had to be of significance to the animal. They had to allow the animal to perform its behavior better.

And the other, more recent set of work came from [INAUDIBLE] Uilman, who postulated this idea that objects could be represented by a hierarchy of fragments extracted during learning from observed examples. He specifically described them as pictorial features of intermediate complexity that could be used to solve certain object class tasks. So this is kind of what we want to focus here on this topic is we want to understand what information concentrated features are encoded by neurons in this visual recognition pathway.

We want to follow and measure how complex they are relative to natural objects, classic image stimuli, and even representations in artificial neural networks. And following Horace Barlow's idea, we want to find out if they're of ecological significance to the animals. So let's do that.

So what is-- first of all, how do you identify these features? And there's a number of ways to do with. I'm talking about is basically feature visualization, and we want to do it in the primate brain.

And there's a number of ways of doing it, but I'm going to talk about one that I think is particularly effective. This is based on a paper that we published last year while I was just still in March Livingston's lab and working with Gabrielle [INAUDIBLE]. And this is the idea of using generative adversarial networks as part of the electrophysiological process here, what we recall from neurons.

And being in [INAUDIBLE], I don't really have to spend a lot of time describing what these GANs are, but I think for the purposes of the the talk, all we need to know is that these are deep networks, neural networks, that can basically take a vector as an input of a few hundred elements or maybe a few thousand elements, depending on your GAN, and then, through a set of convolutional layer's, outputs an image that is over half a million pixels. But it's in color, and they can be really good at representing natural objects.

And they're in fact so expressive-- our encoders are also very good with light GANs because they are very good at expressing not just class dependent information but information pretty much in any natural scene. For example, this one by [INAUDIBLE] shows images that are pretty stylized, but you can recognize those cheeseburgers and water jugs. And there's more recent ones, including big GAN by Google that can also do a really good job of representing images and also giving you very intermediate and continuous transformations between images.

So we started using this one, a 2016 version. And the way that we did it in this initial set of papers was to combine these GANs by-- in a typical experiment, we would begin by taking a bunch of random input vectors actually based on [INAUDIBLE] textures. Then they would input it into the GAN.

That would create a bunch of random images that we could then present to a monkey, record from neurons in the monkey, a [INAUDIBLE] behaving Macaque, count the number of responses per picture, and then basically use a genetic algorithm to take those input vectors and the responses in order to come up with new candidate vectors that were meant to increase the firing rate of the neuron in the same way that [INAUDIBLE] did. And when we tried that, we found that he was an effective approach. Here I'm showing you the responses to one experimental session of one neuron in [INAUDIBLE] temporal cortex.

In black are the responses per generation to synthetic images and in green a set of non-changing reference images that we use to track the stability of experiment. And as you can see, these neurons showed an increasing firing rate, again reminiscent of what it means to identify the preferred feature of a given neuron. And what we found was that these images compressed fragments that could also be identified in preferred database images of the same neurons that give rise to this-- so for example, features corresponding to things you can find in faces, features corresponding to things you can find in objects with a dark-light contrast, and even things like objects that weren't always so easily semantically describable.

So we thought, OK, this could be one way to get at this information concentrating features. They're certainly meaningful to the neuron even if they're not always semantically describable. We think that's an exciting region to be at.

And now for the rest of this talk, what I want to do is I want to describe these GAN direct images using a nickname. We're going to call them prototypes because we think that they're abstractions that are learned by neurons from the natural [INAUDIBLE]. All right, so now we can ask questions.

How complex are these prototypes related to natural objects, classic image stimuli, and artificial neural networks? And do they matter to the monkey? So that's what we set out to do. And this is the work that we've been doing in my lab for the past year and a half. These are the members of my lab, and I'll be talking about [INAUDIBLE] work done by Olivia, Mary, James, and [INAUDIBLE].

OK, so first of all, to test this hypothesis, what we wanted to do was to examine prototypes encoded across the visual system, particularly the ventral stream. So we implanted microelectro arrays in two different monkeys starting from the intersection of [INAUDIBLE] V1, V2, then V4, as well as for the temporal cortex. Then Olivia and [INAUDIBLE] ran the experiments, and we basically found that it worked across the ventral strain.

So if you can now see one particular example here, the increase [INAUDIBLE] for the synthetic images, if we take this amplitude in firing rate change over the course of the experiment and compare that to that of the reference images, here's what's being plotted here on these two. Now the response changes in static images versus those of the reference. And as you can see, it worked well across both monkeys and across visual areas.

So the first thing we found that we noticed during the experiments was that neurons in more anterior parts of visual cortex were taking longer to generate their prototype compared to neurons in earlier visual regions. And this is by looking at the number of generations needed to reach half maximum of the converged response. This, we think, is due to the fact that these areas are representing less and more specific kinds of information.

So for example, here I'm showing you prototypes that were a sample from area V1 and V2 in both monkeys, area V4, and [INAUDIBLE] temporal cortex. And I think the clearest similarity by I is that the primary visual cortex in the early visual cortex representations tended to correspond to a lot of lines and stripes, things that the generators are very easy to populate their space with. No matter where you go in the latent space of a generative adversarial network, it won't be hard to find contours.

What I've superimposed on these two are the estimates of the center of the receptive field of these individual neurons. In contrast to V1, thug, IT tended to have objects that were rounder, that tended to have eye-like dots, and even more reminiscent things that remind you of arms as well as even fruit is what we called our banana neuron. So more specific-- so it turns out as [INAUDIBLE] showed that in some experiments that actually, convolutional neural networks show the same pattern.

So we took AlexNet, and then we used AlexNet in the exact same approach as with our experiments. Same search algorithms-- everything was basically the same. And we found that indeed, the deeper layers took longer to generate their preferred prototype compared to early layers.

So we can describe-- we've been looking at different ways to interpret more of the content across different areas, but what I really want to focus about right now is complexity. And so one of the things that we noticed is that in this evolution that the prototype seemed simpler, more concentrated. They really were like little image fragments or patches compared to natural images.

So we wanted to ask how simple are these prototypes. So to do that, James Johnson [INAUDIBLE] post-doc measured the compressibility of these prototypes using just a simple discrete cosine transform that's used for visionary compression, which I think all of you guys know is basically a Fourier transform using real values. And that just transforms the images into a set of weights corresponding to cosine functions.

So you can take images that have different levels of simplicity to them like an object that is segmented from the background versus texture. So you can see the discrete cosine transformers for them. These are weights on the specific cosine function.

So using these transforms, you can compute a compression ratio, which is basically you can think of as the fraction of coefficients that are needed to maintain the percentage of an image's energy between 50% and 99%. And then you can average across the different levels. And what it gives you is a value that you can use to intuitively see that, OK, an object like a hand without a background is more compressible, a 0.01 compared to [INAUDIBLE] extended texture, 0.33.

So we wanted to now ask, all right, what is this slide for for prototypes? One problem, though, in making this assumption [INAUDIBLE] whatever number we get is that we have to deal with the fact that maybe the GANs already compress images quite well. I mean, they've learned to represent efficiently a set of natural images.

So to control for this, what we did was take each individual vector that had been curated by the neurons independently, and then we simply shuffled the elements in that vector. And so that, it's a plausible vector that could have been reached by the neuron but wasn't. And so we're going to call those random prototypes.

And what we can do now is visualize them. They look more like this. So they're more high frequency. They're a little more extended. Here they are compared to each other-- prototypes versus these random prototypes.

And now we can measure how compressible they are. And what we find is that-- here, I'm showing you distributions of the prototype compressibility for both animals, white and black, compared to the random prototypes. And as you can see, the distribution is more towards 0.

That means that it's a lower compression ratio. It's more compressible. In fact, we find that it's about 20% to 30% more compressed than random prototypes created by this generator.

What's also interesting is that we then decided to contextualize this number by taking some of the iconic, important stimulus sets that are present in visual neuroscience going back through [INAUDIBLE] Vanessa's work, Charles [INAUDIBLE]. These were stimuli that were designed to elucidate one particular feature of neuronal function, like curvature. And we find that indeed, they are simpler in terms of their compression ratios. They're meant to be. But I think it provides an interesting point of comparison as well.

Then what we also found was that when we examine the representation, the complexity of the representations, we found that they were-- in neural networks, we found that they were actually less compressible. I'm sorry-- yeah, less comparison. So it made it more irregular textures.

And so they were most of them floated around 0.2. And so we thought that was an interesting finding, which I'll come back to in just a second. But we wanted to do one more way to try to measure the relative complexity of these prototypes compared to other things, and one of the things that we used was the mean shift plus the [INAUDIBLE] algorithm, which is again an algorithm that is out there.

You can use it to segment images. We can take an image like this and, through this algorithm, basically come up with a segmentation that in this case, for this particular picture, decided is made of seven parts. So we applied it to natural images. And as you can see, the number of parts that it identifies for segmented objects is less than that for textures. And when we apply it to the prototypes versus the shuffled prototypes, what we find is that indeed, the prototypes have fewer parts compared to what the GAN generates on its own.

So what this tells us, to takeaway that we're getting from this, is that these neurons really are encoding prototypes that are simpler-- [INAUDIBLE] simpler, the natural seems really responding more to like texture segmented objects. Neurons [INAUDIBLE] to do that. And what's an interesting comparison, as we're reading, this generator was created by [INAUDIBLE]-- well, developed by [INAUDIBLE] with [INAUDIBLE] and Jeff [INAUDIBLE].

And what [INAUDIBLE] finding is that when training networks that are more adversarial robust, less likely to be fooled by random pixel masks that are designed to make the neuron change its classifications, these networks, these visual networks, tend to rely more heavily on shapes compared to textures. They tend to detect more pixel-wise smoother patterns. So they can read a lot of the high frequency stuff. And they do represent more lower level features. So we think that's an interesting point of comparison that we might want to pursue.

And then finally, the second thing that we wanted to understand-- I'm just looking at the time here-- is this idea that Horace Barlow proposed, that prototypes really should have particular significance to the animal. So this means that we have to have some sort of behavioral correlate to our ability to predict based on the prototypes. Now there's a lot of different kinds of behaviors that one can teach a monkey to do. It's one of the best things about working with monkeys.

But what we really wanted to do was focus-- we're looking at a set of representations that are there. We didn't train the animal to do it. We're discovering an architecture of an ecological world that monkey cares about. And so we also wanted to have a behavior that we discover.

And one of the things that's interesting about working with macaques is that they love to just look at TV like the rest of us. So here's we have monkey alpha, and he's sipping from juice right here, asking just present images. And I can give you an example of what that looks like.

So we can present an image here of a monkey holding a banana and another little big moneky in the back. And very easily right away, you can see that what the monkey's looking at is the face of the baby, what the mother monkey is holding, and also trying to check, I guess, the sex of the monkey. So this is very, very an easy, intuitive way to understand what the animal cares about.

So what we decided to do was basically infer what-- identify the parts of the images that the monkeys care about based on how they look, the hypothesis being that Olivia help develop was these prototypes may be representing features that are salient in visual scenes. And so to do that, we show thousands of images to the monkey. And we started to see if they had a relationship with the prototypes.

So we started with 5,300 images. You can collect these very quickly. Then we found the image patches that first, through the case of the monkeys, we took the first [INAUDIBLE].

We used this as a center point in which to extract a patch based on the monkeys [INAUDIBLE] size. And so we presented images like this, which we can then feed into AlexNet and then compare in the same space as the prototypes. And so here's what we this is in layer six.

And as you can see, this is the distribution of points. So not to contextualize the distance between the protection and the viewed patches, we also did a couple of other things. We first of all fed random prototypes, the [INAUDIBLE] prototypes.

And then we also did a control where we took the eye movement behaviors, and then we simply swapped the picture underneath to make sure that the monkey's behavior actually mattered. So we swept and extract those patches. And when we do that, what we find by inputting the prototypes, the shuttle prototypes, the patches viewed by the monkey, and patches that were not viewed by the monkey into AlexNet, we find that indeed, the prototypes are more similar to the image fragments that the monkeys wanted to see.

They were as a group more related to natural scenes than random prototypes. And now here, I'm showing you the actual results. These are very small effect sizes. There's [INAUDIBLE] correlation between 0.1 and 0.05.

But I think based on [INAUDIBLE] behaviors in just a few hundred prototypes, we think that it's-- and the error bars are we think the actual symbol-- that it's pretty reliable and I think a good start to try and understand these representations in early visual cortex. So that's in. The overall summary is that we are finding that these neurons are encoding motifs to have certain intermediate complexity.

We're calling them prototypes, but we can think of them as some of the [INAUDIBLE] fragments. They are less complex than standard neural network representations. They are predictive of the monkey's looking behavior.

And I think it creates an interesting discussion point, nothing worth pursuing we're pursuing just yet, but the idea that convolutional neural networks that have simpler representations and trained with the visual ecology of primates may ultimately be better models for a visual brain. And with that, that's all I have. Thank you very much.