From Zero to CNNs to Brains
Date Posted:
August 13, 2020
Date Recorded:
August 12, 2020
CBMM Speaker(s):
Andrei Barbu ,
David Mayo Speaker(s):
Colin Conwell
All Captioned Videos Brains, Minds and Machines Summer Course 2020
ANDREI BARBU: So we wanted to keep with the theme of brains, minds, and machines, and have a tutorial that tries to walk you from knowing really nothing about machine learning to how do you do some very basic things, like how do linear classifiers work, to what's the intuition behind modern convolutional networks. What are the limitations, which is something David will be talking about.
And then finally, Colin will talk about how knowledge from the brain, in this case the brain of mice, can be used in order to perhaps one day improve the performance of our networks, and definitely to try to understand what's going on inside the brain of mice. There are a few links here. We're going to share the slides and the various Colab worksheets we're using. But there's no need to follow along for now.
So machine vision is probably one of the most popular areas of AI these days and of machine learning. And amongst the many different tasks in machine vision, object detection stands out as-- they have been sort of the flagship task that people are interested in. But that often gets reduced to the task of putting bounding boxes around objects of different categories. But of course, you or I have much more rich knowledge of the world than simply knowing that there's a bounding box around a dog or a chair. But for better or for worse, that's how we, as a field, have defined object detection for now.
And object detectors are getting deployed in the real world. You may cross the street while an object detector watches you. That object detector may save your life. These days, they're deployed in actual cars. And the car may stop because that object detector detected you. Of course, with deployments also come failures. So in this case, there was a very tragic failure of computer vision about two, three years ago from Tesla, where someone was driving in an autonomous car.
And the object detector did not pick up a white-ish truck on a blue sky that had no clouds on it. And the car just slammed into the truck. You or I would have seen it, but the driver was relying on their autopilot and didn't have the reaction time once they possibly noticed the truck in order to do anything about it. So computer vision is extremely useful, but it still has some significant limitations compared to the abilities that humans or even animals have when it comes to seeing objects in our environment.
And so I wanted to start from the most basic fundamental building blocks of machine learning, and build up to what is inside one of these modern CNN-based object detectors. They all, more or less, look similar to one another. And then David will tell you when things fail. And Colin will compare them to the brains of rodents.
So at the heart of everything is the [INAUDIBLE] classification. You may get a collection of points. These two sets of points cluster very nicely. And a machine may see all of the points without knowing what labels correspond to which points. And a training time will receive a very clear map like this. And a test time will have to decide which points fall into which cluster.
And one of the most straightforward ways to do this, that's been known for many decades now, is linear classification. In other words, you would try to find a linear decision boundary, where you say, everything to the left of my line is a circle and everything to the right of my line is, in this case, a cross. Of course that doesn't always work. In this case, that's not a very good decision boundary, and neither is this one.
But there are, for these sets of points at least, good decision boundaries, like these two. And some of them that we'd prefer, if you just look at the data, it looks like the red one is more robust than the other dashed one that's in the center. And so what we're going to do is turn this problem into something that we can actually solve in PyTorch, and come up with the linear classifier to tell these points apart.
Now of course, points come at different distances from this line. Some points will be very close to it, and others will be very far. So you would really like to have just two answers. I'm at the left side of the line or the right side of the line? And one way to do that is to basically squish the distance between the line and every point with a sigmoid.
So that if you're just infinitesimally far away from the line on the left, the answer will be minus 1. And if you're infinitesimally far away on the right, the answer will be 1. And that's exactly what we're about to do in a moment.
So we can think of this as an optimization problem that's two-dimensional. In other words, that central line, or any line, has an offset and it has a rotation. And I want to find the offset in rotation of my line that classifies my points correctly. And if you try to operationalize the notion of classify the points correctly, you can measure, how many points on my training set am I correctly classifying into each of the two categories?
And some combinations of offsets and orientations of your line are going to do a poor job and misclassify the vast majority of points. And others are going to do a great job. And so you get this energy landscape where you have peaks. In this case, peaks are bad. And troughs, where you have good answers, and very often you have different local minima-- in other words, multiple answers that both do well. In some cases, one may do better than the other, but they both may be acceptable solutions.
And the big hammer that we have for aligning these parameters of these lines correctly is called gradient descent. And the thing that it does is, it lets you look locally at the shape of the landscape right underneath you. So if you imagine that you're at one of these points on this hill, and you look down at your feet, you can't see the landscape.
But just simply looking down at your feet, you could see that the hill is pointed in a certain direction, and that if you were to take a step in the direction, you would be further down the hill. That's all that gradient descent is. It allows you to computer the gradient, in other words, the slope of that hill. In fact, we take many small steps. And that's what gradient descent is. [INAUDIBLE] talked about that.
One of the things that you can do with this kind of setup, though, is separate points that at first may not appear to be linearly separable. So there's no straight line that will separate the black dots from the open circles here. But if I reproject these points into some other space enough times, eventually, the hope is that I'll be able to warp the space sufficiently so that I find a straight line that separates them. That's the intuition behind why we're going to end up applying multiple transformations to our points, and eventually, have a linear classifier on top of everything that tries to divide them into multiple categories.
So just to show you what this looks like in PyTorch, there's-- and again, we're going to share these Colab worksheets with you. There's some basic preliminary setup that you have to do, importing some basic libraries. We won't go over that too much. I just entered the coordinates of the points from that earlier slide.
And we could just show them inline. If you haven't use Colab before, or Jupyter Notebooks, what Colab Is Built on top of, this is definitely a tool that's worth playing around with. It's very powerful to have inline graphs to be able to share notes and have your code interweaved with your experimental setup.
So to get to the actual meat of the linear classifier, we're going to do a setup, a very simple model. This is probably the simplest model you can possibly set up in PyTorch. And this is a linear layer that takes two inputs and has one output. And the two inputs are the x and y coordinate of every point. And the one output, well, we want to know essentially how far away this point is from our straight line. We're going to have a criterion, a loss that defines the shape of that landscape that we looked at a moment ago. In this case, we're going to do the binary cross-entropy loss. All this is going to measure is how many points are being misclassified as part of one class as opposed to the other.
So the optimal value for this is going to be some line that cleanly separates the two. We're going to set up some optimizer. In this case, it's stochastic gradient descent. And we're going to have some learning rate. For SGD, the learning rate corresponds to how large of a step you want to take. And if you remember that picture with the hills and the valleys, well, if your steps are very large, you may step over a valley into another hill.
And so very often, what people will do is start with the large step sizes at the beginning to get out of a very bad place in the optimization landscape, and then slowly anneal the learning rate downward. And these days, we have much more powerful optimizers that will do that for you, like Adam. But we're going to stick to just stochastic gradient descent here. We're going to just pick some learning rate.
We can also easily plot the current model that we have. So I initialized this layer. And whenever you initialize a model, it'll have some random values inside of it. And PyTorch tries to intelligently choose distributions for those random variables. You can see this totally random line is not a good classifier at all. This classifies half the points, about as bad as you can get.
But we can try to adjust the slide and run our optimizer. All these two lines do is they just reshape our points so that they're 1D arrays. So it just concatenates both the points from the two classes together. And we create another array that has the targets, the labels that we want to assign these points, again, a 1D array, where we have 0's for one category and 1 for the other category.
And if we want to run our model, all we have to do is pass the inputs to the model. Just call it like the normal function and you get your outputs. Pass it through the sigmoid, like we said a moment ago, to not care about the distance to this line, merely whether we're on the left or the right. Put it through your criterion, which is, again, our binary cross-entropy loss. And give the criterion the true label so that we know this is the prediction that we're making and this is the target. And we will know how high up in the energy landscape we are.
Now, all this does is compute where we are in that energy landscape. The other thing that we want to know is, what does the landscape under a heap look like? So which direction do we have to go into in order to descend the hill, as opposed to ascend the hill or stay at the same location? And that's what the next two lines are doing.
For various optimization reasons that we won't go into, you have to zero out some state for your optimizer. And then you call the backward function. And what this backward function does is-- this is really the magic of PyTorch, this tool called automatic differentiation, that takes the function that you ran, takes this model-- and for that matter, the criterion and the model combined together-- and essentially analyzes it and propagates information backwards through it in order to figure out what the shape of the local landscape is under your feet.
The easiest way to think about this is, this function could do finite differencing. In other words, it could take your model and change the input just a little bit in different random directions, and see what the hill looks like if you had taken a tiny step up or down the hill or left or right. And based on that, it will tell you which direction you should go in in order to go down the hill. That's not actually what happens.
What happens is there's a little bit of calculus involved. And there are many resources for automatic differentiation online, so I won't go into them here. But that's the basic intuition. And then we use the direction that you've computed there in your optimizer, and you take a step down your hill. We can run this optimizer. This model runs very, very quickly.
You can see the loss goes down, not particularly quickly. We only ran it for 100 steps. But already you can see the line looks very differently. And already, some of the points on the left are part of one class, some of the points on the right or part of another class. Could see that if the line were to sort of move its way towards the middle, we would be in pretty good shape here.
It's also very good to be able to look at the parameters of your model. So we can actually investigate what parameters our linear layer has here. In this case, is just a little matrix that you feed a 2D point into. It projects that to the point, does a multiplication between this and your vector. And then it adds a bias, adds this value to it.
So it's a little bit of high school algebra. You can take these values and figure out how to reproject this line. Essentially you're asking, what's the decision boundary that corresponds to my model? And that's one of the most popular visualizations for what your neural network is doing. As long as you're not in too many dimensions, find some interesting 2D or 3D region, and go back and plot what the decision boundary looks like, which for a linear function, is going to look, obviously, like a straight line.
But as your network becomes more complicated, you're going to get the effect that we saw in the slides, where your decision boundary is all of a sudden going to take some interesting shapes, enclosed shapes, et cetera. And we can even watch this optimization happen in real time. We could run the function.
Oh, apparently Colab decided to disconnect me halfway through. But we could run the function and watch that line slowly move. And you can play with this at home if you want to try out different learning rates and what, in fact, they might have, as well as different kinds of initializations.
So that just tells us about linear classifiers. That's literally the simplest thing we can possibly do. But of course, we would like to talk about object detection. And at the heart of modern object detectors is not just this linear operation. But before it, is an operation called the convolution.
And that comes from signal processing, where you have a signal. In this case, you could imagine like an audio signal. And this is amplitude. And I have a filter, in this case, a box filter. And I slowly slide my filter along my signal, and I get a response for the filter to my signal.
We can play the same game with images. In this case, we have a kernel, which is the filter. We have an input image, and we have an output image. We take the kernel, slide it across every position in the input image. At every location, we multiply the value that's in the kernel against the value that's in the image. And we send them all together. And that produces a single pixel in the output image.
If you just repeat this, you perform a lot of interesting image transformations, including the most popular one that probably exists in every image processing tool in the world, which is blurring. If you want to blur an image, the way to do it is to use what's called a Guassian blur. In other words, you create one of these kernels that has a lot of energy in the middle. And the energy tapers out as it goes down to the sides.
It's just a matrix. There's nothing special about it. But one way to think about this is that, a blur involves taking some amount of information from the central pixel, and then allowing all the other pixels to have some amount of influence. This is kind of spreading that information out. And indeed, you blur a imagine if you convolve the [INAUDIBLE] Gaussian kernel.
But you can put whatever you feel like into these matrices, into these kernels. And going back, again, 30, 40 years, even further, one of the most basic operations that people discovered is the Laplace transform. This is a 3 by 3 matrix, or a pair of 3 by 3 matrices, that have 0 value if they're being applied to a white image. Because if the values on the left and the values on the right are the same, they'll just cancel out when you send them together.
But if there is more energy on this side and less energy on this side, or vice versa, that won't cancel out. You'll get a high response for this kernel. In other words, this kernel is looking for vertical edges, and this kernel is looking for horizontal edges. And that's when they're going to have maximal response on any 3 by 3 region of an image that they're being convolved with.
And if all you do is you take your image, you can conolve it with the kernel on the left and the kernel on the right, and compute the magnitude of the response, because you end up with two images at that point, what you have is a Sobel edge detector. You take the image on the left, perform that operation and get the image on the right. And you can imagine turning this into an edge detector. Or you can even imagine turning it into an object detector.
Imagine for a second you're sitting in a factory, and you have to verify whether these parts on the left are correct and they just bypassed your sensor. Well, if you take a single good part, you compute this map on the right, all you have to do is multiply every image with the map on the right. And whenever the multiplication turns up to have a high value, whenever the dot product turns out to be high, that means that the image you're looking at is a lot like the image that you had before. And when that's not true, well, maybe something went wrong with your part. That's probably the world's simplest object detector that you can imagine.
Well, people did this. But of course, you have a problem, because parts can move. They can be rotated. They can have parts to them, or subparts. And so the next big thing that people said is-- well, I'm skipping over a lot of work here, so this is a very selective history of object detection. But the next big thing was in the 2000s or so, where you take those edge maps that you saw earlier, you bin them. And each bin, instead of storing the edge itself, you just store the statistics of the bin.
So you say, what does the histogram of the orientation of the energy in different directions look like? And what that does is it buys you a certain amount of sloth, because if this wheel is slightly rotated, it doesn't matter. Now I'm not asking, does the edge of the wheel perfectly match the edge of my filter? All I'm asking is, in this region where the edge of the wheel is, are the statistics similar to those of my filter?
And people ran with this idea. This is called the histogram of gradients. They created what's called the deformable part detector, where you just apply this idea over and over again. So this is deformable part detector for human, where you can see that this is kind of the head. So you're saying that edges around the head are sort of oriented in a roof-like shape. And there are arms. And the face kind of has edges all over the place.
And then once you apply this filter and you have some confidence that this might be a person, you can go on and apply even more fine-grained filters. So we can see now, instead of having really large bins for every part of the human, we have smaller and smaller bins. And of course, humans are deformable. So people said, well, how about we carve up the filter into pieces, and allow that filter to move around, and ask, what's the best location for the head to filter for this human? And I don't want the head to be below the feet, so I'm going to put in a deformation map that only allows the head to move around a little bit. And you have a filter for the head and the arms and the legs.
This was the state-of-the-art in the 2010s or so. But it turns out that this is already a kind of deep network. We have multiple sets of filters we're applying. We convolve an image with some filters. And we're doing, essentially, a max pooling, where we're asking where the arms fit best on an image.
And so while modern object detectors are a bit more inscrutable than the ones that preceded them, because you just have these rather complicated metric diagrams and it's difficult to look at the parameters that are learned, compared to maybe how interpretable they were in the past, particularly as the networks get deeper, the intuition is exactly the same. You have an image. You have convolutions that are applied to this image, which are your filters.
You perform this max pooling in regions where you say, I don't care where this filter succeeded. I just want to know, what was its maximum response? And you just repeat this over and over and over again. And this is the architecture of AlexNet. This is first popular deep network that did well on ImageNet, and conviced people to keep pushing on it, and sort of is the backbone of almost every modern object detector. They don't look very different compared to this.
And of course, at the end of all of this, we have that linear layer that I talked about a moment ago. So the basic story is, you take images as input. You propose some regions that are likely to have objects. You feed it through that image through a network that has lots and lots of layers. And you end up with this linear classifier at the top, because all of these layers did what that kernel in the linear case did earlier on, which is disentangle the space. And it made the images linearly separable, or the representation of the images.
And in the Colab sheet I included a quick walk-through of AlexNet. There are many tutorials on it. It's worth having a look. But once you play around with the linear classifiers, it's really just more of the same. And PyTorch takes care of the heavy lifting for you.
And this really made a huge difference in performance. This is performance on ImageNet. Since 2014, it's gotten much, much better. But unfortunately, the problem isn't solved. And that's what David will talk about next. If you feed in images that look unlike the training set, so they look like the images that were used to set the parameters of this network, the same way that we set the parameters of that linear function, they can often get very confused.
So in this case, I put this image through a state-of-the-art captioning system published at the end of 2019. And it said, there's a man and a child sitting on a plane, because it's never seen images of babies trying to drive cars. And I'll leave it to David to show you how bad that problem is and what solutions there might be to it.
DAVID MAYO: OK, so I think Andrei gave a great overview of intro to models and intro to vision. And so now we're going to take a little closer look at what some of those actual data points in his graphs look like. And they take the form of images from various image data sets. So much like the history of machine learning, an important component has been building up better data sets. And large data sets have allowed us to have large amounts of training data and to be able to actually create models that perform well at classification.
So here's a quick example in PyTorch of how we actually load many modern object recognition data sets into a Jupyter Notebook. So the first thing you need to do to load models is to basically create a transform. And the transform just says, how do I convert from an image, a PNG or JPEG format, into an image that I can plug into a model? So this involves just resizing it, cropping it, subtracting out the mean based on the statistics of the entire data set. And there's many other operations you can do here to preprocess.
Once you have that transform, you can load your data and apply the transform to it. And so these are several different data sets-- CIFAR 10, ImagineNet, COCO, ObjectNet, which is a data set that I created. And I'm going to walk through here a quick random sampling of images from each of those data sets so you get a sense of what they look like.
This object here is also called the Data Loader, which is very useful in PyTorch and allows us to quickly learn batches of data in parallel at the same time. So CIFAR 10 here was an earlier data set. As you can see, it's a little lower resolution. But you get the sense very quickly of what data set characteristics are.
For example, there are many dogs here, and cats and animals, and the objects are all centered and fairly large in the field of view. ImageNet was created by web scraping Flickr, and so many of these images look a little more like stock photos or are taken to be a little more artistic-- for example, this picture of a diver holding up the sea creature or a [INAUDIBLE].
COCO was created for mainly doing segmentation, but has much more cluttered scenes, because they have many instances of objects inside their scenes. Although they have a fairly narrow set of objects, they do have a much more complex scene to recognize. So it's easy, looking at each one of these, to pull out what-- if I gave you an image, you'd be able to tell me which data set that image came from.
So we created this new data set called ObjectNet, which gives you an entirely new variation in your test data, which includes things like the rotation of the object, the room that the object was taken in being decorrelated from the object itself, and the angle that you saw of the object from. And this is really important, to have each of these things decorrelated in order to really stress test your models. Because models turn out to be really great at this ImageNet task, where many images have correlations, such as kitchen chairs always occur next to kitchen tables, or things like that.
So by decorrelating each one of these, we create a data set that looks really different. It's simple, kind of like CIFAR 10, in that all the objects are centered in the screen and big and easily human recognizable. But it provides a different kind of difficulty than COCO does, in that we've decorrelated things instead of adding clutter.
I've also included in this tutorial a validation accuracy code that you can try out yourself. This is fairly simple. It just loads in data in batches using these different data set data loaders, and then allows you to get in top 1 and top 5 accuracy, which is what we typically use in vision research to basically allow models to either get one guess or five guesses as to what the correct answer is of what's in the image.
I'll just jump ahead here to the actual computer results, which this is comparing ImageNet data set and ObjectNet, the data set I created. And what we found through building this and testing our models is that there's a large absolute performance drop that happens when you actually control for a number of difficult parameters, such as rotation and viewpoint. And this large performance gap shows that our models still have a long way to go to reach human level performance. So Colin is going to talk a little bit more about how we can compare these models when they're looking at images to the mouse brain, maybe in hopes of being able to bridge this gap and build more biological human, or least mouse-level neural networks.
COLIN CORNWELL: So if you'd like to follow along with this tutorial, there's a very simple hyperlink you can go to here-- bit.ly/neuralcheese, which will lead you to a Google Colab notebook called Deep Mouse Trap. So the first thing that we're going to do here is we're actually going to load some tools that we've made for facilitating this process of neural comparison from GitHub.
So we'll run this cell first. If you are not particularly familiar with Google Colab, notice that there is a directory structure over here. So this is what we'll be creating. We'll be creating a directory, which we'll then be changing directory into, so we have all of our tools available to us.
So we've talked so far about deep neural networks, how to build them and how to test them. And one way of testing them is to actually see how well they do at the tasks they're assigned to be doing. But we can also test them by seeing how well these biologically inspired models can actually predict the biology. So what we'll be looking at today is what's called a neurophysiology data set.
So this is a data set that consists of electrical and optical recordings from an actual animal brain, in this case, the rodent visual cortex. And we'll be looking at how we can use the representations and knowledge learned by neural networks to actually predict what's going on in the brain, and explain it to some degree. And simultaneously while doing that, we can think about how well we could build the next generation of models that could predict the brain, and in so doing, models that will also perform better at real world tasks.
So we're going to go ahead and load some resources here. The first thing that we obviously want to do with any sort of comparison to the actual biological brain is to actually get some biological data. So that's the first thing that we'll be doing here. The files that we'll be loading are two files, one that contains a bunch of metadata for each of the actual neurons that we're recording from in the actual mouse brain, and also the actual responses from those neurons. We have some information about where those neurons are located, what those neurons do, and also how they've responded to a set of natural images, which we'll look at in one second.
So the first thing that we have is information about the neuron. So some things to take note of here, some major information that we need about the neuron is where it came from. So for example, there are different parts of rodent visual cortex. You've all probably heard of the primary visual cortex in the mammalian brain, so that's the first part of the neocortex, into which visual information flows after hitting the eyes.
And that also is recapitulated in what's called primary visual cortex in the mouse brain. So once again, we have some imaging depth. So this tells us where, basically, in terms of the actual cortical depth, the neuron is being sampled from, and information about the area as well. There's a bunch of metadata here, which if you're interested, there's some more information from the Allen Brain Observatory from whence this data set came, about these metadata variables.
So after we get our overarching data about each individual neuron, we then also need to look at the neural response arrays. So in this case, we'll be loading a data frame that is, in dimension terms, the number of neurons by 119. Now, what do these 119 columns respond to? While they correspond to the mean response of each of the neurons in our data set, so the 119 images in our stimulus set, which I'll load now.
What do these images look like? Well, there are about 119 images that are grayscale images of various natural scenes, sometimes containing animals, sometimes not containing animals. So we've loaded our neural physiological data. And now the question is, how can we try to use this data, or to predict the responses from the neurons in this data with deep neural networks?
The first thing that we have to do in order to facilitate this process is to actually extract from the network the representations for these images in our stimulus set. So this is a process called deep feature extraction, or just feature extraction. And what we're going to do here is we're going to import some tools. And first, we are going to choose a model to use as our basic feature model.
So we have various options here. We have a lot of object recognition models, including AlexNet [? retrained ?] on ImageNet. We have also some randomly initialized models, so some models that are fully constructed but have never learned anything. We have various other models that do things like object detection and segmentation. And we also have included in this data set something called taxonomy models, which are the same architecture of models, but trained on different computer vision tasks. And in an empirical or scientific context, this is valuable in that, in many cases, we want to look at the difference between what differences in model architecture do and what differences in model training do. And we can do that using the taxonomy models.
So for today's purposes, we're going to be using ResNet-18 ImageNet. So because we're using an ImageNet model and because we have a NumPy array, as the basis of our stimulus set, we're going to use image transforms that basically transform our data to be an appropriate data set for an ImageNet model from a NumPy array. And we're going to basically extract some information from this model, including a command that we can then pass to the Python interpreter in order to give us our pretrained ResNet-18 model. And then we'll load that model accordingly. If you are operating on a GPU, this will put the model on the GPU. If you're not operating on a GPU, this will not put the model on the GPU.
So in order to pass our stimulus set into our neural network, we have to first make a PyTorch variable out of it. And we do that by first transforming each of the images in our stimulus set according to the image transforms we loaded above. And then we put those into an array, and we make those a PyTorch available. We then pass those into the neural network via this feature mapping function. And all of these functions that have been defined here, you can find over here in the GitHub repo.
So for example, right now, we're trying to feature extraction. So there's a Python file that shows you the operations-- whoops-- behind the feature extraction. This can take a while. It's somewhat computationally expensive to do this. And the main PyTorch tool that allows us to look at the internals of a neural network is something called a hook.
So what a hook does is basically a little key that we can then use to open up the black box of the neural network and look at what's going on inside. And what's going on inside at any given moment for a neural network is obviously a bunch of matrix multiplications. So what we get out of a feature extraction process is a matrix. And in this case, ideally, what you want returned to you for a neural comparison type scenario is a matrix in which you have, in the first dimension, however many images are in your stimulus set, and in the second dimension, the actual activations, the flattened activations from whatever model layer you're looking at at a given moment.
So in this case, if we're looking at the first convolutional layer of ResNet-18, we have 119 images. And we have, for each image we have 800,000 activations. Now, what this should make immediately clear is that this is a very computationally comprehensive process. And so what we're going to want to do, for the purposes of this demonstration, but also for the purposes of analysis, is whittle this data down a bit more. So for this tutorial, we're going to basically subset from this dictionary of feature maps that we have, every third convolutional layer of ResNet-18, which will give us in total about six convolutional layers.
And while we don't have too much time to go into detail today, what we first and foremost need to do before trying any sort of modeling with these features, is we have to reduce their dimensions. So in this case, we're going to be using a dimensionality reduction technique called sparse random projection. We're not going to worry too much about this for the purposes of the demonstration today because we don't have too much time. But you can, of course, find the process here in the SRP extraction Python file.
So this feature extraction process will take a moment. And while this feature extraction process is occurring, I'll start talking about the next section here, because we're a little pressed for time. So once we have our dimensionality reduced feature maps, what we're going to want to do is we're going to then want to use those feature maps in a regression scenario. So we're going to want to try and predict the actual representations in the brain using now the representations from our neural network. So notice now that once I've reduced the dimensionality of my feature maps-- I'll actually show this in vivo-- we now have a much more manageable array, which is 119 observations, so 119 stimuli, by 4,096 sparse random projections of the original 800,000 activations from the first convolutional layer.
So with our features now reduced, let's actually try and predict some real physiological responses. So once again, because this is a computationally expensive process, we have a lot of neurons and we have a lot of activations, we're going to subset from our original 6,000 or so neurons a smaller data set of about 500 neurons. In fact, to make this run even faster, we're going to do 250 neurons.
So now we're going to basically pass these into a regression, where, as our y or our predicted outcome, we have the brain's 119 responses, 119 responses, as in the 119 mean responses to 119 stimuli. And as our predictors, we have the 4,096 dimensionality reduced feature activations for those images from the deep net. So we're going to pass those into a ridge regression. And we're going to let that compute.
Now, while this computes, I'll just give a little bit of an intuition about the kinds of information that we can extract from doing this sort of thing. So when you do a large scale data analytic format, in which you can pass many models into these functions and get their predictive power of the brain out of them, you can start building hierarchies of models that tell you a little bit about the design features that you might want to incorporate into novel neural networks in order to better predict the brain.
So one thing that falls out of this, if we look at some actual results that we've calculated in the last few months, one of the first things that we see is that, neural networks that do better at ImageNet accuracy-- so here on the x-axis here, you have the top 1 accuracy for ImageNet, and on the y-axis you have a normalized R squared score, so this is about how well a given neural network does for predicting the brain, at least within a certain range. And you see that as neural networks get better at classifying objects, so, too, do they get better at classifying the brain, or predicting the brain, in this case.
Now, what else do we see? We see that deeper neural networks-- so on the x-axis here, we have the number of layers in the neural network. On the y-axis, we have the same R squared scores. We can see that, as we increase the number of layers in our network, we also are able to predict the brain better, usually with those later layers in the network.
This is also true for the number of features in any given layer. So if a neural network is sort of saturated with convolutional layers and has more features accordingly, we also get more predictive power for the brain out of that network. So these are the sorts of intuitions that you can derive from a large scale data analytic scenario, in which you're passing many models into a function that tells you how well they're predicting the brain.
So our function here has run. We've now run a sort of preliminary neural regression, as you might call it. It saved a bunch of our results to an output folder here, as you can see. So we have a bunch of CSVs that are the CSVs associated with each layer of the ResNet that we've computed. We can load those back in. And we can now basically parse our results by the given neural area we extracted each of our neurons from. And we can see how well, in general, our models are predicting those neurons across the brain.
And with that, I'll conclude, because I believe we're out of time. But we're happy, all three of us-- Andrei, David, and myself are happy to take any questions you may have. And we hope you'll use this tool kit and use some of the code that we've made available today to do your own research at the summer school or in various exploits moving forward. So please feel free to visit this GitHub repo, and also to use this Colab tutorial in the future.
PRESENTER: Great, Thank you all very much. We do have a couple of questions. The first one from Manuel [INAUDIBLE]. It's a question for David Mayo. Can you talk more about how you go about creating a new data set? Did you take the photos yourself? What are some of the interesting operational details or tips?
DAVID MAYO: Hi, Manuel. Yeah, sorry, I didn't get to cover that too much. So that whole project was called ObjectNet. And the data set was captured using a platform that I built, where Mechanical Turk workers actually used their smartphone. And we walked them through this process of capturing an image according to a label that we wanted. So we came up with a label, like capture an image of a chair upside down in your bedroom from above.
And then the smartphone, basically, has a web app that walks you through the process with an overlay, where you have to rotate your phone to the correct angle and be in the right place. And there's a bunch of different checks. And then you capture this picture and upload it back to us. So this project had about 6,000 people at home taking pictures of random objects in their house. If you're interested, check out objectnet.dev. I think it's maybe also in Andrei's slides and in my notebook for more info.
PRESENTER: Great, thanks. The next one is from Sasha [INAUDIBLE]. What purpose could convolution serve other than edge detection?
ANDREI BARBU: Well, I guess I'll talk about that. It's whatever operation that you would like to apply. So people, for example, use the convolutions inside speech recognizers in order to process signals. You can have 2D or 3D convolutions if you want to, say, find regions rather than just find edges. But at the end of the day, it's just some arbitrary function that that you compute that allows you to apply that function over some signal, whether it's a 1D signal, 2D signal, 3D signal.
If you look inside an image processing book, you'll see that a lot of the basic image processing technology is based around convolutions, and that your camera is probably carrying out 100 different steps or so to turn an image that it took from the photons that hit the CCD, to something that looks pleasant to your eyes. That includes everything from sharpening the image, which is just the opposite of blurring it, to despeckling it, to, say, attempting to remove bad pixels. Even, say, changing the hue of the image can be set up to be a kind of convolution if you want to. Essentially, almost every operation that you do in Photoshop, at the end of the day, is being done with a convolution, in addition to the fact that you can just learn some arbitrary function that might not be interpretable to humans.
PRESENTER: Great, thanks. We have a follow up from Sasha and Sophie. How is the score of prediction interpreted, as well as calculated?
COLIN CORNWELL: I think this is relevant to the neural prediction component. So I believe we had two questions. One was about the SRP process, the sparse random projection process. And the other was about the prediction and the score calculation. The score calculation is actually very simple in this case. So for any given regression where we have as our outcome the actual neural response array-- so in this case, for one cell, that response array is 119 numbers. And we're trying to predict those 119 numbers.
And so we pass in the features from our neural network. We get 119 predictions, one prediction for each of the numbers for each of the responses from our brain data. And then we basically just correlate the predicted values from the neural network with the actual values in the brain. We square that, and that gives us an R squared score. So the score calculation there is very straightforward, no tricks or anything else involved.
Now, for the sparse random projections, the dimensionality reduction technique, this is probably most easily explained by putting it in contrast to principal components analysis, which is a much more popular and widely used method of dimensionality reduction. What principal component analysis is trying to do is, it's trying to find a lower dimensionality of variance, where each dimension is orthogonal to the dimensions previously used.
Now, sparse random projection is a bit different, in that it's not looking for orthogonality in the dimensions of variance onto which you're projecting. In this case, it's literally just taking random projections, so random dimensions, and it's projecting the data onto those random dimensions, as opposed to orthogonal dimensions. So those are the main, I'd say, principles of this, of the sparse random projection that we're using for dimensionality reduction.
And the reason that we use that, actually, is that, generally speaking, principal components analysis is way too computationally expensive to perform on activations from deep nets, especially some of the larger, or I'd say wider deep nets, like VG 16, in which the features from the first convolutional layer can number in the millions. So sparse random projection is a way to make more feasible the process of dimensionality reduction in neural comparison scenarios.
PRESENTER: Great, thanks, Colin. We have a question from [INAUDIBLE] directed at you as well. Can you clarify what you meant by predicting the brain? What are the input and output?
COLIN CORNWELL: Yes, OK, well, this is a-- yes, I should have probably clarified this at the beginning. The actual brain that we're predicting in this case are neural responses. So when you engage in some sort of physiological experiment, what you're doing is you are assessing the responses of biological neurons in the brain to certain physical stimuli-- in this case, images. So the mouse in this case is sort of in an electrophysiological or an optical physiological rig, basically meaning that we're either scanning their brain or we're basically putting electrodes down into their brain and recording responses that way.
And what we're doing is we're seeing when the mouse is looking at a certain image, how are the neurons in its brain responding? And what that gives us, basically, for each image, it gives us a number. So for example, this neuron, with the cell specimen ID of 0119, for the first image has, in this case-- now, the specific units here require a much more complicated tutorial. But the specific unit in this case is called DF over F, or the change in the fluorescence of the neuron.
Basically, it's looking at how bright that neuron shined under a certain genetically modified light in response to this stimulus. And we have that same signature for every stimulus in our stimulus set and for every neuron that we've recorded from. So when I say predicting the brain, what I mean is that we're actually trying, for a given individual neuron, to predict these 119 values, using the responses of the neural network to these same 119 images. Does that clarify?
PRESENTER: Hopefully it does. Otherwise, [INAUDIBLE], please follow up with another question. Our next one is from Dev [INAUDIBLE]. If we run a modified version of this on an SNN, which more closely resembles the brain, what would you hypothesize for that? Would it be reasonable to expect significantly better results?
COLIN CORNWELL: Sorry, could you clarify what an SNN is? I'm not entirely sure, actually.
PRESENTER: Dev, if you could follow up on that?
COLIN CORNWELL: Sparse neural network, I'm assuming?
PRESENTER: He's saying a spiking neural network.
COLIN CORNWELL: Yes, this is an active question. And this depends, I think, fundamentally on how you're modeling the data. So the data that I've shown you today is a sort of abstraction of the real physiological data, in the sense that this is the mean activity of a neuron. And when you're working with something like optical physiology or electrical physiology, you actually have the spikes available to you in the data set.
So my assumption would be that, if you're trying to actually model the individual spikes in a given neuron, that a network that actually incorporates into its design, into its responses, some sort of spiking scenario, would do a bit better. There have been some attempts at using spiking neural networks so far to actually model the brain. Now the problem is, generally speaking, that spiking neural networks are having trouble getting off the ground, in terms of the real world tasks that we would hope a neural network could do.
And when we see something, like in this scenario, ImageNet accurate models are doing much better at predicting the brain, the ideal would be that you also have a spiking neural network that can simultaneously perform a real world task, like object recognition, and also predict the brain. And I imagine if that you had a very highly performant spiking neural network, it would do a much better job at predicting the individual response profiles of neurons than one of these convolutional networks, which is very much an abstraction of the real biological processes going on.
PRESENTER: Great, thanks. The next one's from Miguel [INAUDIBLE]. How do you know which layers of the ANN select to compare with the actual biological neural response, based on the type of computation expected in a given biological layer compared with an ANN layer?
COLIN CORNWELL: So I think this is actually more of an empirical question, personally for me, than it is a theoretical question. So in this scenario, it's unclear exactly how the various different computations in a neural network actually map onto the computations that the brain is performing. For example, do you sample from a convolutional layer or do you sample from the ReLU layer that typically immediately succeeds a convolutional layer?
Now, the brain is doing some nonlinear thresholding. That seems to be clear. But it's not exactly clear where it's happening and exactly how it's happening. So whether the pre-nonlinear thresholded convolutions are better than the post nonlinear thresholded convolutions is, I think, an empirical question. What we've done in work like this is we've actually tried to figure out which layers, in general, are doing better at predicting the brain, where those layers are in the general depth of the network.
But I would say that, theoretically, there's probably an answer to this, but I don't have it. And therefore, that incentivizes me to try a more empirical approach to answering that question. But it's certainly still an open question. Andrei and David, if you want to hop in, please feel free.
ANDREI BARBU: Yeah, I think that's one of the interesting questions. Our networks don't look a whole lot like the brain. Networks that do object detection with 500 layers are not a whole lot like any piece of cortex that anyone has. And so I think, aside from not knowing whether you want to do it where the ReLu layer is or anything like that, there's the question of, how many layers do you need from your network in order to be able to explain something about a single layer in an actual mouse brain?
But separately, there's a different approach that you could take and be more agnostic to this, and just try to compare the activation in an entire network to the activation that you recorded from your mouse, and then try to ask, what subset of this network might best explain what's going on? And sort of try to whittle away at your network until you're left with something that maximally explains your neural data. That's also something that we were trying to do in our paper. And certainly, we are not unique in attempting to do that.
PRESENTER: Great, thanks. The next question is from [INAUDIBLE]. It's a two-part. So you are basically using the convolutional architecture to extract relevant features from the input brain data. So are you using default weight from ImageNet training? If yes, then is it really extensible? And also, if you have 119 classes, have you tried using normal machine learning techniques such as Random Forest or XGBoost? Is their performance comparable?
ANDREI BARBU: Just to clarify, we don't use the neural networks to extract features from the brain. The networks are trained to do object detection. And they learn to extract whatever features from images are useful for the purpose of object detection. Then we pass in images through those networks, we record the activation inside those networks.
And we take those activations and we try to compare them against the activations that are recorded from, say, the brain of a mouse. And we do that by trying to find some linear transformation between them, as well as a few other techniques that Colin didn't have time to go into that are also in his worksheet. It is a good question to ask, whether ImageNet is ecologically valid for a mouse or not. Maybe their vision is totally differently tuned from ours, and they certainly probably are not detecting a whole lot of buses or airplanes or anything like that.
Their visual acuity is also very different from ours. And their vision is foveated differently from ours. That being said, we can still explain a large amount of what's going on inside their brains. And it seems like the determining factor is performance on ImageNet, rather than small changes that we make to the networks. So it could be that, if we had a better, more ecologically valid data set, we could actually get higher accuracy.