Have We Missed Half of What the Neocortex Does? Allocentric Location as the Basis of Perception
December 15, 2017
December 15, 2017
All Captioned Videos Brains, Minds and Machines Seminar Series
Jeff Hawkins, Co-Founder, Numenta
Abstract: In this talk I will describe a theory that sensory regions of the neocortex process two inputs. One input is the well-known sensory data arriving via thalamic relay cells. We propose the second input is a representation of allocentric location. The allocentric location represents where the sensed feature is relative to the object being sensed, in an object-centric reference frame. As the sensors move, cortical columns learn complete models of objects by integrating sensory features and location representations over time. Lateral projections allow columns to rapidly reach a consensus of what object is being sensed. We propose that the representation of allocentric location is derived locally, in layer 6 of each column, using the same tiling principles as grid cells in the entorhinal cortex. Because individual cortical columns are able to model complete complex objects, cortical regions are far more powerful than currently believed. The inclusion of allocentric location offers the possibility of rapid progress in understanding the function of numerous aspects of cortical anatomy.
I will be discussing material from these two papers. Others can be found at
A Theory of How Columns in the Neocortex Enable Learning the Structure of the World
Why Neurons Have Thousands of Synapses, A Theory of Sequence Memory in the Neocortex
Speaker Biography: Jeff Hawkins is a scientist and co-founder at Numenta, an independent research company focused on neocortical theory. His research focuses on how the cortex learns predictive models of the world through sensation and movement. In 2002, he founded the Redwood Neuroscience Institute, where he served as Director for three years. The institute is currently located at U.C. Berkeley. Previously, he co-founded two companies, Palm and Handspring, where he designed products such as the PalmPilot and Treo smartphone. In 2004 he wrote “On Intelligence”, a book about cortical theory.
Hawkins earned his B.S. in electrical engineering from Cornell University in 1979. He was elected to the National Academy of Engineering in 2003.
PRESENTER: I'm very glad to introduce again Jeff Hawkins. I say again because he has been one of the long-term supporters of the Center for Brains Minds and Machines and our vision of an engineering of intelligence based on the science of intelligence. That is, cognitive science and neuroscience.
So I introduced him the first time as a speaker of our then Intelligence Initiative seminar series in 2010. So that's seven years ago. He's the founder, as everybody knows, of Palm Computing and Handspring. And as such, he has been a legend in Silicon Valley for quite some time.
In 2003, he was elected as a member of the National Academy of Engineering for the creation of the handheld computing paradigm and the creation of the first commercially-successful example of a handheld computing device.
He has a deep connection with MIT. In its infinite wisdom, the MIT Computer Science admission office-- representing, I note, the other side of [INAUDIBLE] Street, not this one-- rejected Jeff's application to the iLab, and so made it possible for him to invent handheld computers and for us to have iPads and the like.
So Jeff wrote a book, which in the meantime is a classic book on intelligence-- that's 2004-- describing his memory prediction framework theory of the brain. He then started to maintain the belief that it's time for computer science to learn from the brain and for making computers more similar to the brain.
Jeff and I agreed then on the belief that the time had come for a new attack on the problem of AI, and that neuroscience would provide important cues. He wrote the initiative. This was the intelligence initiative, the precursor of CBMM.
The initiative is exciting. Over the last 40 years, I had seen many intelligence initiatives come and go, but the positioning and thought behind I squared-- that was the term-- intelligence initiative is the best I've seen. MIT is the ideal location for an initiative like this.
And since then, companies such as Mobil-i and especially DeepMind, which were then just tiny startups when they participated in the MIT symposium Brains, Minds and Machine, which organized in 2011, those companies have achieved a lot of success in AI by using two main algorithms-- reinforcement learning and deep learning. And both of such algorithms were initially inspired long ago by cognitive science and neuroscience.
So because of this, when I'm asked what will be the next breakthrough in AI, of course I answer that I don't know, but that it is reasonable that it will also come from neuroscience. And it may well come from looking in more details at the anatomy and function of the layers in each cortical areas.
And this is what Jeff would speak about. The title is Have We Missed Half of What the Neocortex Does? Allocentric location as the Basis for Perception. Please join me in welcoming Jeff Hawkins.
JEFF HAWKINS: Thank you, Tommy. That was very generous. And it's nice to be back here. I do view MIT as really setting the agenda in the field that I like to participate. And I almost completely forgot about the fact that my application for a graduate program here was rejected many years ago. That's good. So I don't hold anything against you guys.
Anyway, so yes, this [INAUDIBLE] my talk. And I won't explain it other than I'll just jump right into it here. I just figured a few words about my company, because it's a bit unusual. Numanta is a small business in northern California. We're really like a private research lab.
There's 12 people. We're almost completely dedicated to neocortical theory, and scientists and engineers. We have a rather ambitious goal, which is to reverse engineer the neocortex. I'm not embarrassed to say that. It's an ambitious goal. It's achievable. We should all be working on it in one way or the other.
And our approach is a very detailed biological approach. We want to understand how the neurons and the circuitry, as we see it in the mammalian neocortex, what it does and what its function is.
We're not interested in ideas inspired by the brain. That can come after you understand how the brain works. So we really stick to the biology.
We test this empirically with collaborations in experimental labs and via simulation. And that's what I'm going to talk about today. We have a second goal which relates to what Tommy just mentioned here, and it's definitely second in our case, which is to enable technology based on cortical theory.
So I'm still a believer that the way we're ultimately going to get to truly intelligent machines is we're going to-- the fastest path there is to understand how the brain works. And to that end, we have a very active open source community. All of our stuff is very open, all of our source code. You can reproduce all of our experiments. And we believe this ultimately, this endeavor, whether its us or other people, will be the basis for machine intelligence as we will see it in the future.
OK, I just want to remind-- I know everyone here is in neuroscience, and you all know this, but I just-- I find it's a good idea just to review a few basics before I delve into this. Mammals have a neocortex. Non-mammals don't. In the human, it's about 70% of the volume of your brain.
This is my model. I carry it with me all the time. It's about this big in area, and it's about 2 and 1/2 millimeters think. And what's most remarkable about the neocortex is the consistency of the microarchitecture you see everywhere you look. It's not 100% consistent, but it's remarkably consistent.
And so instead of focusing on the small differences, we really are focusing on the common elements we see everywhere. And so all the different regions of the cortex do different things. It appears-- and this was first proposed by Vernon Mountcastle many years ago-- that cortex is cortex.
And the way we see and the way we hear and the way we feel and the way we do language somehow is all based on the same sort of underlying fundamental architecture, which is just a remarkable thing to think about. But it appears to be true.
And Vernon Mountcastle also basically proposed, he says, well, the way to think about the neocortex is just think about one little section of it that goes through that 2 and 1/2 millimeters. He called it a column. And he says basically, in that column, you're going to have that central function.
So the goal is to really understand what a column, a single perhaps a millimeter square by 2 and 1/2 millimeters of cortex does. And if you can figure that out, you've got most of it figured out. So that's what we're going to talk about today, a cortical column.
Now, if you open up a basic textbook, introduction to neuroscience type of thing, you'll see a picture like this. And they'll say, oh, there's a bunch of layers in the cortex. Input arrives into layer four. Layer 4 projects to layer 2/3. Layer 2/3 is the output. Goes to the next region, and then layer 2/3 projects to layer 5, and that projects to layer 6. That information flows through the cortical columns.
It's actually not bad, but it's leaving out quite a bit. By my count right now, we deal with relatively about 12 different cellular layers. Layer 3 is easily divided into two. Layer 5 is three different cell types. These may not be visible layers. It doesn't mean the cells are actually stratified. But they're cells of different anatomy or morphology or physiology that can be uniquely identified.
Layer 6 is a very complicated layer. It has these two, layer 6a and 6b, or sort of these very interesting layers. And it's got a bunch of other cells down below there. If you just follow, for example, the same as we did on the left there, the feedforward circuitry gets complicated, too.
So there are actually two inputs to every cortical column, especially not the primary ones. Sometimes you have connections directly from other cortical regions, and sometimes they go through the thalamus into there. So there's two sort of feedforward inputs.
They do arrive at layer 4, among other places. But they only form about 10% of the synapses on layer 4 cells. About 50% of the synapses on layer 4 cells are shown on this blue [INAUDIBLE] through this very kind of unusual bi-directional connection between layer 6a.
So if you're going to understand what layer 4 is doing, you can't ignore what layer 6a is doing. Because it's providing about half the input there. Indeed, layer 4 projects to layer 3. That's the output layer. Goes direct to other cortical regions.
But layer 3 also projects down to layer 5, and here you see a very similar type of circuit. Between layer 6b and one of the layer 5's, you have a similar sort of parallel structure going on there, where there's a very characteristic bi-directional connection.
Then that projects to upper layer 5. Or at least in some species, it's upper layer 5, but it's the layer 5 thick-tufted cells. And that becomes the second output of the cortical column, and that is the one that goes through the thalamus. So it is like these two sort of inputs and two outputs, and there's this complicated circuitry going on between.
Now, there's a lot unknown about the cortical anatomy. I'm not going to go through it. But we can summarize a few things here. We can say cortical columns are complex. They're very complex. At least 12 or more excitatory cellular layers. There's two feedforward pathways. There's at least two feedback pathways. I didn't show them here. And there's numerous connections up and down the column and between columns.
And then of course, there's an entire inhibitory circuit, which is at least as many cell types and equally complex. So this is a very complex system here.
Now, the function of this thing is also going to be complex. It's not going to be simple. So anybody who says, oh, it's a filter, it's changing this or changing that, that doesn't seem to be the case. We should expect this thing to do a lot.
And in some sense, we're looking at-- and this is the thing that makes us think. This is the source of everything. In fact, whatever a column does has to apply to everything the cortex does, because this is the circuitry of the cortex.
So we might think about, oh, how is this going to touch, or how am I going to see with this? But it's also going to explain how we do language, and it also has to say something about how we do neuroscience and how we build buildings, and so on. So it's something really remarkable.
Now, I have two thoughts about this, before I get into the details of my talk. One is I just want you to remind yourself, this is one of the most important scientific problems of all time. It's worth stating that. It's worth remembering that. It's up there with discovery of genetics. It's really kind of the core of who we are as humanity.
And it's the only structure that knows things. This is the only structure that discovers things. And of course, it defines us as a species. So it's a really very important thing to work upon.
Now, I'm going to-- I've been working on this problem for a long time, And like many of you. And what we've been doing is we've been sort of teasing apart pieces of it and trying to understand a piece, and then we find another piece, and we try to fit those two pieces together and so on.
And lately, we've had some success in getting those pieces. We started putting them together in interesting ways. And actually in the last month-- less than a month-- we discovered another piece, even for after I set up this talk. And also, a whole bunch of stuff fit together really, really well.
And so I'm going to tell you about that. It goes beyond the abstract I mentioned today in the talk. At the end of my talk, I'm going to give you explicit proposals about what many of these layers are doing. I'm going to be filling in a diagram here explaining what's going on here, at least our hypothesis for that.
It won't be everything, but it's going to be an interesting foundation, and I'm going to make the case for that. Now, to do that in the time I have allowed, I have to move quickly through a whole series of concepts.
And typically, when you give a scientific talk, you explain one concept, and you explain how you did it, and what didn't work, and your experience, and blah, blah, blah. I don't have time for that. I want you to understand that everything I present you here is not just made up. It was a lot of work, a lot of testing, a lot of-- it took a long time.
And I have a lot of confidence in it, but I can't present the data to explain that, why I have that confidence. So I just want you to at least give me the benefit of the doubt that later, when you ask me questions, I can go into any detail about this stuff in great detail. But I'm trying to tell a story here today, and I want to get to that end picture.
Now, the way I'm going to tell the story is the way we discovered it, the way we went about our work. It may not be the best way, but it's the way I know. So I'm going to start at the beginning.
The beginning, all of our work was based on a single observation. The observation is the cortex is constantly making predictions of its inputs. Every time I feel something, I have an expectation what I'm going to feel. And that expectation is a very detailed prediction. As I move my hand along this lectern, if even the slightest little dip here, I would notice it, it would catch my attention. Or if it felt a little funny, if it felt like jello, or cold or something.
So I have this-- that tells me if I notice changes, I must have had an expectation what it was going to be. And the same thing as I move my eyes. I'm constantly predicting what I'm going to see, or trying to. And the same with audition. You're constantly trying to predict what I'm going to say or what you're going to hear.
So we asked ourselves the question, OK, our research paradigm has been how do networks of neurons, as seen in the neocortex, learn predictive models of the world? It's not that the cortex is only building-- doing predictions, but it seems to be a fundamental component of what the cortex does. And if we tease apart prediction, we might understand what some of the functional components underlying that are. So that's what we went about.
Now this question-- this research question-- can be broken into two parts. If you think about the patterns that are coming into the brain, you've got the sensory streams, millions of sensory bits coming into the brain that are changing all the time.
Why are they changing? Two fundamental reasons. Either the world itself is changing-- and I'll call that extrinsic sequences, like you're listening to a melody. And you're learning the sequence, and it's the pattern in time that matters. That's one form.
The second form is when you move yourself. And you're doing this constantly. Every time-- you move your eyes several times a second. Every time you touch something, every time you walk around a room, there's a flood of changes coming in.
And it's been known for a very long time, back to Helmholtz, that you can't really understand the world of those sensory inputs if you're not accounting for the behaviors that go with them. So it's the sensory motor sequences that are leading to those. And so that's part of the problem. So we started with the first one, and then we tackled the second one.
So on the first one, we had a paper that came out in March of 2016 called White Neurons Have Thousands of Synapses, a Theory of Sequence Memory in the Neocortex. And in there, the big idea is we suggested that every pyramidal cell is actually a prediction machine. And the vast majority of the synapses on the pyramidal cell are actually used for prediction. I'm going to walk through that.
Then we showed if you took a cellular layer, like you might say one of the layers in one cortical column, that a network of those mammals would learn a type of sequence memory, a very powerful sequence memory-- a predictive memory. And we also had them introduce the [INAUDIBLE] sparse activations to understand that. So that's in that paper.
Then we just had a paper come out in October of this year called The Theory of Columns in the Neocortex, a Theory of How Columns in the Neocortex Learn the Structure of the World. In that paper, the big idea is we deduce that every column, every-- you think of it-- we will talk mostly about primary and secondary sensory columns. But ultimately, I think it will be every column.
We deduced that it must have a sense of an allocentric location. And I use the word "allocentric" in a very broad term. It just means other. I'm not using it in the term specifically as people who study, like, grid cells do and so on, like that. But really you can think of, when I say allocentric-- this is tripping some people up today-- you can think of it as object-centered.
So when I touch this little clicker here, when my finger feels something, I'm arguing that the column that's receiving the input from my finger is also figuring out where it is on this object. And we'll get into that.
So that was the big idea there. And then as the sensors move over objects and through the world, [INAUDIBLE] can learn models of complete objects. And I'll walk you through that.
And then the third part here is our current research, and this has not been published. It's very new. We asked the question, well, how could columns compute this allocentric or object-centric location? We had the idea that, well, let's look at grid cells and place cells, because they solve a similar problem.
And after we studied this for a while, we've come to believe that cortical columns contain analogs of grid cells and head direction cells, that they're solving the same basic problem that the entorhinal cortex is using to map environments. It's been [INAUDIBLE], and it's now using to map physical structures and objects. And it's a very parallel process. And when we understood that, now we're starting to understand the function of numerous layers and connections.
So I'm going to go through this in order. I'm going to very quickly go through these points and end up down here with the specific functions of layers and [INAUDIBLE]. So I'm going to go pretty quickly.
So let's start with one slide on the pyramidal neuron and the prediction system. This is your typical pyramidal neuron. It has thousands of synapses, anywhere from 5,000 to 30,000 excitatory synapses. Only 10%-- or less than 10%, typically-- are proximal, can actually drive that cell to fire.
90% of them are on either the distal basal dendrites or the apical dendrites. And typically, they're completely unable to make the cell fire. A lot of great research has been done to show that dendrites are active processing elements.
So if you have somewhere around 15 active synapses that could come active at relatively close in time and space-- so they have to be within, like, 40 microns on a dendrite segment-- that it can generate a dendritic spike. The dendritic spike can go to the soma. Generally, it does not cause the cell to fire. It depolarizes the cell.
So it raises its voltage, but not enough to generate a spike. That can be a sustained polarization, hundreds of milliseconds up to a couple of seconds. We are going to argue that that is a predictive signal.
So the proximal synapses-- this is our theory. The proximal synapses cause somatic spikes. They defined the classic receptive field of the neuron. But the distal synapses cause dendritic spikes, and they put the cell into a depolarized state or predictive state.
What's the benefit of a cell being depolarized? Our models and our network models rely on that fact. What happens is that a depolarized neuron will fire a little bit sooner than another neuron, if they both have the same receptive field. They both have the same basic feed for a receptive field.
The one that's going to be depolarized will generate its first spike a little bit quicker, and it's going to inhibit its neighbors in a very fast inhibitory circuit. And it turns out, a typical pyramidal neuron can recognize hundreds of unique patterns, 100 unique contexts in which it's going to predict its input.
This is how we model it. All of our simulations, we use-- this is a picture our software model for this thing. Basically, in green there, that's the proximal synapses. And then we have the basal synapses labeled here with context. It's an array of coincidence detectors. And then the apical dendrites are similar. These are like threshold detectors.
So this is our model of the neuron. It has multiple states. I won't get into it. I also should point out, the learning model here [INAUDIBLE] we rely on synaptogenesis. So we're not changing weights of synapses. We're actually growing new synapses in our model, in a very clever way that matches biology. But I'm not going to get into it.
Now, what are the properties of sparse activations? We have to cover this, because you won't understand anything else until you cover this. And maybe you know this already, but I don't know.
So let's take, for example, we have one layer cell. It doesn't really matter. We're just going to take a bunch of cells and say it's like one layer on our cortical column. Let's say it's 5,000 neurons. And typically, what we see is a very sparse activation.
So let's say 2% of our neurons are going to be active at any point in time. So we have 100 active neurons. Now, at any point in time, there's 100. And then a moment later there's another 100, and a moment later there's another 100.
So first question we're going to ask is what is the representational capacity of a layer of cells? How many different ways can I pick 100 out of 5,000? Well, you're all not surprised, it's very, very big. What you may not know, you can type this into any browser and just say 5,000 choose 100, and it'll tell you.
And in this case, it's 3 times 10 to the [INAUDIBLE]. That's infinite, as far as we're concerned. And we don't have to worry about that. We can pick them all day long.
The second thing is, if you randomly choose two sets of patterns, two activation patterns, what's the likely-- what's sort of the distribution of the overlap? How many cells would they have in common? In this case, it's about two. But then you can say, well, what's the chance that it's going to have 10 cells, 20 cells or 30 cells on common?
It turns out that it's very, very unlikely. It very quickly drops off to, like, never, even though technically, it could be. So you can pick random what we call SDRs, or sparse activations, all day long, and they almost all overlap by just a few. So they're very, very orthogonal, in that sense.
Now, we can take advantage of this, because a neuron-- what it means is a neuron only has to form a few synapses. It doesn't have to form connections to all the cells that are active if it wants to recognize a pattern.
So in this case, I say I want this neuron to recognize-- I have 100 cells active. These are the gray cells. It only has connections on one of its dendrites to 10 of those or 20 of those, and it can reliably recognize that pattern. Technically, it could have a lot of false positives, but it just won't. Just never going to happen.
The second thing we can do now-- this is perhaps something you haven't seen before, but maybe you have-- is we can ask ourselves the question, what happens if I form a union of patterns? So instead of just invoking one pattern in this layer of cells, I'm going to invoke 10 patterns. That's 1,000 active cells, or 20% of the cells being active.
Well, you could say, wow, this cell's going to be in trouble now, because it's still looking at only 10 of those synapses, and it could have a false positive. But if you do the math, it's still extremely unlikely.
So this cell, by connecting to 20 synapses in the whole population here, can reliably pick out that pattern, even those there's a whole bunch of other patterns going on. And you can do unions much greater than that.
We're going to rely on this property. Because what we think is going on, every cellular layer in a column is representing things, and often there's uncertainty. And when there's uncertainty, it's going to use a union. And it's going to say, oh, I don't know. Could be xyz, and so on.
And what it means is the networks don't get confused as it tries to resolve that uncertainty, is they bounce back and forth. They're going to essentially narrow down to the only consistent answer under-- I'll explain some of this. But the point is that we think unions are happening everywhere.
And so the density of the cell activity basically represents uncertainty. And when you really got something, you know what's going on, it can be very sparse.
OK, then we said, OK, take a bunch of those pyramidal neurons for sparse activation, put them in a layer like this, and we add a few more things. We're going to basically define-- we're going to put cells into the mini-column. So you might say 10 cells per mini-column.
And what the mini-column-- it doesn't have to be a physical structure. What we're only asking is that the cells in the mini-column have a common feedforward receptive field property. This is classic Hubel and Wiesel, many, many years ago. All the cells that are sort of vertically in line have some sort of receptor field properly.
You don't have to see the mini-columns. You just have to have that property. You add the cells-- so those cells in the mini-column are going to respond to the same feedforward pattern, but they're going to form connections horizontally that are unique. And so [INAUDIBLE].
Here's what would happen in two time periods, time 0 and time 1, if I had no predictive state, and an input comes in, it's going to activate all the cells in the mini-column. Because they're all equally getting this thing, and they look similar.
In the condition where there's a predicted state-- and I represented those by little red circles here-- this means that these cells are predicting they're going to be active. They're depolarized. The same input comes in, but it's only going to select one of those cells. The one that was predicted is going to fire first, do a very fast inhibition, and basically form a sparser pattern.
The next moment after this, what will happen is the active patterns will then predict another [INAUDIBLE] cells. And so you can go through these sparse activations in time-- prediction, activation, prediction and activation-- and that's the basis of sequence memory.
We had built this for years, and we tested this, and we applied it commercially. We understand it very well. I'll just mention a few things. It's very high-capacity, and this is important to remember.
A slightly bigger network than this we've shown can learn up to a million transitions, meaning that's like 10,000 songs of 100 notes each. It's really high-capacity. It's surprising.
They can learn high-order sequences. So imagine you trained on two sequences, ABCD and XBCY. If you show it ABC, it predicts D. If you show it XBC, it predicts Y. It doesn't get confused by there being the C. Similarly, if I just show it the B and the C, it's going to predict both D and Y, because that's all it can do at that point in time. Cause it does all these things automatically.
It's extremely robust to noise and failure. You can knock out 40% of anything, and it still performs well. And it has very desirable learning properties. It's all local learning, very simple rules. I won't get into all of that. It solves many biological constraints.
There are many people who have implemented this by now, and it's being used in some commercial applications. But it is a biological model, first and foremost.
OK. We're done with the first section. Now the second section. We asked how are we going to learn predictive models of sensory motor sequences?
Our first idea, we said, OK, let's start with the same cellular layer. And can we turn it into a sensory motor layer? And we said, well, here's the basic idea. What if we just added a motor-related context. So instead of the context just being the previous state, we could have a motor-related context.
And we were inspired because we said, look, we know that 50% of the inputs to the layer 4 cells come from layer 6a. So that's an idea. Let's go for that. And we asked ourselves, well, what would that motor-related context be?
Well, this is the hypothesis. By adding a motor-related context, the cellular layer can pick its input as the [INAUDIBLE]. And then we said what is the correct motor-related context?
We started working on this several years ago. We tried different things, and they kind of worked, but they didn't work really very well. They didn't scale well, and so on. But just a little bit under two years ago, we had an insight about it. And this gets to that allocentric [INAUDIBLE].
So let me use my coffee cup as my prop. I'm going to use this a lot during this talk. So you can just basically ask yourself a very simple question. Imagine I'm not looking at this coffee cup. I'm just touching it. I'm familiar with it. This is my coffee cup from my office.
And I'm holding in my hand, and I'm about to move my finger. And can I predict what I'm going to feel? Yes, I can. I know what I'm going to feel. I know I'm going to feel this edge here. I also know if I touch down here, I'm going to get this sort of rough thing here, because this cup has a rough bottom. It also has this little doodad here.
So as I touch my finger, I make the predictions. Before I touch it, I know I'm going to feel. Now, how could I know what I'm [INAUDIBLE]? I have to know-- first of all, the cortex has to know that this is a cup. It has to know it. And it has to know where it's going to touch the cup. It has to know that.
If I'm going to predict what I'm going to feel, it must know where-- and that thing it's going to know is where on the cup it's going to touch. It's not relative to my body. It's relative to the cup. I need to know the allocentric location, otherwise I can't possibly make that prediction. That's deduction.
And the predictions are going to be at a fairly fine granular level. Every part of my skin touching this cup is predicting what it's going to feel. And there's a lot of them. It's not like some global prediction. It's a very local prediction. So we realize that that is a requirement. And that's where this idea for the allocentric location comes from.
So my answer now is hey, let's take-- if we have an allocentric location, the location of the cup-- and how could we derive that? I didn't know. What does it look like? We didn't know. We just assumed we had it.
So in the beginning, we just did experiments. We sort of randomly made up stuff. And then we also realized we really wanted a second layer to the network. The second layer was what you would typically call a pooling layer. That's a term that a lot of people use.
If you don't know what it means, in this case, what I mean by it is the second layer, we're going to essentially pick a sparse activation of cells up there. And it's going to stay constant while the lower layer is changing. The upper layer, those cells up there are going to learn to respond to a series of individual sparse activations in the lower layer.
So if you think about the lower layer, it's sort of representing at a feature-- the sensory feature-- at a location. And if you have some-- if you basically-- you're basically modeling an object as a set of features at locations. It's kind of like a CAD file.
Well it kind of makes sense. What else could you do, modeling an object? And what's interesting here is that the output layer of this object layer is going to be stable over movements of the sensor. And the input layer will be changing with each movement of the sensor.
You have a stable representation of the object as you move. And it doesn't matter which order you move, how you touch the object, as long as you know the allocentric location, that magic signal. We don't know how to do that yet, but that's the [INAUDIBLE].
So we modeled this, and we did a lot of work with this. So with an allocentric location input, a column can learn models of complete objects-- or this two-layer network can-- by using essentially different locations on the object over time. So it's an integration [INAUDIBLE], you can both learn model objects, and you can infer. I'll show you that.
Now, the next thing we realized is if you had a series of columns near each other-- imagine they were representing three tips of your finger-- and it's going to touch that coffee cup three fingers at a time, well, each finger is going to have its own location on the object. Each finger is going to have its own sensory input, but those are unique.
But they're all going to be basically trying to model the same object. And if they're confused, they may not know what the object is. But the output layer of these are going to be 3, because they're going to be basically representing the same thing.
And so if you formed an associative link on the [INAUDIBLE] layer, they can vote together, and they can help resolve ambiguity. And that's the basic idea.
So each column has partial knowledge of an object as its equivalent sensory thing is moving. And these long-range connections in the object layer allow the columns to vote, and inference will be much faster when you're using multiple columns than with one column.
Just like it's faster for me to reach into a dark box, and if I use one finger to figure out what I'm talking about, or if I grab it with my hand, I'll get it. Or if I was looking at the world through a straw, I'd have to move my straw around a bit. But if I open my eyes and see the whole thing, then I can do it very quickly.
So this is just a little cartoon animation just to illustrate some of this. It's not too terribly accurate. It's just for illustration purposes.
So imagine this finger is going to touch this cup in three locations. And I have one column, which has an input layer and an output layer. As I move towards this spot I'm going to touch, I have a predicted location signal. That basically invokes a union of possible sensations I might find at that location.
When I actually touch it, a sensory feature that comes in, it selects one of those sensations, it projects up to the output layer, and this thing says I know three objects that meets this. The coffee cup, the can, and the tennis ball all meet that, so I'm forming union representation up there.
Then I go to the new location. I get a new location. But it basically makes [INAUDIBLE] about what it might sense. You actually get a proper sense and say, oh, I have this featured at this location. I pass it up the output layer, and I eliminate the tennis ball because that's inconsistent with feeling a lip or an edge. And then I go to the final sensation here, new location, new sensory feature, pass it up, and I can eliminate the Coke an or the soda can because it's inconsistent.
If I do this with three fingers at the same time, the hand grasps it, I get three different locations, three different features. In this case, we're showing them the same. They pass it up. In the output layer we can say, oh, well column one says it could be the coffee cup or a ball. The other ones too are saying it could be the coffee cup or the can. You just quickly associate with each other, and you eliminate, and you're down to the only thing that's possible for the three of them, is the coffee cup. So very quickly do you do that.
We tried this out then on a more sophisticated problem. We started with this yield, CMU Berkeley benchmark, which is about 80 objects. They'll actually send them to you [INAUDIBLE], or you could just use the 3D CAD file. So we figured since some of them are perishable food items, we would go for the 3D CAD files.
And then we built a robotic simulator, a virtual hand using a game engine. We built sensory arrays on each of the fingers, [INAUDIBLE]. And we built a multi-column array representing its finger. We use 4,096 neurons per layer per column. So if it's three fingers, we've got 24,000 neurons, each with thousands of synapses, and not surprisingly, because it's a simulation that worked very well.
But just a few things to talk about here, imagine we did it with one finger, and the one finger is touching it in different places. In one touch, you can't really tell what the object is. So this is a confusion matrix, which is what the actual object is on this side, and the vertical option is what it actually thought it might have been. And you can see obviously the right answer is the diagonal. But in this case, there's a lot of confusion.
And after the second touch, things started narrowing down quite a bit. After six touches, you were doing really, really well. And after 10 touches, you're already guaranteed to get it. Now there's a lot of variability in this because if you touch unique features on the object, you can narrow it down quicker than if you touch non-unique features. But this gives you the general idea. We also did a lot of experiments looking at basically the number of columns or the number, if you want to think about as fingers. But we can do this abstractly.
And of course, what we'd expect is that the fewer columns we're using, the more touches you have to-- or the more sensations you have to have to recognize this thing. And if you have more, then it quickly settles down to a basic-- you can do it in one sensation. And it gets harder depending on some other parameters. There's a lot of parameters. You can make this harder or easier. But the point is we show that characteristic.
All right, so that was that big idea there. And but then we really said, OK, we got to get to the heart of this allocentric location thing. What's going on? What does that mean? And as I said, we thought of-- we said, let's go look at the entorhinal cortex to see what was going on there. Now, I know there's a bunch of hippocampal people here. And we were talking about this this morning. There's various reasons why we chose [INAUDIBLE] and to [INAUDIBLE] cortex.
So think about [INAUDIBLE]. I won't get into it, but don't get mad at me if I don't touch your favorite topic. So we ended up here-- this wasn't our initial hypothesis. Our usual hypothesis is a cortical column is the grid cells. And very recently, we realized they had to have analogs that had direction cells. That was the last missing piece that I didn't know about until just a few weeks ago. So let's talk about what goes on in the entorhinal cortex.
And I wouldn't claim to be an expert in this, but we have run this by some experts. And they said it's OK. You can say this, Jeff. So we're going to go there. The entorhinal cortex, one of the things it does is it allows an animal-- typically we study rats-- to basically build maps of its environment to know where it is and maybe be able to make predictions, and know where things are, the foundation of other navigation problems.
And grid cells, I won't go into all the details. We all know about them, but some of the details are really important. But we won't get into them. They allow us encode location. So the way to think about this, if you look at [INAUDIBLE] rooms, they're actually the same shape. I should go back here. They're the same shape, but they're different in some salient feature. And so the rat perceives them as different rooms. And you would too if you were in there.
And what we want to do is to have a representation of where the location of those rooms are. Now the way grid cells do this thing-- and I'll just tell you a few things. First of all, every point in this room can be associated with a sparse activation of the grid cells. So you have a bunch of grid cells. They're in these grid cell modules. But if you just looked at which cells are active and which cells are not active, it's a sparse representation.
And I've shown here three locations in these rooms. Every location of these rooms has an associated pattern. What's interesting about it is the locations in the room are unique to the room. So the actual coding of these locations in room one will be very different than the coding in room two. This is actually essential to the whole theory. So a means that location in that room. That's a sparse activation. And x means that location in [INAUDIBLE] room, and r means that location in that room that is very different things.
And of course, one of the most important things here is that this location is updated by movement. So even if the complete dark, if the rat is in that room, and it moves, and you walk forward, it updates its location information. And one of the clever things is [INAUDIBLE] property, is I want to go from here to there. I can go this way and then turn this way, and I get the same representation if I just went straight or went around in the circle.
And what's clever about this, it works even in novel environments that it's never been in before. So it maybe never been in room three, but it will have that path integration property there, even in the dark. So that's kind of clever. Now, the rat needs to know, or you-- it's fun to do this in the dark yourself. I do it at night. It actually is fun to try to see how good you are how good at this you are. You need to know the orientation of, in this case, the animal's head to the room.
And so there's these things called head direction cells. These are not driven by magnetic fields or something like that. They are basically a set of cells which indicate the direction of the head. The anchoring of those head direction cells is unique per room. So it doesn't really-- room to rooms, it's not always aligned along an edge. But [INAUDIBLE], it was consistent.
And the orientation is also updated my movement. So think why you need this. You need is-- first of all, you're going to need to know the orientation or the head direction if you're going to know where you're going to end up after you move. So if I walk forward two steps, well it depends which way I was facing, where I'm going to be. Also, if I know, if I want to predict what I'm going to see or sense, I have to know where I am and which direction and facing because I could be in the same location here.
As the animal moves, both of these are updated simultaneous. You have to update the orientation-- I'm going to use the word orientation because I'm trying to generalize it. Orientation and the location both get updated. I might be updating just one orientation, or I might be just updating my location, or I might be doing both, as I move around at a curve like that. So location and orientation are both necessary to learn to structure rooms and predict sensory input in that case.
So we think the same thing is going on. And the cortical column is trying to model external objects in the world. You can define a location associated with individual objects. So my coffee cup is like a room, and the points on it are going to be both unique to the coffee cup and unique to the location of the coffee cup. And the same thing with the pen. And it's going to have to be updated by movement.
And in this case, the movement is, in the case of my finger, is the movement of my finger relative to the cup. And so we have to have that. The second thing-- and I only realized this recently-- you also, to solve the problems of modeling of objects and modeling of structures, you need to have an equivalent of an orientation. So I've tried to show [INAUDIBLE] here as a sensory on your tip of your finger, both at sensing point A, but from different orientations.
So you can look at it this way. I'm touching the lip of this cup. And as I rotate my finger like this, the sensation of my finger's changing, but the location I'm sensing on the cup is not. [INAUDIBLE] is a feature of the cup. And I'm not actually sensing the feature. I'm sensing the feature at an orientation. I can't-- this feature is actually this lip of this cup and the frame of the cup. But the sensation I get changes as I move the orientation of my finger relative to the object.
So we need to have something like that too of the sensor patch to the object. Now, I should state now that I'm going to give this whole theory in terms of touch. But the whole thing applies to vision too, and I believe it applied to addition as well. It's a little hard to think about at that. But there's nothing. We're not doing anything specific here. We're really trying to talk about generic properties of sensory patches relative to things.
Anyway, we're going to argue that this is anchored to the object in the same way that it is over there. And this orientation has to be updated by movement. So our basic idea is the following. Location and orientation are both necessary. That is, location and orientation, my sensor patch where there's part of my retina. Where is it? It's where it's sensing, not where the sensor is, [INAUDIBLE] both necessary to learn the structure of objects and to predict sensory input, and to infer.
I view this as a deduced requirement. And therefore, I don't feel it's speculative. But you may not agree with that. So now with this knowledge, we went back and we did the following, OK? We started with putting these pieces together in ways that are interesting. And this is where I'm going to lay out the basic of the theory here. And this is my most complex slide, so if I lose you here, sorry, but I'll bring you back in a moment. I'm hoping-- I think everyone who's really smart [INAUDIBLE] figuring this out. You're probably ahead of me already.
I'm just going to up front say without any further justification that layer 6a is representing orientation of the sensory patch, and layer 6b is representing the location. There's reasons for this. I'll get it in a second. These are both going to be motor updated. They're going to be path integration type of-- and then it's grid cell like and head direction cell like. And they're going to have properties similar to those cells in entorhinal cortex.
Now, let's follow the circuitry as information in your basic feedforward pathway here. You've got a sensation, which is arriving delay of four, and that's paired with this bidirectional connection, this very characteristic connection between layer 6a and layer 4. And what I'm going to argue there is that layer 4 is representing a sensation at an orientation. Now again, if I didn't know the orientation, I'd just have a bunch of cells that look like edge detectors, or something like that.
But in the context of an orientation, I'll get a sparse pattern. And it's the sparse pattern that represents sensation at an orientation. This is our sequence memory layer that I started with. It can learn sequences, but it can also lead learn sensing motor sequences. And so it forms this unique representation of sensation at orientation.
Now the next layer is going to be a pooling layer. Imagine if I were pooling the input as I rotated at the same location, like this. It takes a while to sink this into your head. Well, you end up with as a stable representation of the underlying feature independent of the orientation of the sensor. So I would end up with representation of whatever the thing is that I'm actually sensing at that point independent of whether this way, this way, this way. If I went through that motion that's what would happen.
[INAUDIBLE] layer, and this represents the feature that is being sensed at that point. At the moment, there is no concept of object. I'm not locating [INAUDIBLE] object. I'm just representing what I'm sensing with my finger. Layer 3 then projects to layer 5. As we saw, that's a classic projection layer. And we're going to repeat the same circuit. We're going to have the location information predicting the layer 5b. And that's going to represent a feat. And this is another sequence memory.
Now we really have the feature at location. Our earlier experiments didn't deal with this, right? And they had some problems. But now because I've added the second thing up above, Now I really am locating the featured location. This feature at location is a very [INAUDIBLE] representation is independent of the orientation of my sensor. And if I pull over that, in the upper layer here, which I'm layering layer 5a, which really would be the layer 5 thick-tufted cells. In some species, it's above, and in some species, it's below. But just pretend it's this one above here.
That pulling layer would then be stable over objects. It would [INAUDIBLE] actual object. So we have this two-stage sensory motor inference engine. Now, if you think about earlier I talked about you could share. You could share information between columns. The only two things that are worth sharing here are the object layer and the feature layer. Those are two things that neighboring columns my also be doing in column. And everything else in here should not be projecting the other columns because it's unique to this column.
And sure enough, the two primary output layers of a cortical column are always identified as layer 3 and layer 5 thick-tufted cells. And those basically represent the feature that you're sensing independent of the object and the object that you're sensing. Now, those actually can be shared to multiple columns, and those become the feedforward input to the next regions. It's worth noting that a column-- oh listen, I'll get to the second part.
A column therefore is a two-stage sensory motor model for learning inferring structure. This is a [INAUDIBLE] properties, the thing about touching. And it's important to remember, a column usually cannot infer either the feature or the object with a single sensation. It's just not going to be possible. You have two choices. You can take the single column, and you can integrate over time by sensing, moving, sensing, moving, sensing, moving, or your eyes could you looking out through a straw, and sense, sense movement, or you can vote with neighboring columns.
And both of those strategies are employed in the brain. The column, to be trained, has to move over to the object, but the column to infer can rely on with its neighbors. As I said earlier, this system is most obvious for touch because the easiest to think about these columns is being separate sensory patches that are moving [INAUDIBLE] between each other. But it also applies to vision fairly straightforwardly and would be suggested that other sensing modalities would work in the same way.
We spent some time earlier this week trying to map these onto whisking in mice, and I think that can be done. And, of course, as we said at the beginning of this talk, because this architecture, these structures, if there's any truth to this, if there is, this architecture is just about the cortex. So it suggests that we infer, and learn, and manipulate abstract concepts in the same way, the same way that we manipulate objects in the world. So the theory is the evolution discovered a way of navigating and knowing, mapping our environment. Had to do this a long time ago because all animals move, and they have to figure out where they are and how to get home.
And then there's another theory that's been published that the entorhinal cortex-- so there's three-layer structure and two parts. And I forget the scientist who proposed this initially. But the proposal of the neocortex was actually was formed by folding those two halves on top of one another into a six-layer structure. So we think what basically happened is evolution preserved much of what's going on in the entorhinal cortex-- not exactly. There's differences. But it preserved that, and now is learning how to model objects in the world.
And in the human brain, what happened, it's now continued that, and it's using that same mechanism to model [INAUDIBLE]. And so it would suggest that, just suggested that when we think about things, whether it's mathematics, or physics. and brains, or neuroscience, or politics, or whatever, we're going to be using a similar type of thing. And what's interesting about this is, is this space, is this idea of location and orientation, they're dimensionless.
They're defined by behavior and they're not metric. It's not x, y, and z. There's this very unusual way of representing these things. And if behaviors weren't physical behaviors, what were mental behaviors, like mathematical transforms or something like, you could apply behaviors to abstract spaces, and it should [INAUDIBLE] this might be the core of high-level thought.
OK, I want to have one more thing here. It's suggested we might want to rethink some thoughts about hierarchy that we've all had for a long, long time. This is a cartoon drawing, but it captures some of the basic essence of it. We think about [INAUDIBLE] arriving at a primary sensory region, labeled region one here. And we extract some simple features, and then we converge onto the next region. We extract some complex features, and then somewhere up the hierarchy, we actually start representing objects in their entirety.
This proposal I have today is quite different. It says that every region has columns. Every column is actually learning complete models of the world. I'm not joking. A single column can learn thousands of things. And I've only talked about what six of the layers do. There's a lot more to be done. But the idea that these things are actually very powerful modeling things, you have a huge array of basically models. And they're all bottling the same stuff in the world.
Now a couple of things here, I want to make it really clear. I'm not saying that the classic view is wrong. I'm adding some new thoughts to it that we hadn't really thought about before. One is, you say, well, what's the difference between all these columns? Well, [INAUDIBLE] odd things about the cortex when we talk about how regions project to each other. They never do it that way. They always project at least two, at least three regions above.
It's like, if the LGN is projecting to v1, it also projects to v2 and v4. And people say yeah, bu the connections aren't really strong. Well, they might be diverging. The point is, there's nothing that requires here a strict hierarchy. And so a secondary region could be looking over the same sensory rate, but at a wider area. Now, why would it be doing that?
Imagine I'm going to recognize the letter E. And I can do this, I'm going to argue that I can do that in v1, that every column in [INAUDIBLE] can recognize the letter e. And if that e was really, really small, right the edge of my abilities, it's only going to be recognizable in v1, because the other reasons, it just doesn't exist.
It's too fuzzy. But if it gets a little bit bigger, than it might be recognized by the columns in both v1 and v2. But it gets really big, then [INAUDIBLE] can't do it anymore. It's just too big an area, and I can't move over that. And so you could be representing things at different scales here, but they're complete objects and they're overlapping.
Now, what if I had two sensory arrays going on at the same time? So I have now a vision and a touch array. And we're going to basically grasp the cup and see the cup at the same time. Well, you'd be invoking models of the cup in many cortical columns because there would be columns in the retina that are sensing the cup, and there's columns in the somatosensory regions that are sensing the cup. So multiple columns are trying to infer that this is a cup.
They all have models of the cup. Some are derived visually. Some are derived tactically. But they all [INAUDIBLE]. Now, interestingly, if they all have models of the cups and they're all sensing similar features, it's possible that they vote in various ways here. And one of the things we see in the cortex, there's a lot of projections which don't make sense in a hierarchical fashion. You see projects from s2 going to v2. Well, that doesn't make sense in a hierarchical [INAUDIBLE] here.
They can be voting on cups, the object, they can be voting on features, they can go up and down the hierarchy, they can go across the coliseum. So and it's interesting. You can form various-- as long as you go to the right layers, you can form very sparse connections to different parts of the brain, and it works. You don't have to have a lot of connections at each column. You could just send one connection, a few over here. It's odd the way it works.
But anyway, you can have this. All these connections will help vote so the auditories, the tactical system will be helping the vision system. The vision system will help the somatosensory system. So little non-hierarchical connections will allow columns to vote on [INAUDIBLE] such as object and [INAUDIBLE]. And that's the thing we see up here.
OK, so I'm almost done. The summary of the talk as we start with our goal, which is understand the function, operation of the [? Lamers, ?] circuits in the neocortex. Our methodology of study is to study how cortical columns make predictions of their inputs. We then propose the pyramidal neuron model, which is [INAUDIBLE] prediction. We say every pyramidal neuron is basically using 90% of its synapses for prediction, and each neuron predicts its activity in hundreds of contexts, and that prediction is manifest as a depolarization.
We then said a single layer of neurons forms a predictive memory of high-order sequences. This has been well documented. As long as you have sparse elevations, mini-columns, fast inhibition, and lateral connections, that can be learned. And we said we'd find a two-layer network, which forms the predictive memory sensory motor sequences, if I have some motor drive context and a pooling layer. And of course, we proposed next that that motor drive context with an allocentric location, object censorial location.
And then we further went beyond that to say, OK, cortical columns can be equivalent to location and orientation of the sensor relative to the object. And those are [INAUDIBLE] grid [INAUDIBLE] cells. And this begins to define a framework for a cortical column. It's certainly not [INAUDIBLE], but it it's a potential framework-- would tie a bunch of things together that make sense. Columns [INAUDIBLE] models of object as features at locations using a two-stage sensory motor inference model.
And I went through the details that matter a lot, but that's the basic idea. And then the sum total of this is the neocortex contains thousands of parallel models that are all modeling the world surprisingly in high capacity that resolve in certainty by associative linking and/or movements of the sensors. There's a couple of things that I should point out we didn't do, very big ones.
Objects have behaviors. Now, I should point out that everything I've talked about so far is really about the what pathway, we haven't been talking about the whole cortex. We've been talking about how the what pathway would model structure, and so on. And when I talk about behaviors [INAUDIBLE], what pathway, I'm talking about behaviors of the objects themselves.
So my laptop has a behavior. The lid can open and shut. And I know that. Also, if I touch keys, they move. I know that. This thing has behaviors too. If I push this button, something happens. Objects have their own set of behaviors. We have to add that into this model because it's not just the shape of an object. It can change.
And the way that I think we're going to model behaviors, if you think about it, the model of the objects are features at locations. Those features can move in the object's space-- that would happen if I'm opening the laptop lid-- or the features can change at the particular location. So if I bring out my cell phone, and it's on, and I touch something on the screen, new features appear at the same locations that they appeared before. So the whole [INAUDIBLE] of modeling behavior of objects is how features move and change at locations.
We have to do that. We haven't done that yet. We need a detailed model of the hierarchy including the thalamus. I didn't talk about the thalamus. We spent a lot of time today talking about the thalamus. We have a hypothesis, what it's doing, why we need it. But we have to finish that out. And also, I already mentioned, so to build the complimentary aware pathway, this is not a model. We haven't described anything about how we generate behaviors, and why I might move, and how I would reach something. I haven't talked about that at all. I just talked about how would a what pathway column learn the structure of objects [INAUDIBLE].
I want to put in a little plug here [INAUDIBLE] collaborations. There are many testable predictions in this model, in some sense a green field, because we're proposing that cortical columns, even primary ones, are doing a hell of a lot more than most people think. And so we spent a lot of time this week talking to various labs about how we could do that. And we welcome that. We're welcome to have discussions. And we can talk on the phone, or here today, and so on.
And we're always interested in hosting visiting scholars and interns. We have a couple right now. And so if you want to come sign, spend some time in sunny California, even for a short period of time-- so we have people come just for a couple of days, and want to get immersed in what we do. We like having visitors like them.
This is the team we have on the left. There's 12 people. I want to call out specifically [INAUDIBLE] Ahmed, who is with me right here. He's been with me. We've been partners for 12 years, and he's critical to the whole thing. And Marcus Lewis is one of our scientists, and he really helped understand the interaction between layer 4, and layer six, and layer 5, and layer 6b.
I didn't really talk about his work here, but it's underlying everything we're doing. And he has some [INAUDIBLE] insights into that. So I hope I didn't speak too quickly. But that's the end of my talk. Thank you.