Invariance and equivariance in brains and machines
Date Posted:
August 13, 2024
Date Recorded:
May 7, 2024
Speaker(s):
Bruno Olshausen, UC Berkeley
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Abstract: The goal of building machines that can perceive and act in the world as humans and other animals do has been a focus of AI research efforts for over half a century. Over this same period, neuroscience has sought to achieve a mechanistic understanding of the brain processes underlying perception and action. It stands to reason that these parallel efforts could inform one another. However recent advances in deep learning and transformers have, for the most part, not translated into new neuroscientific insights; and other than deriving loose inspiration from neuroscience, AI has mostly pursued its own course which now deviates strongly from the brain. Here I propose an approach to building both invariant and equivariant representations in vision that is rooted in observations of animal behavior and informed by both neurobiological mechanisms (recurrence, dendritic nonlinearities, phase coding) and mathematical principles (group theory, residue numbers). What emerges from this approach is a neural circuit for factorization that can learn about shapes and their transformations from image data, and a model of the grid-cell system based on high-dimensional encodings of residue numbers. These models provide efficient solutions to long-studied problems that are well-suited for implementation in neuromorphic hardware or as a basis for forming hypotheses about visual cortex and entorhinal cortex.
Bio: Professor Bruno Olshausen is a Professor in the Helen Wills Neuroscience Institute, the School of Optometry, and has a below-the-line affiliated appointment in EECS. He holds B.S. and M.S. degrees in Electrical Engineering from Stanford University, and a Ph.D. in Computation and Neural Systems from the California Institute of Technology. He did his postdoctoral work in the Department of Psychology at Cornell University and at the Center for Biological and Computational Learning at the Massachusetts Institute of Technology. From 1996-2005 he was on the faculty in the Center for Neuroscience at UC Davis, and in 2005 he moved to UC Berkeley. He also directs the Redwood Center for Theoretical Neuroscience, a multidisciplinary research group focusing on building mathematical and computational models of brain function (see http://redwood.berkeley.edu ).
Olshausen's research focuses on understanding the information processing strategies employed by the visual system for tasks such as object recognition and scene analysis. Computer scientists have long sought to emulate the abilities of the visual system in digital computers, but achieving performance anywhere close to that exhibited by biological vision systems has proven elusive. Dr. Olshausen's approach is based on studying the response properties of neurons in the brain and attempting to construct mathematical models that can describe what neurons are doing in terms of a functional theory of vision. The aim of this work is not only to advance our understanding of the brain but also to devise new algorithms for image analysis and recognition based on how brains work.
MODERATOR: Welcome, everybody. It is my great pleasure to introduce Bruno Olshausen, who I've been inspired by his research since back when I was a grad student. And Bruno has been working really at the interface of computation and neuroscience for an extremely long time, and bridging the two, and asking what I've always found to be very clear and courageous questions using methods from one field and bringing them back to the other.
So what fraction of the activity in visual cortex do we actually understand? How can we study that quantitatively, or what learning rules help give rise to sparse codes? And really distinctive in the depth in each of those areas that he's brought separately and to the intersection of the two. So really delightful to have him here speaking to us about invariance, and equivariance, and brains and machines.
[APPLAUSE]
BRUNO OLSHAUSEN: Thanks so much for that kind introduction. So it's great to be back at MIT. As some of you know, I was a postdoc here for a brief while. I had the great fortune to be a postdoc for about six months before starting my faculty job at Davis and Tommy Poggio's group back when Max Riesenhuber was here, Emanuela Bricolo, and Peter Dion across the hallway. It was a very influential period for me and many great memories from that time.
I'll first start here by introducing the group I work with at UC Berkeley, the Redwood Center for Theoretical Neuroscience for basically a group of faculty and students and postdocs, many of us with backgrounds in physics, and math, and computation, and trying to bring ideas from these different fields to bear on understanding computation in the brain.
And I'll just use this to introduce the people who are kind of spearheading the work that I'm going to be telling you about today. So Sofia Sanborn and Christian Shewmake in particular helped me craft together a proposal to NSF on Lie groups-- on learning of Lie groups, which we eventually got. And that's a funding a lot of the work that I'm going to be telling you about now, together with Nina Miolane and Stella Yu.
And then also I'm going to be telling you about this work. Amazing work by Kris Kim on residue numbers, which was also done together with Pentti Kanerva. And Konrad Kording just happens to be in the picture because he happened to be visiting that day when the photograph was taken. In case any of you are wondering he's not a member of our group, but that would be a great thing too.
So since this is a talk about brains and machines, I thought I'd start out by maybe going back more than 100 years ago when people were trying to make a similar kind of fusion between not brains and machines, but flying animals and machines. And people at that time not only took great inspiration from birds, but actually tried to learn and adopt some of the principles by which birds fly.
So they understood the principle of lift and wings. And Otto Lilienthal in particular research that a lot. And he and other people were successful at building gliders. That's what's being shown here.
So they could kind of glide off a hill and just glide in a straight path. But one thing they could not do was turn, because the problem is as soon as they banked their glider, it would just fall out of the sky. And so this was really one of the great insights of-- and innovations-- the Wright brothers that they learned by observing birds in flight.
And one of the things that Wilbur Wright noticed about birds is that when they turn, they twist their wings. And he reasoned that this twisting of the wings might have something to do with breaking them from-- applying more friction so it wouldn't just simply fall out of the sky when it turns.
And so he tested that idea by adopting it, and built a kite to test it. And that worked in the kite, and so they eventually incorporated it into their airplane by attaching pulleys to the wings that could sort of twist the wings when it turned, and it worked. And that was a really key innovation that allowed them to succeed where other people had failed.
And what's really wonderful about this story is not only did they-- were they able to figure out how to build a flying machine, but they also learned something about birds at the same time. So this is why they're twisting their wings. And the other thing is that I think to note about this example is that what allowed them to make this connection between biology and the synthetic approach-- what they were trying to engineer-- was the fact that they were confronted by the same problem that biology had solved.
So they were trying to solve the same problem that biology had come up with a solution to. And so therefore, they could draw this inspiration and find these links between these two approaches. And so the question is, is are we in that situation today.
With this all this fervor about AI and the advances in AI, and neural networks, and advances in neuroscience, do these two fields have something to learn from each other? Certainly they do, but right now, it seems like we're on very different paths. So at the left is the NERSC supercomputer at Lawrence Berkeley Lab, which is-- I'd say this is really kind of what AI is doing right now.
So many data centers around the country, very similar to the NERSC supercomputer, consuming megawatts of power. Facebook, I think, is trying to acquire 350,000 H100 GPUs, each of which consumes about 700. So that's in the range of 250 megawatts or so. So very power hungry, very data intensive, very large scale computing infrastructure.
Contrasted over on the right, which is-- this is what biology built. So biology built something a very small form factor, very low energy consumption, fully autonomous, which moves around in the three dimensional environment, which is capable of sophisticated pattern recognition. So they have these high resolution eyes at the front here and form-- large lenses, which allows them to form high resolution images at the back of the eye here.
And at the back here, there's a one-dimensional strip of photoreceptors, which move back and forth inside the head. And you can see that here in this infant jumping spider, where the exoskeleton is still translucent-- these tubes kind of scanning back and forth to build up maybe some kind of a perception of the scene. So-- and then these eyes on the side of its head have very low resolution sort of fisheye lenses that obtain 360 degree view of the environment.
They detect something moving. They don't build a web like other spiders. They rely on vision to find prey. They orient their head towards it and scan it with those eyes.
They can recognize conspecifics for courtship, do prey capture. They can do distance estimation, they can do navigation, 3D, geometric reasoning. All of this inside this very tiny brain.
So this is the problem that biology solved. And I think you'd agree, we're sort of-- there's an explanatory gap here in scientific understanding of how you could do something like this in such a tiny brain. And this is how vision began.
So these animals were doing this around the time of the Cambrian explosion 500 million years ago. So it feels foreign to us. It feels kind of strange and foreign to us, but it's important to remember this is how vision began. And I would argue many of the core algorithms of vision that we seek are really in these kinds of animals. These are the core problems that vision solved.
So maybe here's one way to picture what's going on, is we can think about the evolution of these ideas and algorithms in a two dimensional space. Certainly more dimensional than this, but here are two that are useful. And one is just a-- whoops-- functional competence on the vertical axis and computational efficiency on the horizontal axis.
So we've sort of been marching up this direction of functional competence. And this is kind of something I'm guilty of myself, I think, as I think to myself, I'm just I just want to get this system to work. I just want to get it to work.
And as soon as I get it to work, well, I can always worry about making it more computationally efficient later. There's always a way to do that, but I'm just going to try to get it to work for now. And I think this is maybe one way of thinking about what's going on. We're just trying to get these models to work better and better and better, and not so much paying attention to the computational efficiency and paying the price of that.
And on the-- moving along the right axis, there's many efforts at neuromorphic computing trying to adopt biology's designs, using spiking neural network chips like Luigi, which have an amazingly power efficient computational ability, but on the other hand, it's sort of-- it's hard for-- it's been hard for them to get traction and the same performance that people get out of these large scale models moving up on the left side.
And so what biology did, though, is to just move directly along the diagonal here. And so what I'm going to argue here is that maybe the design of the algorithms that we're using-- the data representations-- is intimately intertwined with the physics of computation. So currently, the computational paradigm, by design, is divorced from the physical implementation.
We can think about the abstract structure of the algorithm independently of what is implemented on it. It could be on rocks, it could be on silicon. It doesn't really matter.
But many people in the electronics industry are coming to the conclusion we can't do that anymore. As we push against Moore's law, try to make these transistors smaller, lower power, they behave more stochastically. And so we now need to think about algorithms and data representations that can perform well in this setting, where you have noisy non-deterministic computation and computing elements.
So it really kind of defines a fundamental shift in how you think about the problem. So it's not necessarily the case that you can just march up down the left and then move on the right. You might have to start over.
And so this is important, and thinking about these things together is really crucial. I think this is mine. Hopefully. OK. So I'll tell you about approach we've been taking in my lab, which begins from observations about animal behavior.
And this is important, really, because it defines the problem landscape. It defines the problem that you're trying to solve. So in the case of vision, how do animals use their visual system, or why did vision evolve? And then this sort of starts to define the set of problems we're trying to solve.
The next is to look at biological structure, because this tells us about something about the computational primitives that are being used by the system to solve these problems. It gives us hints about the computations involved. And finally, mathematical structure builds the computational foundations. It's what allows us to engineer these systems.
So I'm going to give you some examples of each of these three parts of the approach. So let's begin with animal behavior. So just a few examples here. So let's go back to our favorite example-- my favorite example, the jumping spider, and-- which has just amazing capabilities at vision.
And one of the behaviors that people have noticed that they exhibit in the wild is when they do prey capture-- when they see a prey item that they want to capture or jump at, if it's too far away, then they have to navigate there by foot. So this is an example. The spider sees this prey item up here.
And so it's too far to jump, and so it has to climb down here and get within range. And one of the things they notice is that on the way, as it marches towards the prey item, it sort of turns and does this kind of reorienting as though to check is it still there. And what they notice is that the angle that it turns out is always the correct angle given its new position with respect to the prey item.
So what they did is to try to test this in a more controlled environment, where they put the jumping spider in a track here in the center, and the prey items somewhere out here on this ring. They started at the center, and then there's walls on this track, so it can't-- it's occluded as they march down the track.
And then they look at when it does this reorienting turn, what angle does it turn at, given the new position of the prey item with respect to the animal? And indeed, what you see here is a scatter plot of the predicted angle sort of by geometry or trigonometry versus what they actually turn at, and it's doing a pretty good job.
So it sort of suggests that they're doing some kind of geometric reasoning inside their head-- that they know where they are with respect to the prey item. And they can maintain this internal representation, because they're not just tracking it in their visual field, although that would be interesting too. But they're not they're not just doing that. They can't they can't do in this particular case.
So how do they do that? Well, one of the most remarkable discoveries, I think, recently in neuroscience certainly is the discovery of these head direction cells in the ellipsoid body of the fly. And so flies face a very similar problem. Where am I going, where am I in this world?
Any animal faces this very fundamental problem. And so these head direction cells have been known about for a long time from initially from rats, but they were discovered somewhat recently by Vivek Jayaraman's group at Janelia in 2015 or so. And so the ellipsoid body is this-- it's a complex nucleus in the central complex deep inside the fly's brain-- just blowing it up here-- which is kind of a ring-- it is a donut-- a ring-shaped structure anatomically.
And these neurons seem to form a ring attractor and hold a bump of activity, which reflects an internal representation of where the fly is heading in the environment. So this is just absolutely stunning-- truly remarkable that you're seeing this representation-- this very explicit representation being built up like a compass-- a neural compass inside the fly's head. There are no sensors for heading.
This is something that's not using magnetic sensors or something like that to do this. It has to combine many different sources of sensory information-- its own motor signals and so forth-- combine these together to fuse them and compute, in a very non-trivial kind of computation, it's heading with respect to the environment to an allocentric reference frame.
And this-- it holds this activity even in the dark. So there it is. You peer inside, and these animals seem to have some-- maintain some kind of internal representation, and have to do this geometric reasoning and form these explicit geometric representations about where they are in their world.
What about in humans? So back to the human visual system, if we look inside our eye, this is not so much about navigation, but just pattern recognition. These are the cones. These are the cones look like at the back of your eye.
So Austin Roorda at Berkeley can do this with-- using adaptive optics-- look inside the living human eye as you're looking at something. And when you look at an object-- so all these-- first of all, all these different cones here have different wavelengths selectivities. And they're just shown in false color over here.
So the LMS cones selected to different wavelengths form a mosaic. And now as you look at something and hold your eyes still-- so, for example, when you read the letters on the lowest row of the Snellen eye chart, they're projected as an image on the back of your eye, and that's how big they look.
That's how-- that's like an E that's at that lowest row of the Snellen chart where you can read with 20/20 vision is that size with respect to the cones on the back of your retina. And you're trying to hold your eyes still. You're fixating, but nevertheless, it's moving due to these drift motions.
It's predominantly smooth drift interrupted by sort of corrective saccades. It's just being played over and over again. So the amazing thing here is that when you look at that, you do not see the E moving.
And moreover, you can read the E, despite the fact that it's defined in a completely different set of photoreceptors at each point in time. And if you just integrate it over that, you would just create a big motion smear. And if it were a yellow E or a blue E or whatever, you can somehow interpolate among all these different photoreceptors here.
It doesn't look sort of splotched or kind of textured in color. It looks like one solid color. So somehow, you can take this time varying signal from the retina and factorize it into a representation of shape, a representation of color, and the motion. You can discount the motion that's causing all that and form this invariant representation from the image.
Many other examples of this, but one of the most striking ones is in the-- for three-dimensional perception. If we just look at these dots that are moving on the screen, they very compellingly define a cube that's rotating in three dimensions. So the remarkable thing here is you perceive not only a cube, but you also perceive its 3D axis of rotation.
So you're perceiving 3D-- a three-dimensional transformation in the group of so3, and at the same time, you're perceiving this solid three-dimensional shape. But there is no cube there and there is no 3D motion. All there is a bunch of dots moving on the screen.
So what you're is remarkably adept at is taking that stuff you just scramble them-- what I just showed you. It's remarkably adept at just taking this collection of moving dots and ascertaining that there's a three-dimensional object there and extracting the invariant part and the invariant part being the cube and the equivariant part being the transformation that's acting on them.
So just a handful of observations there. We looked at the animal kingdom, insects, spiders, humans. This is what we seem to do in vision. We're very good at geometry, and at extracting information about geometric transformations from time, varying stimuli.
So we can pose these as centimeters problems that we want to try to figure out. So one is the problem of equivariance how do you form neural representations of these transformations going on in the world or you with respect to the world? How do you form these equivariant representations of those transformations?
And the other problem is how do we extract the flip side of that, which is the invariant part, the parts that's not changing. There's one world out there that you're moving through, which is giving rise to all these optic flow patterns. How do you extract one world-- that invariance by discounting the equivariant part?
So the animal behavior defines the questions. What about-- how is it computed? So we can look inside the brain and see, well, what does the brain actually compute with.
And by the way, this is just an advertisement. I'll just push this book, which I just have huge fan of. If you want to know how the brain computes the beginnings of that, there's this marvelous book-- I think is just a gift to our field by Peter Sterling and Simon Laughlin.
It's not so much about computation per se. It's more about signaling mechanisms, but it gives you just this incredible respect for what biology-- the cleverness of biology and the sophistication of biology is put into trading off signal-to-noise ratio with volume, with energy, and all these different factors, and how neurons are really signaling and are incredibly well-engineered devices. And so the same is going to be true-- certainly, I would guess-- for the computational realm.
And so one first observation there is, how do-- how does a real neuron compute? If there's anything that we've learned in neuroscience in the 50 years or more-- 60 years, I guess, or more, that a sense perceptron-- a sense Rosenblatt's model, the perceptron, which by the way, predominates nearly everything in deep learning today-- is the backbone of that.
If there's anything we've learned in neuroscience, it's that real neurons in the brain are not perceptrons. In fact, it would be hard to make a real neuron behave like a perceptron because of all the active processes in the dendritic trees. The very nature of the compartmental model of a neuron makes it more divisive in nature. The minute you open a channel, it pokes a hole in the membrane-- it changes the membrane resistance.
So these are definitely nonlinear devices in the way that they integrate signals on their dendritic trees. So any given pyramidal cell has maybe 1,000 inputs coming into it. It's not simply summing them together. Where you come in as an input matters.
It matters very much in terms of what signals you're going to combine with. And so some people, like Bartlett Mel, have argued that a better model of what a neuron does is a sigma pi model-- maybe sort of like a sum of products or sum of some kind of nonlinear combinations of inputs on this dendritic tree.
That's great news, because this is a much richer computation. You can get much more out of this. And Bartlett has really, I think, pioneered a lot of the studies showing why it's much more advantageous computationally to use these as elements in a network.
But I think one of the lessons here is multiplying signals together is something that biology can do easily. It's a natural primitive that the system gives you to multiply signals together. So it's not something we can sort of-- we should shy away.
Something and say, wow, that sounds too complicated-- I shouldn't multiply signals together. It's a natural primitive-- the system. Carver Mead at Caltech was often fond of saying, listen to the silicon in his sort of pioneering of neuromorphic engineering with analog VLSI.
So listen to the silicon and say, what are natural computational primitives of the physics of silicon, semiconductors? And similarly, biophysics had to listen to the biophysics-- what are the natural computational primitives in neurons. And so this is certainly one of them, I think. Another important primitive, I think, of neural computation in the brain is, is recurrent computation.
So this is just showing a cross section of the different layers of cortex. And Douglas and Martin's canonical microcircuit diagram to the right here. And so to me, one of the most striking and amazing properties of this circuit is this very strong and powerful recurrent feedback loop in layers 2 and 3.
So these neurons-- those neurons in layer 2 and 3. And these superficial layers, they're interconnected with-- by horizontal connections. And these are recurrent connections, so if neuron A projects to neuron B, then neuron B projects back to neuron A and many other neurons in addition. But the information is reverberating in this system.
So it's not like we have some kind of clean feedforward chain of processing where information goes from A to B, to C, to D and so forth in a chain. It really is more like a dynamical system. The system, the information comes in, and it circulates around, and it mixes in a very rich and interesting way as a dynamical system.
So this is another thing to think about. Recurrence is something to be embraced, not something to be avoided in neural computation. It's something that biology does naturally, and there's tons of it going on in the brain.
And a third property-- and many others we could point to-- is phase coding. So that is-- so we see in different parts of the brain, especially in the hippocampus, this property of phase precession, where neurons in the hippocampus, the time at which they fire an action potential with respect to the phase of the ongoing oscillation there in the theta rhythm gives you information about the rat's position in the environment with respect to a place cell.
And this was marvelously demonstrated at-- and even at the LFP level, you can decode where the animal is in the environment. You can decode place cells out of the LFP-- the Local Field Potential, which is a very macroscopic signal there.
This is from-- data from Buzsaki's lab where he uses these depth electrodes-- polytelodes. And from the LFP, what they did with-- this is work from Fred Sommer and Gautam Agarwal.
They basically looked at the-- they find the center frequency of these oscillations and then demodulate the theta rhythm with respect to its central frequency. So they extract these traveling-- from these traveling wave patterns in the hippocampus, they look at the relative phase in these traveling waves. And when you do sparse coding or ICA on the phase-- not just the raw signals, but the phase-- the relative phase of these traveling waves-- then you can beautifully decode these signals, which tell you about the rat's something like place cells-- the rat's position in the environment.
So phase coding seems to be something that is going on in brains. So finally, let's turn to mathematical structure. And the problem we're going to focus on here is this-- both this problem of extracting equivariances and invariances that we talked about before.
So let's just think about what happens when a pattern moves across a sensor array, like in the retina. So up here, we see we see the letter E on, let's say, a 10 by 10 pixel array. Think of this as a photoreceptor array, the sensor array.
So at that particular time, this pattern, the activation of these receptors can be thought of as a point in a 100-dimensional space. There's 100 measurements here, one for each axis of the space, and we're in a 100-dimensional space. And right now, that pattern of activation on the sensor is a point in this 100 dimensional space.
If this now-- if this pattern moves to a different location on that sensor array, it corresponds to a different point in this 100-dimensional space. And so now, the interesting question is, what happens when the pattern is halfway in between? That corresponds to yet another point in this 100-dimensional space.
And the question is, is it halfway along a line between these two points? I'll let you think about that. And the answer is no. And there's a very easy way of saying why.
So that basically is-- that point lies along a long linear manifold, the set of the set of points. But as you go-- as the E translates across the image-- lies on a nonlinear manifold that curved manifold in the space. And the reason why-- you can see-- is very simple, because this point halfway in between would correspond to simply the superposition of these two patterns at half contrast.
So the pattern in between is clearly not that, so it has to be off the line. And the answer is actually is moving along a circle in this space. In this 100-dimensional space, it's moving along an arc of a circle. So that's interesting, but how do we describe that mathematically? And so this is where Lie groups come in, because we can get from any point to on this manifold to another point through a matrix transformation.
So we can multiply by a certain matrix, which would take us from one point to the next. And the matrix that takes us from, for example, this point over here to that point is different than the matrix that takes us from that point over to that point there. But what the theory of Lie groups tells us is that all of these matrices that move us along that manifold belong to the same group.
And it's called a Lie group because it was Sophus Lie, mathematician-- often called lie groups. That's what I used to call them. In fact, when we got this grant, NSF said we have to change the way we phrased it because they were afraid that people would think we're working on lie detection.
But just so that's-- yeah, it's a common mispronunciation. Lie-- Lie for Lie groups. So Sophus Lie showed that all of these matrices belong to one group and can be parameterized in terms of this matrix exponential. So this e to the As.
So there's one matrix, A, which tells us how to move along this manifold and exponentiating that matrix. Multiplying just simply by one scalar in the exponential-- so that number t tells us-- that single scalar t tells us how far we're going to move along the manifold.
So if we dial in t by different amounts, we're dialing in different matrices here, and we can move smoothly along this manifold just by advancing that number t. So that's one-- I think that's an example of mathematical structure that's going to help us in solving this problem.
The other observation is that this is what happens when the letter E moves across the image. What happens if we have a different letter there, like the letter A, moving across the sensor array? Well, that traces out a different manifold in the space, because obviously that lies on a different set of points.
That traces out a different manifold in the space. But the important thing is that the transport operator-- the Lie group that moves us along that manifold is exactly the same as the one that moves us along this manifold here. So they're not they're not different. It's the same matrix A.
So that introduces this idea of factorization. So somehow, I want to factor out this x0-- what the pattern is out away from the Lie group action that's acting upon that point. So the two of these combined give rise to this vast variety of different patterns that can occur on the sensor array, but there's a much simpler, much simpler explanation as to say there's a set of patterns here and there's a set of group actions acting on them.
So that's what we're going to try to do here. So this is just one way-- so just turning to a standard data set that many people use as training wheels in the field for testing these models, we can think of the MNIST data set, for example-- this variety of handwritten digits as being composed of a set of discrete object classes.
So that's going along the vertical direction here. So these would be discrete patterns O parametrized by alpha. So alpha is going to tell us like which, which pattern we are. And then another axis going this way, which is the group action-- the transformation that's acting upon these, these different patterns.
So we-- and so there's some matrix A that's-- or some set of matrices A that is moving us from left to right here. And so here's the factorization problem-- is that given this product space of all these different patterns and group actions acting on this vast variety, we're saying we can reduce that basically to this-- oops, I'm sorry.
We can reduce this to this product of a transformation and acting on a set of patterns. So there's the problem. So that's where we're going here, is that we have to somehow solve this factorization problem that given the image, the hypothesis is that there's a simpler explanation of all this variety in terms of a set of transformations and a set of and a set of objects that are there in the world.
So I'm going to tell you about three specific works that are trying to get at this idea. So the first is a-- attacks the problem of learning. So how do we learn about these Lie groups? I mean, we could just sort of-- for certain kinds of actions like translation or rotation and so forth, we know what they are mathematically.
We could just kind of plug them in. But maybe for other types of data, they're a little more-- less well mathematically posed. Like in MNIST, for example, those different style variations might be something we want to learn.
Oh, I forgot to mention. Yeah, so going back to the style variations, here I am at MIT. This is the home of Freeman and Tenenbaum who did this very foundational work about 20 years ago or so on separating style and content using by linear models. It's a very similar idea that they were proposing back then.
And here, we're trying to cast this using this framework of Lie groups to, to help do this factorization. So one problem is learning these group actions and the patterns in the space. So that's this work I'm going to be telling you about first by Ho Yin Chau, Yubei Chen, and Frank Qiu.
And then the next is the problem of factorization. So let's say I know what the group actions are, I know what the patterns are. I still have to solve a factorization problem. That's a very difficult computational problem.
And so there's a recent innovation that Paxton Frady came up with he calls a resonator network. And they have a paper that just is coming out in Nature Machine Intelligence describing how you can use this for visual scene analysis. So this focuses on the problem of factorization, not so much on the problem of learning.
The learning work up here focuses on the problem of learning, but not so much on the factorization problem we'll see in a second. And then for the equivariant part, for representing the transformation, there's a very efficient way of doing this, which is alluded to many years ago by E. Lafitte of using residue numbers. And Chris Kymn has made some remarkable progress on that I'm going to tell you about.
So let's first turn to the problem of learning. So if I give you a set of images-- so that's just-- I is just a vectorized representation of the image, like that 10 by 10 image I showed you earlier. I think this is a 100-dimensional image-- 100-dimensional vector representing that image.
The challenge is to factorize it into a transformation and as object. And maybe this is just sort of residual kind of-- because we don't expect to describe everything just perfectly. And so the way we're going to model the shape is through a sparse coding model.
So we have some dictionary phi. The columns of phi are the different shapes we're going to-- different templates, if you will, that we're going to model. And alpha is a set of coefficients which tells us which shape that is or some linear combination of shapes, but we expect alpha to be sparse.
And then we're going to model the transformations in terms of a matrix exponential. And so I forgot to mention, actually. So this matrix exponential, it sort of hides some complexity, because e to a matrix is not simply like element wise, e to each element of the matrix.
It's a much more complicated operation. You would never want to have to compute that in a million years explicitly. So it's kind of a mathematical elegant way of expressing it, but it hides a lot of computational complexity.
And so one way of making that more approachable is to re-express that Lie group in terms of what's called as irreducible representation. In this case, that essentially amounts to diagonalizing the matrix A. So if we can take that a matrix and diagonalize it in terms of its eigenvectors, that's what W is-- W is the matrix of eigenvectors.
And then sigma is the diagonal component of that matrix. Then it turns out we can just move the orthonormal matrices down outside the exponential. That's a special case when you can do that.
And then we're just left with the matrix exponential of a diagonal matrix. And indeed, that is something very simple. That's just simply the element-wise exponent-- exponentiation of each element of the diagonal.
And so then we can then re-express all that e to the sigma S-- that diagonal as this other diagonal matrix R sub s. And so what you're going to find is that for many groups of interest for transformations in images, these-- this matrix A, when it's diagonalized, has eigenvalues that are imaginary. So that's what's being shown here.
This is the diagonal part of A, and the W, I'll show in a second what's learned from images. And then just when we exponentiate that diagonal part, we just get all these-- a set of complex phasors along the diagonal. So we can just write the total generative model down here.
As we're saying, in a given image that I give you, I want to be able to explain it in terms of a product of two things-- a transformation, which is these three terms here, and a shape. And the things I need to learn are the irreducible representation, or the matrix W, corresponding to that group.
So I need to learn W from the data. If I don't know that a priori, I'm going to start out W random. I don't know what those are. And I'm going to start out phi random. I don't know what those are.
So given a bunch of data, learn what phi is and learn what W is. And then given any particular image in those, then I have to infer the transformation S-- particular instantiation n alpha, the pattern that was shown. And we're going to start with two different simple data sets where we know what the patterns and the transformations are just to prove that it actually works.
So one is just translated digits. And here is just rotated and scaled digits. So when we give it translated digits, it learns-- is that irreducible representation-- the matrix W, it learns the Fourier transform.
So this is the right answer. We know this is the right answer because for the Fourier transform is what diagonalizes shift. So if you want to translate, you could do that very easily in the Fourier domain just by phase shifting-- doing element-wise phase shifting on the Fourier components.
So the Fourier domain is the right domain to be in if you want to do shift through this-- by multiplying by a diagonal matrix. So we didn't tell it that. We just gave it a bunch of patterns shifted by different amounts. We didn't tell what the patterns are.
And so it just has to figure out from all of this data that, look, there's an arc connecting these patterns, there's another arc connecting these patterns, and the group action of these different arcs is the same. With one W matrix, I can connect all these different points in the right way.
And then the set of patterns that it learns is shown here. So these are the templates that it learns from the data. And then you give it shifted versions of any given pattern. It can very compactly describe that in terms of a set of patterns and a set of and a set of transformations.
The same thing for rotation and scaling. It sort of learns a more groovy transform here. This is-- basically, you can think of it as a Fourier transform in a log polar coordinate system.
So again, we didn't tell it that. We just give it patterns that are rotated and scaled and it figured that out-- is the irreducible representation. And again, it learns the patterns.
And so once it does that, it can take these patterns and rotate and scale them just by moving two different scalar variables to do that. And now, the interesting example is when we give it the full MNIST data set. So the previous example, I just gave you some templates-- sort of just 10 particular exemplars out of the MNIST data set as templates. And you have a question? Yeah-- oh.
AUDIENCE: You don't have to take questions now if you don't want.
BRUNO OLSHAUSEN: Yeah, please, go ahead.
AUDIENCE: Maybe you're going to address this, but what about either out-of-plane rotations or compositions?
BRUNO OLSHAUSEN: Compositions I'll address in a second. Out-of-planes rotations, hopefully next year. So this is early days, and we're just kind of trying to exercise the framework.
But that's definitely where we want to go, is think about this as the training wheels for doing something more three-dimensional-- absolutely on the patterns.
AUDIENCE: Are you centering the learned patterns, or is it actually [INAUDIBLE]?
BRUNO OLSHAUSEN: In this case, with the rotation and scale, we have to center them. Yes, so in the previous case, they're obviously translating and-- because you cannot, at least with a group of so2 we're using here, do all four of those conjointly and have it still be commutative. So you have to-- so in their paper, they show a way you can do these jointly, but you have to do another transformation in between them. But that's an interesting question of how-- something I'm trying to think about how you can do them all in a better way conjointly.
So here's just the full MNIST data set. So we're not translating or rotating it by hand or anything. We're just letting the natural style variations in MNIST do their thing.
And here's what it learns for the irreducible representation, and here's what it sort of pulls out as the sort of templates there within MNIST. And notice here, it's made a slight mistake. It's duplicated the 0 twice.
That's the way-- and this left out the 1, because the way it describes a 1 is by taking a 0 and squeezing it. So it learned-- but they learned these two deformation axes, one of the squeezing dimension-- horizontal squeezing, and the other of a shear operation kind of operating on the digits.
So it's just pulling that out naturally and a very compact way. I think that's the important thing about this. There may be other-- there are other previous approaches using VAEs to do this kind of thing to disentangle transformations from shapes.
But here, all there is two matrices-- one matrix for the shapes-- 10 shapes-- and another matrix for the irreducible representation of the Lie group action. That's it. And the bilinear model. The bilinear model is key.
Having these things multiply together in the generative model, that's crucial. If we didn't have that, it wouldn't work. So having this sort of a priori built into the model is crucial to having it take on this, this simple form.
So now how do we get to-- so the next point of building this, we want to show that this is a problem that we can reduce to vector factorization. So the way I did it here, we had a bunch of different matrices in the way. And so by simple manipulation here-- by pre-multiplying, both sides by W transpose and then redefining W transpose operation by whatever it is-- I tilde here and phi tilde for W transpose phi-- then-- and just taking out the diagonal element.
This is a diagonal matrix. And so if we multiply a diagonal matrix times a vector, then we get-- then it's basically just element-wise Hadamard product between those vectors. So zs here is just the diagonal elements of R, and this is just what our whatever our shape model is here.
So we can re-express the model as saying, look, it's really amounts to a problem of vectorization vector factorization. Given this image here, I want to factorize it into what's the z of sd what's the s that sort of caused that and what pattern alpha caused that-- or factorize it into its equivariant part and its invariant part.
And so the reason why this is not such a straightforward problem-- it looks simple because you say, well, maybe I can just do gradient descent with respect to S and alpha. Normally, that's what we do for sparse coding for alpha, but we can't with respect to S because there are many local minima in this space.
And the way to see that is if you try to take those two E's-- and a shifted E and a template E-- and align them, then as you do that, you're going to get many false matches in between, and that's going to give rise to all these different local minima. So gradients-- a straightforward gradient descent strategy is not going to work. It involves some kind of a search.
And so I'm going to get to that in a second. But the point is-- let's just go back and think about that vector z of s-- that complex valued vector z. What it really-- the way to think about it is that we start out with some base vector of complex phasors. And then we're going to multiply each of these phases by that number x that we're representing.
So x is-- I'm sorry, here-- yeah, here we have it in terms of x. I had s previously. So x is what we had for s previously. So we're going to multiply each of those phases by the value x.
And that's the way the variable of shift is being encoded. We're basically taking it as one number, but we're encoding it by pushing it into a high dimensional space. And so one way of thinking about that for these different phases with different phase values here-- phi-- is that as I advance the value of x, then these phases are spinning around at different rates.
So if I just did this with two of those phasors that would sort of trace out a curve along a torus. So as I advance the value of x, I'm moving along this high dimensional moving-- in this case, a two-dimensional torus. But here, if I have n phases, I'm moving on an n dimensional torus.
So that's the way this number is-- this number of shift is being encoded. And it turns out that has a similarity kernel, which when you pick those phases randomly, is the sinc function. But more importantly, this now becomes an-- the shift becomes an equivariant operation, meaning that if we simply bind two of these complex valued vectors multiplicatively-- so if we multiply the z for one shift variable and the z for the other, we get the z corresponding to the sum of those shifts.
So this is going to make it very convenient for shifting or transforming patterns. As we have an algebraic structure, we can add variables by simply multiplying the corresponding vector representations of them. And we can also superimpose them. So we can take-- so if we don't know which z vector is currently in play, then we can simply add these together and can sort them all simultaneously in superposition.
Or by taking a weighted sum, we can even form a probability distribution over these different transformations. So this is what was done in the work of Paxon Frady working together with Alpha Renner, which is work that's, I think, out now. So what they do is they start off with an image which has now multiple objects in it, and they encode this image by simply taking a weighted sum-- a superposition of these complex valued vectors.
So this value-- this vector here, u, which is encoding the horizontal position, vector v which is encoding the vertical position, and this vector W which is encoding the color. So you just simply take a Hadamard product of these at each point in the image-- which color, which horizontal vertical position it is-- multiply those together element-wise, and then weight it by the pixel value, and then sum all these together and, that forms a scene vector s.
And this scene vector s is now a product of the shape of the position and of the color of any given object in that image. And so that can be solved very efficiently-- that factorization can be solved very efficiently by a resonator network. And basically, the way to think about this is an iterative algorithm where at each step it's making a guess about what the factor is.
So you basically take your current estimate-- compute your current estimate for x here. You divide out the other factors, divide out the other factors. You project it into the space of vectors that x could possibly be.
You threshold it, and then you simply do this iteratively repeatedly for each of the different factors that you want to estimate. And very quickly, certainly within 50 iterations or so, it can settle up on a solution. That's what you see happening here, is it's simply going through that image and trying to figure out what are the different shapes, and colors, and positions in that image.
And that's what it's basically showing you here in this simulation. And so the main point here is that by taking this whole collection of images-- if you think about the space-- the huge combinatorial space of different patterns, you could create with seven colors, 50 horizontal positions, 50 vertical positions and 26 different letters, that's a humongous number of different objects.
And if you consider now the number of images that you could create by having three of those, or two of them, or one of them in the image, it's a gigantic space that a very efficiently is searching over by virtue of doing this, this factorization. So I think the point of this is that we've made progress on the factorization problem. That's now a tractable problem that we can solve with the recurrent-- a certain kind of recurrent neural network that uses multiplication-- element-wise multiplication between vectors.
And the final point, I think, to address is maybe two questions here. One is, how are we going to represent complex phasors in the brain? And the other is, if we look back here, we just-- this thing we were doing here, we have 50 horizontal positions.
We have to search over 50 vertical positions. This doesn't seem like it's going to scale up very well if I have to go very large scene, if I have to have a different vector for each possible position. So there's a much more efficient way of doing that, which is alluded to in the grid cells, which we see in intrarenal cortex.
And this is something that E. Lafitte pointed out many years ago, is that they may operate as a residue number system. And this is a way of being able to take a very large number-- a large range of numbers that you want to represent, such as the different positions that an object can occupy in an image, and reducing it to a set of numbers which have much smaller dynamic range.
And at the same time, the periodic structure of this is something that maps perfectly, it turns out, onto complex phasors, which would-- which could implement this, this kind of this kind of structure. So just a brief tutorial on what is a residue number system. So this is the basic idea here.
If you want to represent the number 41, for example, we can represent it in terms of its remainder with respect to a different set of base numbers or quotients-- I mean a quotients. So we're going to take 41 and represent it with the remainder with respect to 3.
So 41 divided by 3 has a remainder of 2. 41 divided by 5 has a remainder of 1. 41 divided by 7 has a remainder of 6, and so forth. So now we take the number 41, we represent it in terms of this set of numbers.
And we can represent that uniquely for a range of 105, which is the product-- the product of these different base or quotients that we're dividing by. So it scales-- the big point here is if you pick these numbers to be coprime, then it scales exponentially. The range scales exponentially in those.
And the other very strong advantage of this residue number system is that addition and multiplication operate element-wise, so there is no carry. So when you add two residue numbers together, you just add just the residue components, and they just simply circle around in the space.
It's a circular modulo space, so it just circles around, and there's no need to have a carry. And you have a unique answer here, which corresponds to the sum or to the product that happens, simply element-wise. And so what Chris Kymn worked out is a way that you can do this now.
How do you implement residue numbers with complex phasors? It turns out if you pick that phase distribution of these base vectors discretely from a discrete probability distribution, then you get a similarity kernel which has this sort of modulo property. So it simply wraps around in the space in the way that you want.
And then we can take the different vectors for the different residue components and bind them together multiplicatively to get one number here, which corresponds to the number that you're trying to represent. So this here-- what we have is a number with a range of 105.
And we're composing that by combining three numbers here, which have much smaller ranges-- just a range of 7, a range of 5, and range of 3. And the point is that makes the search problem much more efficient because we have a much smaller space of possibilities to search over. It's the space is basically 3 plus 5 plus 7 rather than 3 times 5 times 7. So it makes that search problem extremely efficient.
And I think I'm running out of time, so I'm going to have to erase-- this is just showing basically the exponential scaling that you get. It's extremely robust to noise. I think this is an important part for neural representation.
You can add huge amounts of phase noise to these phase vectors and it still settles on the right answer. And there's a beautiful way to tile two-dimensional space with these vectors using the Mercedes-Benz frame, which to get kind of a grid cell type representation, which I'm really excited to talk to Ella about, but she's not here today. I'm going to meet with her tomorrow, hopefully.
And so yeah, so I think I'm just going to end there. And there's much more I could tell you about, but the point here is-- I think what I'm trying to make is that-- so bridging this link.
Trying to bring-- build machines that capture what's going on in biology, it requires, first of all, starting with observations of animal behavior that defines the problem landscape of what problems we're trying to solve. The biological structure, looking inside brains tells us about the computational primitives that we should be trying to think about using.
And finally, the mathematical structure, things like Lie groups and residue number systems, very powerful ideas that we can exploit from mathematics, which are going to provide the computational foundations that to allow us to engineer these systems. Well, thanks for your patience and happy to take questions.
[APPLAUSE]