TOMASO POGGIO: All right. So let me start about-- this is a paper by Geman or it's the Brown Group, who started to speak quite a few years ago about compositionality and vision. And just read the first sentences or two, it's the evident ability of humans to represent entities as hierarchies or parts, and the connection with language and so on and so forth. And we'll hear much more about it from Josh, I assume, maybe Max as well. And you're welcome, of course, to pitch in your point of view.
My definition of compositionality is much more naive. It's really about functions. So you have a function of vector x going from, say, RN to R. And this function may be a composition of several functions. In fact, it's kind of interesting that if you think about any computable functions, and I mean a computable function is a function that can be computed by Turing machine, then you are into the theory of recursive function. And a recursive function is a computable function and can be defined in terms of a procedure that uses a few primitive functions, like addition and multiplication, and out of it creates every function. So from this point of view, every function is compositional. That comes from the math. It is very computable function.
These were the composition of elementary procedures, OK? And you can choose whether it's addition and multiplication, or I assume you can choose addition and the exponential function. And out of it, you can define the log, and for addition, you can define multiplication and so on and so forth.
So OK. Now let me speak about the microstructure of compositionality. So the general point is that I will use the graphs like the one on the top. Something like this, for instance, is a function of many variable, a vector in RN, that outputs some number in R.
But then I could have, for instance, a function like the one on the right there would be a function of several variables followed by another function of one variable. So this would be something like followed by another function of one variable, like H1, H2, H3. H3 is a function of x 1 to x n, so from outer N to R. And then this is a function of one variable only, right?
And so depending on this compositionality, there are some interesting questions about approximation. And how can you learn or approximate from a set of examples? Input, output data, so inputs like a vector x, output y-- how can you approximate a function you don't know?
And we think of a neural network as an approximater, a way of approximating functions is this, for instance, sigmoidal units or a rectifier units. This is now a network. It's not a function anymore. It's a network.
Each unit is rectifier units, combines linearly various inputs, passes through the rectifier, and gives an output. So this is a one layer network. This is, for instance, three layers.
Now if you have a function like this, and you ask how complex has to be, first of all, can a network like this approximate it? The answer is yes. A network with one layer and [INAUDIBLE], which is pretty much arbitrary, can approximate any continuous function if within a [INAUDIBLE] epsilon in the [INAUDIBLE], so the maximum difference, provided you have enough units.
Now here comes the problem. The number of units that you require to have-- so let's call F is the function, so the upper graph on the left, and say n is the network. And in the [INAUDIBLE] I would like these to be less than epsilon on a finite domain, bounded domain. In order to have this, n has to have a number of units, which is in the order of 1 over epsilon to the n divided by m. So n is the number of inputs, the dimensionality. M is the number of derivatives of the function, how smooth it is.
Now this 1 over epsilon to the n is what is called the curse of dimensionality, and it appears all over. It's, this is the number of units I need. There's a number of examples I need to learn it and so on. And as you can see, if epsilon, say, is 10%, 1 over epsilon is 10. And if n is 100 or, say, 1,000, which means an image which is not very big-- it's 30 by 30,000-- you have 10 to the 1,000 units. So exponential are bad. That's the curse of dimensionality.
OK. However, there are some situation, like this one where in this case you have a function that we assume has this particular structure, are a function of many variables n, but they are really made up of functions of functions, so compositional, that are functions of two variables. So this is a graph of the function. You can imagine you have here nH is a function. Here is another one. Each one of these functions at each level is just a function of two variables.
OK, in this case, you can prove that the number of units that you need is-- so this is well known. This is a new result-- is 1 over epsilon 2 over m times n. So we avoid the curse of dimensionality.
Now the number of units is-- OK-- depends on the dimensionality of all the constituent functions, which in this example is two, and not on the overall dimensionality, only linearly.
You have some more general results of this type using both smoothness and dimensionality of the constituent functions, but the point is that something like this-- which, by the way, is more or less what we think visual cortex works. This is like neurons, not looking at the whole image at once, but only locally at a small receptive field.
That's another one. And then neurons at the higher level, looking at an image created by the previous neurons, but also locally. So if the function that you want to approximate at that structure, then deeper networks but not shallow ones can avoid the curse of dimensionality.
So question is-- let me see. These are some of the theorems that we have. That's this-- these are for function in Sobolev spaces, but it's pretty general. And there is a more general version that is this one, which is combining both smoothness and dimensionality.
But anyway, you can avoid the curse of dimensionality if you have this compositional function you try to learn now. It seems intuitive, and there are some numerical results that we have done with, for instance, this is on [INAUDIBLE]. It seems that by-- shallow networks do much worse than convolutional ones as we predict.
But let me come to--
AUDIENCE: Times up.
TOMASO POGGIO: Yeah. So there are some connections with the old perception and stuff, but I think the question we'll be debating is, you can say, OK, fine. You showed us that you can beat the curse of dimensionality for compositional functions, but what about why are compositional functions important or relevant? Why do they come up?
And there are a few arguments. I imagine that Max would say physics is the reason why function that come up are compositional. And Josh would say it's neuroscience or some prior. And I'll probably say it's mathematics. OK.
TOMASO POGGIO: Yeah, sure.
AUDIENCE: Tommy, could I ask a quick question?
TOMASO POGGIO: Yeah.
AUDIENCE: So in a convnet there's a particular notion of locality, which is the driving force behind this compositional architecture, right? There's some spatial locality, right? So the assumption there is that the functions are trying to approximate are compositional in precisely that sense. Is that the only notion of compositionality that--
TOMASO POGGIO: I claim that this binary tree that I showed, this is simplified model, [INAUDIBLE].
AUDIENCE: Great. No, I understood that part. What I'm asking is that-- are there other notions of locality that give rise to compositionality in, let's say, visual perception that go beyond just spatial locality, right? Because presumably if you think about theories of object recognition, like geons, for example, recognition by components, there's a notion of compositionality there. But there the inputs into the partition that gives rise to compositionality is not at the spatial level, right? It's at some more subordinate level.
TOMASO POGGIO: Yeah. So I assume that could be put in functional terms.
TOMASO POGGIO: It's a question of how you translate, in my case, it could be directly [INAUDIBLE] that gives you some parts or something that then your function [INAUDIBLE]. So yes.
AUDIENCE: Yeah. Perfect. Yeah.
TOMASO POGGIO: That's not the complete spatial locality.
MAX TEGMARK: Great. So why compositionality? Is it math, physics, neuroscience, something else? Let me give you the physics pitch, or to argue at least that physics has something to do with this.
So I think taking a step back, there are two questions which I find just fascinating in this theory. One is, why does cheap learning works so well, right? And the other one is why does deep learning work so well? And let me explain what I mean. Let's start with the first one here.
So if you look at the functions in general, functions can be very simple. It could be like a little box where you put in a number, or turn a crank and a square comes out. But they can also be complicated. If you had the world's best function that takes as input a chess position, outputs the next move, well, you shouldn't be here. You should be off winning the world championship in computer chess, right?
Or if you have the world's best function that takes one megapixel image with a million pixel values input and output caption, you can win all sorts of nice computer science competitions with this, right?
So we're talking about the question of how well one can approximate these sort of functions now with things like neural networks. And what I find kind of fascinating is that we don't just completely suck at this, if you think about it. Because look at this function, for example. If this is a megapixel image, even if we're black and white with only one black or white pixel in each place, how many images are there? 2 to the power of one million. That's more images than there are atoms in our universe, right?
And to specify an arbitrary function of such an image, even if you're just classifying it into cat versus dog, you need to write down for each possible image the probability that it's a cat. So 2 to the one million parameters in that function. Even if you could write one parameter on each atom, the specification of the function wouldn't fit in our universe. So clearly, it should be impossible to make a cat classifier, right?
No. Magically, it works. Why? I'll get back-- you have these images where you can send in all sorts of images, and then you can classify them quite fine with just a million parameters or so. And your brains with maybe 10 to 14 parameters tops do a great job on these sort of things. And we have deep learning networks with much fewer parameters than that that can even evaluate the value of go function, go positions, and stuff like this.
One might say, well, of course it can do it, because neural networks don't have a lot of parameters in them. But that's cheating. The whole point was to understand why any kind of way of parameterizing the function like this or in some other way can do it with a less than astronomical number of parameters.
You might start to suspect that physics has something to do with this if you just look at an image and reflect. I said that there might be-- that there are like 2 to the million possible images. Well, almost all of those images-- what do they look like? Almost all of them look like pure noise.
That's not what you have in your photo album. Typical images look like what would happen if you ran an encryption algorithm or something where-- so somehow, physics presents us with a very, very small fraction of our actually possible images. And somehow, that enormous simplicity is what enables us to build algorithms and functions that do useful things with them.
So I wrote this paper on the topic with Henry Lin, who unfortunately couldn't be here today. Hi, Henry, if you're watching this on the web. And first of all, we point out that if you take some probability distribution that you have-- maybe you are doing Bayes' theorem or something-- you take just basically just take logarithms of all your probabilities. Then Bayes' theorem turns into something which looks a lot like a Boltzmann equation in physics where you're taking the exponential of another function. So the probability distribution is exponential of something which looks like what we in physics often call the energy function, or the Hamiltonian.
And if you look at the Hamiltonian that describes part of the standard model of particle physics or some approximate to classical physics or whatever, what you also find is that those functions are also extremely simple. They could have been crazy complicated and physics would have been an epic failure. But they're also very, very simple.
And how are they simple exactly? Well, let me give a little example. Suppose you have 10 variables, and I want to specify some arbitrary function of these. Suppose, for example, these are little tiny magnets. Each one could point up or down, so we'll have, say, it's 0 or minus or plus 1 maybe in code what each of these is doing. And we want to write down what is the total energy of this system, and maybe we have a very--
So an arbitrary function of-- or maybe these are continuous variables. An arbitrary function of 10 variables-- well, we'd need infinitely many parameters to specify. Suppose the function is a polynomial. That's way simpler. It's a quadratic polynomial, and you need only 66 parameters now.
And, in fact, it turns out that the Hamiltonian of the standard model of particle physics is the best we've gotten after thousands of years of physics, but it's a polynomial and it's of degree 4. How crazy is that? Why is it so simple?
And moreover, the only thing that's fourth power really is the Higgs particle stuff. A lot of other things like electromagnetism is just a quadratic function. So that's one simple thing we see in physics. For some reason, nature loves low order polynomials.
Another thing we noticed in physics is that the-- it's local. Information can't travel faster than the speed of light, and that means that things, given very little time, only talk to their neighbors. So if this is a physics function, it's only allowed to have things in it like this times this. It can't depend on that times something far away, so-called action at a distance.
We don't know why that is. It's just an exponential fact. But if you insisted you can have a function here that's local also, then suddenly the parameter count dropped even more. And for my little poor example from this to point B.
Yet another simple thing we see in physics is symmetry. The laws of physics over here seems to be exactly the same as the laws of physics over there, or over there, or for that matter also at a later time. Or if you rotate things, the laws of physics also seem the same.
So in this case if you just have translational symmetry, then the interaction strength between these two guys must be the same as between those two and those two. Now we have only two parameters left. So you see, even in this poor example how things get very, very simple.
In the actual real world, we can so far predict every single experiment ever made in the history of science and physics in principle with just 32 parameters, because the laws of physics have this locality symmetry, low polynomial degree, and so on, which we'll talk about later is also very linked to compositionality. So that's what I meant when I asked why does cheap learning work so well. How can-- by cheap, I mean how can we approximate things we care about, functions we care about, with very small numbers of parameters. It's kind of on the cheap.
What about the deep part? Well, we heard from Tommy already that there are these beautiful theorems, including the ones that Tommy showed, that say that even if you can approximate something very well and cheaply with a deep neural net, if you force yourself to flatten it and try to do it all on one layer, you need suddenly exponentially more parameters. I wanted to share with you a very, very simple example of that that we proved a little theorem about in our paper, Henry and I. Two minutes? Yeah?
Just multiplying together numbers sounds like a very easy task, right? And it's sure enough very easy to do also with a neural network. If you have, for example, 512 numbers and you want to multiply them together, We proved that you can do it perfectly with arbitrarily good accuracy by using 36 neurons if you put them in nine layers.
So that's not surprising. It's easy in the way you can kind of multiply things together in pairs, and then you multiply the pairs together and all done.
If you force yourself to do it all at once with a neural network, we were able to prove that you need 2 to the power n neurons do it. You can do it also like this, but you can't do it any cheaper.
So that's 2 to the 512-- you need more neurons than there are atoms in our universe again, just for this trivial task. So that just emphasizes, again, what Tommy said.
Multiplication, of course, is a beautiful example of something which is compositional.
TOMASO POGGIO: It's actually the prototypical--
MAX TEGMARK: Yeah. If you want to multiply these things, you don't have to multiply one of the first by second then the second by the third. You can multiply everything in pairs and then multiply them and so on.
So to summarize what I wanted to say is what I find very intriguing is if we start and ask, of all possible functions you can imagine as a mathematician-- that's this space set here-- then the set that we can approximate well with feasibly large neural nets is this puny, puny subset, actually exponentially small subset there, OK? If you ask out of all these functions which are functions that physics makes you actually care about, that's also a tiny, tiny subset of all functions.
And what we seem to find-- that's what we were arguing in our paper-- is that basically this puny fraction of all functions and this puny are actually the same, and that's very fortunate. And I hope we can argue now in our debate why that is. Is it that somehow we evolved our brains to tap into the simplicity that was in nature, and that's why evolution selected neural nets, or is there some other reason? I don't pretend to have the answers, but I think we'll have a fun debate about it. Thank you.
JOSH TENENBAUM: So I wasn't really sure what to prepare, because I wasn't really sure what we were debating. So what are we debating? Maybe we could have a brief intro. Are we debating why our minds and brains compositional? Is that what we're debating?
TOMASO POGGIO: Yeah.
JOSH TENENBAUM: Yeah. Or why does it seem so essential?
MAX TEGMARK: Who is our moderator?
JOSH TENENBAUM: Sam, maybe. Sam is whispering all sorts of useful things in my ear. Or no, oh, Kenny. What are we-- yeah.
TOMASO POGGIO: Why so many problems same vision--
JOSH TENENBAUM: Yeah.
TOMASO POGGIO: You know, problems in which [INAUDIBLE].
JOSH TENENBAUM: Yes.
TOMASO POGGIO: Why should they be compositional?
JOSH TENENBAUM: OK.
TOMASO POGGIO: If they are, then we know why they do well.
JOSH TENENBAUM: Right.
TOMASO POGGIO: That's the argument, right? Why they should be?
JOSH TENENBAUM: So why should the problems have this compositional structure to them. Yeah. Well, OK. I'll give some perspective and see how that meets up. I think-- this is one of these debates where we agree upon probably maybe more things than we disagree on. This one I think will be good.
TOMASO POGGIO: Yes.
JOSH TENENBAUM: OK. I'll try to give the cognitive science perspective. Maybe this also counts as a neuroscience perspective. And it's really just sort of some background. I think everybody here has seen me talk about a lot of this stuff before, so I"m not going to actually talk about the research, just sort of the background.
So I think compositionality makes all interesting computation possible. Here's a slide a lot of you have seen me show before, which is just-- I like to think about, what are the aspects of intelligence that we don't yet have good models of that span engineering in the brain? So things that neural networks don't do well.
I think Tommy's made a good case that things that-- like convolutional nets can really be well understood. They're best understood in terms of ideas of compositional functions for pattern recognition. So good pattern recognition from limited training sets, and all of our training sets are limited, requires compositionality.
But so does all of this other stuff that minds do, right, which I like to lump under modeling the world. So explaining what we see, imagining new things, solving problems-- like that, right-- requires understanding parts and their relations and their functions, like this part and that part and that part, right? I mean, it's just so self-evident that every-- all the interesting things our minds do. Building new models-- all of these require having systems of knowledge that have parts with meaningful relations to each other, and the power comes from combining and recombining these parts in increasingly new and powerful ways.
Same is true for communication. And it's just it's a very old idea. So in the Geman paper, he goes back to Laplace. Whoa, that's really weird. OK, it looks better on that slide.
So Ken Craik is one of the relatively unsung heroes of cognitive science and AI. How many people have heard of him? OK. How many people have heard of him other than from a talk in which I've shown the slide before? OK.
So in his book, The Nature of Explanation-- he was kind of a contemporary of Turing's and also died tragically young. But he emphasizes basically-- this book is called The Nature of Explanation, but it's really his philosophy of mind in-- one of the first computational philosophies of mind. But it was before computers, because he was writing in the early 1940s, so before computers as we understood it.
And he says one of the most fundamental properties of thought is its power of predicting events, enabling us, for example, to design bridges with a sufficient factor of safety instead of building them haphazard and waiting to see whether they collapse. And in general, if organisms carry a small scale model of external reality and if the organisms own possible actions within its head, then it's able to try out various alternatives, conclude which is the best of them, react to future situations before they arise, utilize knowledge of past events, and so on in every way to react in a much fuller, safer, and more competent manner, OK?
He then makes various analogies to engineering and modern technology, like all these instruments which have extended the scope of our sense organs, telescopes and microscopes, wireless, calculating machines, typewriters, motor cars, ships, airplanes. He doesn't mention computers, because that didn't really exist yet, right? But is it not possible, therefore, that our brains themselves utilize comparable mechanisms to achieve these same ends? That there are mechanisms that can parallel phenomena in the external world in our heads as a calculating machine can parallel the development of strains in a bridge?
So I think this is some deep connection between what you're talking about-- that's the cognitive scientist's view of your last point-- that if the key thing that our minds are designed to do is not just to recognize patterns, not just to compute and compose functions, but in particular to build models of the world, if there's some basic kinds of compositionality in the world, in the physics of the world, then minds are going to exploit that if they're well-designed. I think that's the basic principle.
This history-- this has come up and all sorts of other ways, whether it's Helmholtz is analysis by synthesis and perception, or in Neisser's first book on cognitive psychology which emphasizes across language, vision, and action planning hierarchies of composable functions sort of as a unifying framework for the mind.
A lot of the work in our group-- we're using these probabilistic programs, which are basically ways of updating and combining some of these ideas, but embedding them in a probabilistic inference framework to deal with uncertainty from small amounts of data and all that sort of thing.
I think some of the challenges, though, come where this meets up with the brain, right? So we have good-- we have at least some plausible ways that the brain could do probabilistic inference, whether it's Bayesian inference or statistical learning or something. But at the moment, when we look at how minds are supposed to capture compositionality, it looks like symbolic languages of thought, whether it's LISP or prolog or anything else like that. And understanding how these meet up, I think that's one of our really great challenges.
So we can build compositional models at the cognitive level of the mind in terms of probabilistic programs where the language is compositional, and that explains in engineering terms how our minds use compositionality to very richly model the world. But I think if we want to introduce some topics for debate here, one of the key challenges is in how any of these things actually work in the brain, right? Tommy has often been pushing me-- it's one of the great things that comes from CBMM is getting Tommy's push on this, right? Like, how could this work in the brain?
I'll talk-- and this is just the last slide I'll show. These are just some particular examples that we could talk about if people want later on in the debate. Again, I'm mostly referring to things that are pretty much out here in our CBMM conversation already.
But if I say, for example, that common sense scene understanding involves something like a physics engine in the head, or a game engine in the head, Tommy will say, well, how does that work in neurons? I don't see how they could do that, right? The vector space language of neural networks, right, for the most part, it can support some basic kinds of composition like the ones Tommy's talked about, just like nested hierarchies of functions and repeated functional or network motifs like convolution. But we don't see how to build something like all the symbolic structure of a game engine in neurons.
So that puts out a challenge. Either there are these things or they aren't, and if so-- I think that the evidence from cognitive science, from behavior, and from computational models of cognition, whether it's in scene understanding or our ability to learn new concepts from just one example-- people have seen me talk about that a lot-- or in language. I mean, the classic example, right? How we put together words into phrases and phrases into sentences to express increasingly more sophisticated kinds of meaning.
These kind of symbolic compositionality that we see in these sorts of languages, or in probabilistic programs-- the evidence for that is overwhelming, at least in engineering terms. It's like the only good way we have to capture this kind of thing. And yet when we look in the brain, we don't see those things.
So how does that work? Do we need some new mathematics, for example, of how neurons can not only capture the kinds of compositionality that Tommy was talking about, or the kinds of compositionality that we have in symbolic programming languages? Or is there some way that, say, that the kinds of compositionality that Tommy's talking about, or other kinds of, say, recurrent neural networks can somehow meet these challenges?
We could talk about if people want specific versions of these things and ways in which if you look at, say, for example, neural network models that don't have the kind of compositionality I'm talking about fail so far to meet these challenges. Doesn't mean that they couldn't.
There's this-- we have this forthcoming BBS article with Sam and Tomer and Brenden Lake called, "Building Machines That Learn and Think Like People." And I think many of you maybe have seen it. It's on archive. But it tries to make the case in these examples, domains, why you need some kind of compositionality, but to leave it totally open how, for example, that could come into any machine architecture. So neural nets-- we think that people are starting to think about ways to get compositionality into neural nets, and I think that's a really interesting discussion we want to have.
Maybe if I do have another minute, I'll just give one example because it's maybe the most basic one. Again, some people here who've been with CCBM for a while have heard me talk about this, but it's-- you might not think this is about compositionality, but it's one place where the very basic physics that Max is talking about meets up with what I'm talking about. And there's definitely real tractable neuroscience to be done in the middle, and that's this issue of objects.
So in cognitive development, people talk about the Spelke object. I don't think Liz Spelke is here, right? Spelke object refers to what I think you could legitimately call the first human concept in the sense that in humans-- in the youngest humans that you can look at, this is the first thing we can identify in the earliest ages-- pretty much as early as you can look-- that really counts as a concept, plays an actually totally key role in human conceptualization and later thinking. And it's a very basic kind of symbol in a compositional language.
Namely, this is the concept of an object. I don't mean like a particular kind of object, like laptop, or table, or chair, like an object category, right? But just an individual object. Like this is one-- I should realize that I shouldn't do that. But this is a thing, and a Spelke object basically is a independently, movable chunk of matter.
And it's just an interesting fact of physics at the level that at the grain of space time that we've evolved in-- and probably it's not a coincidence. I'd love you or others here to talk about this, right? I think it's not a coincidence. But at the grain of space and time that we interact with the world and have for basically our whole interesting evolutionary existence, there are these chunks of matter that move together, right? Some of them are inanimate objects. Some of them are bodies, right? And sometimes it's bodies moving and making these chunks of matter.
And it's one of the great discoveries, I think, of cognitive science from Liz and Rene [INAUDIBLE] and others that infant's brains are basically genetically predisposed to see the world in these terms, right? Not as just voxels or pixels or just stuff, but as discrete objects that don't wink in and out of existence, that move coherently through space and time, and that can be moved and made and so on.
So this basic ability to see the world in terms of discrete objects, to track them over time even when you can't see them, to think-- to act on them, that's one of the most basic, symbolic capacities of the human mind. And when we learn to represent scenes and to think about scenes, it's basically compositional models composed out of objects and their parts. Again, that's the same thing Geman was talking about.
So I think to me the question I would most want to understand of all the mysteries that we don't know how it works in the brain, the thing I would most want to understand next is how does the brain represent these discrete objects, right? The many objects that are here in this room-- I can look out and see all of you. I know you're all sitting on chairs. This room is full of objects, and I can start to think and plan about them.
How does that work in the brain? We really don't know, right? But I think if we can learn about that, if we could make progress on that, then we'd start to have some of the building blocks of how the brain represents symbols more generally and composes parts into holes. And I think the insights there would also scale to these many other sorts of problems, whether it's intuitive physics. Physics engines are again defined on top of objects, discrete objects and their relations, learning new object concepts out of parts and relations, and also language. OK. I'll stop there.
AUDIENCE: Bravi, bravi. You all stuck to time.
AUDIENCE: Josh, do you really think that it's a mystery how--
TOMASOO POGGIO: Well, let's see. Where is chair in to sit?
AUDIENCE: I'm not going to-- because I have to go do another thing.
MAX TEGMARK: So do you want us all three to stand here or should we sit?
JOSH TENENBAUM: We can move our chairs up into the front.
AUDIENCE: Yeah, yeah. Take your chair.
TOMASOO POGGIO: If we find our chair. Here's one.
JOSH TENENBAUM: Thank you.
AUDIENCE: Where are you going to sit?
MAX TEGMARK: I'll just stand next to them.
AUDIENCE: Josh, do you think that it's a mystery how neurons can be compositional? I mean, it seems like Tommy gave a perfectly good story about how neurons could do compositionally.
TOMASOO POGGIO: I think we should share this question.
AUDIENCE: I'm just going to start off the question. OK, so I'm going to ask the first question, which is, it seems like there's a perfectly good story there about how neural network could do at least simple forms of compositionality, but it seems like the challenge there-- and this is what I was getting at when I asked Tommy a question earlier, which is you have to have-- it seems like a necessary ingredient is that the compositionality in the function that you're trying to approximate is isomorphic to the architecture of your network. And for the examples that-- so, for example, convolutional neural networks reify a particular form of compositionality in terms of local--
JOSH TENENBAUM: Like translation invariance, yeah.
AUDIENCE: But when you get to--
TOMASOO POGGIO: [INAUDIBLE] translation invariance helps but it's not critical.
AUDIENCE: OK, but what--
TOMASOO POGGIO: In other words, you can get-- you can correct the curse of compositionality without regulation invariance.
AUDIENCE: My point is more general, which is that you need some isomorphism the function you're trying to estimate, and your internal representation by, for example, neural network. And it seems like the critical challenge in higher level cognition, and maybe even in lower level cognition, is that that structure is constantly changing, right? If you think about, for example, understanding language or object, there's not some fixed tree structure that we're trying to estimate. We have-- it's constantly changing. And so how do you design a network that's flexible-- accommodate that structure, the structural diversity.
JOSH TENENBAUM: I think I get what you're saying. So you're trying to put more of a finer point on some of the contrast. Like, whereas if Tommy's saying, OK, let's build a network that just in a single bottom up pass can compute a very useful function. Like, is this particular object category present in the scene or in this window? That's like a circuit you want to build and just wire up, and then it's just ready to run, right?
Whereas the kind of things I'm talking about when I say, OK, I have this C model, that my working C model in which there is-- I can look out over there and see people, and I can't see most of the chairs that you're sitting on but I know they're there. And if someone says time to clean up and stack up the chairs, I'll say, OK. Everyone get up and let's stack the chairs in the corner. And I'm working with a set of objects-- some plans on top of those things.
That's a much more flexible working scene representation, and it's not something that's going to be reified into the structure of the network. So the sharper question is, how do neurons represent these very flexible symbolic structures? Or similarly-- or like the parse tree of a sentence. Every sentence has a different structure, and how do you quickly assemble these and use these to compute meaning. It's a good question.
MAX TEGMARK: Yeah. And just then one more comment on the question you asked there about convnets and so on. There, I think, we can see very clearly also how, again, physics takes part of the blame. Because the fact that we have translational symmetry in the world, that the laws of physics look the same if I move sideways, is precisely why you want the convolutional neural net. You want to process the information coming from there in exactly the same way as from there and from there and so on. It very much cuts down on the parameter count.
So we don't know, in turn, why physics has translational symmetry. It's one of those profound things we just notice. Oops. I mean, [INAUDIBLE] she discovered this incredibly beautiful result that translational symmetry is also the reason why momentum is conserved in physics.
So ultimately, that is related to why convnets are useful. It's pretty profound. We don't know why ultimately it all came about, but it's all the same thing.
Second, why is it that a convnet can do often these local kernels. So why does my retina first only combine together near-eye pixels? That has to do with locality in physics, right? Again, in other things, first you want to look at the local interaction, but then gradually put a coarser and coarser grain. We, again, don't know why that is, but for some reason the world is that simple. And clearly a lot of our--
AUDIENCE: The world's not that simple. That's what makes it so interesting. That's why we-- imagine watching a person fishing--
MAX TEGMARK: Yeah.
AUDIENCE: --right? Now that fishing line may go really far out into the water and it hooks a fish, and now he's doing something local over here on his fishing rod. And something very far away, far beyond the classical receptive field of any visual neuron, something's happening out in the water, right? And we understand that. We parse that in a way that these two things are deeply connected, right?
MAX TEGMARK: Oh, yeah.
JOSH TENENBAUM: Let me qualify what I said. I wasn't saying that that's all there is to it. Absolutely not. But the one thing that's always true for the brain, regardless of whether you're fishing, or reading, or surfing, is that it's very good at it as a first step in the retina and in V1 and so on to do these local convolutions.
And the reason it's always useful is because we always live in a world where it has those properties of symmetry and locality. After that, it becomes very context dependent. And sometimes you have these very long range correlations, and we don't understand how our brain even quite handles that. So that's not all there is to it.
TOMASOO POGGIO: So I think that physics has quite a bit to do with locality and locality at different levels. We spoke about coarse graining and why locality's preserved, not only at one level of resolution but at different levels. On the other hand, the function that has to be called positional is not only the x, the image that comes from the word, but also the function is the mapping from the image to the output. For instance, a class.
So it depends also on the question you ask. And I think you can make up questions, like, is there a [INAUDIBLE], which are intrinsically local. But you can make up questions like, you know the cover of the preceptor book?
TOMASOO POGGIO: It's a spiral like this--
JOSH TENENBAUM: Inside-- is it closed or open or--
TOMASOO POGGIO: Yeah, the question is, is it closed or open? And it turns out this is there because it's a normal co-computation in the framework of preceptors. It requires an infinite order preceptor.
And I think this would probably be very difficult to do also for deep networks. It's not compositional, the question you're asking. So it's a bit more complicated, because the physics determines the input but not the output.
One could argue that we tend to ask questions that we can answer. And maybe--
AUDIENCE: We can answer that question.
TOMASOO POGGIO: We can answer-- well, we can answer, but not immediately, right? It's not this question that you look at-- none of the [INAUDIBLE] so far has tried to do that.
JOSH TENENBAUM: We have other ways to answer the question. Our brains do.
TOMASOO POGGIO: Right, and the getting to that point, I think the first thing I said about functions is that ultimately all functions that are computable--
JOSH TENENBAUM: Yeah.
TOMASOO POGGIO: --are recursive.
JOSH TENENBAUM: Yeah.
TOMASOO POGGIO: And to me recursive is almost the same as computational. And it's the program.
JOSH TENENBAUM: Yeah.
TOMASOO POGGIO: So from that point of view--
JOSH TENENBAUM: Yeah.
TOMASOO POGGIO: --you don't need to fix networks. You can deal with the variability--
JOSH TENENBAUM: Oh, that's right. Yeah. But understanding how the brain implements these more variable--
TOMASOO POGGIO: Sure, sure.
JOSH TENENBAUM: I think in my urge to be all controversial and everything since I heard this was a debate, I maybe jumped over-- I jumped over a point that actually-- of where the things I'm talking about might connect very much to what you're talking about. And this relates to some issues of compositionality we've talked about informally and said, oh, we should actually write a paper on this, but we haven't yet.
So as you were pointing out in a deep network, like for an image in that classifier or something, one of the really interesting things isn't just the convolutional structure which maybe reflects translational invariance as Max talked about, but just the fact that it's a deep function as opposed to a shallow one. And that basically when you train a standard deep network to, say, classify thousands of image categories, most of the-- basically you're solving a whole bunch of different problems, like thousands of different classification problems, in which you share all of the structure except for the last leg. But you share all the convolutional layers and then a few fully connected layers.
So that's a real interesting-- like, why is that a good way to do things?
TOMASOO POGGIO: Actually, that's very interesting, and you brought up my--
JOSH TENENBAUM: So I have a hypothesis, actually. This is something I was talking about with a few people at [INAUDIBLE], and it connects to physics. So let me put it out there and see what you guys think about it, all right?
I mean, again, it's not so different than what we've all talked about for a while. But suppose you think that what your network is trying to do in like an analysis by synthesis point of view, or like in some of the models that say [INAUDIBLE] building, is basically it's trying to learn to approximately invert a generative model like a graphics engine.
So then that means-- and again, this is related to the ideas that Dan Yamins has had, and I think a bunch of people like Anka Patel. Surya is thinking about some version of this. Just want to mention everybody's name in case this goes well. Probably a bunch of other people too. Geoff Hinton has put out some version of it for a long time. OK.
But the idea that, OK, vision is inverting a graphics engine or something, so then the-- that means that if I'm going to try to recognize this as a chair or table or any other thousands of object categories, I have to do a whole bunch of things to go from pixels of images to the object category label. And-- or think about the generative model, like there is-- I choose an object category. Then I choose a particular instance of that category.
It has a certain shape, and then I put it somewhere in 3-D space with some orientation and turn on some lights. I'm making a graphics model. Make some lights, put a camera somewhere, maybe put some stuff in the background and render an image.
And if vision is in some sense inverting that process, it's got to invert all the physics of image formation-- how light bounces off objects, surface of objects into your eyes, the properties of shape. And the interesting thing is that for all different object categories, that's-- physics is almost completely shared except for the first stages of the general model and the last stage of the recognition model, right? The way light bounces off surfaces, though the geometry of image formation but also the physics of light and surfaces, it's all the same for all objects.
So it makes sense that if you're building a model to invert that process that it could reuse a lot of structure all the way up to the very last part where, OK, there are some differences between chairs and tables. But in the grand scheme of things, they are so much more similar in the generative processes that makes a chair image and a table image share so much structure, right?
MAX TEGMARK: Yeah. So with the risk of disappointing you both since you were trying to provoke a debate here, let me say that I completely agree with what you said. And I think it's actually very profound. In fact, Henry and I also proved in that paper that if you have any image that's generated by a multi-stage generative process like the one you nicely described there, we showed using information theory that there's an optimal way of undoing it, which is one step at a time.
JOSH TENENBAUM: Right. I remember that. Yeah.
MAX TEGMARK: Yeah.
JOSH TENENBAUM: Exactly. You guys showed that too.
MAX TEGMARK: And that fits so perfectly with what you said, because in that case it seems smart for the brain to tap into that rather than just trying to go all at once from what's on my retina to, say, oh, here's a really cute dog. Do it one step at a time. And here is an object, then here's a texture, and this and that.
JOSH TENENBAUM: Yeah.
MAX TEGMARK: Because then you do that once and for all, and then to figure out whether it's a dog or cat or all those other things you can share. As you said, that the--
JOSH TENENBAUM: And the conjecture is-- I'd love to actually take this a few steps forward in concrete reality, right? Like, conjecture would be that we could look over all the different networks that people have proposed or evolved to do this object recognition problem, and see if we could make really precise or somewhat precise correspondences between the motifs that seem to work well at different levels of these compositional hierarchies, and actually trying to do what you suggested in your paper, which is a kind of step by step, working backwards through the generative process. I think that that might be profitable.
TOMASOO POGGIO: Yeah. I think one way this works, helps a lot, is a network like some of the ones we have discussed, but anything which is [INAUDIBLE]. Is a function of K, and then up here is function of H, right? So the overall thing is a composition of K and H.
Then it turns out that the generalization arrow, the arrow we are doing on new data, so the prediction, the difference between the expected and empirical as a [INAUDIBLE]. And there is some measure of complexity of the function H. There's some measure of complexity of function K. But k is shared among different tasks. For instance, an image net K could be shared across all the thousand classes.
JOSH TENENBAUM: Like the inverse rendering problem as opposed to the object.
TOMASOO POGGIO: That would correspond to, perhaps, rendering. So you mention that you are classifying simultaneously 1,000 classes. So what you gave [INAUDIBLE] is a factor square root of 1,000, so about 30 in that part of-- which can be very large. And this comes simply by the fact there is sharing between classes. So you don't get if you [INAUDIBLE] 1,000 on one.
JOSH TENENBAUM: OK. So you have that result now. Yeah that's right. Great. OK.
AUDIENCE: Maybe I can inject a little bit more controversy in here by disagreeing with all of you. I mean, I get how the laws of physics are conserved across all scenes, right? But it doesn't seem like they're conserved in a way that would facilitate the kind of sharing that you're talking about. Because, let's say that there's-- conservation of mass is preserved across all scenes, but the way in which that could manifest visually could be dramatically different in different scenes, right? But you have lots of nice examples.
JOSH TENENBAUM: But that's not what I'm talking about, though. I'm talking about specifically the aspect of physics of how light bounces off surfaces into the eye, and that's the rendering problem base. You have to invert that.
Again, if you think that object categories are based on that form follows function. There's 3-D shapes, and what you really need to get somehow either implicitly or explicitly you have to get back to 3-D from 2D, then it's that process which is shared across all objects and all scenes.
AUDIENCE: Oh, so you're saying that all of those layers up until the last one--
JOSH TENENBAUM: Yeah.
AUDIENCE: --is just the rendering-- not anything about the physical structure of the world.
JOSH TENENBAUM: And again, this is compatible with some of Dan's, Yamin's, and Jim [INAUDIBLE] stuff where they show that from these higher levels you can decode properties of the geometry of the scene, right? I think--
AUDIENCE: But then it doesn't explain the mystery of why you could do so well with single layer--
JOSH TENENBAUM: My hypothesis-- another version of this hypothesis that it's not all layers up to the end, but most of them. And then at the end, there's maybe a little bit that tries to capture some of the shared structure of hierarchies, of object categories, right? So in ImageNet there's like 150 kinds of dogs or something. And so there's going to be some shared features that are going to be useful for all of the different dogs, and some-- also telling the differences between them.
So probably a lot of what's going on in the higher layers of connections is just trying to exploit that shared structure, and also tease out what the difference is between those 100 different kinds of dogs are.
MAX TEGMARK: To just try to inject a bit more discord, at least emphasise our ignorance-- so if we take that as a given then that step one is to take the role of visual input, try and reconstruct surfaces, textures, motions, et cetera, so we know. OK, we go [INAUDIBLE] et cetera. Then we get the ventral stream, and we can go all the way into the fusiform face area, and say, oh, there's Nancy. And then we combine the dorsal stream and things about objects.
JOSH TENENBAUM: [INAUDIBLE]
AUDIENCE: That's what it's for.
MAX TEGMARK: And try to say things about where the objects are and what they do. But it seems like at that point, our story sort of fizzles out and we really have no clue.
JOSH TENENBAUM: OK, good. Now we can go back to what I was talking about.
MAX TEGMARK: So what happens after that-- there is like-- in these Larson cartoons of physics derivations, is step five-- then, the miracle happens . So you love to think about--
JOSH TENENBAUM: Miracle.
MAX TEGMARK: [INAUDIBLE] physics or real world physics, how the brain actually represents a world model. What do you think happens after that?
JOSH TENENBAUM: Well, that's what I'm talking about is the challenge. I think that-- I mean, I don't think it's just in the ventral stream, right? When we study physics-- like the work that Jason Fisher did with Nancy and a little bit with me-- trying to find the brain's physics engine. And there we're talking about some parietal and premotor areas, right?
And I think there-- those are also action planning areas. So there's compositionality-- the kind of compositionality that I'm talking about when we construct scene representations to think about their physical properties and understand how to act on them, there's also the compositionality action plans, goals and sub goals, how I can manipulate tools.
I think if we want to understand where we go next, we might want to be also-- there are other parts of the brain where there are basic kinds of compositionality that we could start to study, right? Of if we talk about where Spelke objects are, again, it's going to be probably more like dorsal stream stuff that's going to be important.
So though we can point to other parts of the brain where we see some of the building blocks of, say, compositional physical scene representations, I do think we don't know. I think we're going to need-- like what Sam was talking about and what I was trying to talk about. I think we need ideas about flexible, symbolic structures, how they could be represented in neural systems. It's much harder to study. Yeah?
AUDIENCE: So you all talked about these convolutional deep networks, but there are other types of neural network models. Can you maybe just talk about the kinds of composition principles that, like, recurrent networks or some of these newer networks that have memory or at least neural Turing machine networks. Can you maybe talk a little bit about the kinds of compositionality that those kinds of networks may represent, even at the level of the way they're wired, or the way-- the kinds of things that they can represent.
TOMASOO POGGIO: So I would say a fully committed network is-- does not completely suffer from the curse of dimensionality, because it's still much less. For example, keep the number of units constant across two [INAUDIBLE], then what a real function approximation of a known compositional functional need. So you may get some gain, but one of the predictions is that you'd expect great performance by hierarchical local networks, like convolutional networks, and not so impressive from dense networks, densely connected.
On the other hand, recurrent networks are really compositional in the sense that they iterate the same function n times. And depending on the flexibility of f, they're really a Turing machine if you think about it. It's back to what I was saying. The f could be a relatively simple function, gets an input x, gives an output y-- can be multivariable, does not scatter, plus some parameter's state. You have a finite state machine.
And the individual f labeled by n could be different from each other, because of parameters. It's like the write and read tape of a Turing machine. So there you do something potentially complex using simple steps, which is what a computer is.
MAX TEGMARK: Yeah, so--
TOMASOO POGGIO: So a recurrent network principle for-- if you run it then for a finite time, their finite state automatous if you principally run them for infinite time, the universal Turing machine.
MAX TEGMARK: Yeah, I agree. And I think we can see an evolutionary reason for why the brain would prefer to be recurrent, because it's cheaper that way. If you have a computation that you force yourself to do as a feed forward around that. Suppose it involves 1,000 multiplications and they have this little multiplication module. Well, you need 1,000 of them. Whereas it's obviously smarter to just have one multiplication module and use it many times. You don't have to build so much.
If I'm looking at all of you and I'm identifying exactly who's here, it will be very annoying if I had to have 62 fusiform face areas. But thanks to the fact that evolution gave me the ability to have a recurrent neural network, I only need really one. Or I guess one in each hemisphere. And I can just reuse it. [TICKING NOISE]
AUDIENCE: Tom, do you think that the-- even though it's true that recurrent neural networks can implement something like a Turing machine, or approximate it-- is that the right model of computation for connecting it to the kinds of ideas that Josh was talking about? Because when Josh talks about the probabilistic language of thought, probabilistic programs, the way that that's actually implemented are variants of risk and scheme and so on.
JOSH TENENBAUM: Like what we call source code.
AUDIENCE: And it's these kinds of order calculations are very relevant there. Because how long would the tape of the Turing machine have to be to implement a probabilistic program of the sort that Josh would use to do even a very simple cognitive computation. It would have to be an extremely complex Turing machine. And so I guess the question arises, is that the right direction to build little recurring neural networks that try to approximate these things as Turing machines?
TOMASO POGGIO: No. I mean, this-- I was speaking from the point of view of theoretical equivalence. I think you never program directly a Turing machine with read and write commands. You are--
AUDIENCE: But isn't that what they're doing, basically? That's what the--
TOMASO POGGIO: Well, maybe evolution, is what evolution is that. So they master create it, like we did, and now he start to create more powerful symbolic languages, so that you don't have to programming machine language, you don't have to write Turing instructions, so to speak.
AUDIENCE: Right. So then the question arises, what are the high level programming languages that are-- because right now, a Nature paper was just published a few months ago that was essentially a fancy Turing machine. But there's no high level-- there's-- the whole-- they're actually very proud of banishing the sort of high level psychological homunculus from their designs. It can just--
JOSH TENENBAUM: You know, they could learn algorithms. I think that you're talking about the DeepMind differentiable neural computer. I mean, it's very cool because it does really learn algorithms, I think. They're just, by the standards of any algorithm we would write in computer science, really lousy algorithms. Like they can find the shortest path in graphs, but only for if it's like up to length four for a relatively small graph, about some percentage of about half the time, or something. I mean, I'm not trying to diminish. I'm saying, it's just by what we normally mean by an algorithm in computer science, that's a lousy algorithm. It's really impressive that you can get a system that can learn to do that purely from experience. And it also requires many tens of thousands of examples.
I think that's again, that's reflective of the fact that in some sense, they're doing neural programming, but it's in a very low level language. And the kinds of compositionality you have in high level programming languages, or things like natural language, is just a much richer, more powerful kind of compositionality, I think. So it's still a mystery how you could either get that in a trainable neural network system or how that works in the brain.
Another thing I wanted to just say, to go back to maybe Max's question, I'm not an expert on neural Turing machines-- that Max-- neural Turing machines or differentiable neural computers. But I think it is really interesting to look just very pragmatically at the recent history of recurrent neural networks, because what you've seen is people actually putting in more kinds of compositionality that are traditional ones from computer science, not having them emerge but actually putting them in and getting power out of them. So the neural Turing machine or the differential computer are examples of that.
Like again, other people here probably are much more expert than I am. But the neural Turing machine making explicit the distinction between the tape and the processor, basically, like separating out a read/write accessible memory. In earlier kinds of recurrent neural networks, people made a big deal of the fact that you didn't have that kind of separation. And they even said the brain doesn't have that separation. And then people realized, actually if we put that separation in, again a kind of composition, then it's a more powerful machine. And then I guess again-- again, someone else can probably clarify this better than I can-- but the difference between the DNC and the NTM was that they also-- there are a number of differences, but one of them was these memory acts, this sort of trace of memory accesses. do You guys know what I'm talking about? What's that called? The thing that keeps track of which memory location you accessed when and in what order. It's basically a symbolic data structure that allows it to get things like a stack or cue or something. That's really important to learning the algorithms that it has, and it makes it much more powerful than the neural Turing machine.
But it's again, it's basically, it's not it's not a distributed-- it's not like the traditional thing we thought we were looking for in neural network. It's not a distributed vector space representation. It's basically a symbolic trace of computation.
Another example of putting stuff into a neural network compositional structure, which is very powerful, has to do with recent neural networks trying to build neural trainable physics engines. So Pete Battaglia, who did the work in our group, along with Jess Hamrick, with this physics engine model of intuitive physics, like actually using a explicit symbolic physics engine, he's been at DeepMind for a couple of years. And a project that he's been working on that he just talked about at this last NIPS, basically tries to build a trainable neural physics engine, but does it not by learning it-- it learns things, like for example, balls bouncing in a box or it can model it planets orbiting around each other or even a string draping over something, so some pretty interesting physical scenarios.
But it does it not by learning a distributed vector space representation of these states, which people have tried to do before. Like Hinton had tried to do this. A number of people tried to do this. And it was really basically a failure. It's very hard to get a recurrent neural net that doesn't explicitly decompose into objects to learn to model multi-body systems. But Pete showed that by-- he put that in. He put in objects and relations symbolically, but then used a distributed neural net representation to learn the dynamics, like how forces worked and how they interacted. So that was much more powerful. Michael Chang at MIT, an undergrad, has built a similar kind of thing and also showed that.
So I think of that as a nice triumph of compositionality and engineering. But it wasn't symbols and compositional structure emerging somehow from training a neural network. They put it in. And as a result, it did much, much, much better. So that doesn't help to solve the problem of how it works in the brain. It just motivates us more to look for how is it somehow put into the brain. Where is that discrete symbolic object structure in the brain?
MAX TEGMARK: Well, when you say putting things in, there are two ways in which things get put into our brain. One is you're a baby, you have some basic structure put in according to the blueprints in your DNA, and then it gets-- the synapses adjust based on training data. But evolution has, of course, also been an iterative process, which has put things in. It just happened on a much slower time scale. So if it turns out that what you're saying is correct, then you can get much better efficiency by reprogramming modular--
JOSH TENENBAUM: Object-oriented representation, basically.
MAX TEGMARK: Then I don't think that should count as cheating.
JOSH TENENBAUM: It's not cheating. It's not cheating. I'm just saying it-- it's not cheating. It's just it further motivates us, I think, to ask, OK, well, it's proven its value on the engineering side now, and more reason to say, OK, well, how might it work in real neural networks?
TOMASO POGGIO: I wanted to ask a question, it's a little bit of philosophical, maybe. Why compositionality? For instance, if we for a moment, let's go back to the visual system. And suppose you are looking at this locality iterated values hierarchies. You may argue, this is evolution discovering these, because this is what physics imposes as a constraint. But you could also argue, this is because in the brain, neurons cannot make very long range connections. It's much better in terms of wiring and development to our local connections. And so it's very natural to have a local receptive fields. But of course, you have to look at the old image.
So if you go this way, you say this compositionality, at least that form of, comes from constraints of neurons, ultimately. And then if you go on, maybe question we ask, things we worry about, everything is because our brain is compositional. It's not because the world is.
MAX TEGMARK: I think that's a great question, Tommy. So you're asking, is it because the world has a structure or because our brain is made of neurons? I would say both, in the sense that evolution gave us brains made out of neurons, in particular, because they are local in the way that could be matched to the physics. That's why our evolution did not give us brain made out of some completely other kind or--
TOMASO POGGIO: But it could have.
AUDIENCE: Max, if there was no wiring cost, would you have discovered different laws of physics?
TOMASO POGGIO: Yeah. It's this kind of--
JOSH TENENBAUM: I don't know if-- so I don't know if, Yarden, you want to weigh in on this at all. So for example, Yarden Katz over here is, some of you might know, he long ago was a grad student in our group, where he worked on some symbolic compositional models of high level concepts and intuitive theories. And then a long path took him through molecular biology, and now he's working on yeast. And I won't steal the thunder of your research program. But he's kind of doing, I think of it as like cognitive science of yeast, trying to understand how yeast can compute some of the basic kinds of things, like perceive the world, learn, make decisions. Now they don't have a nervous system, but they do computation. And they're not the only one, obviously.
Think about this. Like, single cells do computation. They don't have-- this is like what goes on, what's the computation that goes on inside a single cell. It's a different physical machine than neurons and their connections. But I mean, to some extent, it's maybe less compositional, because there's a soup of chemicals. On the other hand, there's lots of very discrete parts. And I don't know if-- I guess what I just want to say is it seems like biology has figured out different ways to do computation. It's amazing how much even a one-celled organism can do. And if we're going to try to make grand theories of how physics constrains compositionality in biology, we might also want to think not just about brains in cells, but the kinds of computations and possibly the kinds of compositionality that might go on inside single cells. Do you want to comment on that?
AUDIENCE: I actually had a different comment. But I'll comment on this. Yeah, I agree, I think that you can represent intercellular. There will be a lot of interesting computations that have state and memory, using things that look a lot more like symbolic state, like for example, like chromatin states or phosphorylation sites on a protein, which are kind of like binary states. And so these are very different motifs of computation from synaptic strengthening, which people have been fixating on in--
JOSH TENENBAUM: Would you think of it as compositional, or would that not be right?
AUDIENCE: I think that's a separate axis, but I think that representing, having binary states that you can with perfect fidelity control is an interesting motif for a computation that you can't always get so easily with synaptic strengthening.
MAX TEGMARK: I would love to read a book entitled, The World According to Yeast, to understand what aspects of the world they pay, the yeast pays attention to and cares about. And it may very well be that because the computational demands of a yeast cell, the kind of problems they're trying to answer are quite different, that it's optimal for them to not have the spaghetti-like neurons, but to have some sort of computation like they have, where it's still the recurrent network, but if you have a cell, by expressing some gene, you can communicate with everybody else, not just with some. I think it's a really fun, very fundamental question, how a particular class of physics problems dictates a best architecture. I have never had any interesting conversations over yeast cells. We should go for a beer tonight.
But I imagine they have a very different world model than we do, where it's a lot more local, a lot more stochastic and random, and maybe not so much long-term planning. So if you could imagine that that will dictate different kind of optimal--
JOSH TENENBAUM: They actually have to plan much more long term than we do, as I learned from Yarden. It's a different topic, but--
AUDIENCE: There's actually an interesting tradition of, in neural biology, with people like Dan [INAUDIBLE] in the '70s about linking bacterial chemotaxis with computation in neurons. But he was looking at computation intracellularly. And so that was before, I think, we got so caught up with synaptic plasticity. So people appreciated the connection between intracellular somatic computation in a single-celled organism, like a bacterium, and somatic computation in neurons. So I think there are some deep connections there.
JOSH TENENBAUM: Did you say you had a different question you wanted to ask us?
AUDIENCE: Oh, yeah. I had a different comment. I think that maybe going back to the paper you mentioned that you had with Sam, the BBS paper, it seems like compositionality is not really the key feature, but maybe something like the interaction between compositionality and some form of abstraction. So in your paper, you guys discuss cases, sorts of compositional problems where if you-- where neural networks fail. Like if you think one of these Atari games, where you have to plan ahead of time, and you slightly change the objective function, like instead of getting the maximum number of points, you want to lose as quickly as possible or get to the next level, but just barely, then those are compositional problems where deep learning kind of fails. And I think that in those problems, you have an interaction between compositionality and some form of abstraction, like you need a notion of agents and objects and agents that have beliefs and desires and plans and all that. So maybe that's the key thing, it's this interaction between those two things, rather than just compositionality.
AUDIENCE: Yeah. I mean, I think that there's a common response to these kinds of arguments, which is that you build a expressive enough neural network and it will learn all these things about agents and objects and so on. But it's all going to be implicit in a distributed representation. And it's going to learn exactly what it needs to perform the task that you were training it on. And that's a pretty sensible and compelling argument, I think. But the problem is that it runs into challenges when you want a network to generalize in non-trivial ways. And that was sort of the point of that section, which is that it's not enough to simply represent something implicitly in a distributed representation, that the whole reason that you need modularity, and hence, some form of compositionality is that it needs to be computationally accessible. The information needs to be able to be extracted and used for a different task. And that's non-trivial in a distributed representation. There's some kind of fundamental trade-off here, that I think it kind of gets to the equations that Tommy was writing there, which is that you didn't talk much about M. M is the number of derivatives. So that's something about the smoothness of the function. And if you have, when you use distributed representations, that makes it easier to capture maybe complex but smooth functions. But the trade-off is that the more distributed you get, maybe you can capture more of the smooth functions, but then you lose some of the modularity. I don't know if that resonates with how you think about these things.
TOMASO POGGIO: Yeah. I mean, you may have a compositional functions. The overall function has a smoothness in Sobolev space that is the minimum of the smoothness of the constituent function. So you may win big if you represent approximate separately the individual functions instead of the old functions. Because the individual function of part one may have, has much higher smoothness than the request and much [INAUDIBLE].
JOSH TENENBAUM: I think, yeah-- I don't think I have the right vocabulary to really answer your question. Because we have these words like compositionality and abstraction that are just vague, but really important notions. Another one I want to use is symbolic. And somehow I think I want to try to say that you can have compositional systems that are circuits-like, like we can build Boolean circuits, or you can build the kind of circuits that maybe Tommy was showing up there that are compositional, compositions of functions. You can also have compositionality in symbolic systems. And there is something about compositionality in symbolic systems which enables a kind of abstraction that is central to human thought that we were trying to get at in our article. So yeah, I think we need better ways of understanding what all those concepts actually mean formally.
AUDIENCE: Is there a question out there?
AUDIENCE: I wanted to bring up a question that was related to a couple of ideas that came up earlier. One is this idea of inverting a generative process by sharing the first several layer, the first several steps of this hierarchical process. And the other was this idea of contrasting a high level programming language versus something like a machine language, and ask about so far it sounds like we've been talking largely in terms of two primary languages. One is this very, very low level neural network base representation, which might be somewhat analogous to a machine language, if we're going to fold to the computer metaphor. And the other is this higher level, more flexible language of thought idea that we were talking about, in terms of these certain probabilistic programming languages.
But-- and again, this is not really an area of expertise for me, so this interpretation may be totally wrong, or the metaphor may not quite map. But it seems like in the history of development of computer science, it's not just the idea that you can have this isomorphism between the physical implementation of a computer chip and the symbolic representation of machine code of those chips as ones and zeros that's important, but also this idea of boot strapping ever increasingly complex programming languages on top of those programming languages, so that it's not that you go straight from implementing machine code to implementing something like a modern probabilistic programming language, but instead, you actually need to boot strap just a slightly more complex language on top of that, something like BASIC, and then only using something like BASIC can you then boot strap a more complex language, like C, to then bootstrap something like maybe a probabilistic programming language.
So do you think that there's something in the nature of-- how many, at what level of programming language complexity could you be able to see the influence of the constraints from physical systems, like the ones that Max was describing, where you can say, ah, well this language is now complex enough that I can actually represent the idea of locality, in the sense that we refer to in the physical domain, or I forget some of the other points that we were talking about. But to what extent do you feel like you would need something in between these two languages that you're talking about to represent that?
JOSH TENENBAUM: Yeah, that was like an uber question, I think. No, it's good. It packed a lot of things in there. So I'll just try to address what I think was the thing you were getting in at the end, or one version of that.
So right, something that I learned from Vikash Mansinghka, in interacting with him, was that there are lots of different kinds of programming languages for all sorts of different reasons. And there's the ones that we engage with as humans, like source code when we're writing code, and then there's the ones that are the closest to the machine circuits that are being implemented. And then there's lots of things in between. And in the design of modern compilers, again, I don't know anything about this but I saw one or two lectures in the context of probabilistic programming meetings by compiler people. And they would say, well, OK, a modern compiler might go through 10 or 20 transformations from the level of source code all the way down to the level of the finite state machine that actually runs on the computer. And I guess you call all of those intermediate languages, and they have value. So at each step along the way, you're making some different optimization or trade-off in space resources or time complexity or efficiency for this kind of part of the computation or that one, or maybe it's about power efficiency or whatever.
And yeah, I think if we're reverse engineering the brain, it's likely that we're going to find that there are more than two different levels of compositional languages, in some sense, that there's a compositionality to neural circuits that, like Max and Tommy were talking about, there's a compositionality to the cognitive level that it's easiest for us to think about in language and write down in a symbolic language like LISP, and they're probably going to be others, too, multiple levels between them that we would like to understand.
There was a panel, a similar sort of panel at Woods Hole this summer. It was kind of a joint panel between our summer school-- and it's actually really more the other summer school, the Methods in Computational Neuroscience one. But it was about do we need new kinds of math to understand the brain and the mind? I don't remember who was on that panel, if you guys were on it or not. Surya, I remember Surya Ganguli and I were on that panel. And I made a remark that I think we need to understand better in our field compilers and how compilers work and what are basically programs that can automatically transform from one language to another to optimize different kinds of trade-offs in resources and algorithmic and other kinds of efficiency. He thought maybe I was making a joke, or he took it as a joke, he turned it into a joke, that like, OK, next year we're going to come back and Computational Neuroscience is going to be all about compilers.
But I honestly think it's not just a joke. I'm serious. I think if we understood better, as a field, what kinds of things are going on in programming languages and compilers, we'd have more ideas about how to link up the high level symbolic language of the mind to the lower level language of neurons.
AUDIENCE: How do we know when we've found something? Like, I'm a neuroscientist recording brain activity. How do I know-- if I'm measuring a compiler that's doing 20 transformations, how would I know that what I'm looking at is now--
JOSH TENENBAUM: We should ask [INAUDIBLE] and Eric Jonas.
AUDIENCE: Yeah. I'm measuring transformation 17 right now.
MAX TEGMARK: Well, if you look at the pre-motor cortex and the motor cortex is the one, it feels like there's some compilation style things going on. If I decided I'm going to get a pet, this very cute dog, it was a very high level command. And then somehow that very rapidly got translated into also the very detailed fine motor movements in my fingers and so on. And I have no idea how it happened.
JOSH TENENBAUM: Or when you practice a skill. When you decide to learn to play tennis or something, you start off by making very conscious high level plans about how you're going to serve or whatever, and then it just gets compiled up.
AUDIENCE: So how do you do that in neurons? I mean, I could see in a motor, there's a motor hierarchy, for example. But that's just one, that's just one example. How would you know if you're measuring-- how do you map a neural measurement to a representation of probabilistic programming language in some systematic way?
JOSH TENENBAUM: Well, how would you do that in a computer? I mean, just to take an example where at least it's in principle understandable with current technology. Yeah. Again, because I don't really understand enough about compilers, I don't know how to answer that question in a meaningful way. I think if we studied compilers, we'd have at least one set of hypotheses.
Anybody have any ideas on that? If Vikash was here, we could ask him.
AUDIENCE: [INAUDIBLE] generating their reaction to generate [INAUDIBLE], i.e. to generating an output. Like, what is the output of your program and what is the motor output or whatever output [INAUDIBLE]?
JOSH TENENBAUM: It's easier to study in the motor system, though, where the representations are closer to output ones, I think it might be easier. Let's take goals and sub goals. So one basic way that people have long, going back to Ulric Neisser and, well, well before, thought about action as being programmed. It's like, OK, you have a high level goal and then you put together sub goals and maybe they have sub goals. And then at some point, it grounds out at some just low level controller that your brain and body implement fairly automatically. So that seems like the kind of thing that we could start to understand how a high level goal is compiled down into a plan that you can actually implement.
And again, you're saying, well, how would you know? Well, you'd see probably some cell assembly.
AUDIENCE: Yeah. But how do you see cells? You have to have some notion of what--
JOSH TENENBAUM: So if we could--
AUDIENCE: Even to find a receptive field, you need to have some stimulus space to find. Is our stimulus space-- what is our stimulus space?
JOSH TENENBAUM: Well, again, it's not-- here I'm suggesting looking at the actions. I mean, it's one of the reasons Brendan Lake and I started studying these handwritten characters. Because I think they're, both in perception and production, there's a very basic kind of compositionality. The strokes, basically the strokes and the smaller gestures of smooth motor gestures that make up these strokes. So I think we can see that perceptually. We've found various ways to study if behaviorally and computationally. But we'd like to go right now-- Brendan started doing some FMRI on premotor representations of handwritten characters. I think there's probably you could study these things in monkeys, as well. I think [INAUDIBLE], one of our CBMM partners, has suggested some such ideas.
So I think, again, it's not the most satisfying answer, but I'm saying, here's a place we could go and look and start to see these kinds of very basic sorts of compositionality.
MAX TEGMARK: There's another interesting aspect of compiling, also, I think, could be worth studying, once we figure out how to study it. You mentioned tennis, for example. When you're a beginner at anything, you do it very consciously. You're really focusing on it, you're paying attention. When you practice for a long time, it feels like you've compiled it down into unconscious modules. It can just do it. Maybe it's somewhere deep in your cerebellum how you serve and stuff like that. So you really don't even think about it. And it's well known that if you're a really skilled athlete, the more you pay attention to what to you're doing, the worse you'll actually even perform. So you want to kind of run it in machine language.
And I'm also curious how that slower transition happens when you go from doing things very slowly and deliberately. Like you all remember the first time you tried to ride a bicycle or walk, maybe, to how it is once you've compiled it down.
JOSH TENENBAUM: Or when kids learn to write. So with Brenden and Eliza Kosoy, who's also one of our CBMM people, starting to study how kids learn to write characters and when they have these sorts-- and yeah, they're much less good than adults, and they're much slower and they take a long time, and it's like they have to really think about each stroke that they're drawing.
So yeah. Again, I'm just saying, that's a place where we can start to study these things. Totally agree.
Should we wrap up--
AUDIENCE: Let's thank our speakers again.