The Road to Intelligence
December 4, 2014
A panel discussion with
Geoffrey E. Hinton, Bob Desimone, Laura Schulz, Josh Tenenbaum and Patrick H Winston
Tomaso Poggio and Shimon Ullman
TOMASO POGGIO: I'm Tomaso Poggio. I'm the director of the Center for Brains, Minds and Machines. And I'm happy to present to you this panel discussion about what seems to be, and should be, a hot topic in science and technology today.
These are the rules of the evening-- the rules of engagement tonight. First, give the background of the panel, why we're doing this, and then ask some questions that we hope the panel will address. And then before starting the discussion, I'll introduce each of the panelists.
Patrick will speak first for five minutes. He'll be the first one. And then each one of the others will follow. And then after this set of five minutes talks, which will take roughly half an hour, we'll start the real discussion among the panelists. We're trying to have roughly an hour of discussion, or so, depending how it goes. I'll cut it early if it gets ugly.
All right, so the background is that the center mission is to work understanding and replicating intelligence, understanding the capabilities of human brains, understanding how the brain generates the mind. We plan to do that through a research program which is based on the realization that the last five years or so have been a magic time for the history or the size of intelligence. We are starting to have systems that are as intelligent as we are, or better. Different [INAUDIBLE] domains of intelligence-- there's chess playing, there's Jeopardy, it's driving a car with vision.
But none of these systems is intelligent in the sense that you can ask the system to ask question about a scene like this-- ask question about what is there, who is there, what are they speaking about-- about making a little story about what's happening there. And so our goal, our plan, is to make progress in the basic science of how we can do this-- ask how we can answer this question, how our brain can answer this question.
And we try to do this by measuring our progress against a set of-- you can call it SuperTuring tests. SuperTuring, because Turing test's about vision. Answer a question about images and videos-- simple questions like what is there, who is there. More complex questions like, what is going to happen in this scene? What is this girl thinking about this boy? And all questions that any one of us can answer, or try to answer.
And I don't think, questions we don't have any idea how to build machines that can do that, despite the progress in the last few years, and the progress that will undoubtedly happen within companies like Google in the next five years. And we want answers to these questions that are not only at the level of human performance, but also the level of what the neurons in the brain are doing. That's a more difficult question than a standard Turing test of vision.
Now, motivated by this-- must say, Apple is not Microsoft. But still, it does not work. Now, these are questions for the panels. Not the only ones, but some. Panelists will look at them. It's questions about, if we want to understand intelligence, replicate it in machines-- is the road to understand, first, the brain and human intelligence? Or can this be done on a different path, without worrying about neuroscience and [INAUDIBLE]?
And there are some specific questions that we can ask, since we have Geoff Hinton there. You know, can we just develop more complex neural networks, replacing sigmoids with more complex functions, maybe capsules, and get these systems at some point suddenly to become intelligent? This is what--
Well, you know, this is what Elon Musk and Steve [INAUDIBLE] are worrying about. But of course, stakes like this have been predicted forever-- that in 15 years, we'll have intelligent machines. This prediction has been made since 100 years or so. And it's still 15 years away.
And more generally, what do we need to do in order to get to understanding intelligence? In particular, a question that's important for me is, does understanding intelligent mean we have to have a theory of it? Or just some hackers getting some systems have deep learning networks to work very well without any understanding of what's going on. OK.
Let me introduce now-- no, no, wait. Nobody gets up. So I'll introduce each panelist in order of appearance. I'll be brief, because I expect most of you to know our panelists. They are MIT people.
So Patrick Winston is a Ford Professor of Artificial Intelligence and Computer Science at MIT. He joined the faculty in 1970. Wow. And is a member, of course, of the CSAIL, the Computer Science and Artificial Intelligence Lab. He served as director of the artificial intelligence lab from '72 to '97. And he started to serve as a director well before finishing his PhD thesis on machine learning.
Served on many MIT communities, taught many courses. Has a great talk-- everybody should go listen to, during IAP, on how to speak. And he studies how vision, language, and motor faculties account for intelligence. 17 books. He's a member of CBMM's management. And, among many other [INAUDIBLE], he received the Meritorious Public Service Award.
All right, so Josh Tenenbaum is a professor at MIT in our department of Brain and Cognitive Sciences. Also a member of the CSAIL. Received his PhD in '99. He was at Stanford from '99 to 2002. He studies learning and difference in human and machine. [INAUDIBLE] human intelligence and bringing computers closer to human capabilities. He's teaching a very good and very popular course in machine learning and cognition, and is a member of CBMM's management team. He's a recipient of the New Investigator Award from the Society for Mathematical Psychology, a Distinguished Scientific Award for Early Career Contribution [INAUDIBLE].
Laura Schulz is a professor in the Department of Brain and Cognitive Science. Her lab studies babies and children and how they [INAUDIBLE] human cognition is constructed during early childhood, how it develops. She's a member of CBMM and among her honors, she was an MIT MacVicar Fellow, and the recipient of the Troland Award of the National Academy of Science in 2012.
Bob Desimone is my boss. He's the director of the McGovern Institute. And he's the Doris and Don Berkey Professor in our department, Brain and Cognitive Science. Before joining McGovern, he was director of the Intramural Research Program at NIH. He studies the brain mechanisms that allow us to focus our attention on a specific task. He's one of-- he is the person, actually, who found first the physiological mechanism underlying attention-- visual attention. And he has shown that relevant information is selectively amplified in certain brain regions. He's a member of CBMM. He's a member of the American Academy of Science, the American of Arts and Science. Recipient of numerous awards, including the Troland Prize of the National Academy of Science.
Shimon Ullman is the Samy and Ruth Cohn Professor of Computer Science at the Weizmann Institute. An adjunct professor at MIT. He's received his PhD at MIT in Electrical Engineering, Computer Science in 1977. He was the first graduate student of David Marr. And then he went from MIT to the Weizmann to study visual processing in humans and machines. And he has made a number of very key insights in this field. He's a member of CBMM. He was awarded the 2008 Rumelhart Prize for Theoretical Contributions in Cognitive Science. He's also a co-founder of Orbotech, one of the first high tech companies in Isreal, and a member of the Israel Council for Higher Education.
Geoffrey Hinton-- I will not say much, because he is well known, even if he's not at MIT. And he has been introduced at least twice over the last 24 hours. So I think I met him the first time in San Diego, which must have been 1979 or 8 with [INAUDIBLE] and David Marr.
He has been a charismatic [INAUDIBLE] in the field of machine learning since the '70s. Two years ago, his post docs did very well on this computer vision benchmark. Imagine that. And so Geoffrey agreed to be acquired by Google. And by doing that, he started an epidemic. Over the last 24 months, I've seen a lot of machine learning researchers falling to Google, Facebook, Baidu, Alibaba, Microsoft, IBM-- being bought or hired, left and right, by them.
So he was a professor in the Computer Science Department at the University of Toronto, where he is now an emeritus university professor. And he's also a distinguished researcher at Google. Among many of his honors, he's an honorary foreign member of the American Academy of Arts and Sciences, a fellow of the Royal Society. He was awarded the Rumelhart Prize. And he is not a member of CBMM. But if he survives this panel, maybe we'll make you an honorary member [INAUDIBLE].
OK. So now, Patrick, you can come here. And after you speak for five minutes, no more, you can sit there.
PATRICK WINSTON: Well, I think it a great privilege to be on the panel, partly because it guarantees a seat. I must say, beyond that, that I'm also pleased to be the first speaker, because noting the other speakers, it's the only real way I could be [INAUDIBLE] of having an opportunity to say anything. But [INAUDIBLE] know us both understand that.
Now, these are interesting times, at least for me, because when I started in artificial intelligence, there were well-meaning, but foolish, people like Hubert Dreyfuss, who said that the enterprise was impossible. And now we have well-meaning, and perhaps foolish, people who say it's not only possible, it's dangerous. And I think that's a wild exaggeration. On my list, it's only fourth or fifth, well behind things like, oh, engineered pandemic, nuclear war, environmental collapse, and things of that sort.
But before I go any further, I've been commanded to speak for five minutes at most. So I'd better get to the bottom line. And my bottom line is that artificial intelligence [INAUDIBLE] it goes through [INAUDIBLE]. And that's because what the enterprise is focused on is getting a large community of people with different skills and talents to combine together to produce implementable models of what goes on in here. So it's not just models of what goes on in here, and it's not just implemented systems that do stuff. But it's implementable models of what goes on in here.
So when I think about what's happening now, especially during the past couple of days-- I listened to Geoff's extraordinarily exciting talk, how to think about all the [INAUDIBLE] stuff that's being done. And I tend to think about these sorts of things from the perspective of what a model is. I think what MIT is all about, MIT is about models. We build models of everything here. We build models of differential equations, we build models with a story, we build models in every kind of thing that we engage in. What's a model? It's a mechanism, it's a surrogate, for something that helps us understand it, that helps us to make predictions, that helps us understand the past, that helps us to control situations, and to build stuff. And from my perspective, incidentally, we don't really understand how something works unless we can build it.
So that's one reason why I think the CBMM enterprise has to have [INAUDIBLE] talked about aimed at building systems that can do the kinds of things that go on up here, but in ways that are consistent with and faithful to what we know about what goes on in there.
So to be more specific about how I think I will be thinking about the sorts of thing that have been happening in recent months, I have to go back and focus on what kinds of ways a model can be impressive. And one way a model can be impressive, of course, is if it's an engineering success. And, without a doubt, the announcements made by Stanford and Google in the last couple of weeks have been amazing engineering successes. It's hard to imagine how they could be more impressive.
Another way, however, that a model can be impressive is if it illuminates something about the problem itself. And in some respects, the Google and Stanford announcements have been more interesting to me in the things that they don't do quite right than in the things that they do do right. Because the things that they get wrong say something about the nature of the problem.
Next, the way in which a model can be interesting is if it's behaviorally faithful-- if it does the same thing that a biological system does, with the same kind of information that a biological system has. And in this, what I would like to see would be equipping a small child with a little camera pasted on its head so it could take 10,000,000 pictures of what it sees in the first year of life, or maybe 18 months, and then maybe see if those pictures could be used by a deep learning type of system to name the kinds of things that deep learning can name. I suppose some tenure-hungry professor might do that, but probably be arrested in the effort.
But biological faithfulness is not necessarily interesting, because it can be biologically faithful in the same sense that [INAUDIBLE] is biologically faithful. It might act like [INAUDIBLE] psychiatrist, but not impress us very much when we see what it does on the inside.
So the next dimension in which I think a model can be interesting is if it is what I like to think of as constraint congruent. That is to say, it does what it does because it uses the same kind of information that a biological system would be using to achieve its biological-faithful behavior.
It's easier to explain what I mean by that by citing of the well-known example of something that performed well, but not in a way that was constraint congruent with what people do. It was that famous [INAUDIBLE] in which a [INAUDIBLE] system was trained to recognize tanks and performed beautifully well. But I think it was Seymour Papert who noted that all the pictures of the tanks were taken in the morning when it was rather dark and overcast, and all the pictures without the tanks were taken in the afternoon when it was bright and sunny. So although it was performing extraordinarily well, it was basically an illumination detector, and not using the same kind of constraints that people were using when they did the same job.
Just two more. Going beyond constraint congruent, I like to think in terms of being algorithmically aligned. And by that I mean a system is interesting to me if it is plausibly doing the same kind of algorithmic work that goes on in my head. So a system might do that because it's been template matched. It might do that because [INAUDIBLE] capsules. It might do that because [INAUDIBLE] a variety of kinds of things, [INAUDIBLE]. If it's doing things at that algorithmic level that are aligned with what I think might be happening in here, then I think that is an interesting model.
And finally, of course, there's the substance of what we're talking about now, which is a model is interesting if it is neurally plausible. So the thing I find exciting about the past couple of days is that we're engaging with a neurologically-plausible world. [INAUDIBLE] There's a kind of attention to structure that I think is essential to getting it up to the level that I'm personally interested in, which has to do with ways in which we humans are different from other species. Other than that, I think the key difference is that we humans understand stories, and can tell them, and can recombine them to make new stories, creative stories, and do the kinds of things that the Neanderthals couldn't.
So when I think about what's happened in the past couple of days-- whenever Geoff talks, I think about it for a year-- what I'll be thinking about in the next year is how, with the kinds of things that Geoff has talked about, how it can be brought together with the kinds of things that I think about, which are sequences, and actions, and assemblies of [INAUDIBLE]. And make ourselves creative.
So that concludes what I would like to say about the [INAUDIBLE] extraordinarily exciting time to be here. And I look forward to all that we're going to be able to do in the Center in the next decade.
TOMASO POGGIO: Thank you, Patrick. Josh.
JOSH TENENBAUM: Tommy said no slides, but I [INAUDIBLE]. So it's a great privilege and honor and I'm really excited [INAUDIBLE]. [INAUDIBLE] what the Center is trying to accomplish. It's remarkable, and impressive, and inspiring, the kind of development that Geoff talked about, [INAUDIBLE] yesterday, in terms of what, in some ways, are old and simple ideas, made much more powerful with new technology on the data side. On the computer science side, how much progress in deep learning has revolutionized certain [INAUDIBLE] division, and other kinds of pattern recognition problems, [INAUDIBLE]. How much attention it's given to AI, and to the connection between AI and the brain [INAUDIBLE] science. So that's super exciting.
Somehow my notes got wiped out. I would say, also-- like many in the Center, and, I think, like Geoff-- I think that there are many ways into [INAUDIBLE] intelligence. But something like [INAUDIBLE] understanding and [INAUDIBLE]. I would say that-- and, again, like Geoff-- very much like [INAUDIBLE], that's going to be going beyond some of the kinds of problems and approaches that the current, sort of very heavily data-driven approaches in machine vision and art [INAUDIBLE] tackle problems in [INAUDIBLE] understanding that are fundamentally more challenging in ways that brains-- and maybe [INAUDIBLE] human brains [INAUDIBLE]-- are [INAUDIBLE].
So for me, that means these are problems very much like ones that Tommy talked about that our Center has been [INAUDIBLE]. We started off, actually, with problems of object recognition that are much more challenging, in terms of the kinds of invariant, code invariance, dealing with occlusion, very stark data compared to things that we're currently seeing in data-heavy machine [INAUDIBLE].
So, for example-- this isn't something that Geoff talked about, but relatedly-- DeepFace. Right? It's a nice example of applying deep learning to face recognition. And it can get 97% correct-- mid-level performance-- on a certain class faces, which are basically the faces that people think are worth uploading and labeling with people's names onto the web and Facebook. So that means, basically, nice, fully visible, frontal views with slight, small, maybe 30 degree [INAUDIBLE].
But look around at all the faces in this room-- people you know and recognize. Or if you go back to [INAUDIBLE], the image that Tommy showed --since I'm not allowed to show slides, I'll refer to his slides-- along with Joel and others in there. Only one of those faces, I think, would even be an acceptable input to the DeepFace system. And all the other ones are some heavy profile view. Which, you know, if you know the people-- I know Joel. I don't know the other people there. You have no trouble recognizing even somebody-- you know, think of all the times when you've seen somebody you know, where you just-- walking down the hall from the back, or from somewhere up on the stairs-- and you just glimpse a particular shadow of their chin is enough.
So we've been interested in all those kinds of problems of hard recognitions. Our abilities to understand scene [INAUDIBLE] include those kinds of problems. But as some of you might know if you've seen talks I've given here recently, for example, to understand the intuitive physics of a scene, or even the intuitive psychology of it-- so again, without slides, I'll just take the scene in front of me [INAUDIBLE] table.
We know that not only are there some bottles and cups and other funny objects-- dongles and such, which I don't even know the name for, [INAUDIBLE]-- and we know that there are these things here. We also understand that they're all stably supported by the table. Although, you saw when it rocked a little bit, like this, you might get a little bit nervous. If I put these things out here, you might get a little bit more nervous. Right?
Because you understand, intuitively, a rich analysis of the forces that are involved here-- the masses, the role of gravity, the role of the [INAUDIBLE], what role friction is playing [INAUDIBLE]. You know that if I put this thing on its side, it's going to get a little bit, maybe, more-- there we go-- a little more precarious. And so on. You know what would happen if I were to remove the table. These things would fall. You know that just going like this and sort of sticking this on the side isn't going to make it stick there. If I let go of it, it's going to fall. But you can catch it by doing that sort of thing.
But if you want to talk about things like Turing tests, there's literally endless number of things you can ask about, comment about, jump into physics [INAUDIBLE]. The same kind of thing extends to [INAUDIBLE], but I won't really talk about that.
And referring to things like these fun things, also, in common with Geoff's talk the idea of one-shot learning, you want to think about the hard problems of learning whether it's object concepts or physical concepts or other things-- even very young children are able to learn a new thing like-- think about the first time you learned what a dongle was. I've only recently converted to Macs. But at some point we all got used to the fact that we couldn't just directly plug our cable into this. We had to carry around the other funny things.
And, you know, one dongle is enough to know that this is a dongle, too, I think. This is a dongle. This is a dongle. This is a dongle. They're all sort of different dongles. This is maybe like a dongle without a cord-- but that's not really a dongle. It's like an adapter or something.
That ability to generalize from one [INAUDIBLE] a new concept to another one, integrating both [INAUDIBLE] shape and the function-- that's a very sophisticated kind of problem, which can just happen in an individual [INAUDIBLE]. So those are problems we want to work on.
Also like [INAUDIBLE], in the spirit of today's talk, really-- I think that the kind of intelligence that underlies our ability to do all those things is a kind of [INAUDIBLE]. Again, if you've heard me talk recently, it's also a kind of inverse graphics that [INAUDIBLE] in our group recently. I think the only places, really, where I start to differ in my approach to intelligence-- and I think it's really quite close to Geoff in a lot of ways-- is I want to take inverse graphics literally, even more literally, literally than I think Geoff wants to. Or at least, the only thing we're going to spar about is who's taking inverse graphics more literally, literally.
For us, what we've been doing is actually taking graphics engines and [INAUDIBLE] and exploiting the idea into a computational model that can be tested behaviorally and ultimately, even, hopefully, neurally. The idea that your brain has in it an actual graphics engine, something like the kind of one that are used in computer graphics-- not just post-transformation, but all the rest of it, too-- tools for modelling objects, something like CAD tools, things which are now part of most kinds of computer graphics engines, particularly the ones that are used in real-time video games, like physics [INAUDIBLE].
These days, graphics and physical modeling almost go together. It's a toolkit of modeling [INAUDIBLE] familiar with if you've seen Pixar movies, or in various kinds of real-time video games, touch physics games-- which many of us play on our phones. We think that your brain has that kind of game engine there.
And it extends, also, to include a psychology. Your ability to analyze other people's behaviors, of other [INAUDIBLE] in terms of their mental states-- their beliefs and desires. It's a kind of inverse planning, in the same way your ability to analyze a physical scene is a kind of inverse graphic physics engine [INAUDIBLE]. If you think of agents as being guided by some kind of planning model, the sort of thing that people in robotics develop, then you can kind of do a [INAUDIBLE] version of that, too.
So that's, in a sense, that's my view of inverse graphics. The hard challenges to me, then-- also relating to Geoff's [INAUDIBLE]-- is to say, well, how might this work in the brain? Honestly, I have no idea how to implement a physics engine or a graphics engine in neurons. Where I differ from Geoff is I-- and I don't know if this is differing or not-- but actually, the problem we saw today was, to me, very inspiring-- much more psychologically and cognitively-inspired. I thought it was beautiful that the whole first half of the talk was psychology, and there was very little neuroscience. Which is nothing against neuroscience-- nothing against Tommy's boss, or anything.
But, again, in the spirit of the Marr approach, I just think that there is-- particularly if we're trying to understand human intelligence-- just actually understand it-- we understand the relevant constraints, up to this point, much better coming from the psychology and the cognitive side. But then the goal for me of participating with Stanford on this whole project is very much to work down in the Marr sort of tradition to [INAUDIBLE].
And here, again, I think I'm mostly in agreement with Geoff, that psychology, for the hard problems of [INAUDIBLE] understanding-- not the things that deep learning, just data-driven deep learning from yesterday's [INAUDIBLE]-- but for this kind of stuff, psychology is really the [INAUDIBLE] source of the [INAUDIBLE]. And so I don't worry so much if I have to start off with about how physics engines and graphics engines can be implemented in neurocircuits if I understand how [INAUDIBLE] the software, like the kind of software in here. [INAUDIBLE].
The other place where we might differ-- and again, here, maybe because [INAUDIBLE] CBMM and even more so since, where I've spent a lot of time talking with Liz Spelke over here and getting those nativist hooks into my brain even more deeper than [INAUDIBLE]. It's not so easy for me to see how a physics engine or a graphics engine could be learned by gradient descent, or stochastic gradient descent, or anything like the kinds of good engineering, reliable learning algorithms that you can [INAUDIBLE] in an hour on a MacBook, or even two days in MATLAB. We don't know how to build a [INAUDIBLE] that's going to learn a physics engine. Maybe Geoff will show us how, and that will be wonderful.
But here I'm not so worried. Because again, I've lived-- and I think many others have showed us over the last couple of decades-- evolution, I believe, has built a lot of irrelevant stuff into our brain. Maybe just a few more steps beyond where Geoff currently is there, in terms of what might have actually been built [INAUDIBLE]. But in the grand scheme of things, I think we're pretty much coming from the same perspective.
And it'll be exciting, both within CBMM, and also with our fellow travelers-- I think Google, not just Geoff, but Google. When Tommy was describing CBMM as that-- or, sorry, maybe what Patrick was saying-- bringing together all these people from different perspectives to try to reverse engineer the brain. I'm really excited what Google and DeepMind are doing. And I think they are in other instances doing that sort of thing. And I think, in the long run, what we're doing is going to look a lot more like what [INAUDIBLE] was trying to do than actually what a lot of other approaches are out there. So I'll turn it over to [INAUDIBLE].
LAURA SCHULZ: It's a real privilege to be here and to be on this panel. And I think that the main thing I'm going to say, and the main thing I'm sort of qualified to speak about, is actually this moniker itself-- this idea of deep learning. Because deep learning holds out a kind of promissory note that a more innocuous name like, say back propagation, doesn't hold out.
And, in particular, it implies, at least evocatively, two different kinds of promises-- one, that deep learning can tell us deep things about learning, that we can use these models to understand specific and precise things about human behavior. And, of course, the other promise is that these models will be capable of implementing learning about deep things-- the kind of learning that we think is distinctive and important to human intelligence.
And so I want to take up the implied promises-- and, of course, it's not committed to that. They could just refer to deep layers of networks. But if you're going to use a title like this, then you're going to evoke these concepts. And my question, I guess, is how much we should look to deep learning to make good on these kinds of promises.
So thinking about the first idea first, I really am genuinely curious. You know, how can we use these models? What can we learn about learning in these ingeniously-designed networks with lots of hidden layers that were working on large data sets, that can make predictions about human behavior? How can I be informed by these models as a cognitive scientist, as a researcher, in their successes and failures, to design experiments, or to make particular predictions. Is that the kind of promise that's hauled out by these models?
Or are they in some sense incredibly sophisticated, but largely black boxes that are going to be able to do a lot of things that we want them to do, and that can maybe echo, or even exceed, human performance-- that may or may not give rise to the kinds of experiments or interventions or predictions that tell us something more about what it's like for human beings to reason about the world. So that's one question I have.
The kind of question I think about more often is how much these systems are capable of actually deep learning. And that isn't to take anything against the exciting successes in things like image processing and speech recognition and natural language processing, because those are clearly phenomenal successes. So at the risk of being terribly reductive about phenomenal technological successes, it's still tempting to think that what these systems are doing is what they are designed to do, which is match input to desired output. And that is a huge kind of accomplishment. But human learners, even our very youngest human learners, do much more than make matches. They're much more than classifiers and parsers of the world.
And I think that the kind of problem here is not just the problem that children can maybe learn in many fewer trials with much sparser data, or do one shot learning-- although I think that that is true. But clearly these kinds of approaches are getting much closer there. I don't think, actually, human children can probably learn their numerals in 25 labeled examples. I think they're probably worse at that, in some ways, than these machines. I think that maybe that kind of problem-- I know. It's hard to win. But I think that's the only problem.
And I don't think the problem is just-- also, the places I agree with Josh, with Liz-- there is good evidence to suggest that humans and other animals probably have a lot more innate constraints that they come into the world with that encode abstract features of objects and forces, or agents and goals-- although, again, I think that's true.
I also don't think the problem is that human learners probably do something more and different, at least in many cases, than stochastic gradient descent over large data sets. I think there's pretty good evidence that human learners, even young children, can entertain maybe one or two hypotheses and actively search for evidence to distinguish them and disambiguate them and engage in processes of discovery that are not maybe well captured by systems like gradient descent.
But I think that the real problem is that, if you really want to take seriously what's deep about human learning, then it's our ability to go beyond the data-- and go beyond the data, even with really elegant ways of filtering and analyzing and synthesizing that data. What's deep about human learning is not just that we can represent a scene like a group of blocks, and [INAUDIBLE] some difficulties with high order parity, mentally rotate those blocks or visualize it in different ways. It's really the fact that we can look at a bunch of blocks and see things that aren't there. We can see the future, or the past, or counterfactuals. We can see bridges, and castles, and graveyards, and riots in a bunch of bricks.
And those problems are ill-posed, right? The problems of invariance, the problems of [INAUDIBLE] are much better-posed kinds of problems. And the problems I'm gesturing at are really a problem of how human beings generate massive variance-- how it is that we generate altogether new hypotheses. How we look at scenes and make up new things to see, new things to learn, new things to do with those kinds of scenes.
And there's a certain sense in which those abilities are trivial, in the sense that any child can do it. But I don't think it's necessarily true that any other animal can do it. And I'm not sure that any deep network can do it. I'm not sure it's a problem of more layers, or more data, or faster processing. I think it might be a different kind of problem.
So one of my questions is how far these deep networks can go in getting at the kinds of things that are really deep and uniquely human about human learning, and that help us understand, really, the problem of thinking.
BOB DESIMONE: You'll get the neuroscience now. And Shimone, you can give the rebuttal.
So Tommy asked me to comment on the idea of whether we'll have more intelligent machines in the future by paying attention to neuroscientists. And I got stuck right away on this idea of what's an intelligent machine. I was imagining, like someday in the future, a salesman from a company that's owned by Google or Alibaba or something, goes to Toyota and says, I have a really intelligent machine for building your cars. It passed all of Tommy Poggio's Turing tests. You really want this machine in your factory. And so they put this machine in there. And it seems it's working great.
But the next day, it's doing nothing. And so the owner's upset. And the salesman will say, well yeah, the machine, it's not going to work. It's bored. It's wondering, what is its purpose in life, doing these repetitive things all day. And it was programmed in mainland China, it doesn't really trust these Japanese managers.
So, really, we don't want intelligent machines in the sense that they emulate human intelligence, which is irrational, self-destructive, prejudiced. What we want, is we want-- our idea of an intelligent machine is to implement some idealized form of some very narrow aspect of what we do, so it will be helpful to us in some way. It's sort of like we want our machines to be like idiot savants-- really good in some restricted domain.
And then, think about what those domains are. Tommy's example of the scene, where people are around having drinks and eating and so on. And so, yeah, it'd be useful to have machines that understood what was going on in a scene like that-- that I can relate to more as a neuroscientist. Also, that studying the visual system.
And I was thinking, so what does an neuroscientist have to offer about the neural mechanisms involved in something like that? And I'm seriously thinking about-- back to the days, when I was a graduate student doing neurophysiology in monkeys in Charlie Gross' lab. And this is the days before people collected all the data with computers, and you don't really understand what's going on in the experiment, anyhow, until MATLAB tells them what happened.
But back in those days, you actually listened to neurons over loudspeakers. And you actually waved your hands and you put your face out in front of monkeys and listened what the neurons did. And the thing, the overwhelming memory that I have from that time, was that the neurons that you recorded from-- a lot of these higher-order areas-- each one of them seemed like they were really intelligent.
If you record it from a neuron that was a face neuron-- you could play peek-a-boo with the neuron. You would go-- blop, blop, blop. Or you'd draw something on a piece of paper, and it would look, you know, it's like, well, is this a person? I don't know. Show it to the neuron. And you'd show it to the neuron. And the neuron would go, errr. Oh yeah, that's a good face.
And then, even if it wasn't completely invariant-- I was thinking again of, like, the hierarchy. It's like there's an area where-- the face, you had to be in profile. And so you'd say, well, is this a profile? Is that? What if you did this? Is that a profile? And you could do those things. And the neuron would tell you what it thought about your attempt to create that kind of stimulus.
So what's the point of this rambling about the past? The point was that I was listening to Geoff's ideas about the hierarchy, which, at every stage of the hierarchy, there are certain forms of invariances. And what seems to be the case in the biological visual system, is we also have this hierarchy with limited domains of intelligence at multiple levels of the hierarchy. And the key thing is that all of the outputs of all the levels of the hierarchy are made available to the rest of the brain. Not only when I think about the hierarchy-- we think about, there's this process, and it'll go to the top.
And I was thinking, you want to label. The label is, OK, I'm looking at Tommy. But, really, you have access to the information about Tommy's pose, and how is Tommy moving, and so on and so on. And those are being extracted at all the various levels of the biological hierarchy, and made available simultaneously with things like the categorical labels and so on. And I think that our understanding of each of those levels of the hierarchy, I think, is clearly advancing in biological vision. I can't help but imagine that it's going to be useful in creating our intelligent scene recognition systems, which will be kind of like our idiot savant intelligent machines. Thank you.
SHIMON ULLMAN: OK, so I'd like to make a few comments on the road to intelligence, using, again, the vision as the domain example. But I think that the issues are quite general, and not limited specifically to vision. But I think that in vision, probably more than in other areas-- as we noted already here-- there has been really quite a remarkable success in the recent years, accomplishing things that were almost unimagined a few years before that, relying heavily on advances in theoretical machine learning, and on some clever, useful, practical algorithms.
But I think that these successes rely, even in relatively simple tasks, on very extensive training by a large number of very often richly annotated examples that you have to give, feed, spoon feed the system in order to succeed. An image net, for example, which is a database of something like 15,000 visual categories, I think, with a few thousand images per category. So there are millions and millions of images. And typically, in computer vision systems, are getting trained at about half of the images in this data set and then are trained on the other half. And this is just for producing the label, the name of the object.
If you want something more sophisticated, still pretty decent and humble, you want to also delineate where the object is, then you have to train your system on many, many other examples, in which people carefully annotated example objects. And then they learn from this. And then they can do these other tasks. And if you want more sophisticated things, then when you go and you try to extract more, richer information, then things continue along this way. And you need more examples. And you need them to be carefully, richly annotated in order to get the right information out.
So, single object recognition, in which we had remarkable success, is relatively simple task, compared to two more natural tasks that people can do. People can look at images and, as we discussed, they can get very rich and deep information about people, and about actions, and about goals, and about agents out in the world, and the way they interact. And they can do all of these things very quickly. They learn these things quickly. and typically without supervision and without annotation.
So it's not just the technical problem of who supplies the data. But there is a trend there that shows us something important. And it's worrisome that the more you want, and the deeper the information, more structured, you need to supply more and more data, and to be prepare it and annotate the important things that your machine, your system should pay attention to.
In our own work, we've been interested, for these reasons, in examples in which, for machines, they will require a lot of annotated data and help-- and that, on the other hand, that children, even infants, can get very quickly, and can get it without any help, and without any supervision and any annotation.
Let me give you-- I will not go, of course, into any detail, but just sort of in a telegraphic way-- just to tell you about two or three concrete examples, just as examples of what I mean, and try to make the general conclusion from this. So one example that infants do very quickly early on--and it's, surprisingly, for computer vision type of research, is recognizing hand-- just being able to recognize the object hand. Because it's so flexible, and there are so many different views that a hand can take, that it has been considered in computer vision a particularly difficult problem in object recognition. And it's typically approached by richly supervised and long training of systems in order to accomplish this task.
And yet for infants, it's sort of the second most obvious object to pay attention to. We start maybe with faces. And the second thing that they look a lot of is hands. And they know what they are. And they can recognize them when they're small and dynamic static images. They do it very well, without any supervision, in the very early stages of their life. So this is surprising.
And we think-- and I will not go into all the details-- that we have some special mechanism in our vision system that sort of directs us to pay attention to hands-- to pick up, specifically, images of hands, despite the fact that they are so variable. And the suggestion, based on some computational experiments and some available evidence from infant data, is that hands, they have special sort of dynamic characteristics. They are moving objects that also typically can cause other objects to move.
And it has been known that infants, or human beings, have in their visual systems, from birth-- they have sort of special capabilities, special detectors tuned to this kind of dynamic pattern of activity-- of regions in the image that are set but moving, and they can come in contact with other regions, and carry them around, or cause them to move. And we think that you have specialized detectors for these things. They can detect them. They can use them in order to extract specifically hands, [INAUDIBLE] about hands, and about the properties of hands as active agents that can move things around and so on.
And then hands give rise to other things like direction of gaze, which I want to sort of mention. Because if you think about it, it's surprising-- infants can very quickly, within the first three months of their lives, start to pay attention to the direction of gaze-- what object other people are looking at. And this is something that is interesting, because it's not salient, not obvious. It doesn't exist, in fact, in the image. And somehow, they pick it up. And they learn it. And they can compute it.
And we think, again, that this is directed by specialized mechanism, and that infants actually pay attention to hands. It's sort of related to hands. And when hands initially grab an object-- they make an initial contact to grasp it, and to move it around-- then they typically-- in fact, we measured it, it's over 90% of the time-- they are looking directly at the object they are making contact with. And we think that our mechanism, it gets sensitive to these particular events. And they watch for hands coming in contact with objects.
And then when they do it, the next thing they do is to look at the face. And they use it as a particular example in which the direction of gaze is known. It's from the eyes to the point of contact. And they use it in order to master the concept of the direction of looking. And they use it in order to be able to do deal with this particular problem.
Gazes, by the way, even used later on for other things. We use it later-- infants use it later on to know what people are attending to. And this is used later on in other things, like language acquisition. It turns out that to disambiguate nouns, this turns out to be a useful cue that people use. If it's a new noun that you don't know what it refers to, then it's more likely to refer to the object that the person, the speaker is now looking at.
So it's also interesting to see this, not just the isolated cues and mechanisms, but the sort of natural developmental trajectories that moving regions can lead to hands, and hands to gaze, and gaze to reference of nouns. And if you build a system correctly, all of these things will unfold.
And, very briefly, just last thing, which I think refers to, also, things that Laura was talking about, which is the richness of what we get. We're looking at issues of actual recognition, and, for example, in action of drinking-- just taking something and drinking from it-- and it turns out that humans are very sensitive to very subtle properties of objects that people, that agents, are holding. They are very important, and people are paying attention to this.
And they can, for example, in the case of drinking-- they automatically and naturally, spontaneously-- they test whether the object being held in the hand is open from above-- sort of a container that can contain something. Even if it's a new object that you haven't seen before-- but if it conforms to this description and this property, that it will be OK for drinking.
And it's interesting, again, that infants, at very early age-- what's very surprising to me, one of the first properties and relations they note about object is whether or not an object can be a container. And it's not something that, as a computer vision person, you would expect. They know about container. They recognize them. And they know a rich, semantic information about them-- that they can carry liquid, another object, inside them. All that, already, in the first few months of life.
So I think that we see an interesting contrast here that, at least in an unsupervised setting, the kind of main systems of today-- if you give them many, many images, they will extract mainly things which are statistically salient, not necessarily very meaningful, and will actually label to them. What you really want in more cognitive systems is to give the system many instances, many images-- but you want somehow the system to learn from this the more meaningful things, even if they're not very salient-- like gaze, and containment, and so on-- and use them in order to construct a very rich semantic understanding of what's going on.
So I think that will be an important thing that we'll have to understand and discover and be able to put into our system with our structures. Innate for us, inside the system, that biases us in certain ways. That are the structures already partly there, and they are ready to get out your important information from the sensory data.
Now, how do you discover them? How do we get all of this structure that guide our learning and our most specialized and very general-purpose learning mechanism? There are two ways, I think.
One is the kind of thing that Laura and Liz and others are doing. We can get them from studying humans and trying to understand what goes on in our cognitive systems. I think it can also be studied in sort of more machine learning and formal ways. After all, even these innate mechanisms were learned, in some sense, by evolution and wired into our brains. But you can still think about it as a learning problem, in which you get lots and lots of examples over a long time. And somehow you learn to develop this innate structure.
I think it's something that can be approached formally. But probably not exactly. It will be a somewhat different kind of learning, in terms of what you get, how you start with, and what you end up with. I think it's doable. It's approachable. It's something that should be tried. But it's not exactly the kind of learning which is currently being developed, and being discussed in standards-- learning algorithms. And it will require some change and some new conceptual approaches. So that's my [INAUDIBLE].
GEOFFREY HINTON: OK. I spoke for several hours recently. And so I will be rather brief. I want to respond to Josh by saying, I'm convinced [INAUDIBLE]. I agree with everything else he said. [INAUDIBLE]
And I want to respond to Shimon by saying, the vision systems I was talking about are disembodied vision systems. They're just trying to [INAUDIBLE]. Of course, a real vision system is embedded in the [INAUDIBLE] that has goals and pays attention to things. And I think a lot of what he wants is going to come out of embodying a vision system in a larger system. It's learning to achieve its ends, [INAUDIBLE].
I think it's [INAUDIBLE] question as to whether you can learn that whole larger system by doing stochastic gradient descent. I think probably you can. I don't think you need much [INAUDIBLE], but that's a different question. I agree with him, but these disembodied vision systems aren't going to get at things like that. [INAUDIBLE]. I'll leave it at that.
TOMASO POGGIO: So there is this question that is the first one in this list. And I think the person [INAUDIBLE] to the kind of West Coast discourse [INAUDIBLE]. I think Geoffrey's West Coast right now. The West Coast, you can not [INAUDIBLE] experience. And the East Coast, maybe because of Chomsky, is you have to have some [INAUDIBLE].
GEOFFREY HINTON: Yeah, I think this form of argument that I attribute to Chomsky goes like this-- because I, Chomsky, [INAUDIBLE].
TOMASO POGGIO: There are more sophisticated forms of nature versus nurture, right?
GEOFFREY HINTON: But mostly, when people say things are [INAUDIBLE], it's because they can't see how they could possibly be learning.
TOMASO POGGIO: That's partly true-- but on the other hand--
GEOFFREY HINTON: [INAUDIBLE].
TOMASO POGGIO: --on the other hand, there is this old idea that [INAUDIBLE] discovered. It's an idea that, back in 1890--
GEOFFREY HINTON: [INAUDIBLE].
TOMASO POGGIO: --Mmhm. [INAUDIBLE].
GEOFFREY HINTON: [INAUDIBLE].
TOMASO POGGIO: Yes.
GEOFFREY HINTON: [INAUDIBLE] even the innate stuff was [INAUDIBLE].
TOMASO POGGIO: Well, it says also that even the innate stuff need learning. Right? One consequence of that-- almost everything-- if you have an organism that is sophisticated enough to have a general learning machinery-- almost everything it can do is partly genes and partly learning from experience. You should explain [INAUDIBLE].
GEOFFREY HINTON: Really?
TOMASO POGGIO: Yes.
GEOFFREY HINTON: [INAUDIBLE].
TOMASO POGGIO: Uh, sure.
GEOFFREY HINTON: How many people know what the Baldwin effect is? OK, [INAUDIBLE]. Baldwin effect. B-A-L-D-W-I-N. Gosh. How many minutes do I have to explain it? If you give me a number of minutes, I'll explain it in that many minutes.
TOMASO POGGIO: Four.
GEOFFREY HINTON: Four. OK.
PATRICK WINSTON: I could do it in two.
GEOFFREY HINTON: So there's a guy called Darwin, and there's a guy called Lamarck. And Lamarck said you inherit characteristics that you acquire in a lifetime. And Darwin said no you don't. Is this [INAUDIBLE] discourse? And the question is, can you get learning to guide evolution? And the answer is yes.
Can I give you the six minute explanation?
TOMASO POGGIO: Speak about connections and the [INAUDIBLE].
GEOFFREY HINTON: OK, I'll give you a little [INAUDIBLE]. Suppose I have a little neural net. And it has 20 connections in it. And they have to reset to the right values for the neural net to perform well. So suppose it's a circuit for mating. If you set the 20 connections to the right values-- they're on or off-- if you set them to on, or off, whatever the right value is, it will produce lots of offspring. And if you don't, it won't. And you'd like to get that into the DNA of the organism. So you'd like the organism to be born with 20 decisions of the DNA made [INAUDIBLE].
One way to do that is to have-- we're going to have a very naive model where for each decision, you have two alleles. And you have to get the right allele. For each organism to have a whole bunch of genes that start off random, and to do crossover, to run a genetic algorithm.
And it's only one in a million that's going to work, because you have to get 20 decision right. So obviously, you're going to have to build at least a million organisms, on average, before you get a hit-- before you get any benefit. And for that to-- remember, the population is going to be hard. Because if you don't mate and you lose one of the alleles, you're not going to be fit anymore. And so it's going to be millions you're going to have to build.
But now suppose that each organism can do learning during its lifetime. It has three versions of the allele. It has wired into a yes, wired into a no, we'll leave it alone. Suppose you start off in a population with these more or less equally distributed for each allele, each decision. Then each organism runs for a thousand iterations.
So now what's going to happen with learning-- an iteration of learning is much faster than an iteration of producing offspring-- so now what's going to happen is, the organisms that are born with the hard-wired ones correct with a remainder left to learning, during their lifetime have a good chance [INAUDIBLE] iterations of setting the remainder correctly, and therefore having lots of offspring. And therefore, they will be fitter.
So you can get a fit organism by not making all the decisions-- by making some of the decisions right, and leaving the rest alone. And so you're selecting advantage for getting some of those decisions right. So now you've turned a spike into a hump. And now you can build much faster. And if you don't have [INAUDIBLE] crossovers between them, the whole population will have most of the alleles set right.
So, it's basically, you have to do a million. But you can divide a million into 1,000 thousand organisms each [INAUDIBLE] 1,000 learning trials. And so for the 1,000 organisms, you try different combinations of the hard-wired [INAUDIBLE]. And the [INAUDIBLE] you try different combinations of the [INAUDIBLE]. And between them, you're going to get a good hit, eventually.
And that's just a much more efficient way to get stuck into the DNA than pure evolution. It's an example of a very general thing, principle, which is, if ever you can get a fast loop to help you out with a slow loop, it's a good idea. So if you're doing motor control-- can I go on for another minute? I know I talk too long.
If I'm holding a piece of chalk. Let's suppose this was a piece of chalk.
The thing about dongles is-- you can generalize dongles. But the dongles don't generalize. You need a different one for each Apple.
So, if I'm holding a piece of chalk, and I try to write on the board, and I do it by open loop control, one of two things will happen. [INAUDIBLE] nothing will happen. No, I didn't really want-- [INAUDIBLE]. Either [INAUDIBLE] nothing will happen. Or [INAUDIBLE] the other side of the board, and I'll smash the dongle on the board if I've got non-compliant control.
But if I make my muscles have the right, appropriate stiffnesses, and I am just beyond the board, then the board will stop me. And if you ask how fast in this [INAUDIBLE] is the board stopping me, that's going to loop in the [INAUDIBLE] shockwaves in your arm. So it's an incredibly fast loop in the physics.
And by putting a fast loop like that to help you with the problem, you can do much better than just having pure control. So you control stiffness of muscle so that you've got very, very fast feedback in physics to make up with [INAUDIBLE]. And that's true everywhere in biology. So you have developmental loops in the time period like 20 minutes. And that makes evolution go much faster. Because now evolution doesn't have to solve all the problems. It just has to set up some links so that development can solve the problem. And then you have learning, which is much faster than development, and so on.
So it's a very general principle in biology which we always use something fast to help something slow. And the Baldwin principle is just an example of that, using learning to help evolution.
TOMASO POGGIO: This says that almost everything that a sophisticated organism like us can do, if it's important for evolution, is going to be partly learning. There will be no evolutionary pressure for the genes to solve completely the problem if there is a general purpose learning algorithm that--
GEOFFREY HINTON: If you have something where you have to get a big [INAUDIBLE] adaptation, it's going to be very hard to do purely with evolution.
TOMASO POGGIO: Yeah, exactly. But at the same time, a lot of these problems would be solved, at least in part, or to a good extent, by genes. And the story may be just more than speed. I think Shimon may have something to say here. Is there? I'm just guessing.
SHIMON ULLMAN: [INAUDIBLE]. In machine learning, in general, the problem is to find the best function to perform the particular task, minimizing the [INAUDIBLE] within a restricted family of possible functions [INAUDIBLE], often called [INAUDIBLE] space. So it's OK if you are under [INAUDIBLE] space. Then you can use some examples and find the optimal solution.
Now, whether the [INAUDIBLE] space come for us-- if it's in our brain, it's probably wired in by evolution. Evolution has the task of learning over many years what's a good [INAUDIBLE]. And then you're searching for the function within this [INAUDIBLE].
But then there is a nice picture of division of labor between evolution and individual learning. Because now, evolution doesn't have to search every possible solution. Is this good? Is this good? Is that good? You can chunk the search into big [INAUDIBLE]. This big circle, this big circle, this big circle. If you found the right circle, then the rest of the job will be done by the agent--
GEOFFREY HINTON: I completely [INAUDIBLE].
SHIMON ULLMAN: [INAUDIBLE] division. It's not just about speed. [INAUDIBLE]. As a community, what we usually try to do-- let's look for DPM with [INAUDIBLE] and 8 orientation, and 32 [INAUDIBLE], and so on. Did this work? And it doesn't work. We start [INAUDIBLE]. So I think that, eventually, we'll have to find ways of not just assuming that we have the right [INAUDIBLE] and then [INAUDIBLE] that our systems can [INAUDIBLE] this, and everything works right, and we can do this. But we have to somehow automate this [INAUDIBLE] over the two sides of the [INAUDIBLE]-- not just the algorithm, but given [INAUDIBLE] can find the right solution in the [INAUDIBLE] family. But the combined search [INAUDIBLE] the whole thing working together. [INAUDIBLE].
GEOFFREY HINTON: I mean, basically, if you believe in stochastic gradient descent, then you ought to believe in the [INAUDIBLE] level, too. Right now, we're doing stochastic gradient descent [INAUDIBLE] relatively [INAUDIBLE]. And it's working better than things did before. So we made a little step forward.
Maybe the next step forward is going to come from changing the kind of [INAUDIBLE] use in the net. Or maybe it's going to come from changing the [INAUDIBLE]. Each time you get a system to work better, you've made progress. But you shouldn't assume you keep going in the same direction. You have to go off in different directions. I think new directions are going to be different kinds of architecture for the net. But we don't know what new directions are. And we don't know how much is innate.
I just have a sort of basic feeling that the first thing you should try is to do it without making an [INAUDIBLE] assumption. Because [INAUDIBLE] assumption's really easy. Just say it's all innate. We're done.
JOSH TENENBAUM: This is where the [INAUDIBLE] engineer model, to the extent that, at least around here, we're not just telling stories. We are building actual working models, even when we're studying stories-- like Patrick, who's been building actual models of stories [INAUDIBLE]. That's not a free pass.
LAURA SCHULZ: It seems to me that there's an interesting question about whether inference above instance can be learned. And either way, there's a problem about how [INAUDIBLE] develop a computational system that does physics. Whether it's developed by evolution, or learned as an empirical fact, it seems to be clear that there's a lot of evidence that a lot of things really are innate.
There are [INAUDIBLE] animals. They're born, they gallop across the plains, or they fly. Which means they know a lot about scenes, they know a lot about physics, and they know a lot about other agents. And that is innate, insofar as they're not learning it. Right? They're born, and they're doing it.
So given that we have a lot of data saying that this is, in fact, how a lot of organisms behave, it's the kind of thing that makes me sort of think about to what extent are these models showing us the kinds of things these models can do. To what extent should I be taking these models as telling me how to make predictions about how humans do these things, or how other organisms do these things, how the brain is doing these things? Because they could be different kinds of questions. There could be things that you could learn that nonetheless turn out in practice to be innate. And that might support other kinds of learning and other kinds of development.
GEOFFREY HINTON: [INAUDIBLE] but I don't think there's anything we do that can't, in principle, be learned. Because if you take evolution and learning together-- that we came from dust and we do it--
LAURA SCHULZ: But why would you want-- what's the advantage of starting there, I guess? What's the advantage, even for machine learning, of starting there? Right? So humans don't start there. We start with, at least as far as you can tell by testing neonates or [INAUDIBLE] checks to the other kinds of things that happen, organisms don't start as a blank slate. Right? They start with a lot of innate structure.
And that appears to be an advantage as far as evolution is concerned, because that is how, in practice, organisms are. So what's the advantage from a machine-learning perspective, of posing the additional challenge of how you might learn everything, as opposed to starting with those things that really are plausibly innate.
GEOFFREY HINTON: OK, so over the last 40 years, I've heard lots of people say, this has to be innate, that has to be innate, this has to be innate.
LAURA SCHULZ: Not has to be-- just is. Is.
GEOFFREY HINTON: Yes. I've heard lots of people say just is, or has to be. And normally the argument has been that the [INAUDIBLE] has been there's no way you could learn that.
So there was a wonderful NOVA program about language-- I watched it about 10 years ago. And all these Chomsky students-- [INAUDIBLE] senior linguists-- all look straight at the camera. And they all said, if there's one thing we know for sure about language, it's that it's not learned. They were-- I mean, it was like a religious cult. They knew that-- no, I withdraw the like. They knew that it wasn't learned.
And I think it was mainly because they couldn't see how to do it, based on spurious things like Gold's Theorem, where you have to learn perfectly. And as soon as you produce [INAUDIBLE] learning, or something like that, [INAUDIBLE] goes away. And it's now obvious that you can learn an awful lot of language.
It doesn't answer the debate about how much is innate. But it doesn't look like there needs to be much innate. You can learn grammar from [INAUDIBLE]. It's not a big problem.
Another example I like is, when I started in-- before you were born-- they told me orientation detectors-- the fact that neurons detected oriented [INAUDIBLE]-- that was innate. People were pretty confident about that. And then after a while, people learned to extract oriented things.
And then later on they said, well, look, if you look in the cortex of a monkey, you'll find these topographic maps. You'll find, for example, patches that have [INAUDIBLE]. And these patches have color opponent cells, as opposed to what I call black and white cells. They're lower frequency. And they come in patches. And they're typically red-green versus yellow-blue. And that's obviously innate. [INAUDIBLE] obviously innate.
It turns out if you take a simple learning algorithm, with a simple architecture with locality in it, and you just apply to patches in the visual world, you get a map that looks just like the monkey. You get these continuously changing, high [INAUDIBLE] frequency rotation detectors. You get these colored blobs. And the colored blobs are low frequency. And they're red-green opponent or yellow-blue opponent. And all of that is just there in [INAUDIBLE] of the world.
The combination of the three things-- the statistics of the world, a sense of [INAUDIBLE] learning algorithm, and some architectural restrictions-- [INAUDIBLE] innate, but it was locality, the locality of connections. And you end up with colored blobs. And you would have sworn the colored blobs were innate.
I mean, most neuroscientists would have told-- well, Bob was there. Most neuroscientists would have said the colored blobs were innate. They weren't something you'd develop in response to the structure of the world.
LAURA SCHULZ: And neuroscientists are--
BOB DESIMONE: Every time there's been a debate like this, the answer is always in between. There's always some innate aspects that come with the biology. And then it always gets refined with experience. But I mean, so I guess I don't really understand what this argument is about.
No one wants to argue that there are parts of behavior that are not deterministic in some way, right? So if you allow for supervised learning, you could presumably create a system to emulate anything that humans have evolved to do. So, given that, then what is this discussion about?
Is it what things need supervised learning and what things are unsupervised that are just basically responding to natural statistics and loyalty?
GEOFFREY HINTON: As a matter of fact--
TOMASO POGGIO: If we want to understand human intelligence-- you know, we don't know what intelligence, in general, is. There is a very practical definition about human intelligence-- basically, the Turing test [INAUDIBLE].
BOB DESIMONE: Yeah, but so, you want the machine to emulate the human. But why do you care how the human got to be that way? The human is the ground truth, so you want your system to emulate it.
TOMASO POGGIO: One implication is that, why should we explore everything that can be possibly learned? It's like saying I start from [INAUDIBLE] dynamics and I want to predict evolution and biology. It's crazy. We have to start from what human intelligence is, and what is determined a priority by the genes, and what is learned from visual experience.
So I think Geoffrey has always been for finding a machinery that can learn everything, and forget about evolution in the genes. And they find that he's crazy. I put you on the spot.
GEOFFREY HINTON: No, my principle has always been, you should first see if there's any way it could be learned before you say it must be innate.
TOMASO POGGIO: OK. OK, agreed.
JOSH TENENBAUM: I think that's a good principle. It's actually, I think, really interesting to look at the recent history of deep learning, right? You were saying, OK, well who knows where the next-- or you were making some speculations about where the next [INAUDIBLE] is going to come. Maybe it's from a new kind of architecture, or something like that.
I mean, the things that you talked about as what are the key steps that make a deep learning system work-- a certain kind of layered architecture, for better or worse. Convolutional architecture is a good one. But maybe the [INAUDIBLE] is good for some things, but not for others, right?
Those are innovations that were made by very smart people, maybe taking various kinds of semi-random steps and trying things out. And things happen. But I think it's interesting to look at that process of discovery that led the field of machine learning and neural nets to come up with these really cool architectures.
In a lot of ways, it looks a lot like evolution, right? A lot of parallel search, of trying things out, and seeing what works. Often, it's not the theory that [INAUDIBLE] those most useful discoveries, but just trying some things out. [INAUDIBLE] are human beings. They have a little bit of intuition that [INAUDIBLE]. Maybe it's more [INAUDIBLE] or Baldwinian there, or something.
In some sense, it seems like it's almost like a parallel root that-- a kind of cultural evolution, basically-- that, I think, when you look in the brain-- if there is a brain correspondence in these systems-- the analogous choices, like how many layers there are, or where the [INAUDIBLE] are, or whether there's convolution-- I think almost everybody believes those are genetic, not learned.
GEOFFREY HINTON: Yeah. I [INAUDIBLE] that.
JOSH TENENBAUM: It does seem-- it's interesting, right? It seems like the things that were most important for getting deep learning to work-- I mean, in addition to, yes, some things are learned-- are actively [INAUDIBLE] evolutionary step.
GEOFFREY HINTON: Yes. Because once you [INAUDIBLE] then what you have to do is figure out what system you need to run it in so that it works. And that's what evolution is doing.
JOSH TENENBAUM: Right.
GEOFFREY HINTON: But what I object to is, or what I'm very suspicious of, is when someone tells me, we have to be born [INAUDIBLE]. Now, I don't know the data you're talking about, so--
SHIMON ULLMAN: Actually, one of the reporting group, [INAUDIBLE] you do not need [INAUDIBLE]. It was more like your [INAUDIBLE]. The point was that [INAUDIBLE] born with [INAUDIBLE]. In this case, for example, you have some learning system that is much more primitive than that. But it knows something about [INAUDIBLE] and leave the rest to [INAUDIBLE].
GEOFFREY HINTON: [INAUDIBLE].
SHIMON ULLMAN: [INAUDIBLE] individual learning [INAUDIBLE] just the right thing without worrying about programming by evolution [INAUDIBLE] solution to [INAUDIBLE].
PATRICK WINSTON: I guess I'm as confused as Bob, because it seems to me that it's very expensive to wire anything in. So therefore, the most useful thing to wire in is the capacity to learn rapidly. The trade off must have something to do with a computational trade off of how expensive is it to import it, versus how efficient is it to learn it. So I think that if we didn't have [INAUDIBLE] elegant ways of learning what a hand looks like, [INAUDIBLE] gradually evolve [INAUDIBLE] because it's a useful thing.
GEOFFREY HINTON: I think some things you can't afford to learn-- like maybe falling off cliffs, or maybe picking up poisonous snakes and things. And there's very good reason why you just can't afford to have one trial and you're dead. And that's not good for learning. And so I'm quite happy to believe there may be [INAUDIBLE] low level systems wired in that give you primitive responses to those things to avoid them.
It seems hard to say you should leave that to learning, because you don't want to fall off a cliff.
PATRICK WINSTON: Well, I admit that things that can be efficiently learned without killing yourself [INAUDIBLE].
Also, let's bring up the question of what intelligence actually is. And that takes us all the way back to Turing, and the Turing test, which I always thought was a bad idea. I thought the Turing test-- I've read Turing's paper maybe 20 times, because I assign it every year.
GEOFFREY HINTON: You don't have to do the assignments yourself.
PATRICK WINSTON: Facing MIT students has a wonderful focusing effect on the mind. Anyway, every time I read it, I become more and more convinced that Turing wasn't actually serious about the Turing test. He really wanted to talk about why all the arguments against computers being intelligent were refutable. And he was a mathematician, so he really just wanted to get around the usual critique of, well, how can you talk about it without defining it?
So the Turing test was a circumnavigation that I think led to several decades of unfortunate distraction. Because in the end, that particular kind of Turing test, which mostly involved reasoning, was not, in my view, anything more than shadows dancing on the wall.
The real defining characteristic, in my view, of human intelligence, is the ability to understand stories. And reasoning is just a special case, [INAUDIBLE]. So I think if Turing had taken Marvin Minsky's course, he wouldn't have even bothered. Because I think Marvin has this notion of suitcase words-- which is a very healthy way, in my view, of thinking about intelligence. Intelligence, like many words-- creativity, emotion, all those kinds of words-- are what Marvin calls suitcase words, because they're really words that cover all kinds of capabilities. And so when you try to define them, or think about what they are, it's better to just say, they are many kinds of things. Let's talk about the different kind you're thinking of.
So if you ask questions like, is Watson intelligent? Then right now, you could say, of course it is, it's intelligent. But it's Jeopardy--it has a certain kind of intelligence which allows it to do that. But that's only one kind of intelligence. There are many others. So I think what we're all talking about is different aspects of a complex bundle of capabilities that we put this universal label to.
GEOFFREY HINTON: Do you think consciousness is a suitcase concept?
PATRICK WINSTON: Yeah, to a large extent. And I think that, to me-- I have a very simple notion of what consciousness is. It's reciting the story of what's going on around you. It's a kind of a-- well, I feel like when I'm being conscious, I'm telling the story of what's going on. I'm telling the story about how I'm involved in a panel. Josh is sitting next to me. Laura is sitting next to him. There's an audience. I'm saying something. That's the kind of running story that I'm telling to myself. That's how I--
AUDIENCE: Why [INAUDIBLE]? What's the purpose?
PATRICK WINSTON: What's the purpose? [INAUDIBLE] suitcase word, I suppose. Yes?
AUDIENCE: Yes, I've thought a great deal about [INAUDIBLE].
PATRICK WINSTON: Yeah, so, you're citing what's going on now so you can repeat it in the future.
TOMASO POGGIO: [INAUDIBLE]. I'm curious to know how the panel thinks about a couple of questions there. One is the importance of neuroscience [INAUDIBLE] understanding [INAUDIBLE]. I would like to know what [INAUDIBLE] think [INAUDIBLE].
And the other one is about the role of theory. Right now, deep learning networks are difficult to understand from a theoretical point of view. In fact, [INAUDIBLE]. It's kind of ironic. You know, you have a model of the brain with [INAUDIBLE] but we don't understand why it works as well as it does. Is that the goal?
GEOFFREY HINTON: Is there any a priori reason why you should expect to be able to understand it?
TOMASO POGGIO: Um, no. I mean, biology [INAUDIBLE] principle that [INAUDIBLE]. Evolution is made up of random decisions that [INAUDIBLE] does not make much sense to try to understand with the [INAUDIBLE]. But, you know, people have different opinions.
Who wants to start? Neuroscience, theory. You can choose.
JOSH TENENBAUM: I'll say something about neurons. And again, as a cognitive scientist, I'm not sure by neuroscience, you include cognitive science in that, or not.
TOMASO POGGIO: I was thinking more about the what part.
JOSH TENENBAUM: Yeah, so to me, I can work from the cognitive side. I'm ultimately interested-- I don't think we're going to be satisfied, I'm not going to be satisfied, until I understand how it works in the brains. Because it's just a question of-- as Alan Newell put it-- how can the mind exist in physical universe and physical object in physical universe. It's as great a question as anything on the software, cognitive level.
I think where there's real inspiration, for me-- and here I'm also inspired by some of Geoff's ideas under what's called the Helmholtz machine, and sort of both by some of the kind of work that Tommy, and neuroscientists working with you have done building models of actual brains, and some of Geoff's ideas. To me, when I talk about analysis by synthesis or inverse graphics, or rich or hard problems of scene understanding, the biggest challenge is, how do you get these two things-- how do you get such a rich [INAUDIBLE] of the world so quickly. Right?
We can build these inverse graphic systems-- which, if you're willing to wait long enough, they can perceive incredibly rich structure in the world. To take it to some extreme, if you go to the people at Pixar and you say, snap a picture, like the picture you're taking right here. You go to someone at Pixar and say, OK, we want you guys to build a 3D model that when you, inside your computer graphics system, light it and render it, will look just like this thing that we see here, only it will be an actual 3D physical model. So you can move the people around, and move the things around, and imagine what would happen, and solve the tetrahedron problem, and all that. But they could do that. It'll just take a lot of person power and computer power to do that.
And somehow, it seems to me that the brain is able to do that inverting of graphics just in a few hundred milliseconds. And when you look at the hardware-- and this is where I don't know how a graphics engine, or the inversion machinery is actually implemented in neural circuits-- you get a lot of constraints that you don't get from psychology by looking to see, OK, well here's what these neurons can do at this stage of the [INAUDIBLE] after 10 milliseconds, after 50 milliseconds, after 100 milliseconds.
I mean, you have [INAUDIBLE] classic work, and even some of the most recent developments in it, showing how much people can get a high-level scene understanding in just a relatively few milliseconds. But as far as understanding the machinery that can solve that problem so quickly, that's where I think neuroscience gives us information.
And then for people who don't Geoff's Helmholz machine idea, the idea that you can take a rich, generative model, like your graphics model, and learn a from bottom up, very fast approximate inversion of that-- [INAUDIBLE] what he called a recognition model-- that might look a lot like a [INAUDIBLE] pathway in the [INAUDIBLE] stream in the visual system. I think there's something very deep and interesting there to be understood about the combination of learning recognition models to invert something like a graphics engine, and taking constraint from the actual multi-level architecture [INAUDIBLE].
PATRICK WINSTON: Can I try [INAUDIBLE]? The answer, I think, is not always. If you think that Deep Blue is intelligent, in the sense that it beats everybody else in chess, and we got there without understanding how people play chess. [INAUDIBLE] I think if we do understand human intelligence in the human brain, my intuition is that surely will enable us to take many kinds of machine intelligence to another level.
TOMASO POGGIO: What I meant with [INAUDIBLE] kind of intelligence [INAUDIBLE]. Not one of these questions [INAUDIBLE].
TOMASO POGGIO: Then I think we'd be getting into intuitional ground. And many times, intuition is formed by failure to do it any other way. And so my intuition is that that's the right way.
BOB DESIMONE: [INAUDIBLE] argument from another analogy say if you have theories of color vision, it's hard to imagine that your abstract theory of color vision could be useful, without taking into account the properties of the rods and cones. You must understand something about the biology, or you don't understand why the organism works the way it does. Likewise for hearing. There's just so many things that you need to understand.
TOMASO POGGIO: [INAUDIBLE] in principle, you can go from biology to [INAUDIBLE] dynamics [INAUDIBLE]. You cannot go the opposite way, because there are too many possibilities.
SHIMON ULLMAN: [INAUDIBLE] historical mode. Marr was mentioned several times here. [INAUDIBLE] paper by Marr [INAUDIBLE] which was called, "AI: a personal view," [INAUDIBLE] in which he talks about the theory of [INAUDIBLE] nice, concise [INAUDIBLE] from which you can derive everything. And I'm not sure what was the example of type two. [INAUDIBLE].
BOB DESIMONE: [INAUDIBLE].
SHIMON ULLMAN: [INAUDIBLE] what can we do? It's very difficult [INAUDIBLE] reduce [INAUDIBLE]
TOMASO POGGIO: Geoff, what do you think? [INAUDIBLE].
GEOFFREY HINTON: Let me answer the other question about the [INAUDIBLE]. I mean, my hope has always been that, if you want to build a bridge between neuroscience and AI, [INAUDIBLE], I want to figure out how a bunch of neurons can be intelligent. And you can work from the neurons, or you can work from, how do you make things that are smart.
And the hope is that the task will give you quite a few constraints. And just the problem of learning to do things will give you some constraints. And then you'll be able to build this bridge from both ends. Neuroscientists will be building from one end. People like me will be building from the other end.
When neuroscientists are constrained to make [INAUDIBLE] from empirical data-- there's no use having neuroscience theory that posits that every neuron has two [INAUDIBLE] that have different signals. It would be very convenient if they did. But neurons don't seem to be like that. Although, I wouldn't be surprised in 10 years time if they [INAUDIBLE].
From the sort of AI end, you want to make things that work. And [INAUDIBLE] hope is, when you get these two things close enough, you'll notice that to maybe work better, you need to make it more like the other thing. It'll be like magnets. They get close enough, and once they get close enough, they go clunk. That's the hope. And so this idea of capsules is saying, look, [INAUDIBLE] are very unlike the brain, in that they don't have this [INAUDIBLE] architecture cortex [INAUDIBLE] going on in the mini column, and there's a small [INAUDIBLE] compared to what's going on.
But maybe if you're pushing in that direction, I'll know better. So my constraint is only going directions where [INAUDIBLE] make it work better. And if that makes it more like the brain, that's really nuts. But if it makes it less like the brain, we'll try that, too.
But the hope is that eventually we'll get it so that it's obvious you need to make these go the direction of each other-- you'll get a better neuroscience or you'll get things that work better. In the end, we all believe that if you really understood what was going on in the brain, you could make devices that work really well.
PATRICK WINSTON: [INAUDIBLE] brought up the Marr paper on "AI: A personal view." I read that 20 times, too-- same reason. But what Marr said in the end was he expected that some parts of the theory will be type one, and some parts will be type two. Some parts will have elegant, simple, theoretically elegant solutions. Other parts may be complex messes which are their own best explanation.
I'm hoping for more of the type one. I'm sure there's type two. But if we're all type two-- if we're all explained by a simple backprop net, I think I'd go sell insurance or something else instead.
TOMASO POGGIO: [INAUDIBLE] type two or type one?
PATRICK WINSTON: Type two is the kind where you don't really understand it, it just works.
GEOFFREY HINTON: The point about a simple backprop net is a very simple equation. [INAUDIBLE] with me. So, simple backprop is just like a [INAUDIBLE] turbulence with [INAUDIBLE] equations. Nice equations are really simple equations. On one level, you really understand turbulence, because you run the simulation, and it looks just like turbulence.
PATRICK WINSTON: Well, yeah, it does. But then you change the way the weights are randomized, it works 1% better. Then you change them a little different way, and they work another 1% better. Your papers are full of stuff like that.
GEOFFREY HINTON: The basic principle of what you get when--
JOSH TENENBAUM: When you're saying that it opens up the backprop, you mean [INAUDIBLE] in the actual framework? [INAUDIBLE] backprop in the [INAUDIBLE].
PATRICK WINSTON: What I'm saying is, I hope it turns out like [INAUDIBLE]. [INAUDIBLE] is all about structure and how structure constraint has to be there in order to constrain the learning to make it all work. And that I find elegant. The plain old backprop never struck me as particularly informing the sense that it helps me to understand something. Yes, I understand how backprop works.
GEOFFREY HINTON: [INAUDIBLE].
PATRICK WINSTON: That is how [INAUDIBLE].
TOMASO POGGIO: What about the analogy from biologists [INAUDIBLE]. There is one idea which is very elegant, which is the double helix. That's type one. It explains [INAUDIBLE].
JOSH TENENBAUM: Maybe that's [INAUDIBLE] natural selection, and a few other principles of selection.
TOMASO POGGIO: It's not the only thing.
JOSH TENENBAUM: Right, right. But in some sense, that is the only algorithm we know for a fact builds in intelligence, right? The physical substrate of the double helix, and the whole [INAUDIBLE] machinery [INAUDIBLE] and multi-celled organisms and-- what?
TOMASO POGGIO: The way they replicate.
JOSH TENENBAUM: The way they replicate, and build a machine that builds itself, and builds a learning machine. And I think the modern view of evolution is that it is an algorithmic process. It's a random one. It's not completely repeatable, predictable, or always optimizing. But that really is an alogorithm process that builds intelligence.
And I hear, maybe echoing something Shimon said-- and I know some things you [INAUDIBLE] up too, Tommy, and yourself on the Baldwin thing-- I think, ultimately, if we had a theory of building, of adaptive systems, or a theory of building intelligence that involved both a theory of how evolution works, and learning how they work together, and their interactions, and the different multi-scale things that go on-- just like in humans there's learning to learn, there's kind of evolving to evolve.
Evolution builds new machinery that can evolve better. I think, ultimately, if we had an understanding of that-- now that would be a very satisfying theory. And even if all the type one stuff was kind of offloaded into the way evolution works, and some learning algorithms, then we couldn't actually understand the actual circuitry of the brain in action on a [INAUDIBLE], then that would still be pretty satisfying.
TOMASO POGGIO: I think we all agree about this.
GEOFFREY HINTON: You have to have development. I think it's very important that you have-- you have evolution in the loop, and then you have development, which is a much faster loop. And then you have learning machines.
JOSH TENENBAUM: Yes, that's what I meant by the machine that builds itself.