CBMM10 Panel: Language and Thought
November 6, 2023
October 7, 2023
Ev Fedorenko, Phillip Isola
All Captioned Videos CBMM10
Is natural language the language of thought? LLMs as models of human language and thought. Are LLMs aligned with neuroscience and with human behavior? What is still missing?
Panel Chair: J. Tenenbaum
Panelists: E. Fedorenko, S. Gershman, P. Isola, E. Spelke, S. Ullman
JOSH TENENBAUM: Thank you very much, everyone, for joining this. Thanks to the other organizers, Tommy, Jim, Daniela, all the amazing presentations yesterday and this morning. It's really exciting. And we have a fantastic panel here for you.
I'll just say, when we were originally planning this out, I think at some point the idea was it would just be discussion, no slides. And then this thing happens where someone wants to show slides, and then more people want to show slides. But our panel here has bravely resisted the urge. None of us will be showing slides. We will all have an opening statement.
The question that we're getting at here, which the name of this panel is about language and thought, right? It is a remarkable thing about human language and thought that you can just stand up and talk about what you think. And we hope that we will successfully communicate, and when you ask questions and we answer it back and forth, that real knowledge and understanding will grow and unfold before us. That is a distinctive thing about human language and human thinking.
As awesome as vision is, if all the panelists on the previous session just kind of pantomimed and only had themselves standing up in front of you, it probably wouldn't have been as successful, at least, a discussion on vision. I believe they could have also had just as wonderful discussion if all they were doing was speaking, however.
OK. So why are we interested in language and thought? Well, there are so many reasons why this aspect of human intelligence is distinctive. What I've just talked about is one, but related to that, as Tommy talked about and alluded to at the beginning, and every session, even if it wasn't about Chat GPT and LLMs, they've come up somewhere.
In the last couple of years-- really, even in the last year, as far as the broader public is concerned-- the whole landscape and conversation around AI and intelligence more generally has transformed, because, for the first time, we don't just have one example of intelligence in the known universe. We have what might be, or what some people might think or, what maybe are, a second form of human-like intelligence.
At least we have systems, as Tommy said, Chat GPT passes the Turing test. I would say, Chat GPT can often pass at what you might call a casual non-adversarial Turing test, or non-collaborative. But it's easy to come up with Turing tests, even one-word Turing tests, as Tomer and others have done, that Chat GPT or a current large language model doesn't pass.
But just in terms of having a artifact, a machine system, that you can have a conversation with about any topic, and it sort of seems quite sensible in a lot of ways, it's remarkable. It's also the way that AI has always presented itself in to the popular imagination, whether it's Turing's original paper-- which really charted the course of what AI should be as a discipline or what it was aiming for, and introduced the Turing test as just a very completely legible idea of how you would measure whether something was intelligent-- or pretty much every science fiction movie, Star Trek, Star Wars, everything else, right how does AI present itself to humans? How do humans interact with AI?
It's not by coding. It's not by gesturing. It's by talking to it and having a conversation. So it is quite remarkable that we have systems like this. And what the panel is going to talk about-- I mean, this isn't an LLM panel, despite some of the questions up there, though, surely, they will be part of the conversation. Really, this is about language and thought in brains, minds, and machines.
But certainly, in the present moment, It's quite useful and valuable to reflect on the similarities and differences-- and we'll certainly do a lot of that between these models of language and thought, and the actual original sources of language and thought, which is the human mind and brain. OK.
So I'm Josh Tenenbaum, for those of you who don't know me, I'm the moderator and convener. And I will just say a few more brief words of introduction to lay out not so much a set of questions, but a set of issues that I think the panel will be discussing. And then, again, we will have roughly five-minute presentations for each one, and then we'll go into just a more free-form discussion and we hope you guys will participate in that, too.
Just to lay a little bit of my own cards on the table, as a moderator and convener, you know, I was really inspired, actually, by the spirit of the last panel, especially Talia's stuff. I think you have to look at any of these models with a sense of awe and wonder, and also with a critical perspective. My view is that systems like GPT4, especially. If we're talking about language and thought, no other large language model is in the same league as GPT4 in terms of how it's really been built to capture kinds of thinking. Personally, I would say GPT4, as well as, certainly, other large language models, they aren't intelligent.
They don't think, but they interestingly capture certain aspects of intelligence and certain aspects of thinking, and not others. And they do that in ways that we don't fully understand, and they do that in ways that, to the extent we do understand, are very interestingly different, as well as potentially interestingly similar to the way it works in human brains and minds. So I think the differences and the similarities are both interesting, and we'd like to explore those.
The themes that we'd like to explore this around, language and thought, there was a question that Susan Epstein asked. Some of you nicely submitted questions in advance. We read those. We'll probably touch on all of them, to some extent. But Susan said, what does it mean to actually understand language? And that's, I think, the ultimate question, understanding.
As Tommy pointed out, when he was giving the history of CBMM, we started off more than 10 years ago with this intelligence initiative, and in our very first meeting there, Tommy introduced the topic of intelligence by saying, well, what is intelligence? Well, let's look to the Latin root intelligere, which means, to understand.
Intelligence is often talked about as the ability for a system to assess its situation and respond effectively and adaptively in a wide range of situations, and the more adaptive, effective, and wide-ranging, the more intelligent. But for humans, we often think about understanding. Now, that's a really complex word. What is it to understand? What is it to understand the world? What is it to understand language?
That's really what we're getting at here. And I'm not going to give you an answer, but that's what we should be talking about. So I'd like all the panelists, to some extent, to say, well, what does it mean to understand? Part of that is, what does it mean to think?
So there's conventional ideas in cognitive science that have to do with thinking as like having a model of the world-- What is a world model, exactly? There's a lot of debate about what that means-- and using that to form beliefs and update beliefs and plan actions to achieve goals. So those are all traditional aspects of thinking, and we'd like to understand how those work perhaps similarly or different in brains, minds, and today's machines.
When it comes to another part of understanding, though, is understanding language, so the words and the phrases and the sentences that I'm generating and turning into patterns of pressure waves and air, and then you're also somehow reconstructing some notion of meaning from that. How does that work? There are traditional notions of reference, like how does a word refer to something in the world? What is the notion of a concept that might be like a percept?
Our panel here consists of people who both study language and vision and thinking stuff in between. And one notion of meaning might be, and which people have put out there, is how words, for example, and images might relate to each other. But cognitive scientists have often talked about concepts, something that isn't just a sensory stream or even an abstraction from multiple sensory streams, but something else.
So, I hope maybe Liz and Ev and others will talk about that, or Sam or anybody. I hope we'll talk about, what is the notion of a concept? What does it have to do with word meaning? How does that relate to perceptual experience? And how does it also go beyond it, or make it possible, even?
OK, key other themes that have come up in other panels that I think our panelists will address are really questions of how do we get the lengthy abilities that we're putting on display here, which are the human adult capacity to speak and understand and think and do those in and outside of language.
How do we get there? Developmentally, like starting from birth? Evolutionarily, including millions of years that happened in human evolution and well before humans were a distinct species? There's obviously some limited things we can say about that, but there's a lot we can say, looking at comparative studies and other kinds of evolutionary approaches.
And we really shouldn't forget cultural evolution, because human language is the product not only of biological evolution, but of cultural evolution and interplays between those. And human thinking in all the ways that it is singularly transformed by language and enabled also is very much clearly the product of cultural evolution. So I'd like us to talk about what are similarities and differences between today's AI and brains and minds on, let's call it, learning, development, evolution, the whole multiple timescales of how the trajectory gets to the adult state.
There are also issues that Leslie Cabling raised yesterday, which are really important ones like agency. What is an agent? Partly why I personally think that it's not really right to say that a language model, as well as almost many kinds of models, but certainly a language model is not intelligent is intelligence, to me, is a property of an agent whereas language models aren't agents. Although, they can be increasingly built to imitate them, and people are trying to build agents out of language models, and that is both very interesting and potentially very scary. But I hope that's a thing that we'll talk about.
And then there are so many other topics that have come up like emotions or consciousness or morality, even. I mean, again, some of these might just like category mistakes to talk about in the context of language models, but in the context of human language and thought, absolutely. Learning words to how to talk about our emotions transforms our emotional experience.
I'm not an expert on consciousness. I don't know if people want to talk about that, but let's just say morality, for example, I mean, it's absolutely clear that there's a deep relation between how we think about morals, how we talk about them, what we say, and what we don't say. And just across all the areas of human experience, language is our singularly best way of expressing our understanding of what's going on in our brains.
And so, it shouldn't be a surprise that models trained on very large amounts of data should start to show some of their own kinds of understanding of these. But, again, looking at the similarities and differences across brains, minds, and machines will be, I think, extremely revealing about many aspects of humanity.
And I also hope some of the panelists, depending on where we get to at the end, will talk about possible computational approaches to issues of language and thought that aren't just based on training very large statistical learning models on very large amounts of data. There's a longer tradition that goes back decades, just like in other areas of intelligence. And it's quite possible-- I think it's quite likely-- that there's, again, kinds of computational ideas that are not simply statistical learning on large amounts of data that will enable us on both the engineering and the science side to get even more deeply at the themes of language and thought than today's large neural networks, as remarkable as they are or have been.
OK, so, enough of my intro comments. We're going to start off with Elizabeth Spelke. And, Liz, why don't you come up here? I'll just briefly and informally introduce Liz. Liz has been a professor at a number of distinguished institutions, including here. I met her first when she was a visiting and then a faculty member here while I was in grad school, and I've been inspired by her work and her thinking and her example just as deep as I could possibly be inspired by anyone.
So it's a great privilege to be able to know you and to work with you and to have you in this discussion. For those of you who don't know Liz's work, she's probably the world's leading expert on the infant mind, and she just wrote probably the best book you can read on this, and you can read blurbs by Nancy and me on the back of it more or less saying this what babies know, so you should all buy the book. And she's not signing copies outside, although, I suppose if you brought yours, she might.
But anyway, Liz is just a fantastic thinker about how cognition develops, but she's especially interested in language and thought. So we're really happy to have her talking here.
ELIZABETH SPELKE: Thanks so much, Josh. And Tommy and everyone, happy birthday to CBMM. It's been so much fun working with all of you for the last 10 years. It's really been amazing. And I could easily use up all of my five minutes just giving you examples of all the wonderful ways in which that's true.
Instead of doing that, I want to start in a different way by going back. And thinking about what I wanted to say today, I went back to the first time I ever came to MIT, which was in 1982. MIT had just established its brand new center for cognitive science in a building that I have missed ever since it was destroyed, building 20.
And I was invited here as a visitor for that whole year. It was my very first sabbatical. And I was thrilled to be here, because I had been studying exclusively, really, visual perception and its development in young infants and I was really eager to study language and thought. And walked into the middle of the most amazing and vibrant debates over the nature of language and thought.
But I also came to realize pretty quickly that the debates were getting more and more heated because the fundamental questions that were being debated, we were so far from being able to actually answer them. For example, I'm thinking about questions like, is there a language of thought? And if so, what is that language like? What are its concepts? What are its logical inference structures, how do concepts get combined, and so forth?
Another question, do speakers of different languages think differently? How does the language that we speak influence the sorts of things that we think about and how we think about them? And how do we come to be creatures who think and creatures who communicate by language?
Now, I think today we know a lot more about language and about thinking than we knew back in 1982, but I think the big questions are still open. And one way to see that is to look at the current questions that are being asked about large language models.
I was a little taken aback when I saw yesterday Josh shared with us the questions that had been submitted. And they were, I think, all, in one way or another, questions about large language models, and all, in one way or another, about as open, I think, as questions about language and thought were at MIT in 1982.
Do they use language, and does language mean the same things for them that it means for us? Do they think the way we do, or are they just much beefed up, more convincing versions of Eliza with no there there underneath all of the things that they're saying to us? Or are they something else altogether? Well, I have a strategy when questions get too hard for me to address, and that is to ask babies, to turn to research on infants and see what perspective that might lead to on these questions.
So babies are born into wildly different, wildly varying physical and cultural and social environments, and they learn in all of them. They begin to perceive, explore, and learn about those environments from birth, from the moment of birth-- in one case I'll talk about, before birth-- and they all start out by perceiving the environment in the same kinds of ways.
They perceive that there's a surface out there that people and other animals move around on and go from place to place, and those places bear certain geometric relations of distance and direction to one another, that those places, moreover, are furnished with objects, and objects all are very diversely different in different parts of the world, but they all have certain properties in common.
They're solid. They're persisting over time, including when they're hidden, and they interact with each other on contact. Some of those objects are agents, and the agents that are out there detect things at a distance, formulate goals, and plan actions that achieve those goals, and do so efficiently, but constrained both by the physical environment that they're acting in and by the things that they can see in that environment.
And some of those agents are social beings who experience the world, and who share their experiences with one another by engaging with them, and do so within social networks that infants start learning about already in the first year of life.
So, what I think these examples suggest is that infants already come into the world possessing some of the deepest, most abstract concepts that continue to guide our thinking throughout our lives, and those most abstract concepts are also probably the most useful concepts for a learner to have, because they're going to apply in any environment. No matter how crazy our environments get as our technologies evolve, there are going to be objects in them and agents and surfaces to navigate over and social networks of people acting partly together.
Two other systems that babies are born with are systems for representing number and geometry that serve as foundations for mathematics, and those are going to apply in all of these environments as well.
So, human infants use these initial abstract notions that they have to learn about the particular environment that they've been born into, the particular sorts of objects that they're going to be encountering, the particular sorts of places and how people get from one to another, and so forth. And one of the things they learn is the language that the people around them are speaking.
Now, language learning starts in the womb. We know this thanks to the work of Jacques Mailer, who was another lucky visitor to MIT the year that I was there. Even as fetuses, they're starting to learn their native language, and at the time that they're born, they've heard their language by bone conduction from the speech of their mother. And by the time that they're born, they prefer to listen to languages that have the prosodic structure of the mother's speech, either their own language or other languages or languages with everything else filtered out except for the prosody.
So Jacques and his academic children studied that ability, and a number of other capacities for language learning exhibited by very young infants. One thing that they found is that infants are also sensitive at birth to a distinction that's fundamental to the syntax of natural languages, and also to the meanings of the sentences that express thoughts in natural language.
And that's the distinction between content words, like nouns, verbs, and adjectives that refer to people, events, and properties, and function words that don't directly refer to anything in the world, but have grammatical roles and semantic roles that indicate the relationships between the reference of the content words within a phrase.
Now, evidence that babies are sensitive to this distinction goes all the way down to newborn infants. They use word frequency and stress to distinguish content words from function words. And very early, as early as about four months of age, they're starting to learn how function words relate to each other across a distance in phrases.
So if you have a phrase like, they are dancing, you've got a content word, dance, there between a function word, are, and a morpheme -ing. And another phrase like, they have finished, again, a different function word and a different morpheme between the verb.
And they presented babies who were actually German-hearing babies with a whole string of sentences like that in Italian, for about 20 minutes. And at the end of those 20 minutes, they introduced violations, violations equivalent to saying things like, they are danced, where you're taking the same functional morphemes and you're mixing them up in the wrong ways. And the German-learning babies responded to the violations in Italian with a kind of characteristic brain response to semantic or syntactic incongruity.
Now, they also tested German adults who did not respond to the incongruity. Function words and their relationships are the hardest things for adult second language learners to learn, but four-month-old infants were quite ready to learn them in a language that they were encountering for the first time over a 20 minute session.
Now, by six months of age, infants have used this distinction between function and content words to learn the basic ordering of words and within phrases and within sentences in their language to learn whether their language, like English, puts subjects before predicates and, within a noun phrase, puts the determiner like the article before the noun, or whether it's like Japanese and doing all of that in reverse.
Now, interestingly, kids are not only learning this in their native language, but if they're in a bilingual environment, or if they're presented with non-native ordering within the laboratory, they're able to learn that as well. So they're very attentive to these relationships. They're learning about them quickly.
Now, all of this learning is occurring before babies have settled on the actual sounds and meanings of any of the content words in their language. And, at Harvard, our new colleague, Elika Bergelson, has been studying this for some time. I'll give you just one example: nose. So she takes a six-month-old baby, shows them a picture of a nose, a picture of a bottle, and says, look, a nose. And they look at the nose. So already that's looking good. They're six months old. They're starting to learn noses have something to do with that picture.
But then if she asks, look at the nays. They look at the nose just as much as they do when she said nose. And, if instead of presenting a nose and a bottle, she presents a nose and a mouth, they look equally at the two. What is this saying? They've gained partial meaning of the knowledge of this word, but now she can go on and ask, when have they really worked out the meaning? And also, when have they really worked out of the sound? Not until 14 months.
So, learning is just glacially slow for figuring out the actual words, which are, of course, the first step that the large language models are getting, and extremely fast at figuring out the abstract properties of the language, properties that apply to all languages and that are fundamental to how language conveys meaning.
So, I think that's all I want to say about babies' language learning, but I want to go back from that and from the evidence for core knowledge in infancy, knowledge of these abstract properties of places and things and people that research on infants reveals, and ask, how does this compare to what we see in large language models?
And it seems to me that it suggests that they're learning language in a profoundly different way from the way in which infants are. First of all, they start out with the specific words where infants are starting out with these universal and much more general, abstract properties. But second, infants, when they actually do figure out the meanings of different content words, draw on systems of core knowledge for doing that.
And I submit that core knowledge will not be found in a large language model for two reasons. One is that, although it continues to exist in us as adults, it is completely unconscious. If you go back over the history of what people thought was in the minds of infants, you won't find any claims about core knowledge. Nobody thinks babies have it. Nobody thinks they have it, themselves.
And what's more, if you look at large language corpora, you won't find any words or expressions that capture representations from core knowledge. We don't talk about core concepts, because we can't, because they're unconscious, but also because we don't need to, because they're universal. They're in every baby that's been born, and they remain present and functional in all of us.
So it would be extremely inefficient-- we're supposed to speak to each other efficiently-- to be talking about the contents of core knowledge. We'll say things like, I'm walking to area four. We won't say things like, I'm causing my body to move over the ground on an efficient, unobstructed path from my current location to a location at such and such a distance and direction from where I am now.
We don't talk like that. Machines are learning from the things that we say, and I think that, by doing that, they're going to be missing the contents of that really forms the foundations for our common sense understanding of the world.
But I do think there's one question that's going to be more important to answer that this excursion very brief into the large language models maybe raises. Core knowledge of places, objects, agents, social beings, number, and geometry is entirely shared by non-human animals.
All of the research that's been done on infants, in one way or another, has been done on at least some non-human species, and they all have our core knowledge-- or maybe we should say we have theirs-- down the line. But none of those animals learn language in the ways that human infants do, not even the dogs who are hearing language from their beloved dog owners, at least as much as human infants are hearing them and way more than human infants are hearing language from people living in cultures where it's considered crazy to even speak to an infant, since they can't talk back to you.
No other animal learns what preschool children learn, either-- what preschool children spontaneously learn about the environment that they're growing up in or what children learn in school, either their skills like reading and calculation, or whole new domains of knowledge, like history and cosmology, that take them into spaces that you can't actually perceive directly.
And no other animal possesses this ability that we have as adults to imagine that our world as different than it actually is, and to then develop technologies to create the worlds that we've imagined. So, the question that I want to put on the table, and I hope we can spend some time talking about this, and not all entirely about LLMs is, what distinguishes us from other animals and makes all of this possible?
And one way of asking that question, I think, is to ask, what would a machine be capable of if, instead of feeding it a heavy diet of human-produced language, we instead endowed it with perceptual systems, as people have been discussing already today, with core knowledge, with a zest for exploring and learning, and with a predisposition to attend to the aspects of human languages that babies attend to. I'll stop there.
JOSH TENENBAUM: OK. So, thank you, Liz, and thank you for letting me interrupt a little bit. Personally, I could have listened to Liz all day talk about that stuff. And, again, you should read some of this in her book in the last two chapters, and a book definitely forthcoming, perhaps soon.
But let's take the challenges that she put out, the questions there, about, again, ways in which what infants start with and how infants learn language, what that trajectory really looks like, because, unlike, for example, the development of perception, which can be studied, empirically, I think we actually have so much data on the development of language that you're citing. So that's really striking. And we'll come back to those questions.
So let's next move to Ev. Ev Fedorenko is a colleague here in BCS. I've known her for many years. And in the interest of time, I'll mostly just hand over to Ev, but I would say, she is one of the world's experts in my biased view-- probably the world's leading expert. I'm so biased, but I'm so lucky to have what I consider to be the world's leading experts on their topics talking here, which one is the development of language and thought, and Ev studies language and thought in the human brain. So let me turn it over to Ev.
EV FEDORENKO: I'm going to set a timer.
JOSH TENENBAUM: Very good.
EV FEDORENKO: All right, so a thing that you often hear people say is that eyes are the window to the soul. And, no offense to the eyes, but I think language is a way more powerful window to the soul, because the eyes can tell you how somebody feels or that somebody is thinking, but, using language, I can tell you a lot of rich propositional content like what I'm thinking about, what I believe, what I dread, and things like that.
And given this very tight relationship between language and thought, there has been some conflation going on, there have been a lot of heated debates going on from ancient philosophers to more modern day linguists and psychologists and neuroscientists.
And some questions in this space, like, do language and some aspects of thought draw on the actual same biological machinery? Can we think without language? Did we evolve language, as a species, to help us think certain thoughts? These kinds of questions are questions that brought me to the field a bunch of years ago, and, eventually, to work with Nancy Kanwisher using methods that we now have as a species to answer quite directly.
So, the methods that we have to answer the question of the relationship between language and thought basically come in two flavors. One is we can use brain imaging tools, which allow us, with good, careful experimentation, to zoom in on a particular perceptual or cognitive process, and then ask whether a different cognitive process engages the same bit of machinery.
So, for example, using a few minutes of scanning, I can find your language-responsive areas in any one of your brains, and then I can ask, do these same brain regions work hard when I ask you to do certain kinds of thinking, like solving a math problem or solving a logic puzzle or imagining planning some complex set of actions in the future or remembering some rich event from the past or whatever a host of things are encompassed under this broad umbrella of thinking?
And, somewhat strikingly to me, because I came with different beliefs into science, but I am a true empiricist, time and again, we've now found that any thinking task we give participants to do engages systems that are completely distinct from the system that is active when you produce and understand language. So that's one source of evidence we have.
Another powerful source of evidence comes from patients with brain damage. So, sometimes there's natural experiments that happen that can wipe out certain parts of our brains due to stroke or degeneration, and one very interesting class of patients with brain damage are patients with severe aphasia.
So you may have heard of Broca's aphasia or Wernicke's aphasia. So these are kinds of aphasias that impair some aspects of language, but there is also some patients who basically lose the entire language system. This is due to a massive left hemisphere stroke, typically, which just wipes out all of the components of the language system.
So these patients can't understand or produce language. They lost their language knowledge. And you can ask, can these patients still solve a math problem? Can they figure out a Sudoku puzzle? Can they play a round of chess? And, amazingly, it turns out that they can. Of course, sometimes you have patients who lose both language and some of these aspects of thought, because a lot of things are close by in the brain and you can wipe out multiple things, but these dissociations between strikingly impaired language and preserved intelligence are really quite telling.
There is a lot more to say, but in the interest of having a discussion, maybe the last thing I'll say is that one possibility some people still entertain is that we evolve language to think certain kinds of things that we couldn't think otherwise, but given that it doesn't seem that language is the underlying format of our thoughts, you can also tell a very powerful story, a story that many people have been telling for centuries, is that we evolved language to talk to each other.
And this ability to communicate thoughts and knowledge, building knowledge upon knowledge across generations, can very well enable the kinds of cultural revolution that has happened over the last bunch of centuries, allowing us to send rockets into space and figure out how DNA works and things like that.
So I think I'll just stop here and let others talk.
JOSH TENENBAUM: Thanks. That was very sensible. I will say, though, Ev, we should come back to this. Or I don't know if you even just want to say 30 seconds more on this. Do you want to say anything about language and thought in AI and what your research says about that?
EV FEDORENKO: I could. So, I mean, what I would say is that language models have been, obviously, very successful, and-- because language reflects a lot about the world, because we use language to talk about the world, and those are the corpora that are fed into the model, so of course they inherit some knowledge about the world-- there has been this perception that, in addition to linguistic ability, some of these languages also inherit some world knowledge or reasoning capacity.
And that's a reasonable inference based on our experience with language and thought, right? We're not used to interacting with entities that can produce fluent passages about any topic and don't have anything else there rather than things that they've picked up from these patterns. But, of course, as Liz already alluded to, and Josh, too, human intelligence goes much beyond what you can glean from language.
And some of the kind of most exciting directions in AI, I think, are directions that aim to try to solve complex tasks by combining our linguistic prowess with some other capacities, which will have different objectives than predicting the next word. That's not the only thing that drives human learning and human processing, but I'm sure we'll talk more about this in the discussion.
JOSH TENENBAUM: Yeah, thank you. And again, I think we could come back to, depending on how things go, actually-- I mean, there's a parallel line of work to the stuff that we heard a lot about in the morning on using neural net language models and connecting the brain.
Ev's lab has also done things on that. So there are interesting parallels and also dis-analogies, but also looking at how well you can actually predict the kinds of things that Jim and Nancy and others have looked at in the ventral stream and vision in the language areas that Ev has carved out. So to be continued on that.
Let's now have Shimon Ullman. Shimon also has been in and around MIT for many years in many ways. I met him first when I first came as a summer student. He's probably best known in computer vision, or what we used to call computational vision which was the--
Really, the integrated study of computational basis of visual perception in humans and machines, Shimon's been working on that for 50 years, I think it's fair to say, and been a world leader in multiple waves of the field. He's been especially interested in where visual perception links up with kinds of thinking, whether it's his classic work on visual routines or his counter streams, top down, and bottom up, some of the first models there-- also where vision connects with other aspects of thinking, social cognition, and language.
So it's great to have him here talking about language and thought, and I'll just turn it over to Shimon.
SHIMON ULLMAN: Thank you, Josh. So, I'll focus, indeed, on vision and on visual understanding within this broad scope of language and thought. And we are, in fact, in this field of vision in a particularly interesting and useful moment that we had vision without language for a long time. We studied vision by studying, developing pure vision model.
And very recently, it's about two years ago, we all talked about-- well, language model appeared a bit earlier, but vision language model appeared just about two years ago. So we have models of vision with without language, and then, in the last two years, we have vision combined intimately with language, and we can look at the differences and try to infer certain things: what happened, in what way language helps vision, how they interact, and so on. So it's a new field, or a new direction.
So there are quite a number of open questions, but we already have some perspective on, we can compare pure vision and vision language models. So let me go a little bit through this, and what did we find? Some of the thoughts that come out of the experience with this transition from pure vision to combining it with language.
Now, pure vision, or pattern recognition and what has been done in from Alexnet and ImageNet and many models from simple models to visual transformers and so on. And when they teach them, they do intelligent vision or some smart, very well-performing vision quite remarkably.
And you can train them, and they can classify correctly and identify objects and attributes of the objects and relations between objects. You can carry them to more complex tasks like action recognition. They can recognize complex actions, like drinking or fixing bicycles, and so on. So that's quite impressive, and can be useful for medical images, and even for autonomous driving can work with these models.
But, on the other hand, they're very limited, and in some sense, they do not really understand and they do not really generalize. For example, if you teach them the action drinking and you show them drinking with glasses and cups, they will not generalize and understand that bottle is also used for drinking, or water fountain, and so on. They do not really understand that drinking has to do with bringing liquid to the water rather than holding a particular object next to your mouth.
So they do not really understand the world, and they did not lead to scene understanding. What was done in pure computer vision was to try to extract from an image the full scene graph, everything in the scene, all the objects, relations, and attributes, and hope that this will lead somehow to scene understanding. Maybe cognition will take over, look at this very extensive representation and figure out what's going on.
But, in reality, I think that vision really works intimately all the time with cognition, with understanding, and it cannot be separated. Now, in vision language models, suddenly we had a combination that we had in the single model with two streams: a visual stream and a language stream, and they can be combined. And suddenly, we could have more understanding. At least, it seemed the functionally you get things like generating good captions for images.
So it's not just recognizing objects, but it takes an image and produces some text, which describes what's in the image and does it pretty well. And, perhaps, even more impressively, being engaged in vision question answering. So you have an image and some questions about the image, and the model generates some appropriate answers. And this is a big step forward. Generating a full scene structure, you can have a question about some specific aspect that you are interested in, you have a goal, and you can go after it and get some question about the particular question.
But I suspect that it's doing it by a shortcut. And what I mean is that, that sort of thinking with embedding-- and I'll try to explain to you what I mean. At least in the earlier model, things changed a little bit. In the original model, called Clip, you had these two streams, a vision stream and a language stream, each one producing a top level representation and embedding of what the text is or what the image is.
So you have two vectors, one summarizing the image, one summarizing the text, and you try during learning to make them as closely similar as you can working directly on these embeddings. And once you have these two vectors, you can start to do things that look quite intelligent.
For example, you can do semantic retrieving of images. You describe a sentence. I want an image about showing this and this and this, and hope the model can find it. And the way it does it, it simply compares the text vector and looks for the image vector that is the most similar, matches them, and takes this image out.
It can generate captions in a similar way. You give it an image. It gets the summary of the image, and then it looks to-- usually these tests were done with a predetermined set of, say, 10,000 different possible answers. It will keep pick the possible answer by directly comparing these summaries embeddings of complete situations.
But this is more of what Tommy calls associative memory, and is thinking directly with embeddings. And I think that it explains much of what vision language does. And I think that it's something that we use as well, as humans.
JOSH TENENBAUM: You want a minute?
SHIMON ULLMAN: Yeah, OK. I will say that the very current direction-- in fact, not very current. I'll refer to our own work that a couple of years ago in a paper, which there is a more recent version of it a week ago or two weeks ago in PNAS, in which we try to interpret rich scene understanding by applying a particular kind and appropriate program to the visual scene, sort of a set of instructions to the visual model, goal-directed, to extract what you need.
For example, if you ask, is everybody in the room now looking at the speaker? The model will automatically do it. And it will do it not by being trained on 10,000 examples of people listening to speakers and so on by simply going it understand what the components are, and it will go person by person, extract the direction of gaze, if it intersects with the speaker. And that's the way to do it without ever being trained on people listening to a looking at the speaker.
And if you think about it, the same logical structure can also will tell you to find the tallest bottle on a table. You actually will go through very similar logical steps in order to solve this.
In a minute, by just saying that I think that adding this logical part into the visual analysis is cultural and it's very fundamental in a way that I want to just point out. I think it's crucial, but not everybody pays attention to it that doing the logic, and doing direct vision are two complementary and very different ways.
In direct vision, you map directly pixels to some statements about the world. There is a dog in the image. So you start with pixels and you end up with saying something true about the world. There is a dog in the world. In logic you do something else you take it's a mapping from two statements to new truth statements. So things in mathematics. There are things that you take to be true, and then you derive more and more.
Sometimes, amazingly, novel and new things that are also true based on what you know. In the same sense, here to analyze everybody looking at the speaker and so on, you don't want to be trained in a pattern recognition, pixels to predicates, on this thing. You can build it from simpler things, which by verifying that are true in the image, each person is looking and being able to apply the operation of all that is in logic.
You can solve it. So much of vision, then, occurs not by being done now. More and more direct training from pixels, to situation pixels to actions, pixels to social interaction, and so on, but by the right combination of direct pure vision with the logical structure.
So there is more to say about it, and it's a journey that getting more and more deeper visual understanding. We are just not there yet. It's a journey unfolding still in progress. But I think a little bit from there, there is much more to discuss, and maybe it will come later.
But you can see also, within it, there is a place for the language is playing a role, playing an important role, but probably not the entire goal. So a lot is going on beyond language in combining vision and real understanding of the environment.
JOSH TENENBAUM: OK, great. Thank you.
I don't actually know that latest paper, but that's great. We should check it out, the PNAS one. And it sounds, maybe this is a theme definitely to come back to. It connects to some of the work that maybe Philip is about to talk to Ev did. It sounds like this is a kind of program, basically. Like what you're calling logic is some kind of program, maybe?
All right, so let's turn it over next to Phillip Isola. Phillip is a faculty member in CSAIL at ECS, and he was a student in brain and cognitive science. And since his early days, he's been very, very interested in the connections between brains, minds, and machines.
Phillip is one of the pioneers of what's now called generative AI, I think it's fair to say. Before people use that word, he built some of the first models in vision, especially, and across other modalities, too, of just using machine learning methods, neural nets especially, to generate not just classifier labels, but really rich images of various sorts.
And he's also one of the pioneers of some of the things that Talia talked about and is very important in self-supervised contrastive learning. Many of the themes that we're now seeing scaled up in huge ways in industry, Phillip pioneered in his research. So he's really truly one of the leaders in these approaches to AI.
Most people don't think or talk about him as an expert in language and thought, but no, his perspective coming from perception, coming from generative AI, coming from learning, and he's also been really interested in evolution and emergence there. It provides a fantastic perspective here, and his mix between engineering and science. We're really glad to have him here. So, Phillip, go ahead.
PHILLIP ISOLA: Yeah, I'm not maybe the first person you think of as language and thought. I mostly have worked on vision and embodied cognition, robotics, representation learning. But I want to use that to provide an angle, which is that Josh started the comments with language is a distinct kind of intelligence. It's something that might be special to humans. And I want to say that it also has commonalities and also has intersections with other types of cognition.
So let me give a few arguments in that direction. So one is, we think of large language models. I work on computer science, machine learning, so I'm not going to talk so much about the brain, but more about the large language model perspective. Might not be how the brain works, but that's what I know more about.
So, large language models we think of as being language models, models of natural human languages, but I think that's not quite the right way to think of them because, first of all, if you look at how they're trained, they're trained on data scraped from the internet.
And I don't know the breakdown. I mean, nobody does. Maybe Ilya or somebody will can tell us later, but anyway, we don't know exactly what percent of the GPT4 training data is natural human languages, versus other types of text on the internet, but my guess is it's mostly other types of text on the internet.
So it definitely is a lot of code, GitHub code, Python code, that's a lot of HTML, a lot of CSS. It probably has a lot of random data streams that are encoded in text that are not natural language. So I think calling these language models and thinking of them only about natural language, it's not quite right, even though they're very good at that. OK, so that's one thing.
And in fact, we often have debates about how grounded, or how much you need grounding in these language models, but there is textual data on the internet that describes colors, right? There's RGB sequences on the internet. There are things that might look like actual grounding in the text data these are trained on. So it's not very clear to me, exactly, how much these are pure language models. That's one.
Two is a big thing that's been happening in all of machine learning is using language models-- large language models, LLMs-- to solve other tasks. So what you may have seen in ChatGPT and so forth is that these systems are really good at all language tasks, but they're also very good at robotics tasks and vision tasks.
And this gets kind of to what I think Shimon was talking about that some of the best methods right now for solving visual reasoning tasks, for recognizing different objects and where they are in a scene, and analyzing scene graphs and interactions are to convert the image to language. So this is a paradigm of 2023 that's emerged. You take your image. You caption it, or you convert it to language, and then you do processing on the language. And that can achieve state-of-the-art vision results.
So, if you're just somebody who cares only about computer vision recognizing cats and dogs or navigating streets, one of the best working methods right now is convert the data into a linguistic format, and then process it with LLMs. You can also go further and you can say, actually, the language form it's not quite right, but it's some sub-linguistic structure within the language model that is a good representation of the world and you can tune that structure to solve a new problem. So you would fine-tune your LLM to solve a new task and maybe embodied control or in perception. And that's also a very popular paradigm right now.
OK, so language models are not just about language, and somehow they turn out to be very powerful for other tasks that we didn't previously think of as being related to language. There are distinctions, but I want to point out places where I see the commonalities.
One more angle that I'll get at is I've done a lot of work on self-supervised learning. So how do you learn about the world without having labels, without having words? And one of the interesting things that shows up there is that, if you train a system to do something that's self-supervised, meaning it just predicts data. It just models data. It doesn't have any labels.
For example, if you train a system to colorize a black and white photo, just predict the missing colors in a photo. There's no words at all. If it's a neural net, it will just discover, self-organize to find neurons that act like detectors for things that we have words for. So it'll find detectors for face and for cat and for dog.
And over the years, networks have been trained, and they get kind of better and better at being able to linearly separate different linguistic categories-- different words, basically, different nouns. So I think Jim Dicarlo, Nancy, others have mentioned this I'm pretty sure.
So, somehow, training a system just to understand the world without language arrives at structures that have some relationship at least to words-- maybe not to grammars and other parts of language, but at least to nouns and at least to words. OK. So that's another commonality.
So, I'll say that my general quest has been, can we come up with good general purpose representations of the world? And I'm just going to finish up now. People have thought about, what is language good for? It could be about communication. It could be something else.
But what I want to propose, or the way that I'm looking at it, is maybe language is that good general purpose representation of the world that you arrive at from all these angles. If you want a general purpose visual representation, language is a good candidate. If you want a general purpose robotics representation, language is a good candidate. Now, don't necessarily mean exactly English language, but something that has symbols and grammars and discrete structures. So that will be my argument that I pause at.
JOSH TENENBAUM: OK, thank you.
I could ask you to comment more on what you might say about this, I mean, even just speculatively on language and thought in the human mind and brain. But maybe I won't do that now. I'll just start thinking about that, because I think it's quite latent in what you're saying in a good way.
But just to get everything and everyone on the table, and then turn to our broader discussion. So Sam Gershman is our next speaker. Sam is a professor at Harvard. He was briefly a postdoc here. We worked together there, and he's been an amazing colleague and friend ever since then.
Sam is, I think, also not especially known as being an expert on language and thought, but he led an effort here with Nancy and me that was one of the first attempts to actually link up language and thought-- and with Ev also-- but on the kind of on the computational side. He led an effort to link up computational models of language and the human brain, and link them up to perception.
And I think of Sam as somebody who is kind of a generalist when it comes to intelligence, and that includes pretty much all aspects of human intelligence, but also things that are shared with rats and things that are shared with even single cells as one of his research programs. So he's a very deep thinker about intelligence in the most general terms and how to think about it computationally.
He also is actually a master of language. He's a great speaker, a great writer. I think he's written several books of poetry back in the day. I mean, in the way that language and thought like really speaks to what it is to be human, I think Sam is just a great exemplar of this and in so many ways, and I'm always inspired by his perspectives.
And he's been a participant in CBMM at Harvard on the Harvard side from the beginning, and especially, he's always been like really eager to engage on the deep, hard questions in some of our panels. So especially fitting to have him as one of our discussants on this 10th anniversary symposium. So, Sam.
SAM GERSHMAN: Thanks, Josh.
I feel like I'm at the Oscars or something. First, I just want to say thank you to everyone, and particularly Tommy and the other leaders of CBMM, because it's just been a great opportunity to be part of this from the beginning since I was a postdoc. So I'm really appreciative of that.
So in the spirit of unfettered intellectual spontaneity, not only do I not have slides, but I didn't really prepare remarks kind of deliberately, because I wanted to hear what my co-panelists said and try to respond to them. And so, in that spirit, I wanted to return to some of the themes brought up here, and in particular, some of the issues that Liz brought up about development.
And let me try to make this concrete and talk about two developmental phenomena, which I think are really illuminating. So one is, when children are learning the meaning of words, if you show them something like this thing right here and you say, that's a pointer, it's really ambiguous. Right?
Some of you might recognize this as Quine's Gavagai problem, which is that it could refer to all sorts of things. It could refer to the rectangular shape. It could refer to this gray color. But that's not what the meaning of the word that children learn when you say pointer. So if you show them another gray thing or another rectangular thing, they're unlikely to register that as another pointer, right?
Instead, they register the whole object. And that's why it's sometimes called the whole object bias. So it's a very strong and powerful inductive bias. And it's important to appreciate about that inductive bias that there's nothing in our visual input that is an object, right? Objects are constructions of the mind.
They're fundamentally conceptual representation that are, of course, linked to properties of visual input. So we should really think about that as a kind of conceptual inductive bias that is in play when children are learning language.
A second phenomenon that I love is, if I'm looking at this pointer, but the child is looking at something else, at their toy, and I say pointer, they're not going to now think that the toy that they're playing with is a pointer. So it comes back to something that Tommy mentioned yesterday that maybe we're kind of back to association.
And I would say, if we want to build human-like machines, we're can't be back to association, because if you think about language learning association between sensory input and words, that's exactly not what is happening when children are learning the meaning of words, because a crucial step in learning the meaning of words is that they have to establish joint attention with the speaker.
It's only when I'm looking at this pointer and the child is looking at the pointer that they will register this as a pointer. And there are lovely experiments where, for example, if you some disembodied voice behind a wall saying, pointer, it's not like the child learns whatever they're looking at as a pointer.
So that example brings up this point that a big part of language learning is not just about associating sensory and verbal input, but about agents, and agents doing things and attending to things. And so, that's another example of a cognitive capacity that's in play during language learning.
And if we kind of take these two examples as part of our frame, now let's ask, so what? So what? Maybe we don't need any of that in the sense that we now have these technologies like language models that Hoover up lots of language input, and then they can do pretty impressive things.
And they never have to deal with conceptual representations of physical objects. They never have to establish joint attention. They have a notion of attention, but it's very much a different notion of attention. I mean, they're not even doing any kind of multi-sensory processing of input because there's only one modality in their training input.
So, I mean this as a kind of question that I want to pose to everybody to discuss here, which is like, all right, maybe we have totally different pathways to intelligence that kind of dispense with all these facts that we know about how human development and cognition work. And so, that could be an argument against the kind of mission that CBMM embodies, which is that, the facts of development and cognition and behavior and all of that really matter for the design of intelligent machines.
And I actually think that that's probably not true. Sorry, I mean, I agree with the mission of CBMM, but the part that's not true is that we can ignore all of those facts. And partly, just really basically, is the point that-- and this is something that came up in the last panel-- this issue of embodiment and the fact that we are part of a community of other agents. We're using language in the context of communicating with other agents and doing things, and none of that is explicitly part of the language input. So that's kind of where I want to leave, things and we can discuss this more.
JOSH TENENBAUM: OK, perfect. All right, so thank you to Sam. That was great. I think the way I'd like to structure the time we have is spend about half the time with you guys responding to each other, and then the other half of the time with questions here from the audience. Is that good? I don't think you folks will be shy in asking questions.
But I would say, just I'll try to give one-sentence summaries, because I think each of you were raising questions for the others and also offering potential answers. So Liz basically said two things, children start off with core knowledge that must be really important, and yet, core knowledge is shared with other animals, but only human children learn language. Why? Good question.
Ev said language and thought are different in the brain, but clearly related in some ways. Shimon said, there's something about thinking that isn't just vectors, that's maybe something more like programs or logic-- he used the word logic-- but some kind of like logical symbolic structure that allows you to have the concept like, if I say, hey everybody look at me. OK, most of you did, but you didn't all.
As Shimon has showed, he has the really cool phenomena that, if somebody is looking at that scene, you can tell if everybody's looking at the speaker or who isn't. OK, so what does it mean for everybody to be looking at the speaker, and what is the actual explicit symbolic computation that might not be just vectors from image models or language models that might make that happen?
Phillip also tantalized us with the idea that language models are, in some sense, like symbol models, really. Like there's code. There's structured data. The internet is full of symbols in ASCII character sets, but that reflect a lot more just than language. So is there something special about symbols?
And then Sam gave his own perspective. Well, Sam highlighted a couple of really key aspects of symbolic cognition that is there in terms of how we think about objects and words-- the earliest aspects of word reference, as you know, and that is a constituent in our agency.
OK. So you guys go at it. Anyone want to answer one of the other questions, or suggest how some of your ideas might be potentially an answer? Or how would we go about meeting that challenge?
EV FEDORENKO: I mean, I could say a brief thing about the symbolic format of language and thinking. I think a lot of how the language and thought debates have gone on over the many centuries has been biased by the kinds of fallacies that we have in our reasoning that, like Josh and others have been studying for many years.
So, for example, language, of course, is very symbolic. And for some reason, that makes a lot of people make this leap to say, oh, because we have language, we learn to think in more symbolic ways-- even though this does not follow at all-- and a lot of thinking requires some form of symbolic representations.
And, to my mind, it's much more plausible that language simply reflects those biases, because I think thinking, in many ways, is pre-linguistic. So if we come with a mind that can think in some way in symbolic representations, and then we come up with a communication system to actually talk about what it is that's in our minds. Of course that system is going to reflect some of those biases. So I just think that take is kind of worth keeping in mind.
JOSH TENENBAUM: So you're saying that there's some pre-linguistic notion of a symbol.
EV FEDORENKO: Yeah.
JOSH TENENBAUM: Would you say that's possibly part of the answer to Liz's question? I mean, that's one idea people have proposed, that somehow there's something distinctively symbolic about human minds.
EV FEDORENKO: Perhaps. I mean, that's a hard question, so hard to test. Like, you can teach symbols to some non-human species in some limited way. So it's not like it's fundamentally not there, but yeah, that's certainly a big distinguishing feature.
ELIZABETH SPELKE: But I think there's actually another problem with the claim that the fundamental thing that distinguishes us is a more general symbolic capacity, which is that research on young children shows they're actually pretty terrible at understanding symbols.
It's a big achievement that occurs in the second or third year when a child is first able to take a scale model of a room and see it as a symbol for that room. And similarly for pictures as symbols. That happens a little earlier, but still. The idea that someone is showing you a picture to give you some information about the thing that's being depicted. The young kids don't have that.
EV FEDORENKO: Sorry, I did not mean to imply that symbolic capacity is innate. I almost certainly think it's not. I think we're fantastic learners and we can get there. I just think, just because language happens to emerge at around the same time, around two years of age, does not imply that language is what gives you those--
ELIZABETH SPELKE: Ahh! Language doesn't emerge at two years of age. It's there in the infants.
JOSH TENENBAUM: More like grammar.
EV FEDORENKO: You start learning words at six months. Elika's work shows that.
ELIZABETH SPELKE: Right.
EV FEDORENKO: You can distinguish prosodic patterns before then, but you start learning words, recognizing words at around six months.
ELIZABETH SPELKE: Right.
EV FEDORENKO: And then towards the first year, you start saying words.
ELIZABETH SPELKE: And around at 14 months is when, I think, you're actually--
EV FEDORENKO: Sure.
ELIZABETH SPELKE: You have the words well--
EV FEDORENKO: I rounded out.
ELIZABETH SPELKE: That still way before they understand other symbols. Could I put two functions of language on the table that I actually think might be really useful for kids?
EV FEDORENKO: Sure.
ELIZABETH SPELKE: One is the following: because human cultures are so diverse, and a child has to be able to become a competent member of whatever culture they happen by accident to find themselves in, there's a vast space of potential concepts that children have to be able to represent, have to be able to learn.
How should they search through that space to find the concepts that are useful in the particular culture that they've been born into? I mean, one possibility-- and these aren't mutually exclusive possibilities-- is there's just a lot of general purpose learning that's going on about what sorts of things am I seeing in my environment, and what sorts of movements are people making and things like that.
But even after all of that, I think there's going to be many, many things, many, many concepts that members of a culture could single out that children are going to have to sift through in order to figure out, they're pointing to something and they're saying something about it. Are they talking-- the Gavagai problem, right? What is it that they're talking about in that case?
And here, I think it may be the really useful features of language are one, that they're learned. We don't have an innate language that we're all speaking, which at one point, it seemed to me would have been a much more sensible thing to have evolved.
But the fact that they're learned, and they're learned from other people who aim for economy, and for relevance to the situation that they and the person they're talking to are focused on, and for informativeness. Because of that, the words that they're using are going to be eliciting concepts that are useful in that situation, and simple things like word frequency are going to be a good proxy for conceptual usefulness in general. Those should be the first things that children learning language should be learning.
So I think that's a really useful feature of language. And the other useful--
EV FEDORENKO: Communication, right? It's basically communication about the world.
ELIZABETH SPELKE: No, because-- well, OK, that gets to feature two. Maybe it could just be communication. If you grunted and pointed, you'd accomplish the same thing. Kids would see you're pointing more often to the bottle than you're pointing to the plastic that the bottle is made out of or whatever.
But then I think the other really useful feature of language is, it doesn't just directly connect to the world. It conveys our perspective on the things that we're talking about. And that it's really hard for me to see how any animal or machine without language could be learning from humans what the useful perspectives are to take on the things that we all see.
So those, it seems, to me, would be just two reasons why language is a particularly useful thing for kids to be learning early and then using to be learning other--
JOSH TENENBAUM: But let's dig into this, because I forgot to mention, communication was another thing that Ev said that's key. And that was also connected to the last thing that Sam was talking about. And so, I think this might be-- so I think symbols is a really important thing about why language is important and, more generally, in thought.
But this issue of communication, again it connects to understanding. Like, when we talk about understanding, we talk about understanding the world. We talk about understanding language. But we also talk about understanding each other. People want to be understood. Why do we want to get up here and talk? Because we have something that is in us and it has to get out.
One of my favorite phenomena that hasn't been mentioned here of both development and evolution of language is the work of Susan Goldin-Meadow and colleagues on home sign. If you don't know this work, some of the most important work out there, I think, on language and thought.
Susan and colleagues studied children. This is work that was done, I think, in the 1970s and '80s, a lot of it, and continues. But studied children who grow up deaf with families who aren't speaking a sign language. So they don't have much, and perhaps, no actual language input.
But they still have human minds, and they have a rich social environment, and for all the reasons that many of the speakers are talking about-- and they have some proto-symbolic capacities for thinking-- they also have the desire to communicate and be understood. And they tend to make up their own proto-language.
So Goldin-Meadow and colleagues mapped out ways in which these kinds of sign systems that children just spontaneously invent to communicate with other people around them, have various proto features of language, aspects of symbolic reference, proto-syntax like noun-verb type combinations. It's really quite amazing that children without any language input at all create proto-language-like systems.
And then a related phenomenon that she and others studied the development of new sign languages like Nicaraguan sign in an orphanage where children had come together who, again, didn't grow up speaking any conventional natural language, but had their own home sign systems. You bring them together, they merge their languages into some kind of a pidgin, effectively, and then, over just a couple of generations, create a whole new sign language that didn't exist before.
So, to me, that's one of the most striking phenomena of how thought and capacities for symbols and the urge to communicate and be understood, it causes individual and small groups of humans to create language. It's almost exactly the opposite, interestingly, of what we see in large language models where they've taken the outputs of vast generations of humans who produce all this language and then put all this thought out there in the form of language, and they remarkably kind of back out some interesting approximations to thought. But humans, we start with language and create thought in these ways.
So, again, I think it'd be great for others, especially maybe on the AI side. So what insights could we get from that for, let's say, designing machines that could actually use language and communicate with us better or maybe learn much, much quicker or all the things that we might want from AI?
Shimon or Phillip or anyone who wants to weigh in, or Sam or Liz.
SHIMON ULLMAN: A related comment, although not directly, I think that, first of all, for us, in machine learning, or at least for a long time with vision, we learned from looking at the world. But when you think about it, most of the time, we learn by being told. We see it for 12 years in school, and we are being told more and more things. And we take it in, and we learn about the world.
And in large language models, you have something funny that, in some sense, they do it, and as was mentioned here, they accumulate sort of rich world knowledge, and it is not put to use at all. So, for example, in the example of drinking that I said, that drinking, what it really means is bringing liquid to the mouth with intent to ingest it, or something like this.
So, first of all, these things are more speculation or playing with GPT 4 and Instructed Leap and so on. They do not understand that in the sense of putting it to use. For example, if visually they can recognize because they saw example model the train on [INAUDIBLE] they can recognize type of drinking, but not, for example, drinking from a bucket. You showed it to them, and they have no understanding that this person is drinking. You show it to a five-year-old or to anyone here in the audience, you see a person holding a bucket to their mouth and drinking, you would say that's a person drinking from a bucket, but this is out of distribution for.
JOSH TENENBAUM: There's a great-- go ahead.
SHIMON ULLMAN: But if you ask if you ask the language model, tell me what people could drink for from and so on, it will mention bucket, it will mention ladle, and it will mention things that the model does not recognize once it's confronted with them.
JOSH TENENBAUM: Yeah.
SHIMON ULLMAN: So the way that we learn by being told, and then we put it to use, and it can also migrate to the visual domain, and we use it to understand the world around us is very fluent. When we look at the model right now, from what I've seen so far, they have this, in some sense, knowledge that you can retrieve if you ask the right questions. But it's isolated. There is a gap between this and being able to use what you were told to actually interact with the world and understanding.
JOSH TENENBAUM: I think this will start to change partly from some of the work you talked about, or like Phillip, you mentioned when we were talking before about Viper, GPT, and other kinds of things like that, where people are using language to construct programs to look at images. But exactly in current systems, a key thing that humans do, which is when we talk about visual scenes.
We don't just like look at the scene, embed it in our brain, and then talk about the embedding. It's an active process where our thinking guides what we look for and so on. Whereas it does seem that in a lot of today's models, it's much more this feed-forward thing.
Like the scene is embedded somewhere, and then the language model ticks around on that. And as a result, it's very limited. It's limited to the knowledge that's implicit in the language. One doesn't go back and change how it interprets the scene.
There's an example that Shimon used to show in some earlier CBMM things, and I forget. I know he showed some versions. I showed some versions. I forget who actually showed which version of this, but on the topic of drinking, it's so interesting.
There are these weird straws. Like, you can find images like this on the internet of like a straw that's in the shape of glasses. So you put one end in the bottle, you put another end in your mouth, but in the meantime, it goes around like some funny headdress.
Or if somebody brought in a plastic clear tube-- and we can all do this thought experiment-- like a really long tube, and we snaked it around the side of the table and up into one of those bottles, like Ev's bottle right there, then the other end here and I went [SLURPS] and you saw over time the water level go down, it would be very clear to you I'm drinking in some really weird way.
The tube could snake around the back of the room in here. It doesn't really matter. But it's so different in any image embedding, but it's the common sense understanding of what is the action of drinking? What is the goal? It's getting liquid into your mouth through some physical process, whatever.
So, anyway, that's one concrete example, but it's emblematic of more generally, what is thinking? How does it get realized in language and vision? But what is the actual stuff of thinking about, I mean, fundamentally building on core knowledge concepts that we want to communicate? Phillip?
PHILLIP ISOLA: Yeah, I have maybe a bit of a response to Shimon that I think also maybe connects on the debate of symbols versus non-symbolic systems. So I think that it's true that GPT 4-- well, I might evaluate it a little higher, but I agree that it has plenty of cases where it can fail. But that's one use of it. That's using it as a language model or a symbol model that takes symbols as input and produces symbols as output.
There's another use, which is also very common, which is to fine-tune these things. So that the actual representation that mattered is the sub-symbolic representation within the transformer, within the neural net. And I think that these two might have very different implications.
And it could be that language is powerful in the human brain because of the actual symbolic words that we use and manipulate and the programs we run in our mind, or it could be maybe that language is more of a supervisory structure. It's something that leads to the emergence of what really matters, which is the sub-symbolic system that does the computation.
And I don't know so much about the brain, but I think we see both of those usages in AI. We have LLMs that we use just as an input/output symbol manipulator, and we also have the internals of LLMs that if you tune those and peel those out, they can sometimes be even more effective. So, yeah, just putting those two perspectives out.
JOSH TENENBAUM: Just one more question for Sam, and then maybe turn it over to you. So, Sam, something you didn't mention in your comments but you had mentioned before, which I think is interesting is just themes of like amortized inference and planning. Like again, an aspect of thinking that came up here in part-- who mentioned it? Oh, Ev mentioned it.
Like the kind of problem-solving you're talking about, like Sudoku or solving math problems. Like, one classic area of thinking is where you have a problem to solve. You have some representation of a world. It could be the actual world, or some mathematical structure, or a fictional world-- a lot of narrative fiction is like this-- and a goal. You make a plan, a sequence of mental steps that you could act on in the world, perhaps, that would get you from where you are to the goal given your beliefs about the world.
That has been one of the central areas of human thinking in cognitive science, and I don't know if you want to comment on either the role of language in that, or ways in which language models might or might not reflect that.
SAM GERSHMAN: Yeah, well, maybe coming back to what Ev said about communication, one of the most important aspects of language from the communication perspective is the efficiency of communication, which people here, like Ev and Ted have worked on for years, and I think maybe a kind of non-standard interpretation of that is that efficient communication is kind of, in some sense, efficient thought.
Like, in the following sense. So, if I wanted to bake a cake, I guess I could, in principle start from first principles of like matter and how different kinds of molecules interact with each other chemically to produce different kinds of objects. If I knew enough, I could think through that, and figure out from first principles how to put a cake together.
But, of course, none of us do that when we want to bake a cake, We read a recipe. And, in essence, what the recipe is doing is giving some instructions that substitute for all that laborious thought that we want to avoid.
JOSH TENENBAUM: We could use physics, like Ila was saying.
SAM GERSHMAN: Yeah, yeah, that's right.
So, in the same sense that, internally, we have techniques or heuristics for basically figuring out shortcuts to do things that we want to do in order to do things fast and efficiently in the world without getting lost in thought, I think language is part of that apparatus. It's one of the tools that we have for amortizing cognition, if you want to think about it that way. Yeah.
JOSH TENENBAUM: Right, but then we might also say, well, where did the recipes come from in the first place?
SAM GERSHMAN: Right, so that's very important, I think. Like, we have the capacity to do original thought, and then maybe later, compile that into some kind of program in the same way a chef goes to their test kitchen and figures stuff out, and then they write a recipe. And that's what you download from the internet. I think we're doing that kind of thing all the time.
And I think that very much speaks to Ev's point about the difference between language and thought, and why we have a dissociation between those things. Because if we just had language, we wouldn't have the test kitchen of the mind that we really need to have all of our flexibility.
JOSH TENENBAUM: I like that phrase, the test kitchen of the mind. No, excellent. OK. So, yeah, let's just take a couple questions. Ila, go ahead. Yeah.
ILA FIETE: Well, I've been listening to the framing of this amazing discussion. My question is, I mean, in a generative sense, what would it take if I had a collection of agents moving around in the world, what would be the conditions that would require the emergence of language, I guess? What would be the minimal and sufficient conditions for that?
SAM GERSHMAN: And it depends on so many assumptions.
JOSH TENENBAUM: Phillip, you've worked on that, right?
PHILLIP ISOLA: Yeah. So I wouldn't say I've really worked on it, but I've worked on looking at emergence of representations without explicit supervision. Emergence is something that a lot of people have looked at.
So maybe the perspective that I was trying to give or that I kind of believe in or as a hypothesis is that, if you just apply representation learning algorithms-- and don't know quite what the right ones are. Maybe it is a max likelihood generative model. Maybe it's an auto-encoder. Maybe it's a next token predictor. But just kind of generic representation learning, like generic data compression on an embodied agent in the world-- that language will pop out.
Now, I think so far we've seen that something pops out. Words pop out, clusters pop out, some structures pop out that have relationships to language. I don't think we've seen English pop out, or something like homomorphic to English. But that would be my hypothesis that eventually that would pop out from just embodied agents trying to compress and represent the world around them.
AUDIENCE MEMBER: So they don't need to communicate? Because there's a question of is it communication?
PHILLIP ISOLA: Yeah.
AUDIENCE MEMBER: I mean, like people have raised many different criteria: internal symbolic thought, which presumably doesn't require any communication with any other agent, then there's communication, there's efficiency--
PHILLIP ISOLA: Yeah, you have to explain why. I mean, I just defer to Ev, but you have to explain, at least if you're trying to explain how human language emerges, why is it that the particular compressions we have are relevant for communicative efficiency and not just any kind of compression of information in the world?
JOSH TENENBAUM: I mean a point that Liz emphasized a little bit and others have emphasized, especially like Michael Tomasello, is the distinctively cooperative nature, I think, of human intelligence, that there's all sorts of ways in which our minds, but also our brains and our bodies, are clearly evolved to be cooperative.
Like we compete with each other, but compared especially to our nearest primate relatives, we do a lot of cooperation, a lot of joint action towards joint goals. Language is a way of aligning each other on joint goals. And why are we incentivized to share what we know?
I mean, again, we want to do that. Some people don't, but a lot of us really want to share what we know and learn from each other and think that we're smarter together. There's something about humans that seems like that. Like a point that Tomasello has made-- I don't know who first made this point, but I learned it from him-- is you can see this in the whites of our eyes. This is a very striking feature that we can all look in our bodies.
Like, for almost all of you, I can tell whether you're looking at me or looking a little bit away, or I can see who or what you might be looking at, because I have hyper-acuity. I can perceive extremely fine-grainedly, your eye gaze direction based on the fact that you have a white part and a dark part of your eye, which basically no other mammals do, and certainly not other great apes and so on.
So, clearly, evolution made that happen. It has a cost, because we can betray what we want and what we think, and it has a gain in terms of how we interact with each other for joint action. And it's reasonable to think that inside the brain, not just on the outside of our bodies, there's similar kinds of adaptations which relate to and build on the things of symbols and communication urges, but something that's also more about the perspective issues that Liz said and our motivations to make things happen jointly in the world together, it would be important.
And again, if we want to think about how to get that into language models, it does go back to the issue of agency. I don't think you start from text to get agency, let alone agency that is interested. Well, I mean, again, humans have agency and the commitment to joint agency, it seems, as part of maybe both evolving and breaking into language.
Maybe take one. It's always traditional to end a CBMM panel with a question from Manolis, but how about if Jim is quick, then we'll have Manolis end it for us. OK. So something quick from Jim, and then something quick from Manolis.
JIM DICARLO: Maybe this is quick. Maybe you can pick each panelist can pick one answer as the other. This came up in the last panel--
JOSH TENENBAUM: True or false.
JIM DICARLO: No, pick one goal or another. One goal is like get a digital emulator model that goes from whatever sensory system you want, through whatever proto language you want up to languages like a World Map, as you would call it. But it would be computationally-efficient, available, usable maybe to all of the stuff that you could do. That would be goal one, digital twin. Maybe we're close to that.
But I keep hearing in the context a lot about a different goal, which is we really need to know where the thing comes from and whether it evolved and so forth? And that's a different goal, but it's a goal, too. And those goals are getting often mixed up in the discussion to me, the field should do one or the other-- or of course, we'll say both-- but if you had to pick one, which one would you go for?
JOSH TENENBAUM: You mean individually, or which one you think we should all work on?
JIM DICARLO: As a field, yeah.
PHILLIP ISOLA: Wait, why do we have to pick?
JOSH TENENBAUM: This is the Center for Brains, Minds, and Machines, Jim. Well, I think build an emulator versus understand where the human capacity came from.
JIM DICARLO: You can't use the word understand. So that would be which version counts as understanding: a digital emulator, or a way to get to the thing that is the digital emulator?
JOSH TENENBAUM: So I see two kinds of computational understanding.
JIM DICARLO: I just think, from a scientific process, it might be clarifying to focus our goals on what are we actually after? So I think there's--
JOSH TENENBAUM: So in a model, right?
JIM DICARLO: Sam brought this up at the end. It's like, well, should we really, is this part of the path? Or having a comparison of the digital emulator is, itself, a hard task. And that's a big part of a comparative study. So, again, you know where I stand, but I'd like to hear where everybody else stands, because they seem to be saying both or some are saying different things.
SAM GERSHMAN: I'm trying to understand the scope of the problem. So when we say a digital emulator, is it possible to just emulate language? I think one of the themes here is that language doesn't exist in isolation. It exists as an exchange between agents trying to do stuff.
JIM DICARLO: Of course that would be part of the emulation.
SAM GERSHMAN: Wait, so are you saying emulate all of human society?
JIM DICARLO: No, imagine you had a test of humans interacting however many you want, and you have a model that's sitting in there, it's essentially a Turing test, and you're measuring whatever brain levels you want and say it's all lined up and it's digitally successful. And that would be a research program, and it would be it would have many uses, potentially, to a successful program.
Another program is, it doesn't matter how you get there by hook or crook, you just get there. And the world is getting there, possibly one way or another, right now through LLMs, and I think that's one way to view it. And you could say, well, that's a silly goal, and there's a different goal, which is do we really need to know the whole evolutionary arc of humans and the development of humans? And that's going to be far more important and here's why. So those are the two kind of directions.
JOSH TENENBAUM: I think--
JIM DICARLO: Yeah, and they're both valuable. That's why it's a hard question, which is why I'm kind of posing it that way. I'm not trying to give you your answers. I was just trying to separate for the audience that those seem to be both in the discussion.
JOSH TENENBAUM: So from a computational point of view, you can have a model of just language as it exists in adult humans, and that's separate from how you get there. Or you can have a model of how language develops starting from infancy, or how language evolves, whatever.
And I guess a question for everyone is, is it interesting? I mean, again, you don't have to choose. Scientifically, they're all really important, fascinating questions. But I mean, I think everyone would say that. We can go down the line quickly and say either your goal, or is it interesting if you have a model that is just the adult state without how it develops or evolves?
I don't know. Ev?
EV FEDORENKO: Yeah, I want to understand the whole thing. Like, yeah, of course I want to understand how it gets there, like a lot of answers are in the learning and evolution.
JOSH TENENBAUM: Sam, both. Yeah.
SAM GERSHMAN: I mean, are we talking from the perspective of engineering? So like, I think, scientifically, we want to understand both. If you want to just build an artifact that can talk to people in a human-like way--
JOSH TENENBAUM: But is it scientifically--
Suppose we have a model. Like, Jim would say his model's of the ventral stream. He's not making any claims about evolution or development, but he's saying they're scientifically interesting of the adult state. So is it scientifically interesting to have a model of the adult state of language without a developmental or evolutionary story?
SAM GERSHMAN: Yeah, I mean, in the same way that a psychologist who studies adult language.
EV FEDORENKO: Right.
SAM GERSHMAN: Doesn't need to worry about how exactly that particular adult got their language, right?
EV FEDORENKO: These are some questions.
JOSH TENENBAUM: Maybe the question is to articulate for Jim and maybe others here, why have cognitive scientists cared so much about language acquisition and evolution? Because, I mean, people have studied visual development and visual evolution, but acquisition and evolution have been at the center of language and language and thought in a way that maybe they haven't been envisioned. Is there some reason for that? Why is that?
AUDIENCE MEMBER: Maybe because only we have it.
SAM GERSHMAN: Yeah. I mean, I think for evolution, obviously, there's this striking discontinuity and people want to understand where that came from.
JOSH TENENBAUM: Exactly. I mean, again, it just seems so striking for all these reasons, both because only we have it, and because all of us who've had babies, they don't have it. And yet, they still seem to think. And Liz, again, has pioneered ways of studying thinking in babies, and then you all see that process unfold.
So it is just a really striking phenomena of human intelligence. And some people would say-- I mean, maybe this goes to your point-- the view that you especially you get in people who are saying large language models look they're intelligent. They are seeing the thinking and the intelligence as the emergent property that comes after learning language.
But, for many of us in cognitive science, the intelligence is what lets you learn language from so much less data, and even invent language when you didn't even get it. So if you're interested in intelligence and what language might tell you about it, you are focused, among other things, very centrally on the acquisition and the evolution.
Let's wrap up with our traditional Manolis question.
AUDIENCE MEMBER: So no pressure here. So, basically, I'm struggling with the following. Basically you were talking about language-specific cognitive capabilities developing by 14 months and so on and so forth. How much of that is the amount of data versus just the cognitive hardware eventually developing? That's one aspect of the question.
The second one is the concept of abstraction, the concept of hierarchical interpretations of the world. I would argue that it's almost inevitable-- and I would even make the crazy claim, and you might prove me wrong here. You might data that proves me wrong-- that maybe dogs encode visual objects around them in the same hierarchical way that we do, that there's like balls and parts of balls and, I don't know, parts of rooms and tables and legs, et cetera, that we humans have simply given names to those things, and that these hierarchical constructs are almost inevitable.
And, I think, a place perhaps to look in the animal kingdom might be song-learning birds versus innate songbirds where they can kind of look at the constructs that they're making. And I'm also thinking about earlier in evolution. Why did we evolve this giant neocortex? What were the evolutionary pressures prior to language, and is it dance? Is it song?
Is it stringing together constructs, which might have grammatical constructs even before the emergence of language, that then pushed for a hierarchical interpretation machine to the levels that we are now able to just formulate in the context of language? Again, you don't have to answer that.
JOSH TENENBAUM: As a traditional, that was five good questions.
SAM GERSHMAN: Wait, Manolis, do bird songs have reference?
AUDIENCE MEMBER: So, it's a great question for the folks in the room. They probably know a lot more about me than--
SAM GERSHMAN: I mean, I don't think they do. Right? They're patterns, right? And they might have some kind of grammatical structure, but it's fundamentally different from--
AUDIENCE MEMBER: I mean, again, I can't distinguish song-learning birds from typical birds, but I can see patterns in the birds when the birds sing, I can see that they have specific constructs that repeat in specific ways.
SAM GERSHMAN: Right. Do any of those patterns or constructs or whatever refer to things in the world?
AUDIENCE MEMBER: No, no, no, it's definitely not.
All I'm arguing for is a hierarchical organization.
JOSH TENENBAUM: The best consensus is that they don't, for the most part, but there are really interesting phenomena in some of the birdsong, things that kind of like Liz's example from Jacques Mailor's work, that very young babies can pick up on this statistical structure in just a few minutes.
There are some kinds of birds, like the Mockingbird, who can hear a new sound and kind of imitate it. So there are some really interesting circuitry. And Michael Fee and others have long speculated, and that's partly why it's interesting to study like the links between vocal learning in songbirds, vocal learning and imitation in human language.
Does anyone else want to make any concluding comments or addressing.
EV FEDORENKO: I just want to I just want to go for the one about the compositionality and abstract concepts, or maybe that was two. Anyway, I think people have been thinking about animal cognition all wrong. The general assumption, I think, is that animals might share some of our sensory capacities and some of our low-level perceptual capacities, but they don't share our capacities for abstract thought.
And I think actually, exactly the opposite is true, that the most abstract concepts that we have, the most general concepts that apply in any environment, have the longest evolutionary history. So animals are going to view their events in terms of causal relationships. They're going to view objects in terms of properties that you can't see, like permanence, out of view. They're going to view other animals as having consciousness and experiences and sharing it when they engage with each other.
Those are the things that are going to be common across all of these species. What's going to be different about us is our prodigious capacity for remaking the world and for diversifying our societies, which gives children a huge learning problem. How are they going to learn fast enough in order to keep up with all the incredible new challenges that we seem to produce now on a daily basis?
And even in our past, you see great changes over human history and even prehistory. How do you keep up with all of that? And language, I think, can be-- I'll be more general and not just pin it on language. Learning from other people in your social world, who share your culture and have been in it longer than you have as a child, is a really, really good mechanism to be doing that.
AUDIENCE MEMBER: I'm going to bring a piece of data here, which is, of course, the--
JOSH TENENBAUM: Manolis, it's OK. I think we're starting to cut into our lunch hour, so I think it would be a good time to wrap it up. But I hope we can continue this there. I do think it's a good question. I think I like Liz's answer. I think, in general, you said do other animals see bowls and part.
And as Liz said, I think this is a good note to end on, especially when we're thinking about technology, dogs may see some of the same things in bowls, mugs, and handles, but we see those things in part because we made them. We made the handles on the mugs and the bowls. We're making the chairs with the legs. We're making the computers and the language models.
So in a world that we are changing and re-changing for our goals and collaboration, languages is a distinctive capacity there, but it remains, let's just say, a question for the next 10 years of CBMM.
AUDIENCE MEMBER: Can you take 30 seconds.
JOSH TENENBAUM: No. OK, sorry. I'm just going to conclude the panel on that note to understand that open-endedness, flexible nature of intelligence in the world we're making. OK, thank you.
AUDIENCE MEMBER: Thank you. Thank you.