The Debate Over “Understanding” in AI’s Large Language Models
Date Posted:
April 22, 2024
Date Recorded:
April 2, 2024
Speaker(s):
Melanie Mitchell, Santa Fe Institute
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Abstract: I will survey a current, heated debate in the AI research community on whether large pre-trained language models can be said to "understand" language—and the physical and social situations language encodes—in any important sense. I will describe arguments that have been made for and against such understanding, and, more generally, will discuss what methods can be used to fairly evaluate understanding and intelligence in AI systems. I will conclude with key questions for the broader sciences of intelligence that have arisen in light of these discussions.
Short Bio: Melanie Mitchell is Professor at the Santa Fe Institute. Her current research focuses on conceptual abstraction and analogy-making in artificial intelligence systems. Melanie is the author or editor of six books and numerous scholarly papers in the fields of artificial intelligence, cognitive science, and complex systems. Her 2009 book Complexity: A Guided Tour (Oxford University Press) won the 2010 Phi Beta Kappa Science Book Award, and her 2019 book Artificial Intelligence: A Guide for Thinking Humans (Farrar, Straus, and Giroux) was shortlisted for the 2023 Cosmos Prize for Scientific Writing.
JOSHUA TENENBAUM: We are very, very pleased and fortunate to have Melanie Mitchell speaking with us. As I think makes sense from how many people are here, maybe Melanie needs no introduction. She is very well-known at this point as one of the leading public experts on what is going on in today's world of AI and different kinds of approaches and different eras and different ways to think about AI. And I think that's what she's going to tell us about here.
She's a professor at the Santa Fe Institute, where she runs a number of programs that relate both to AI and other kinds of really interesting complex phenomena. She has worked in not just AI, but cognitive science and a range of other related topics. I remember I first met Melanie actually when I was a summer school student at the Santa Fe Institute in 1991, I think it was.
And you were giving lectures on genetic algorithms. And I was just telling someone today-- totally unrelatedly, just coincidentally-- about a talk I remember you giving when I was in grad school here about cellular automata and various like emergent objects, glider guns, and so on.
So whether it's looking at the emergence of intelligence or things that might be kind of like kinds of intelligence-- and maybe not, and maybe yes-- in today's large language models or emergent objects and concepts in cellular automata, simple kinds of models potentially of neural computation, or emergence of algorithms and other kinds of intelligent behavior in evolutionary processes, Melanie has been really fascinated and studied and communicated in fascinating ways about the general ways in which structure and function can, and what are the conditions under which, and what are the phenomena of emergence over many different systems and scales that have inspired me and many others for a long time.
So I think we're really lucky both to have her here, and the field and the general public is very lucky to have her, whether it's writing popular books or tweeting or going on TV, seeing you there. And but also like really actually doing the science that it connects to also the public outreach. It's really one of doing some of the most important work over a long time. So we're really lucky to have her here today. And I'll just turn it over to Melanie.
MELANIE MITCHELL: Thanks. Thanks, Josh.
[APPLAUSE]
All right. Thanks. Thanks, Josh. Josh said that I try and explain what's going on in AI. And of course, what's going on in AI are debates. People are just at each other's throats trying to sort of determine what actually is going on these days. And I think one journalist described it as a time of quite radical uncertainty, which is I think we all are feeling.
So I'm going to be talking about this idea of understanding and large language models and what people think it means, what we might-- how we might mean. And this was an article that came out in PNAS that I did with David Krakauer, my colleague at Santa Fe Institute.
So do AI systems understand the data they process? Of course, it depends what you mean by "understand." It certainly seemed like pregenerative AI days, there was a lot of failures of understanding. And it mattered because you could get all kinds of errors that I think we would say are errors of understanding.
So here's an example. This was from a paper of 2018 that showed that if you have a deep convolutional neural net that's trained on, say, the ImageNet data set that can recognize school buses with 100% confidence, that if you Photoshop the object so that it's in a weird pose, the neural network is now 99% confident it's a garbage truck or a punching bag or a snowplow.
And these were problems of sort of understanding in some human-like way, this visual data. These systems would make errors that were very unhuman-like. And also, common sense kinds of problems like looking-- being trained for like self-driving cars on different objects you might see on the road and not being able to distinguish whether something is actually an object out in the real world or is a bike-- a picture of a bike on the back of a van as part of an e-bike ad.
So kind of humans are pretty good at figuring out the context and knowing what's sort of a picture and on a van and what's real world. But these systems had problems with that. And this is another example of that. A person on Twitter posted that they were driving their Tesla with self-driving software and the car slamming on the brakes in this area. And he couldn't figure out why.
And then he noticed the billboard-- I don't know if you can see that-- which is an ad with a stop sign and the car just would slam on the brakes. This is an edge case, obviously, it's not something that happens all the time. But there's always some edge case that's going on somewhere. And these cars, it's hard to figure out how to train a system to understand this common sense kind of representation of what's going on in the world.
We've also seen lots of instances of machine learning systems that don't understand what it is we're trying to teach them. So this is an example from a paper where a group was trying to train neural networks to distinguish whether pictures of skin were malignant or not. And the system learned that if there's a ruler in the image, it's much more likely to be a skin cancer because that's-- they're trying to measure it. And so what the system learned was that rulers are a good predictor of skin cancer.
And we also see problems with natural language processing systems. So Google Translate-- this is actually only a few weeks ago. I tried to get it to translate this sentence, "the legislator accidentally left a copy of the important bill he was writing in the taxi," into French.
The word "bill" is ambiguous, perhaps. Anyone here speak French? Yeah. So this is the wrong translation, where "facture" means the kind of bill that you would get from your plumber, not the kind of bill your legislator would write. And in fact, translation errors have had real-world impacts by people using these translation programs and trusting them too much, for instance, on translating asylum applications from Afghani to English.
But now we're in a new era of AI. We have these large language models. They have been trained on vastly more data than any of these earlier systems. And they seem to have achieved a richer human-like understanding.
So for example, if I asked ChatGPT to translate that same sentence, it gets the right translation. And I can even ask it, Why-- How did it know to translate that? And it can give me a like very verbose explanation that tells me that it knows it was dealing with a legislator and a legal document, et cetera, et cetera.
So Terry Sejnowski, who most of you probably have heard of, a neuroscientist who was a pioneer of neural networks back in the 1980s, recently wrote this article where he's talking about-- he's also grappling with what is going on in AI. And he says, these things are-- they're not human, but they're superhuman in their ability to extract information and some aspects of their behavior appear to be intelligent. But if it's not human intelligence, what is the nature of their intelligence?
And I think this is the question that kind of we're all grappling with, we're all obsessed by. What is the nature of these systems' intelligence and how do we figure it out? So some people think that they're intelligent in certain ways and they have real understanding.
So Blaise Agüera y Arcas is an executive at Google, AI researcher also. And he wrote this article about how they do, in a very real sense, understand a wide range of concepts even though they're informed just by text. He even wrote another article saying that they have-- are, I think, getting closer and closer to being conscious. I'm not going to talk about that here.
But on the other hand, philosopher Jake Browning and Yann LeCun write that a system trained on language alone will never approximate human intelligence, even if trained from now until the heat death of the universe. So people have extreme polarized opinions on this topic. And maybe they're using the word "intelligence" or "understanding" differently. I don't know. But it's hard to tell.
But this group, a year or so ago-- a couple of years ago-- did this survey of the natural language processing community and asked-- asked people who had published in NLP conferences to agree or disagree, some generative models trained only on text, given enough data and computational resources, could understand natural language in some nontrivial sense. And the results were just split across the middle, right? Just 50/50, really perfect. And so it's radical uncertainty. It's a lot of polarization and disagreement.
So what would it mean to "understand?" Well, Ilya Sutskever, one of the co-founders of OpenAI, put out this hypothesis about what understanding means and how large language models understand. He said that when we train these systems to predict the next word in a huge amount of text, what happens is that the system learns not just to predict text, but it learns a model of the world.
And he said-- he kind of said it poetically-- it learns of people, the human condition, their hopes, dreams, and motivations. It learns a compressed, abstract, usable representation of that. OK. So that's a hypothesis, right? What's the evidence for that?
Well, he didn't give any evidence. But some people have tried to look at evidence for so-called world models in these trained large language models. These are two papers that looked at a toy problem using the game of Othello-- some of you have probably seen this work, where they trained a transformer to generate just from sequences of tokens that represent moves in this little game called Othello to predict legal moves.
And then they looked at what the internal representations of the transformer was and found that it seemed that the internal representations were encoding, actually, the state of the Othello board, which the thing had never been trained on, but somehow emergently had figured out. Now this was very interesting and provocative, but it wasn't really a large language model. It was a relatively small language model. And it's a pretty simple, closed world. So it's not totally clear that this will sort of generalize to things like ChatGPT or anything.
But people sort of took this and extrapolated. So like Chris Olah, who does-- who's at Anthropic said that he thought it was clear evidence of language models determining internal world models. And Andrew Ng went further and said that the Othello GPT project showed that they build models of the world, which makes me comfortable saying they do understand. OK. So they're extrapolating quite a lot from this rather limited experiment.
On the other hand, the other side, we have Yann LeCun again asking, Are these models actually building abstractions that they use to reason? Or are they doing what people have called approximate retrieval? Which I think means something like taking a problem they're faced with and doing some kind of pattern matching with memorized training data and using the training data to solve the problem. So they're not really coming up with general abstractions or world models. And he's saying, there's a continuum between them, and he thinks LLMs are largely on the retrieval side, the memorization side.
And Rao Kambhampati, who is at Arizona State, if you are on Twitter, you should follow him because he's hilarious. And he's very snarky about all of this. So he said, "Emergent abilities, noun. The preferred euphemism for what your LMM does when saying 'approximate retrieval' sounds too unsexy." OK. So these are on the other side of the fence.
So how can we go about evaluating this question of understanding? And this, I think, is difficult. But one way is obviously the old-fashioned way. Just look at their behavior, do a Turing test, say it's an ill-posed question. Let's just define it as they seem to understand.
But of course, we have-- we know that people are very prone to project understanding onto something communicating with us in natural language. We saw that way back in the 1960s with Eliza and chat bots ever since. And so it's a little bit-- we can be misled.
As a more objective approach, we can test these systems on natural language understanding benchmarks. And people do that all the time. But benchmarks can allow shortcuts of the kind like the ruler in the skin cancer. One of the more popular natural language understanding benchmarks a few years ago was something called the General Language Understanding Evaluation, or GLUE, which was a set of different tests for language understanding, and its successor SuperGLUE.
And this is a recent leaderboard from SuperGLUE. The top seven-- and this is general language under-- general language understanding. This is what we're testing here. So the top seven things on the leaderboard are large language models, and here's humans. So that's where we are with general language understanding.
Well, OK. Maybe this isn't exactly a test, a real test of general language understanding, because I think we humans are still probably better than most language models at that. But it turns out that there can be shortcuts, meaning subtle statistical correlations among the words or tokens in these benchmarks that can predict the answer, or these neural networks can learn to predict the answer, without actually doing something like human-like understanding.
And we've seen a lot of papers saying benchmarking and natural language understanding is broken. There's artifacts all over the place. And what these systems are doing is they're kind of Clever Hans, the horse that supposedly could do arithmetic but was actually responding to subtle cues and body language of its trainer. So that's been an issue with these natural language processing benchmarks.
More recently, people have started giving these language models standardized tests. You've all seen these headlines, "ChatGPT Gets an MBA," passed Wharton MBA tests, did better than a lot of students. It might be smart enough to graduate law school. It passed medical licensing exams without cramming. I'm not sure what memorizing the whole internet is kind of cramming, but--
[LAUGHTER]
Anyway, but people have now-- have done some studies questioning whether there's data contamination, that some of the tests are actually in the systems' training data, and it may even-- if not, it may not be the case that performance on these tests really correlates well with performance on the actual tasks in the real world. So I wrote an article about this, "Did ChatGPT Really Pass Graduate-Level Exams," and I sort of show some ways in which that's a bit of hype.
Just the other day, somebody I think here-- I think he was at Harvard. He wrote an article re-evaluating the bar exam performance and showed that it was way overstated. And this group wrote on their blog that these are just the wrong answers to the wrong questions. These are not appropriate ways to evaluate these systems.
OK. So another way to evaluate their understanding, if we think of understanding as building up these abstract abilities, this abstract world models, let's try and evaluate them on tasks that require abstraction and reasoning. But then we have to ask if they can perform these tasks, how well-- how real or robust are their abstractions?
So this paper, which was done by some people here and some people at BU, called "Reasoning or Reciting?" They took several kind of reasoning tasks-- you can see them there-- that GPT-4 was doing very well on. That's these blue bars, the accuracy. And they said, well, What if we come up with counterfactual tasks?
Those are tasks that are versions of those tasks that use the same reasoning abilities, that require the same reasoning abilities, but just are counter-- are not the same content. They are less likely to be similar to what's in the training data. So as an example, one of their tasks was code execution. And the way that worked was GPT-4 is very good at taking a little snippet of Python code and saying what it will print out.
So for instance-- so I heard a talk the other day on AI where-- on prompt engineering, where someone said you have to give the large language model a pep talk. You have to say, you are a genius.
[LAUGHTER]
Or you are an expert programmer, or you are a Nobel Prize-winning physicist or something. And then you ask it the question.
And so it was very good at answering these Python little code snippet questions. But what this group did was they said, well, suppose we now make a counterfactual version of that, where we tell it that it can readily adapt to new programming languages and there's this hypothetical programming language called ThonPy, which uses 1-based indexing instead of 0-based indexing. And you can see that the performance of GPT-4, the red bar, went way down. I mean, whereas-- I don't think they did human studies. But you would expect a human programmer to be able to adapt to this pretty well.
Another group did a similar study. I don't know if you guys saw the "Sparks of AGI" paper from Microsoft. Well, this was called "Embers of Autoregression."
[LAUGHTER]
And what they were saying is that because of the autoregressive training of these language models, they'll have certain properties and certain limitations. And they did a similar kind of study where they looked at several different kinds of reasoning tasks and looked at what the performance was when the content of the task was common in the training data, likely to be common, and contrast that with the performance where it was likely to be uncommon. And as you can see, the performance falls off even though the reasoning abilities that are needed for each task are the same.
Now humans are also sensitive to content of reasoning tasks. But their hypothesis was that-- I mean, they weren't really looking at humans, but one might-- one might guess that humans, at least in some cases, would be able to deal with adapting to the less common content. Well, I studied this with one of my-- oh, one of my collaborators and I studied this in the context of analogy making.
So there was this paper that came out last year from Taylor Webb, et al, at UCLA called "Emergent Analogical Reasoning in Large Language Models," where they showed that GPT-3 actually was able to do a-- perform analogy making where it exceeded a lot of UCLA undergrads. And this got a lot of press, GPT-3 aces tests of reasoning, undergrads get beaten on questions like those that help them get into college." OK.
So what we-- one of the tasks that they looked at were these so-called letter string analogy problems. So like if a string abc is transformed to the string abce-- abcd is transformed to abce, do the same transformation to ijkl. And here's another one. You have two copies of b, reduce it to one copy of b, what do you do for this one? And another one, I have agcd, abcd, so sort of fix up the sequence.
So they had a bunch of problems like this. And they showed that if they look at GPT-3 versus UCLA undergrads, UCLA undergrads don't come off too well [LAUGHS] for accuracy on this task.
OK. So what we did-- this is a recent paper from my collaborator Martha Lewis and myself-- we looked at this counterfactual task paradigm and said let's come up with some-- first of all, we tried to replicate their results and didn't-- we used Prolific, which is a crowdsourcing platform rather than UCLA undergrads. And our humans actually did better-- who knows.
[LAUGHTER]
Of course, they were getting paid and they were just getting credit for psychology majors, whatever. So meh. But then we said, OK, well let's-- How would we make a counterfactual version of that? Well, one way is to say, well, What if the regular alphabet-- instead of having the regular alphabet, we mix it up a little bit. We swap some letters.
And so we said the alphabet might be in an unfamiliar order. Complete the pattern using this order. So here, we have this sequence where m is now in a particular sequence that if you looked at it, it would make sense. And people actually did pretty well on these.
We also gave them a different alphabet, an alphabet made of symbols instead of letters. And these are really the same analogy problems. They're just encoded in a different way. And we asked, How do people so? And this plot shows the blue points are humans with error bars on the number of letters permuted here in the alphabet. And here's for the symbol alphabet.
And the other dots are different language models. And we found that humans stay pretty-- they are able to adapt pretty well to these counterfactual alphabets. They don't really change in their performance. But the language models actually drop way down.
So I guess the take home message here is these systems are better, often dramatically, on solving reasoning tasks that are similar to those seen in their training data. And I would venture, in many cases, humans are able to adapt better to changes. And this reflects some failures of abstract understanding.
So how is it that we can get machines to learn and use human concepts and abstractions? Well, what are human concepts? Well, there's certainly-- a lot of people think of them as mental models of categories. If I think of the concept of a lecture hall, it's-- I have a mental model of that category that I can use to reason and to predict things, like predict that people sometimes walk out of a talk.
[LAUGHTER]
Happens every time. And situations, events, these are all-- we have these mental models. So let's, just as an example, the mental model of something on top of something else, simple spatial concept, OK?
Well, we know that concepts are compositional. If you understand one thing on top of another, you can build up things on top of other things. You can compose the concepts. And if you understand a cat on top of a television, you can also understand a television on top of a cat.
And I actually tried to get like DALL-E to draw a television on top of a cat and it absolutely refused to do that. It always drew the cat on top of the television. OK. But humans could do it.
And concepts have causal structure. And this is really important, I think, for-- that enable us to make predictions, to test our hypotheses, to reason, and to have common sense. So even at a very young age, humans can reason about what's going to happen if they-- if things are on top of each other and they do some kind of intervention.
And they can also reason about how to get on top of something else. This is something that kids learn very-- too early. [LAUGHS] And you can get into trouble. And they learn about sort of 3D topology, that your shoes go on top of your socks. So you have to put your socks on first. So the very common sense kinds of things.
But concepts can be abstracted via metaphor and analogy to new situations. So just to go-- continue with "on top of" idea, we use this term metaphorically as a spatial concept. But we say things like, I'm on top of the world, or at the top of one's voice, or on top of a social hierarchy, and so on.
Lawrence Barsalou, a cognitive psychologist who studied concepts, defined it-- a concept as a competence or disposition for generating infinite conceptualizations of a category. So notice that he's distinguishing between the word concept and category, which is sometimes-- are used synonymously, more informally. But he's saying that we often think of categories as things that you discriminate-- woman versus man, dog versus cat, on top of, on bottom of, and so on.
But what he's saying is that concepts are really more generative. That the way that we think of concepts is that our models are able to generate examples of them-- in fact, infinite conceptualizations.
Lakoff and Johnson and others famously talked about how abstract concepts are learned via metaphors involving core concepts. So for instance, he showed that people use physical language to describe social concepts. So like, she gave me a warm greeting. It's not literally warm, but we sort of conceptualize it that way. Or status is metaphorically thought of in terms of physical location. She's two rungs above me in the corporate ladder, things like that.
And there's a hypothesis, due to a lot of people, that we actually-- our mental models allow us to simulate concepts. That concepts are, in fact, the ability to simulate situations, or even abstract concepts are sort of simulatable situated understandings of situations.
And Josh and his coauthors have for a long time talked about understanding as a way of-- as being achieved through basic simulation, mental simulation. And Spelke and other developmental psychologists have proposed that concepts are-- build on what she's called innate systems of core knowledge. And these include things like the notion that the world is divided into objects. That we understand things about how objects interact and how they might behave.
Numerosity, we understand concepts like something's greater than something else, something's taller than something else, or small-- small-- sort of we understand three and two and one, small numbers, basic geometry, and topology, things being contained in other things or things surrounding other things. Agents and goal-directed behavior, some things in the world are directed by goals and some things are not.
And so she proposed-- she and others proposed that these kinds of core knowledge systems were behind all of our concepts, and they were actually innate in babies. That controversial topic, how innate things are, but I think most people in cognitive science agree that these are important things that are learned-- either innate or learned very early on.
And Francois Chollet at Google took this idea and said, let's figure out a way to evaluate understanding using ideas from these core concepts. And he built what he called the abstraction and reasoning corpus, which is a domain that you can use to test both humans and machines to see how well they understand by being able to abstract. So here's an example of a task in this domain.
So these tasks consists of some small number of demonstrations. You can think of these as the training examples, if you like. So I have three training examples. One that shows one grid changing into another grid, here's a second example, here's a third example, and I'm sure you see sort of a common concept here. And you could probably answer what we should do to the test input to implement that concept in that context, right?
So you've only seen three training examples, but somehow you-- it's very easy for you to see what's going on, what the underlying rule is. It might be a little bit hard to articulate it in language, but somehow there's some object that's pointing in a certain direction. And out of it comes like some kind of ray, a line, that goes all the way to the boundary in that direction. So we're relying on all of these core knowledge that we have about objects and space and so on to solve this.
Here's another example. So here, you can probably see easily what's going on is that different kinds of shapes are being colored different colors. And so if-- it doesn't matter what orientation or size the shape is, it's sort of a more topological property that remains invariant.
So those are the kinds of problems Chollet posed. And he created 1,000 of these tasks, published 800 and held out 200 as sort of hidden test set and put this on the Kaggle platform as a challenge for AI. So if you don't know, Kaggle is a platform for putting up machine learning challenges. And you can get prize money for solving this. In the end, they had about 900 teams submit programs to solve this task.
Those programs were then tested on the hidden test data and the winning program got about 20% accuracy. And each program for each task-- each program gets three guesses per task. And if one of them is right, the whole thing is right. And then the ensemble of the top two programs did a little better, 31% accuracy.
So they didn't actually do a study on humans, a formal study on humans, but they assumed that humans can do pretty well on these tests, these little tasks. And now there's more money available if you guys want to enter the competition. A Swiss lab is offering 1,000 Swiss francs for every percentage point your program gets above 31%-- pretty good. And I think 1,000 Swiss francs is about $1,000-- today's currency.
OK, so it shouldn't be too hard, right? Well, they had a competition in 2023 and nobody-- nobody won that. Nobody got above that. What's the problem? This is simple problems. Well, I think they're actually kind of-- a lot of them are hard for humans. So we-- our group-- got interested in this. But we noticed by going through all the tasks that they some of them seemed kind of hard.
And also, the tasks didn't-- we didn't think they really systematically tested understanding of concepts. If you can solve one of those tasks that I showed you-- let's say the machine can solve that. Does that mean, though, it understands invariance? Sort of shape invariance and all of those concepts? Not really, it could have been using some kind of shortcut.
So what we did is we built a sort of-- a little benchmark we called ConceptARC that was benchmark in this domain, but new problems, new tasks. These are my two collaborators on that. And those problems were-- those tasks were meant to test understanding in a more systematic way. So if you have a concept like "on top of," it tested in many different ways, many different conceptualizations, not just one.
And so we looked at 16 different concepts created-- for each concept, we created 30 tasks that were designed to be easy for humans with-- and here's the kind of concepts we looked at, basic spatial and semantic concepts. And so we had a total of about 480 of these tasks. Here's an example. Hopefully easy, right? Delete the bottom object.
And we tested them on people on Prolific. And we also tested the winning program on the Kaggle challenge on these. And we tested GPT-4. So we tested GPT-4 using a text version of the grids, just by using like numbers to signify the colors in each pixel.
And all those got this correct. So all these programs understand the concept of top and bottom, right? Well, no. We have to keep testing to make sure that they're not using some shortcut. So here's another example. And it was interesting because it was kind of a challenge to get my collaborators-- when we were designing these problems, to get them to design problems that were easy for humans. They really wanted to design ones that were hard for humans because they thought, oh, these are too easy. They're just too ridiculously easy.
And it turns out that they are not I mean, they are easy for humans, but they're not easy for machines. On this one, where you color the top row of the object red, all these programs got correct. But here's another one where you delete the top and bottom object. And while humans still got 100%, these different programs got this incorrect.
Now the Kaggle-winning program was designed to do these ARC tasks. GPT-4, obviously, wasn't designed to do it, although who knows what was in its training data. These were not in its training data because we designed these and never released them until we tested this.
Here's just a couple more examples of a concept like "in the center of." We're taking the pixel in the center of these things. Everybody got it right. Here's another example of "in the center of," sort of extract the object in the center of the grid. Everybody-- humans 100%.
Here's another, move the pixel to be aligned in the center of the horizontal line. These guys got it incorrect. And overall-- oh, and here's another paper we did where we followed up by testing GPT-4 and GPT-4 with vision on the visual version of this. This was the results on the-- for humans on all of the concepts and Kaggle first place and GPT-4 with text only. And if you go-- look at the bottom line, that's sort of the average.
And if you've ever done an experiment on a crowdsourcing site for people, you know that 91% is like 100% because people-- a lot of people are just not trying. I mean, it's not-- it's a very noisy measure, I would say. But still, humans are way better than either the programs. And a lot of people said, well, this is the text-only version. They said, just-- it's unfair. You're giving humans the visual input and GPT-4 the text-only input. And just wait till the multimodal version of GPT-4 comes out. Then it'll solve them.
And we tried the multimodal version. We didn't have much money left after this, so--
[LAUGHTER]
--we only did it on the easiest, what we called minimal tasks. The easiest tasks in our set on which text-only got almost 70% right. Actually, the vision part-- the vision versions did much worse. So the vision versions are really not very good at doing abstract visual reasoning at all.
So can AI understand the world? Well, I think the answer is, yes, in principle. I don't see any reason why not. But now-- up till now, it's had a lot of failures of understanding and even still hasn't achieved human-like abstract understanding. It may be that to understand the world in a human-like way, systems may need something like this human-like core knowledge systems and the resulting concepts.
And it's not-- I mean, I don't know the answer to this. But can they achieve this without some kind of embodiment or-- and/or active interaction in the real world? I think that's an important question that we just don't know the answer to.
Now notice I said human-like. A lot of people will say, can't it understand in some nonhuman-like way? Can it be just a different kind of understanding? And I would accept that. But the problem is if it's a different-- if it's an understanding that's different than human-like understanding, it may be hard to get these systems to actually work with us in our world, in our human world. And that would mean that we wouldn't have trustworthy AI systems.
So I'll leave these questions up. And I'll stop here and happy to have any discussion. Thank you.
[APPLAUSE]