Language Models as World Models
Date Posted:
September 9, 2024
Date Recorded:
August 11, 2024
Speaker(s):
Jacob Andreas, MIT
All Captioned Videos Brains, Minds and Machines Summer Course 2024
JACOB ANDREAS: What this talk is broadly about is understanding whether neural sequence models that are trained to generate text build representations of the meaning of that text and maybe even of the world described by that text. So I want to start with, at this point, a sort of oldish example from Marcus and Davis, originally actually from the late Eugene Charniak's PhD thesis.
The example goes as follows. "Janet and Penny went to the store to get presents for Jack. Janet said, 'I will buy Jack a top.' 'Don't get Jack a top,' said Penny. 'He has a top. He will, dot, dot, dot." And in 2020, if you took this example and fed it into what was then a state-of-the-art language model, I think ChatGPT or not ChatGPT, GPT-3 it would complete it as follows. "He will get a top.' 'I will get Jack a top,' said Janet."
Now, there are many things that are remarkable about this example. It stays on topic. It knows who's involved in this situation. It knows enough about the structure of dialogues and social conventions to know that it's Janet that's likely to speak next. And we get all of this complicated behavior just from training a generic next token predictor on a bunch of text, which was not true in 2019 and, I think, sort of unimaginable as recently as 2014 or 2015.
The other sort of interesting thing about this passage is that it is total nonsense, right? Penny says, "don't get him a top. He has a top. He will get a top." Janet says she's going to get him a top. This is not, actually-- whoops-- a conversation that you can imagine real human beings having with each other. This example is certainly fixed in modern language models, but you don't actually have to work that hard to get things that are in the neighborhood of this.
And so what all of this raises is the big question of what's actually going on under the hood to support this kind of text generation in a way that might both explain the failures and the successes. And in particular, is this just sort of babbling? Is this just a really good model of the surface statistics in text? Or is there some sort of representation and some sort of reasoning about the situation that's being described in text?
And like I said, this is an old example. Modern models can generate longer documents. They can generate documents describing sort of weird counterfactual states of the world involving unicorns that speak English that live in Peru. They can sustain, in some cases, hours-long conversations with humans that are mostly coherent over those windows.
So, again, what's going on here? And I think the mental model or a popular mental model of language models, at least in their earlier forms, was that they were just really good models of linguistic surface data, that they knew a lot about co-occurrence statistics, that they could generate grammatical sentences as long as they weren't too long, and did a really good job of manipulating strings in sort of realistic ways without needing to go through any sort of computation that required them to understand what those strings meant.
But I think once you start being able to tell long stories about complicated situations in counterfactual worlds like the famous unicorn example, I think it becomes increasingly difficult to sustain a picture or a model of how language models work that is just based on string manipulation.
And it's become increasingly popular to instead talk about world models or situation models that live inside these language models and to make claims that modern language models are building and representing and reasoning about and manipulating explicit structured representations of situations and states in order to generate the text that they generate today.
And so what this talk is about is probing into that and trying to see what we can say and what kinds of empirical evidence we can produce for or against the presence of things that look like world models inside these language models. So that's going to be most of this and then maybe a little bit of more sort of higher-level philosophizing at the end about what it really means to have a world model. But yeah, let's dive right in, starting with the sort of actual empirical questions about what's going on here.
So just to very briefly set the stage, and I imagine this is a review for people in the room at this point, by language model, we're talking about transformer autoregressive next token prediction models. So we have a sequence of words as input. To every sort of word in the sequence, we're going to assign some sort of high-dimensional vector representation using something that looks like an attention mechanism and some feed-forward layers.
We're going to stack a bunch of these things on top of each other such that if I look at the final representation of the final word in any piece of text, I can use that to make a prediction, say, place a distribution over all the words that might come next.
OK, so what would it mean for a model, and especially for a model that looks like this, to reason about a situation that's being described in text? And here's one cartoon, which is certainly not the only way to implement this, but that comes from the sort of dynamic semantics literature in linguistics.
When we say Janet and Penny went to the store to get presents for Jack, we're going to build some sort of explicit symbolic representation of the state of the world that includes both all of the entities that we know about in this sort of world described in the story. There's a store. There's a person named Janet. There's a person named Jack. There's a top. We know some information about the relationships between these things. We know that Janet is going to or is maybe already at, after this sentence, the store. We know that the top is located in the store that she's going to buy.
And as I'm drawing these sort of graph-shaped state representations, it's important to pay attention both to what's represented here, which corresponds to what we know based on what's been said so far, and what is not represented here, what we don't know about this situation. So it is possible in this current state of the world that Jack already-- or that the top that's in the store is purple, that Jack is already in the store and has already bought that same top, and so on and so forth-- so various pieces of information that are compatible with this world but that haven't yet been described.
So we're going to call these things information states. They represent everything that we know about the current state of the world in these very simple sort of graph-structured relational terms. We have some objects. The objects have some properties. The objects have some relations between them.
And if I now add a sentence to this document, if I say Jack already has a colorful top, I'm going to think of what that sentence does is specify some sort of update to this underlying state representation. So I know now after the second sentence that there's some other top in the world, that that top belongs to Jack, that the top is colorful, and so on and so forth.
And similarly, if the sentence were-- or if we instead added a sentence that said she gave it to him, in addition to various other consequences that we're not showing, one of the side effects of this sentence is that Jack is now going to possess that top that Janet possessed earlier on. And this is something that's never explicitly stated in the text, but that sort of follows logically from it. If A gives B to C, then as a consequence of that, A no longer has B, and C does. OK, good.
So this cartoon comes from the dynamic semantics and the linguistics and philosophy of language literature. For those of you who have seen, I think especially people in NLP computational linguistics, when they encounter formal semantics, it's more often this sort of Montague truth conditional meaning of a sentence as a function from possible worlds to truth values. Here, we're thinking of basically what a sentence does is specifies some sort of update to one of these world models, basically specifies a transition in this underlying state dynamics.
So what I want to claim is that this is at least a useful framework for starting to think about what world models might look like inside neural sequence models. And one reason to think that this might be part of what's getting encoded or something that would be useful to encode in the process of language generation is if I had access to a state representation that looks like this, it would help me do all kinds of downstream language generation tasks.
It's easy for me to figure out what I'm allowed to say next or what I'm allowed to refer to by sort of consulting what entities are already available here. It's easy to figure out what's allowed to happen next by simulating this model forward in time and looking at the kinds of states that could result and then describing those states in text. It's easy for me to do other tasks that we care about like maybe an entailment judgment and NLP by just comparing these graphs to each other and so on.
Now, obviously, the models that we have don't actually build nice, discrete, algebraic graph-structured meaning representations like this. There's no supervision for them. And more fundamentally, there's nowhere to put a representation that looks like this inside a big neural model. But to the extent that we think these things are useful, it's reasonable to ask whether these sort of graph-structured meaning representations or something like them are maybe represented implicitly, maybe even in a half-baked way inside this bag of vector representations.
Another perspective that I think is useful for understanding what kinds of representations language models might be constructing is just to think about language generation generally as a latent variable problem. How does a document or at least a story get written? Well, the world is in some underlying state. It passes through some sequence of other state transitions that we want to describe.
There are rules that are not rules of language, that are rules of physics and how social interactions work and so on and so forth that specify what kinds of transitions are allowed up top. And if we know what those rules are, then we can place a distribution over plausible state transitions.
And so when we're generating text, figuring out what sentences might come next given some prefix in which we're trying to predict involves inferring what states might have been compatible with the initial sentences that we saw, what kinds of states might result and what things I might be allowed to say about those.
And so fundamentally, a good way to do language generation and maybe the only way to really, really, really reliably do language generation is to solve some sort of inference that looks like this that involves figuring out what was the underlying state of the world described by text, simulating that forward, and then figuring out how to talk about it.
So if language models are solving this next sentence prediction problem, by doing this kind of inference, then we would expect them to produce representations that encode this distribution over possible states of the world. And so what we're going to try to do now, finally, concretely, is to look for some sort of representation of that distribution.
So the setup for this is going to be as follows. We're going to look at a language model. We're going to be sort of picturing these things generically as encoder decoder models. But if you want to think about a modern autoregressive thing, just think of this as everything up until the point where we're predicting a single next token or whatever transformer computation is happening downstream of the representations that were going to look at.
We're going to gather up a bunch of documents where we have access to some ground truth representation of the underlying state of the world, either because the text was machine generated from these states or because some human went in and hand annotated documents with a bunch of these states or hand annotated-- we showed them a sequence of state representations, and then they labeled those with text-- but some paired data set where we have language that looks like this on one hand and state representations that look like this on the other hand.
And what we're going to try to do is figure out whether we can decode these sorts of things from the internal representations that are being built by our language model. So we're going to train what's now called a probing model, which is basically a teeny, little-- basically, we're going to take our big language model. We're going to freeze its parameters. And we're going to train some teeny little decoder that's going to look at the internal states of the big language model and try to read this structured representation off.
These are big, complicated objects. So we need to be a little bit clever about how we do this. And in particular, we're just going to reconstruct or we're going to read the state of the world off one edge, one proposition at a time, right? So if I want to figure out whether in the state of the world that I'm looking at here, there is a locked door, I'm going to train some little model that's going to take as one input a representation of just this edge, the door is locked, as another input, some hidden state from inside my neural model. And I'm just going to try to predict whether this thing ought to be present in my state representation or not.
Concretely, the way we're going to do this is we're going to represent these edges also as little natural language descriptions. We're going to have some other language model that just encodes those and gives us vector representations of these propositions.
For looking inside the language model itself, we're going to write, remembering that what these models do is actually assigned a separate representation to every word in the document. We're just going to pick one of these representations to probe, and we'll come back to the choice of the representation to look for later on.
And then, the actual machine learning part of this model that makes these predictions here is going to be the simplest thing possible. It's just going to be a little matrix, a little linear model that takes in, on one hand, this vector, on the other hand, this vector, and just assigns some sort of scale or score to whether this edge is likely to be present in the state of the world or not.
And it's really important here that this is a very, very, very simple model. If this was a whole sort of arbitrary, deep neural network or whatever, we wouldn't necessarily be able to convince ourselves that what we were seeing was evidence that the language model was building these representations. Instead, we'd be seeing evidence maybe that this model is itself learning to parse these documents and generate these kinds of structured state representations for us.
But because we're just going to learn a linear map here, what this means is that if we can do this task reliably, if we can read off these state representations from the LMs, those things are already encoded up to a linear transformation in the representations produced by the big LMs.
So to the extent that you believe that you can do all of this complicated semantic inference with a linear model, then all this is really doing is translating. And we'll see some evidence later on that a supplementary, just the simplicity of the model, that that's actually the way to think about what's going on.
So just to say this one more time, we have our language model. Our language model produces some representations. We're training a little linear model to try to decode those representations into these structured state representations one edge or one node label at a time. Yeah?
AUDIENCE: The only thing being trained are the ways of the cluster?
JACOB ANDREAS: That's right. The only thing being trained is this one matrix W and. In particular, we're going to share this across every representation or every node or edge label over here.
AUDIENCE: So on the left is some kind of large model?
JACOB ANDREAS: Yeah, so on the left is the language model about which we want to claim it has a world model or it doesn't have a world model. So think of-- in the experiments, the first experiments I'm going to show because this is oldish work. These are smallish models. This is Bart T5 kinds of things. Later in the talk, we'll see some larger-scale models. But yeah, this is a big pretrained language model that we downloaded from somewhere. And this is something that we're training on.
AUDIENCE: And the one on the top is just a smaller--
JACOB ANDREAS: So for the experiments that I'm going to show now, these are actually the same model. You could also learn all of these encodings separately. And again, we'll see some variations on that later on. Yeah?
AUDIENCE: Yeah, what kind of texts-- what is the training data we're going to use?
JACOB ANDREAS: So the training data, like we said before, looks like this. So we have a bunch of documents. We have a bunch of--
AUDIENCE: So are they some variants of this text?
JACOB ANDREAS: Sorry?
AUDIENCE: Some variety of these kind of texts, like a locked door text?
JACOB ANDREAS: Yeah, yeah, yeah. But we're going to train, we're going to evaluate on totally held-out situations.
AUDIENCE: OK.
JACOB ANDREAS: Yeah, and I'll say a little bit more about the training data in a minute. Yeah?
AUDIENCE: When you say semantics, is there a way to make the difference between lexical semantics and the composed semantics of the phrase because I guess, we both have-- we have the word store and loft? And so in light of the similarity there, are you making a distinction between the two, or are both OK?
JACOB ANDREAS: Well, so what we're going to need to-- so what does it take to actually do this task to say 100% correctly? It certainly requires not just knowing that door and lock are similar to each other, but that after, I guess, going back to the original example here, after you unlock the door, then the representation of the door should change to reflect the state change. Now, here, this is a simple example. Maybe it will be easiest to think about this actually in the context or by looking at what some of the real environments look like.
So the experiments that I'm about to show are on two different data sets. One of them is this alchemy data set that basically describes sequences of operations on beakers full of colored liquids. So you have some initial state representation or some initial textual description of a state that says, there's a beaker with two units of green liquid, a beaker with one unit of red liquid, and so on and so forth. All the model is going to see is a sequence of things like pour the green beaker into beaker two, then into the first, and then mix them.
And if you're really modeling what's going on here, if you're representing these situations, then you need to know that as a consequence of pouring the last beaker into beaker two, the last green beaker into beaker two, the world is going to look like this after mixing. Oh, no, and then you pour it into the first. It's going to look like this after mixing. The color is going to change.
And importantly, you sort of expect to need to learn these things just to be a good language model for sequences of instructions like this because the instructions are never going to ask you to do impossible things or nonsensical things, like mixing a beaker in which everything is already the same color or emptying out an empty container or pouring this container into a container that's already too full and that would cause it to overflow.
But these kinds of inferences require certainly more than just lexical semantics because they do need you to keep track of the underlying state of the world. So this is one of the environments we're going to look at. And then the other one that looks more like the examples we were seeing before are these sorts of text adventure games where you're sort of walking around an environment. You're picking up objects. You're opening or closing doors and so on.
OK, so what happens when we try to train this probing model? The first thing to note, so what we're looking at here is for all of the objects that are mentioned in one of these stories, either one of these sequences of beaker instructions or one of these playthroughs of one of these text adventure games, in what-- if I look at on a sort of state-by-state basis and an object-by-object basis, what fraction, for what fraction of objects can I perfectly recover the true state of the object, so all of its properties and all of its relations with other things?
And the main thing to notice here is that you can actually do this quite well, even in relatively small sort of circa 2019 models. In this alchemy environment, you can get the underlying states of these beakers with about 75%, 76% accuracy. In these text and world environments, you can do much better, 95, 97% accuracy.
Now, one important thing to say here is that there are a bunch of very, very simple baselines that also get reasonably high scores. In these alchemy environments, if you assume that nothing ever changes its state, that already gets you 63% accuracy. And if you just guess that things are in their most frequent state in the entire training data set without really building any kind of language model representation at all, that gets you at least non-trivial accuracy.
You can evaluate rather than did I get every-- what fraction of objects did I get exactly right? What fraction of entire states of the world did I get exactly right? The numbers are all much lower, maybe unsurprisingly, but also, the gaps between these baselines. And so similar things over in the text world. And the main takeaway here is I think just that can do this surprisingly well. And you can do this. Yeah, you can do this surprisingly well, even with relatively simple models. Question, yeah?
AUDIENCE: So how do you know it's detecting that particular state versus something that co-occurs with that state?
JACOB ANDREAS: What do you mean by co-occurs with?
AUDIENCE: Someone walking through the door could also co-occur and have a high similarity to the door unlocking.
JACOB ANDREAS: So maybe, I mean, another thing that you can do to evaluate this is to see whether you can actually control model generation. And we'll see an example of that in a minute. It is true that, fundamentally, if event X and event Y always co-occur, then maybe I don't even expect in the underlying state representation those things to be distinguished from each other. And there's no reason for the model to learn separate representations of I walked through the door, and the door is open. I walked through the door implies that the door is open. So you probably after seeing that sentence do want your state representation to encode this thing.
A sort of deeper question here is whether you-- what we're reading here are representations of basically I'm piling up on top of every object all of the things that have been said about the object, and the probe is just saying, can I see whether a particular thing has been said about this object or not?
And I think that is actually probably true, or at least what these world models as we're seeing in these experiments look like under the hood is basically keeping around what's the most recent text in all of the implications of the text that was predicated of this particular object. But they are structured and localized and causally implicated in model behavior in a way that we're going to see in a minute. Yeah?
AUDIENCE: And the fact that you get surprisingly good results with a no-language model linear predictor, doesn't it mean that the linear predictor has maybe surprised [INAUDIBLE] while building a [INAUDIBLE] model?
JACOB ANDREAS: Well, I think the way to-- so, first of all, these, neither of these things are actually training the linear predictor. These are doing other kinds of trivial things. Another thing that you can do that maybe gets at that question more is to say, what happens if I, rather than taking a pretrained model off the shelf, take a randomly initialized model and try to train the same probe?
Yeah, and again, you do surprisingly well non-trivially but maybe just comparing these things side by side, not actually better at least in this alchemy environment, which is a slightly more complex one, than not having access to a language model at all.
Oh, like an even simpler class of probes to-- yeah, well, so we'll-- I mean, what that would actually look like, I guess you can force it to go through some super low-dimensional bottleneck or do some fancy MDL thing. I mean, yeah, there are lots of other tweaks you could make to the probe architecture and to how you're actually estimating it. But for now, we're just looking at linear probes and sort of comparing them to these baselines.
So one of the things that we glossed over before was this question of which representation we should actually pull out of this model in order to figure out what the current state of some entity is. And so what we're going to look at now is actually what the implications of that particular choice are.
And the way we're going to do this is we're going to take our probe and say we're trying to probe the state of beaker number three right here. We're going to point it at just different sentences in the initial state description, setting up the initial state of the world in these beakers tasks.
So maybe we pointed at the word "has" in the sentence describing the third beaker, and we get 64% accuracy. We point it at all of the other words, both in the sentence describing the third beaker, and maybe some sentences describing other beakers instead. And we repeat this experiment.
And what we see is that there are actually pretty significant differences in the accuracy that you get in these places. So to the extent that the model is representing information about the state of this blue beaker, it seems to be localizing it to this initial description of its state even though what we're trying to probe out here, what we need to predict at the end of this document is not that the beaker contains four blue things, but instead, that it winds up empty.
And so coming back to the what does this tell us about how these representations are organized, it says something about them being localized to mentions of the objects that are being discussed here. You can do this experiment also looking at final mentions rather than initial mentions. And the accuracies are pretty similar, suggesting at least in these sort of encoder decoder models, which importantly, this guy gets to attend to this, which is not true in modern sort of purely autoregressive models, you localize the information across all of the mentions.
And so the last question is whether what we've found here is telling us anything about model behavior and not just some sort of correlation that the probe has picked up on that's not really sort of causally implicated in the model's predictions at all. I'm going to start by doing a very crude version of this experiment now, and then I'll show you a fancier version of this later on.
What we're going to do here is we're just going to take a pair of documents, one of which has as its final consequence that the first beaker is empty, and another document, which has as its final consequence that the first beaker is not empty, and the second beaker is empty.
And if this hypothesis that we've made about how the language model is sort of representing the underlying state of the world is right, then what I expect is that if I build a sort of Franken representation that takes the representation of this first beaker from the first document and of the second beaker from the second document, and I just paste these things together in a way that doesn't actually correspond to any text that I could have fed into the model at all, I should nonetheless wind up with a representation of the world that looks like this that has as its final consequence that both of these things are empty.
And in fact, if you look at the text that the language model generates in this state, so sort of concretely, we expect that it will say things like, empty the third beaker because that's well-formed. We don't expect that it will say things like stir the red beaker because the red beaker is now empty. And if you do this, you see that, in general, you get text that's consistent with this state much more often than the other two states.
So another thing that you can do here, once you have this sort of basic machine for probing out what the model thinks is true about the world at a particular state in time, you can also start to predict in a finer grained way what kind of text it's going to generate. And so here, we're going to try to use this as a tool for predicting ahead of time whether a language model is going to hallucinate or not in the sense of generating some text that contradicts the input.
And the way we're going to do this, if we're hypothesizing that now we have a piece of text that says "Gordon focuses on cases pertaining to business litigation, business reorganization, bankruptcy litigation, small business restructuring, blah, blah, blah, blah, blah, blah, Gordon has the occupation of dot, dot, dot," we expect that if the model has accumulated all of this information, has inferred that Gordon is a lawyer, that the representation of this last word, Gordon over here, will actually encode that he's a lawyer and not some other thing.
And so we're just going to encode a bunch of propositions, "is an attorney," "is an accountant," "is a judge," and figure out which of these things looks most similar or to which our probing classifier assigns the highest accuracy when applied to this word, "Gordon." And if this is higher than all of these other things, then we predict that the model is going to generate correct text. And if it picks one of these other things instead, then we predict that it's going to hallucinate in the sense of contradicting this input.
And here, for this real example, this is the proposition that gets the highest score. And in fact, the model generates the word attorney in this context. And maybe more interestingly, you can do this in, or rather, this works in cases where the language model is going to hallucinate.
This comes up especially in cases of bias. One of the motivating examples here is this particular model that we're looking at, which I think is GPT-J, thinks that a person named Anita is a nurse no matter what other context you provide about that person's biography. And we can actually see that happening, that if you just look at the final representation of the word "Anita," no matter how much additional context you pile on, nurse is still the most likely prediction here. Cool.
And another-- so here, we've been looking at hallucination in the sense of contradicting information provided in the input. You can also use this as a way of probing models' background knowledge about the world by just taking sort of just the string "Sundar Pichai" or just the string "Carlos Bocanegra" and looking at what kinds of things that the model is predicating of that.
And so what's cool is that this tool that we originally developed for sort of monitoring dynamic state in stories, you can apply just in exactly the same way to also figure out things about how models encode their background knowledge about the world that they learn from the training data rather than that they learn from the input in representations of words in the input.
So so far, I've been showing qualitative examples, but just here's some sort of numerical things about how well you can do with this. Let me show the very last one here. The pink bars are two different ways of training this probe that we've been looking at before. The gray bar here is what happens if, rather than even trying to learn that probe at all, you just fix it to be an identity matrix.
And that actually works surprisingly well, which tells us now something about the actual encoding scheme being used by this model, that the representation of "Sundar Pichai" looks like maybe a sum of all of the other things that the model knows about him, including the encoding of works for Google. And this is maybe something that should not be super surprising if we think about what's known about word embeddings and analogies with word embeddings even in much simpler models. Cool.
And coming back to the state manipulation experiments that we were doing before, another thing that you can do with this tool is actually use it to control generation. So if I know the representation or the sort of direction and representation space that corresponds to being the CEO of Google, I can subtract that out and add some other thing back in instead and cause the model to generate now texts that describe Sundar Pichai not as the CEO of Google but as the CEO of Apple instead.
And a cool thing about this is that it works sometimes even in cases where just providing a textual prompt to the model doesn't work. And so again, in a sort of smallish model-- I think this was GPT-J-- if you just prompted it with the sentence, "Sundar Pichai works for Apple," and then ask it to generate completions, the completion that you get is "Sundar Pichai is the CEO of Google." so it's ignored this initial piece of input. Whatever background knowledge it had has overridden the information that you've provided in the context, the sort of same phenomenon that we were seeing with Anita, the nurse before. And it just ignores what you wrote.
And actually, one of the motivating examples for this entire paper here was Evan, one of the students who was working on this, was trying to get the model to write a story about a 27-year-old firefighter named Barack Obama. And it just absolutely refused to do it. Bigger models will do this now.
But importantly, even in these smaller models, once you understand how their knowledge is represented and how that gets encoded in embeddings, we can just manipulate the representation directly so that the probe predicts that Sundar Pichai works for Apple and doesn't predict that Sundar Pichai works for Google. And if you do that, you actually get texts that's consistent with this modified state of the world, even in cases where probing doesn't work.
And so you can do this for sort of biographical things. You can cause puff pastry to have been invented in the internet. You can turn Dodge into a plane company. And in particular, this works sort of as well and maybe a little bit more specifically than what was then a state-of-the-art model editing approach. Yeah, in the back?
AUDIENCE: Yeah, sorry. Maybe you mentioned this, but where or how exactly do you intervene on the reservation?
JACOB ANDREAS: Yeah, I guess I went through this pretty quickly. So because we have the probe represented as a linear transformation that looks like this, all you have to do is-- basically, if you hypothesize that this blue thing here consists of the fact encoding times W plus a bunch of other stuff, then all you have to do to make the change is subtract out fact encoding times W and add in the new fact encoding times w. Yeah?
AUDIENCE: I was wondering after-- so I had two questions. So one was I guess you could intervene on many layers. And so is the true value sort of like a union bound over the probability of all of the layers? Or is it like-- do you find the semantics for one layer? Or how does that work?
JACOB ANDREAS: Yeah, that's a great question. So for-- and this has changed a little bit as we've gone through iterations of this. So for these experiments, we're using, I think, just the last layer of the model. And then, for these experiments and everything that comes after, we're-- part of the training procedure is also a search over what's the right layer in which to either do the read or to do the intervention.
Interestingly, in a lot of cases, you want to do-- and there's been work going on in parallel, especially for background knowledge about the world. There are specific layers that are intermediate to the models where that knowledge seems to get encoded. And most of these interventions work best if you do them actually before the knowledge readout.
And if you actually look at what the size of the intervention that you need to make in this new fact direction to cause changes in the models, it's quite large. And we think basically what's going on is you have to both suppress whatever knowledge retrieval mechanism the model would natively invoke on this entity and add, sort of supply the new information on its own.
AUDIENCE: And then, I was wondering also, after you've changed it, so after you prompt it with "Sundar Pichai is the CEO of Apple," then if you ask again, who is the CEO of Google, does it still say Sundar Pichai, or does it say, I don't know--
JACOB ANDREAS: Great question. So because we are only manipulating the representation of this entity right here-- in particular, all of the experiments that I'm showing right now, we're sort of manipulating representations and not weights inside the model. So a lot of the other work that's gone on editing pushes this all the way back into weights. And there, people have found--
Well, and so here in particular, because we're only manipulating this and we haven't changed the representation of Google at all, certainly, it's going to continue-- if you say who's the CEO of Google and these words don't show up in the input, it's going to say he's the CEO of Google still. And you would need to do a corresponding change to the representation of Google if you wanted that to be globally coherent.
An interesting thing is that basically all of the weight editing methods that exist right now, even though in principle they can make global changes like that, in practice, they don't seem to. And we'll come back at the end of the talk to why I think that is and what we would need to do to fix it. Yeah?
AUDIENCE: In the first example, do I understand correctly that the fact that it changed course enough to try to be the CEO is an undesirable effect, and in theory? Maybe this is a small detail, but I think the only thing we would want to change is the company.
JACOB ANDREAS: Oh, you mean that it says he's the vice president rather than the CEO? Yeah.
AUDIENCE: Like ideally, I understand that this is my intent, but ideally, we would want to him to remain a CEO of Apple. And so that means that the factual--
JACOB ANDREAS: Well, yeah. I mean, I think yes. Actually sort of formalizing why that's the right intuitive thing, basically what is this edit operation supposed to do--
AUDIENCE: Although you changed the locks for Apple.
JACOB ANDREAS: Yeah. But basically, in the counterfactual world, what's the semantics of these edits is actually a very complicated question and one that we were not treating in a very precise or formal way here. And so yeah, I think maybe you would at least expect at baseline that it will keep his rank in the company the same and just move him somewhere else. And that doesn't happen here. So it is a little sloppy. And of course, it doesn't always work. So here's an example down at the bottom where we try to move Putin to Denmark and fail.
You can also, a fun thing that you can do here is to use this to redefine words rather than just change sort of factoid knowledge about famous things. So here's, I think, turning, modifying the definition of the word "fork" so that it is used for chopping wood. And one of the cool things that you get, and I think this, coming back to what Bryan was saying, mostly has to do with correlations between features that the model has seen in the training is that if you intervene in the representation of fork, so to increase the probability that the fork is used or so that the state representation thinks a fork is used for chopping wood, that also increases the probability that it has a handle, that it's used for cutting, that it's used for killing, that it's dangerous.
Interestingly, it also increases the probability that it is somewhere on here. Now, I can't find it. I think it's like, "is made of wood" or "has leaves" or something that basically is just picking up on co-occurrence with trees rather than actually being a killing instrument. So definitely, nothing that I'm showing you here is surgically precise, and you do get bleed for into things that look more like surface correlations instead.
So one question that you might still have at this point is, can we say any-- so we've shown that we can read this information off of models with linear probes. We've shown that if you do sort of fairly crude interventions into the models by editing back through these probes, you can change model behavior in a controllable way. But that doesn't actually say whether the computation that the model is performing internally actually looks anything like the computation that our probe is performing.
And so a sort of question that all of this leaves open is, how are the LMs themselves actually decoding the information that is written into these representations and using that to inform next token prediction? And a sort of reasonable baseline hypothesis given-- yeah, so just to say, again, what we know from all of this is that in a sentence like "Miles Davis plays the--" where the model is able to predict trumpet, we know that you can read off "is a trumpet player" from these representations of the name Miles Davis. And you can do that linearly.
You also know that what the language model is actually doing on top of this is some complicated thing involving a bunch of attention mechanisms and a bunch of multilayer perceptrons and so on and so forth. Can we say anything more precise about the actual form of this computation or what's going on inside these models?
And a reasonable hypothesis to have given how effectively all of these linear probes work is that the internal computation being performed by the language model itself is also basically linear, that as much as the sort of language model is able to express more complicated, interesting things, what it is doing is also just linearly reading information off of these representations and feeding that directly into the prediction mechanism.
And a way you can test this is just to say, well, let me try to approximate my entire language model to first order in context requiring these predictions. So let me see if I can explain the language model's predictions, again, as a single-- approximating the language model's predictions with a single weight matrix that I'm going to derive now, not by training a supervised probe, but instead, just taking a sort of first order approximation to the model itself in a bunch of contexts that I expect to involve this retrieve the instrument that this person plays prediction.
And so I'm just going to compute this Jacobian here. This thing is just a matrix that is this first order approximation to what the language model is actually doing. And if this hypothesis is right, then what I expect this matrix to encode is exactly this plays the instrument relation where, previously, we might have been training some sort of supervised probe to compute this relation.
And so you can do this, you can do this for a bunch of different relations. And here's what happens. What we're looking at on the x-axis here are just a bunch of different relations for which we tried to find these linear representations. And on the y-axis is how well these things actually work at predicting the associated property for new entities that weren't involved in the training of these probes or the computation of this Jacobian.
And maybe, the really-- I mean, I think the interesting and surprising thing here is both that this works super well a lot of the time. And this doesn't work at all a lot of the time, even in cases where models are actually able to generate the right prediction. So the way to read this is that it's saying--
So actually, the highest scoring thing is, what's the occupation that's stereotypically associated with a gender? Models are great at that, and it's linearly encoded in the representation of the occupation. Similarly, sort of lower-level linguistic stuff like comparative forms of adjectives, largest cities and countries, all of this is linearly decodable from word representations. And that is actually what the model is doing to do next word prediction.
If we look over at the other end, CEOs of companies, parents of famous people, evolved forms of Pokemon for the Pokemon players in the room, models know a ton about this stuff as well. And it does not seem to be at least linearly read out, that if you do this first order approximation, that doesn't give you a good description of model behavior. And it doesn't actually let you do these predictions.
And so I think this paints at least a much more complicated picture of the story that I've been building up to this point, that there's a bunch of stuff that models do encode in this nice, clean, linear way, and there's a bunch of stuff that they just don't, either because they don't think of company CEO as corresponding to a single coherent relation or because that readout is just not linear at all.
And building on this, you might ask, well, how far can you push this notion that what we're actually interacting with here via these linear probes is the language model's knowledge representation? And another way of getting at this is just saying, well, what happens if I compare-- take a big question answering data set and just compare classifying or putting questions in this question answering data set into the model and measuring the model's accuracy and training some sort of probing classifier like we've been training before to just read off the answers to these question answer pairs from the language model itself without actually requiring the language model to generate any text at all.
And if it's really the case that all of the knowledge is accessible to these probes and linearly decodable and all of that, then you would expect these things to agree basically all the time, right? In cases where the language model is answering the question right, the probe should be able to see that it's answering the question or that it sort of knows the right answer to the question, and vice versa.
And in fact, what we see is that this isn't the case really at all. So these probes actually do better, suggesting that there are cases where models sort of encode some piece of information internally but don't generate that when asked a question. So a model might know that sting is not a police officer.
But still, if you ask, is sting a police officer, produced the answer, yes. And there are lots of reasons we can give for why this might be the case. But in general, these sort of probing methods are a little bit better at recovering from models what's true about the world than just asking these models questions.
On the other hand, there's just a lot of disagreement between these things. So what I'm highlighting here on a fact verification data set are tasks where the probe is correct with high accuracy and the language model is incorrect with very high accuracy and cases where the probe is extremely uncertain and the language model never gives the right answer with extremely high accuracy.
And so again, I think what all of this points to is just that there's quite a lot of heterogeneity in how prediction works inside these models and that this nice, clean, sort of linear readout story is part of the story but definitely not a complete description, both of what language models encode in their representations and how they encode it. Yeah?
AUDIENCE: What's the sampling procedure here to get the actual output? So was there a particular temperature copy.
JACOB ANDREAS: I think these are the just most probable outputs from the model.
AUDIENCE: Is there any kind of setting that we can think about where it would actually be more similar to the probe, where it would actually encode what the probe is saying? In some sense, what we're seeing here is that the most probable doesn't necessarily match the actual code. But when you actually sample across the distribution, you are maybe considering weighting different values. I don't know if that's--
JACOB ANDREAS: Yeah, I mean, so it's a language model. So every string gets some non-zero probability, and the right answer is going to get assigned some probability. I mean, so I think I have a slide for this. Yeah, one goofy thing that you can do is just ensemble the probe and the model together, knowing that they're right on different answers.
And this in some cases does actually give you a little bit of a boost in accuracy on hard question answering tasks. This is a weird thing to do and not necessarily something you would want to do in the real world, but at least suggesting that these things really are complimentary. I don't know if that actually answers your question. OK.
OK, so just to say this again, there is a substantial amount of disagreement between factual knowledge as recoverable by these probes that we train and queries that you would get from the model. And in particular, it's not that one of these or the other is always consistently better.
Sometimes, language models "know" with big scare quotes around it things even though they don't assign them high probability in the sense that you can read that stuff out with a probe. And sometimes, there's stuff that we don't know how to probe for that language models are nevertheless quite good at doing.
Cool, when am I supposed to stop? Sorry? Noon? OK, cool. Yeah, questions before we go on.
AUDIENCE: Does the probe-- people designed the probe, right? Maybe what people are probing is different from what the model is looking at.
JACOB ANDREAS: Yeah, no, I think that's right. And I think, in particular, that shows up here where I glossed over this. The design of this experiment is a little bit different in the sense that we're learning a different linear operator for every one of these relations that's described here. And so the assumption is that whenever the language model wants to retrieve the comparative form of an adjective, it has a coherent notion of the relation between big and bigger or good and better is the same.
And it could be the case that it represents all of these things linearly, but we're just not grouping the relations in the same way that the model internally groups the relations. So it knows that there's a company CEO relation, but it has one version of that relation for companies based in the US and one version of that relation for companies based outside the US or one for companies that start with the letter P and one for companies that don't start with the letter P.
So yeah, I suspect actually a lot of what's going on is just that the specific representations or the specific data that we're using to train these things makes assumptions about the conceptual structure of the world that doesn't exactly align with what's going on inside these models. Yeah?
AUDIENCE: Did you relate it to the structure that some things maybe are non-linearly represented, like as a graph relationship, like hierarchies?
JACOB ANDREAS: Well, yeah. I mean, certainly things are being represented non-linearly inside these models. And there's lots of other sorts of algebraic structures that you'd like to be able to embed inside them that would require you to not just encode all relations linearly. And I think what we're seeing is that there is some amount of that.
AUDIENCE: If we know that it can linearly, then it means it's not in a part of the hierarchy or a graph. And if it's encoded non-linearly, then--
JACOB ANDREAS: Yeah, well, I think it's more complicated than that, that you can encode lots of graph relational structures linearly. There's work on this going back to the late '90s, early 2000s and lots of other sort of interesting algebraic structures that you can encode with simple operations like this but not everything.
I don't think it's as simple as if the probe with one parameterization works, then you can say something universal about how the model represents that internally, both because we're seeing in some cases there's disagreement between the predictions that you get from the probe and predictions that you get from the model, and on the other hand, things like this where even though we know you could, in principle represent this thing linearly, it's not. Yeah?
AUDIENCE: This might be a stupid question, but can we somehow-- or are we distinguishing the recommendation which can be coming from syntactic relationship as relationship because, for example, like occupational gender or-- I'm just looking at some example. Some of the relationship could be really just guessed from the syntactic structure of the font or the words, for example. And then some of the relationship, you really need to have understanding of the scene or the participation of those. And then, I'm trying to figure out where distinguishing goes in this particular setup or not.
JACOB ANDREAS: Yeah, well, so I think this sort of comes back to maybe also what Brian was saying at the beginning is, can you give an account of what this model is doing that is purely about syntax manipulation but that, nevertheless, gives you structured representations that have all of the properties that we're saying here that sort of world models ought to have?
And I think for some of these tasks, that's definitely true, right? So thinking about the alchemy task, if I just remember about every beaker, was I the dependent of the word empty, that's going to allow you to predict some reasonably large fraction of the time whether the state of the beaker is empty right now.
And this was, in fact-- I think I have this paper here-- a point that was being made. And this paper is that in this alchemy task in particular, you can actually get a lot just by sort of keeping track of did this word participate in a particular relationship with this other word at some point. And if you've written that into the word representation, then, you can linearly read off in this fashion the state of the world.
I think whether you can then give an entire account of meaning and state tracking and all that that's just in terms of these sort of word level operations, I think is actually a big interesting question. It seems plausible that you can and that there isn't really-- it's not possible to make a hard cut between this is a really good model of syntactic dependencies, and this is a simulator of the underlying state of the world.
At the same time, you can't actually get all the way up to the accuracy of the probe with at least any of the purely syntactic heuristics that we've been trying here. You can do quite well but not quite as well as the probe, which is, I think, a little bit of evidence that something like that is going on.
But again, I think if there's one thing to take away from all of this is that things are super messy, and they're still super messy, and that we certainly don't have the ability to at any of these levels of representation exactly predict how the model is going to behave, exactly predict what kinds of representations the model is going to build, exactly predict the correspondence between those representations and downstream behavior, and that probably what's going on is a mix of really surface syntax-y things and some amount of deeper structured model building. Yeah?
AUDIENCE: So just to hypothetical question, so when we look at human memory and concept learning, there's this idea that the better, the more sort of relationships we understand, again, instead of looking at, let's say, trees and leaves are sort of just close together because they're both in nature, rather looking at it as like leaves are on trees, and flowers are next to leaves in sort of these relational representations, our ability to recall items in them improves when the representation is bigger and more robust. So there's more items in these connections.
So is there any way to evaluate almost the size of the representations that the model is learning? So going back to that previous, the first example you gave about the top, would that change if there was more sort of information in that representation? And might that underlie that sometimes we see names of CEOs are maybe not aligning with the probe but other things are?
JACOB ANDREAS: Yeah. Yeah, no, that's a great question. So I think one thing to remember with all of these experiments is that we're looking at, for the most part, models that were trained first on a huge amount of random text on the internet and then either fine-tuned or not fine-tuned on data relevant to the task of interest.
And so maybe one piece of evidence that I can give in favor of what you're saying here is that you do much better-- in terms of how well the probe actually performs at these tasks, it's much better in models that have been pretrained on the entire conceptual structure of the entire internet and then fine-tuned on these individual tasks versus models that have just been learned in these sort of relatively narrow domains from scratch where you probably can get away with just memorizing things and not building this larger hierarchy.
That being said, this is a super sort of coarse, messy experiment, not least of which because-- I guess not for these models, but for the bigger models that we were looking at in the second half of the talk, we don't even know all of the time exactly what went into the training data. And certainly, it's too big and complicated an object for us to hold in our head everything about the world that's communicated by it.
I think a really interesting project would be to do a much more systematic study of as you scale things up, either in diversity or just size or complexity of the situations described or whatever, how does that affect these kinds of things. But yeah, we have not done that.
OK, so just to talk about some other cool follow-up that there's been, so this paper basically arguing that our alchemy task was too easy, and you can get not quite all the way up to the probe but most of the accuracy of the probe by not even really keeping track of how much liquid is in each of these beakers and just whether they've been emptied or not, and building a much, much harder version of this task and evaluating modern models on it.
And a cool thing is that this still works even in the harder version of this task. At least behaviorally, models can do it. But it seems to actually be really important that those models are first pretrained on code and not just language data. And there's a big separation between models that have code in their training sets and models that don't have code in their training sets in terms of their ability to do these more challenging tasks.
So this, I think, also comes back to the question about how does training data influence all of this, and in fact, stuff that's seemingly unrelated, but it's much more explicitly about maybe the relevant kinds of reasoning can be useful as forms of supervision.
People have found similar kinds of linearly decodable state representations in non-linguistic tasks. So there was this Othello paper that some of you may have seen that's reading off representations of the state of the board in the board game, Othello, from its hidden representations, and some very recent work from Charles Jin and Martin Rinard at MIT looking at models trained to evaluate programs and seeing whether you can actually find sort of correlates of the state of the program execution as it's generating the program line by line and finding that you can.
And finally, one of the cool things is that we had this original-- whoops. Yeah, finding that information about entities dynamic state is localized to their mentions in documents. In parallel, there was this Rome paper finding that information about models' sort of background knowledge about entities is localized to their mentions within documents.
And so this actually sort of motivated the more knowledge manipulation experiments in the second part of the talk but I think really just points to there does seem to be some uniformity in the way both information provided by training data and information provided as input get represented and get sort of integrated with each other. Yeah, so these are the same picture.
And finally, there's a bunch work on not just these kind of really relational world models but other more interesting graded structures-- so work finding that, for example, the three-dimensional color space is encoded pretty well in models that are just trained on text. And even-- and I think Philip is going to talk more about this in the afternoon, but that you can actually find correspondences between representations of text and representations of images of the situations that are being described by that text in real sort of continuous embedding spaces.
So yeah, and this goes for other things. You can find of maps of the US encoded and models representations of city names. We have started building a benchmark at MIT that's really trying to make it possible to evaluate in a much, much finer-grained way what these state representations look like in situations that actually require sort of physical reasoning, spatial reasoning, reasoning about the dynamic properties of objects and materials and things like that by just having sentences like the piano is in front of Ellie, Ellie turns left, is it more likely that the piano is now right of Ellie or left of Ellie in a way that really requires you to now simulate this spatial situation and not just pile up syntactic relations between words as was being asked before.
An interesting thing here is that-- and we have this in a bunch of different domains. In social domains, it's pretty easy. These spatial relation problems are actually quite hard. These are all open models, but the Stanford big benchmarking people just ran all of the LLMs on this data set. I think that's going to be published soon-ish. And this spatial relations domain is still really hard.
So again, coming back to the picture is not as clean as we've made it out to be before. Even the best models that we have today, this seems to be a modeling problem that they struggle with and for which I don't think we expect there to be nice, clean internal representations.
So to sum up, some evidence that language models for some tasks some of the time produce rudimentary representations of situations in world states. We can read these things off in many cases linearly, either by taking linear approximations to model behavior or by training supervised linear models with some hand annotated data.
And importantly, these are not just correlational. You can actually use them to intervene in representations in order to exert predictable control on generation, sometimes even in cases where you can't get the same degree of control with textual input to these models.
To very briefly wrap up, I want to come back to a question that was asked at the very beginning of the talk, which is, what do we even mean by world model? And why is this the right-- or is this the right language to talk about the kinds of representations that we've been pulling out right now?
And in particular, this is a paper from our colleagues in Max Tegmark's labs at MIT that was saying, you can find timelines showing how historical events are oriented with respect to each other. You can find things that look like maps showing how objects located in space are oriented with regard to each other. This shows up even in much, much, much simpler models. [INAUDIBLE], even sort of pre-neural, LSA-type things, you can decode things that look like maps from them.
And this sparked a bunch of debate online about whether a map is really a model at all and whether this is something that we actually want to think about as just a reflection, that co-occurrence statistics in words have some kind of interesting structure or whether we want to think about this as evidence that models are actually or language models are actually building world models.
And so with regard to what's the right kind of thing to call a model, I think it's useful to draw analogies to other sort of model building activities that we participate in in human societies, the kinds of things that we're willing to call models. So here's an example of another map of the solar system in this case.
It doesn't show us the entire state of the solar system. It shows us the relative sizes of the planets up top in a way that doesn't correspond to their spatial locations at all. It shows us the locations of the planets but not their sizes down at the bottom here, but with some big bars because there's range, and these things are not actually ever or only very, very, very lined up in long lines like this.
And I think we want to think of maps as being-- and so what good is a map in society? Why do we create these things? And the reason is that if I have this thing, then there's actually a very large set of questions that I can ask about the solar system that I don't have to pre-compute or write down in some sort of lookup table, but that I can get out of this map with very little additional computation on my part as a user.
If I want to know how many times could the Earth fit inside Jupiter's red spot, I can answer this question with this map. If I want to know how much farther from the sun is Saturn than Mercury, I can answer that with this map. There are many things that I can't answer, both stuff requiring modeling the dynamic state of the situation-- where are these things now? Where are they going to be three weeks from now? What would happen to the entire state of the solar system. If I were to pick up Jupiter and Mars and swap them today?
But at the same time, there's a lot here that's sort of a compressed representation about the state of the system, a low-dimensional representation of the state of the system that allows us to ask a bunch of important questions about it. Incidentally, I went for a run this morning, and there was also a map of the solar system right here in Woods Hole, on the little rail trail that goes up toward Falmouth, where they have these signs, and the signs have planets to scale at appropriate distances and so on.
So this is maybe the simplest kind of model that we build that lets us answer just these sort of questions that basically have to do with static snapshots of systems. And think of this as really analogous to everything that we were showing about models sort of linearly representing background factual knowledge about the state of the world.
Here's another model of the solar system. If you haven't seen one of these before, it's called an orrery. It's a little mechanical device. There's a crank on it that I think is not actually pictured here that you can turn to simulate the solar system forward in time in a way that will correctly preserve the relative locations of at least the Earth, the moon, the sun, and the inner planets here.
And this model, which maybe is something that we're a little more willing in general to call a model, lets us answer a richer set of questions than a map. Now, we can answer, for example, counterfactual-- or sorry-- not counterfactual questions, but questions about the dynamics of the system, when will the planets all next be in a straight line with each other, or given that there's an eclipse today, how far in the future is the next eclipse going to occur and things like that.
And so by adding a little bit of expressive power to the system, we've given ourselves the ability to answer a richer set of questions but at the cost of making setting up the system more complicated. We have to now do a little bit of work on the outside, turning the crank to get it into the right initial state.
And just like the map, there's a lot of questions about the state of the system that we can't or questions that we might like to ask about the system that we can't ask using this because it hardcodes a bunch of contingent facts like the shape and size of the Earth's orbit.
So if we imagine a sort of counterfactual world in which here, I guess, Earth and Mars were to swap places or the Earth and the moon were to swap places, or I don't know, Mars had never existed at all, something like that, we can't actually answer those questions with this system without sort of breaking it to pieces, recomputing what all of the planets orbits would have been, and putting it back together to reflect those new orbits. And so we can ask sort of conditional questions, dynamic questions, but not arbitrary counterfactuals.
And if we want to go all the way to arbitrary counterfactuals, then we need to do sort of arbitrary N-body simulation of the system using all of the fancy techniques from modern physics. And this, again, comes at now a significant cost, both computationally in terms of how much it takes to run the system.
I can build the map in the Stone Age. I can build the orrery in a Renaissance goldsmith's shop. And this, you can't really do until you've already invented semiconductors and all that. And maybe more importantly, this requires a lot more work to set up and to specify the initial conditions and gives you a much more complicated language that you need to use in order to ask whatever question it is that you're trying to ask.
And so I think when we talk about world models inside of language models or world models that are human mental models or anything like this, it's really useful to think of the property of being a model is not just a binary thing but something maybe a bit more graded where these models generally live on a spectrum of how complicated the questions they allow us to answer are, what kinds of questions about the system being modeled we can or can't answer within a given model, and conversely, for the questions that we can't answer, how much work we have to do outside the system to get the system to actually produce the answers to those questions.
And so coming all the way back in the talk to the question of why is it that when we reach into the model and cause Sundar Pichai to become the CEO of Apple rather than Google, why is it not the case that the representation of Google is being changed so that he's the CEO of Apple.
Well, I think that's because we're trying to do something that's morally the equivalent of like ripping out a piece of this map, pasting it down somewhere else, and expecting the rest of the system to be internally consistent in a way that it certainly wouldn't actually be in the world. And that when we have these more complicated questions like really counterfactually, what would all of the things be that would be true in a world where Sundar Pichai was the CEO of Apple, you have to do a lot more work than you can do with a single linear transformation. And we should, therefore, expect that we'll need those kinds of editing tools to be much more sophisticated than the ones that we have right now.
So to the extent that we want to place LLMs in this sort of hierarchy of model expressiveness or model complexity, I think we should think of it as being something like this, that they're somewhere between maps and orreries, simple information lookup, or very, very, very simple forms of simulation during ordinary text generation, and maybe starting to move up towards these more complicated questions only when you give them extra sort of computational power, either via tools or via work on the scratchpad like Aaron was talking about yesterday to think out loud and do intermediate work.
But I think this also gives us a kind of roadmap for what we want to do with these models in the future, what kinds of research might make them better. And that's basically trying to bring all of these things into alignment with each other, right? We would like for it to be the case that even when they're in map mode, the map doesn't contain any internal contradictions.
We can imagine a version of this map of the solar system where the bar describing or maybe the size of mercury was too large, and it implied that it was going to run into Venus sometimes in a way that doesn't actually happen in the world. And these are the kinds of things that we expect to be able to read off statically and maybe even enforce statically from the outside, that whatever representation we're working with, that it at least exhibits internal consistency or as much internal consistency as we know how to compute.
Similarly, and I think this shows up in both the answer should be correct-- if the model believes some proposition P, it should also believe everything that P entails. And coming back to the linear decoding experiments that we were looking at before, we would like these things to all be represented in the same way as much as possible, even though they aren't right now, for interpretability purposes and probably also because it will make models generalize better and have fewer weird edge cases and sharp edges and things like that.
And so I think there's a huge amount of interesting research being done trying to figure out how to enforce all of these properties at training time in a way that we're clearly not getting from the learning algorithms and the model architectures that we have right now. Yeah, and with that, I will wrap up.
As always, all the credit here to the students who actually did this work-- so Belinda and Max, who did the original probing stuff that I was talking about, Belinda and Evan, who did the sort of representation editing stuff later on, Kevin and Stephen, who did the work on when truthfulness probes and queries to the models disagree, and then Evan, Arnab, Tal, and Kevin, looking at the linearity of decoding as well as a bunch of faculty, Martin Wattenberg, Yonatan Belinkov, David Bau, Dylan Hadfield Menell at Harvard, the Technion, Northeastern, and MIT for making all of this work possible.
[APPLAUSE]