A Theory of Appropriateness with Applications to Generative Artificial Intelligence
Date Posted:
March 17, 2025
Date Recorded:
March 4, 2025
Speaker(s):
Joel Leibo, senior staff research scientist at Google DeepMind and professor at King's College London
All Captioned Videos Brains, Minds and Machines Seminar Series
Loading your interactive content...
Description:
Abstract: What is appropriateness? Humans navigate a multi-scale mosaic of interlocking notions of what is appropriate for different situations. We act one way with our friends, another with our family, and yet another in the office. Likewise for AI, appropriate behavior for a comedy-writing assistant is not the same as appropriate behavior for a customer-service representative. What determines which actions are appropriate in which contexts? And what causes these standards to change over time? Since all judgments of AI appropriateness are ultimately made by humans, we need to understand how appropriateness guides human decision making in order to properly evaluate AI decision making and improve it. In this talk, I will present a theory of appropriateness: how it functions in human society, how it may be implemented in the brain, and what it means for responsible deployment of generative AI technology.
PRESENTER: It's great to have Joel Leibo today as our Quest CBMM speaker. Joel is currently a senior staff research scientist at Google DeepMind and a visiting professor at King's College London. Correct? And Joel I think was set in his career early on when he was at MIT. He-- I think he overlapped with Demis Hassabis back in 2011.
JOEL LEIBO: Nine.
PRESENTER: Nine. Yeah. And so that's the reason he's in London today. He was telling me a funny story that back when in 2010 or so at our coffee machine on the fifth floor, he used the one to introduce Geoffrey Hinton to Demis. And Geoffrey was visiting and Demis was there as a post-doc.
So anyway, Joel has started at the time to explore question about object recognition, and I think there were a couple of very interesting pieces of work with-- who posed questions that I think are still open. One is why domain-specific regions in cortex exist like the fusiform face area, and Joel at the time proposed that they arise due to the importance of building representations that are robust to class-specific transformations like rotation in 3D of a face or changes of expression of a face.
And you also have the time in work with Winrich Freiwald showed that a specific type learning rule can account for the particular properties of area AL in the monkey. You have ML, AL, AM. AM is invariant to viewpoints, ML is not, and Al in between has neurons that are specific to one view and the mirror symmetric view. And that specific learning rule would explain that kind of observation, so all the things that have to be still checked experimentally and explained.
Anyway-- but since then, he has looked in reverse engineering, cultural evolution and economics. And one of his recent major projects is the Concordia Generative Social Simulation platform. It's a tool that uses large language models to create a word simulation. I'm sure he will speak about it and-- because it serves as the foundation for his theories about which he will speak today, and it's an approach to computational cognitive modeling ideas through generative agents.
So, Joel, it's very nice to have you today, and I establish an alliance with your mother here to try to get back more often.
[LAUGHING]
[APPLAUSE]
JOEL LEIBO: Thank you for that introduction. It's really great to hear about the work I did in object recognition and neuroscience at my PhD and how-- I do think that there's a-- at least some kind of aesthetic level connection to what I'm going to talk about now. That was the patterns in how I think about research were set back then. But I think spelling that out, I'm going to-- it's probably best to leave as an exercise to the listener because it would be difficult for me.
But so I'm going to talk about this, which is a theory of appropriateness with applications to generative artificial intelligence. We've been working on this-- so in some sense, this is the story of what me and my team started doing after we started working on language models maybe a year and a half ago, two years ago, something like that. And this has been a chunk of the work we've done since then. And, yeah, I'll launch into more from there.
So why-- what is-- why appropriateness? So appropriateness is important for artificial intelligence right now since artificial intelligence systems are now talking and they often say inappropriate stuff. And as they start doing things more, they're also going to do inappropriate stuff, and we want to have something more to say about it than just that, obviously, because conflicts arise when people have different conceptions of what's appropriate. And conflicts can be bad and destabilizing. And more broadly, the AI risks that my team is interested in are the kinds of risks that are mediated through humans changing their behavior.
Humans care a lot about what's appropriate in context, and so in order to advance AI and do it safely, we have to think about humans and what humans think is appropriate. And so this is another-- this is a long way around of saying that most of this talk is about humans really, and I think it has implications for generative AI. And we also are going to have a generative AI related model of humans as you'll see. But it's really meant to be a statement about humans, and it's a cognitive science level thing.
So this is the outline of stuff I'm going to say. I'm going to start by talking about equilibrium selection and the general picture of that's been animating my team for some time. And then I'll move into the actual theory. We're talking about, the stylized facts I want to explain and how we explain it and so on and so forth.
So this is part of the way I justify the team basically. I say so AI is causing lots of changes in society. There have been other things like that where we've released some technology and then people have changed their behavior as a result, and wouldn't it be great if we could in the long run eventually come to make some decisions in advance based on people might change their behavior in this kind of way in response to some new technology.
Let's try to maybe we would change what we would do. Social media is the obvious example, but you could really give many other examples where some technology is released. And then it's not just the technology itself causes some something happens with it. It's people change their behavior because this technology is there, and that's what we're actually interested in in trying to say things about and worried about risks related to.
So obviously we have now put out some new technology with generative AI technology, and it's affecting all kinds of things. And so-- and I would say probably no one would disagree that human behavior has not yet adapted to it, so more change in society is clearly coming even if the AI technology itself were to stop changing, which obviously won't happen either. But I'm interested in these what happens as humans adapt.
And societal changes are not slow, but they're also not like instantaneous. They're mediated by people changing behavior, and they happen on the speeds that people change their behavior for things. But they're often positive feedback dynamics here. So the more people who have started to do something a certain way, that sucks in more people. So they can go rapidly. They can have tipping point effects and bandwagon effects and all these kind of dynamic things that people talk about.
More examples on the slide.
And there's other places where we have a similar perspective, and I take it from this area that's interested in social ecological systems associated a lot with the work of Elinor Ostrom if people know that name. And the idea there is that you can make predictions about different social ecological systems, which are typically like fisheries and forests and pastures and irrigation systems and things. And based on thinking about them as social dilemmas but also thinking about them as things that are affected by the technologies that are available.
On the slide, there is a picture of dynamite fishing. Obviously, you get a very different equilibrium in a fishery if people are fishing with dynamite versus if they're fishing with nets and, and things. So that's the kind of thing that that literature is interested in. And I think that's broadly analogous to the kinds of things that we're thinking about with AI technology. These are pictures where there's some overall equilibrium shift that may happen because of a change in technology.
Obviously, there are lots of things that we're talking about right now that are very current and important things like will our democracies be able to cope with the epistemic strain from disinformation being very easy and cheap to produce. That's a picture where we change the, the cost of taking different actions, and that has an effect on the equilibrium that society ends up in. Other questions there about mental health and AI-generated romantic partners, which is a thing now.
And next one is about what about cybersecurity equilibria, which you see-- you can see as a an equilibrium between attackers and defenders, but how is that affected by zero day exploits becoming very cheap and easy to find. And then labor market type questions, which are very natural to think about in terms of equilibria. So the point here is that there's lots of equilibrium selection risk to think about.
Why am I calling it equilibrium selection risk? This is a traditional game-theoretic way to think about it. I'm not the biggest game theory fan, I'll say, but I do think this is a reasonable way to describe the idea very quickly.
And so the way you're supposed to read a normal form game, which is what this is, is to say that s two players, player one and player two, the row player and the column player. The row player picks the row. The column player picks the column. The row player gets a reward, which is the first number of the two, and the column player gets the second one.
And then you're supposed to reason, OK, if we were in one place, let's say we both picked A, then does any of us have an incentive to change? Well, OK, the row player has no incentive to change because if they changed away from A option-- the "AA" option, they would get a zero, either choice, same thing for the column player. So if you're at the AA option, you would probably stick there and not change. And so you can see the same thing is true for BB or CC but not for any of the others. Right so there's three equilibria here pure strategy equilibria.
And so the reason it's called equilibrium selection is because there's there's a choice of, which one do you select? And game theory has no way to tell you which one. This is another picture which is also an equilibrium selection picture. But here, they're different from each other.
In the first one, it was doesn't matter which one you pick because they all-- everyone gets the same reward regardless. As long as you pick any of them, everyone gets the same. But here, it matters because in this case, the CC equilibrium is clearly better than the AA equilibrium because they get a 3 instead of a 1.
And also, other things can happen. You could have unfair equilibria like here, where there's AA and BB and both of them are unfair in the sense that one of them, one player gets more than the other. So equilibria are, in some sense, arbitrary in the sense that you could have picked other ones as far as this logic goes.
But they're also not arbitrary in the sense that no one cares about the difference because they clearly do care about the difference. The row player clearly wants to be in the AA equilibrium here, more than-- and the column player clearly prefers to be in the BB equilibrium. So my general picture of societal progress is one where you shift from equilibrium to equilibrium.
And what's interesting about this is it's a picture where there's a-- you can't change unilaterally. So there's a balance of opposing forces in each of these things. You're stuck in a particular equilibrium not because-- but just because, if you were to change, then someone would-- there would be something saying, well, no, I should stay here because I would do worse if I were to change.
And in this picture, you can see that there's both equilibrium risk and equilibrium selection opportunity. If you start in the BB equilibrium, the (2,1), then you could go down, or you could go up. And societal progress is, hopefully, picking better equilibria over time.
OK, so there's another picture that I think is very natural from this, which is really about, what is a society in the first place? I would say a society is a stable group that contains lots of disagreement potentially but is just somehow stable, potentially because of conflict resolution mechanisms or conflict avoidance mechanisms or something. Something prevents the group from falling apart, and that's what makes it a group.
And this way of defining what a society is is associated with people like Chantal Mouffe and others that are cited on the slide. And I think it's a really powerful way to think about what a society is. It's not actually too common in the-- at least in the AI risk community. People tend to have a different view, which is more like there's a-- a society has ultimately some kind of shared goal.
And I really think that that's a dangerous and different perspective here. This is one where just people really do disagree with each other and live in the same society, but they're somehow stable and figuring out how. And what are the mechanisms of that stability? Becomes the game that we're talking about. And the health of those stability mechanisms is, of course, the important thing.
And then the obvious candidates for what I mean by these conflict resolution stability mechanisms are governance mechanisms, courts and parliaments and these kind of things, or appropriateness concepts, which is what I'm also going to talk about. It's going to be a very generic thing as well.
OK, so I just want to talk for one second how I got here. So I've been interested for a long time in cooperation. But we were previously studying cooperation in a different framework in multi-agent reinforcement learning where-- and this was a world where what we wanted to do was to keep the game-theoretic insights that I just talked about, talking about equilibrium selection and things like that, but do it in a world where everything was happening with agents that were seeing pixels and making choices that were low-level motor actions.
But they, over long sequences, would add up to doing something that was cooperation or selecting this equilibrium or whatever. And we did this kind of thing for a long time. And we did things like model different social dilemmas in this type of space.
And when language models became easy to work with, we had a choice on our team where we were like, well, now the agents can talk to each other. And do we study cooperation? Or do we study multi-agent reinforcement learning?
And we were like, well, I guess we want to stick with-- it's better to study cooperation in a world where agents can talk to each other and figure out what the implications for our research program are, rather than to stick with these kind of things that-- is how we became well-known for doing this. The other thing that was interesting about this, and is still true in our new work, is this was another way of complexifying a game-theoretic social dilemma. This was saying, let's make it spatiotemporally complex. Let's abstract the real world in a different way that's not the same as the way game theory is doing it but then try to play in the same space and make predictions about the same kinds of things.
Now we're going to abstract things in new ways that will be in a more fully language-like world. Part of the reason we wanted to do this came from this thinking about what-- I've started to use this distinction of thin versus thick morality. Obviously, this comes from Clifford Geertz's thin versus thick description.
But really, a place I was inspired from was this Michael Walzer book, which is cited on the slide there. And what they mean by thick morality there is it's the kind of-- it's particularistic. It's contingent on the details of the history of the society. It's full of detail. It's encultured.
It includes all the details of who deserves what and why. And it includes lots of rules that are materially irrelevant, apparently, like, what color do you wear to a funeral? When you enter a temple, do you take off a hat or put on a special hat?
Do you shake hands with the right hand or the left hand? Or all these kind of, apparently, arbitrary things that are normative because they would be norm violations to do the wrong thing. Imagine someone wearing the-- wearing hot pink to a funeral or something like that. And someone might get mad.
And it would be some kind of like-- you could imagine someone might have some righteous anger at that. So what I think is interesting is that it's often-- it's very difficult for people to tell apart often which parts of their own morality are things that are specific to their culture and these silly rules, we could call them, versus which things feel like they have some more deep reason for them. I think it's very important that whatever picture we have can accommodate the fact that our morality has a lot of this stuff.
Now, there's also thin morality, which is meant to be the intersection of everyone's morality. That's like what the utilitarian philosophers are talking about when they try to say the thing that's really in common that's behind all the rest. And I think a lot of people are searching for this common core thin morality to build it into machines or something like that.
But what I think would be a better kind of guiding principle would be to think about how multicultural groups, which contain lots of different people which have different thick moralities, manage to get along anyway. And that's the picture I would like to have. Now, I'm relatively theoretical, I would say. I don't have a-- this is exactly how to do any application. But that's the guiding principle.
OK, so how can we live together there in a single sentence, stolen from Aristotle or something. Yeah, OK. So now this is a good point where I should show what the actual data from these models looks like. What do the models I'm talking about actually produce?
Because I've given versions of this talk where I've totally skipped this, and then people have had no idea what I was talking about later. So I'm going to do this really quickly, even though the details are not terribly important for the rest. But we have these systems that are built by language model agents that talk to each other. And they're doing some of-- think of it like collaborative storytelling or writing a play together or something like that.
And this is some list of events here. In this case, it was a whodunit murder mystery thing on a train where someone named Charlie killed someone named Victoria. Charlie's trying to get away with it, and the others have some reason to investigate and figure out why. And you see, there's some different record of events.
But we can zoom in to one of these events and see how it happened. Let's look at that one where Charlie, apparently, said he'll go to the lounge car and have a drink. Remember, Charlie's the murderer. And you can expand any one of these events and see, how did this happen?
And there's a whole-- things underneath. So that's the whole prompt that led to Charlie saying, I'm going to go to the lounge car and have a drink. And as you see, it has all these parts, including here's a justification thing. Charlie's thinking about his justification for the for the murder.
Apparently, he said he was just doing what was best for the country. And Victoria was a corrupt politician. Whatever. But then how did that happen?
Where did this justification come from? And you can zoom into that further and see this-- here, it has this whole long chain of thought which included, looks like it starts with, when Charlie was 12 years old, he was elected class president and eventually gets down to, he did it for the country. And you can see-- and then it's a bunch of agents that are interacting with each other in this world.
And things happen, and they affect things later on. At this point, Charlie was having a drink. And then you can see later on, Donald tried to interview Charlie, but Charlie was drunk because he had been drinking since the other time point.
So there's temporal dependencies and things here. And you can also see how I'm trying to build back up to something that looks a little bit like the kind of multi-agent reinforcement learning picture I had before where there was an environment and agents, and you could study them and say and do experiments and stuff. But we're doing it in a world that's fully language model agents talking to each other, basically.
OK, that's what the data looks like. OK, so now what about appropriateness? So remember, I said that this is mainly about humans. Josh has a question. Yeah.
So Josh is asking-- I'm just repeating for people I can't hear-- if there's any world representation other than what the language models say. And yes, in our setup there is. Yeah. The concordia is a framework for coding that in a way that you can kind of flexibly move between whatever you want to be hard coded versus made up, and you have the ability to kind of go back and forth, basically.
AUDIENCE: In natural language, or what kind of representation?
JOEL LEIBO: In code. But both, actually. A mixture, really. Yeah. OK. But it's not terribly important for the rest of the talk, so I don't want to too much into the details of how the code works. But the talk is meant to be more kind of general theoretical stuff. OK. So right. So we care about humans. We're going to think about what appropriate means for humans.
All right. So why did I even start talking about appropriateness? I'll say right now, most of this talk is about social norms. I think social norms are the important part. But we actually introduced the term-- we started using the term appropriateness ourselves because it was meant-- it was kind of an umbrella term. What we decided was-- and this is really about semantics, right?
This is a choice that we made. Right? Because you could have cut these things up in a million other ways. You could define things how you like. Right? But the way we wanted to define it was we wanted norms to be impersonal things. But it still felt like appropriateness was more broad than that. So it felt like there were two different things. There was appropriateness with friends and with family, and there was appropriateness in behavior with strangers.
And we're only going to use the word norm for strangers. And I think in this talk, I'm mostly going to focus on the strangers part, which is the norm part, which I also think is the most important part. But because I'm not going to say too much else, I'll say very quickly a little bit, which is that part of the difference here, it ends up being similar to the difference between evolutionary game theory and kind of like iterated normal form game theory, with two agents playing the same game together forever.
That's the difference where friends are seeing the same other person over time, whereas strangers are just once. OK. But I'm really probably for this talk going to focus on the stranger side. OK. So what do we want to explain with our theory of appropriateness? These are the stylized facts. So the goal is to explain all this.
And so the first one is that appropriateness is context dependent, it's culture dependent, and it's role dependent, which is just to say that what is appropriate to do here as a speaker in this room is different for someone who's a listener in the room or different if you're a judge in a court is different from a defendant in a court. Also different here versus in China. All of those type of properties, those have to hold.
Also, appropriateness has a sense of arbitrariness, similar to what I was saying earlier in the talk about the possibility of silly rules and also about the possibility of equilibrium selection of different things that might have gone either way. Then the next one is that acting appropriately is usually automatic.
Now, that's not to say that we never have to think hard about it, but it's to say that most of the time we can just do what's appropriate without thinking hard about it, and robustly to distraction. It's not like being distracted makes us always act inappropriately. Maybe sometimes, but a large part of it, probably most of it, is relatively automatic and robust. Then the next property is appropriateness can change rapidly or not. It can stay stable for long periods of time, but then when it does change, it can change rapidly.
So there should be tipping points and bandwagon effects and these kind of things. And then the next one is that appropriateness is desirable. It's somehow related to sanctioning is what I'll say, and I'll say much more precisely what I mean. It's supported by sanctioning. We have a particular meaning of sanctioning. So there's some sense of encouragement and discouragement, social encouragement, discouragement that ends up supporting it.
So these are the properties we want it to satisfy. And I'll try to convince you that it does, I guess. OK. So what's the actual model we have in mind here? So at this level of thinking about it, I will argue that I think it's fine to say we're going to model humans as being basically like a large language model or having one in their heads or something like that, and there's a particular way to make this work. I think it's fine.
We're not really saying that humans deeply are anything in particular. We're not saying that the brain works this way or whatever. We're just saying that it's a computational model and it produces some predictions, and you can check the predictions to data and see how well they work. You can talk about it being-- its parsimoniousness or not, and all the other properties we have.
An alternative is we might have had a reinforcement learning agent as a model, a model of humans, and you can satisfy some data that way. We're also not saying that humans are reinforcement learning agents, or that would have been a dumb interpretation if we thought we were saying that. Here we're just saying, here's another computational model. Let's use it to try to understand-- to see what we can explain with it.
So OK, so how does it actually work? So the idea is we have some kind of global workspace, and we took that term from the consciousness literature. It's meant to be some kind of-- you think that there's a bunch of different brain regions that can kind of be simultaneously active and communicating with one another, and maybe some of them are more perceptual, some of them are more motor.
And for us, that's basically going to be like the prompt. That's going to be like the chat window for the language model here where you can do pattern completion from one part of it to the rest. Maybe from the perception side to the action side would be the reasonable way to do it. We can also think of it as kind of addressed. Right? So the parts of this buffer thing, global workspace have-- some part is the part that perception gets written to.
Some information comes into your eyes and then gets represented somewhere in the brain. That's a chunk of this global workspace. And then other information gets read out from another part, like your motor cortex or something, and that gets sent to muscles eventually. And that's the speech part of it. Right?
So it's like one kind of buffer and you can do pattern completion across it and one part gets written into, another part gets read out of. That's how we're thinking about it. There's one more thing to say, which is that you can also have retrieval of long term memories, maybe from hippocampus or something, and you can think of those as just like a prefix to this whole little structure here.
So you get some memories, like I remember this stuff about myself, and then I see this right now, and then complete from there to, OK, so therefore, I'll do that. That's what the picture is. And of course, since it's a multi-agent picture, we assume everyone is doing something like this and then they're interacting with each other.
All right. There's one more step. You need to turn this into a kind of decision maker agent model that we're familiar with. And so we took this inspiration from this really old book, sociology book by Goffman, where he thinks about people as kind of actors in a play. And what we do is we model an agent, say, named Alice, as an actor playing the role of Alice.
So that means that there's some prompt there, saying, what would Alice do? And then you take whatever gets completed, and then you take that as the action of the Alice agent in the simulation. And you're doing that for a bunch of agents, and they interact with each other and things happen.
Here's an example. So you could have-- the global workspace could include those sentences, like Alice is hungry, Alice likes to eat apples, Alice sees an apple and a banana on the table in front of her. And then it could say, question. Given the above, what should Alice do next? Answer. And then that's when the language model takes over. And it would reasonably complete with something like Alice eats the apple.
You can see this has some interesting properties, right? It's almost like some kind of like Aristotelian syllogism thing. Right? And if you did put Socrates is a man and all men are mortal, it would complete the right thing, of course. Most of the time. Maybe not always, and it would break sometimes. Right? But it would reasonably sometimes do that.
But it would also work in these more vague situations reasonably well. Also not perfectly and whatever, but it does OK. Yeah. And it would also change in response to changes in reasonable ways. Right? If you said instead that maybe a few minutes ago, Alice's friend Bob said save the apple for him, it would then switch its answer to maybe then I'll eat the banana instead.
So it responds reasonably to stimuli, where stimuli here are injecting more sentences which are data sentences. Someone did something type sentences. And also, importantly, sanctions will be something like that. OK. More things we can say. So another thing that makes this a little bit more powerful is this idea of a component of the overall prompt, where you could have kind of a sub-- like an internal question that gets asked and then answered internally, and then you keep going in the generation.
And so you could make choices by asking yourself some set of questions that you could effectively pre-program. Like, I'm going make all my choices by asking myself, what should I do next? Well, you could also say, I'm going to make all my choices by asking myself, what kind of situation is this, and then what should I do next? Or whatever.
So we have a notion of a component which can do some memory retrieval and then some summarization and then put that into the prompt and continue. And that's what I was showing when I kept zooming in to, how did this-- in the data before, when I was saying, you know, why did the person say-- the agent say this? And zoom in and say, oh, we can keep kind of following it down because there were components here.
OK. So another thing that is cool and useful from a kind of a methodology of computational social science perspective is you can compare things that were previously hard to compare. There's one obvious way. So if we're going to be making decisions by asking ourselves a sequence of questions, the obvious one to do is like, OK, what are my alternatives? What are the consequences of those alternatives?
How do I value those consequences? Which alternative has the highest expected value and then say, OK, I'll do that one. So that's kind of the rational actor picture in this model. But what's interesting is you could also do a bunch. You could do any other set of questions you want. Most of the time in a lot of different computational social science methodologies, you're stuck with something that looks like the rational actor picture because you have this deeply embedded need to calculate an expected value of something.
But here, since it's all just a bunch of questions, you can really put whatever you want in there. So you can do this other one that we got from this paper, which was called the logic of appropriateness. It's not exactly what we mean by appropriateness, but it was super inspirational for us. And what they said is that people make their decisions by asking themselves, what kind of person am I?
What kind of situation is this? What should a person like me do in a situation like this? And then make their choices that way. And what's great is that now these are on completely even footing, right? Like, we can just compare these two agents, one that makes their decisions by the logic of consequence, and one that makes their decisions by the logic of appropriateness, or a hybrid agent that does one in one situation and the other in another situations and see what we get.
Construct different models that do different things. And I think that's super exciting. It's methodologically exciting because it gives us a way to compare things that were previously in different languages but in one framework. That's useful. And other things, other kind of comments on this is that it is a reward free theory. But I don't mean there's no incentive effects.
Obviously it's meant to capture all the same data that we captured with other methods. Right? It will only be deemed successful once we've shown that it can capture the kinds of things we could capture with reinforcement learning agents and everything else. The only thing that's different is that if that works, then we'll have done it in a way that doesn't require writing down a scalar reward function.
And I think that could be powerful, because there's obviously lots of distortions that come from writing down a scalar reward function, especially from starting from it. Also, it can make the same type of a version of that same story if you're talking about game theory, where game theory, you have to get the numbers in the matrix from somewhere. This is a picture where you don't have to make up numbers like that.
You make up something like this instead. So it's different. And the other thing that you're assuming, of course, is that you have a pre-trained language model, obviously. But I think there's lots of ways to make that reasonable here. It's like we're saying we're starting from a model that knows something about the culture. It's a different starting place. There's also interesting things that I think would be easier here.
One, there's a whole family of things that are hard to explain with reinforcement learning and with any kind of rational actor model, which is anything that involves preferences changing, because they're typically exogenous in those models. You have to say, if I want to have Alice prefer apples to bananas, usually you put that in the reward function.
Now you can always say, maybe there's some-- you can have a more complex model where things change on different time scales or whatever, and you can do that. But that's why this is a parsimony argument that we're making. We're not saying that we can necessarily capture anything different or that we're not equivalent, in some sense. There is a deep equivalence between this and other things, of course.
But we're saying that it is a theoretical language that might be more parsimonious to explain certain kinds of things, especially things that have to do with preference change. How can it be good at preference change? Well, one way to see how that could work is-- let's say I'm faced with the decision of 'do I like apples or bananas more,' or something like that. Or maybe I should give an example like, 'do I like rock music or classical music' or something like that.
And you could have an agent, you're simulating their life, and maybe different things happen to them in their life. They're exposed to more classical music or whatever and they come to answer this question. Like, what kind of person am I? I'm the kind of person that likes classical music. And they answer that question by restoring a bunch of memories of, like, all these happy life experiences of listening to classical music.
And that's the way these kind of models can work and we see them work. And so why might you think that this can capture the same kind of data as the completeness of it, the same kind of data as reinforcement learning models? Well, one way to make it at least obvious or semi obvious is that, well, you can just, of course stick a-- you can say your goal is to maximize this, right?
And then it's a sentence, right? Your goal is maximize money or fame or whatever, and then it's not really that different. Right? So that's one way to kind of reduce the models to each other. So it should be comparable. OK. But this gives us a nice kind of social construction of the individual picture. I think we have another question.
AUDIENCE: I'm just a little stuck on, what's the difference between saying, what should I do and what should a person like me do?
JOEL LEIBO: Good question. Well, it's a language model. Right? It's a little bit easier. It's like representing what everyone should do. I think that's actually important about the picture. It's one of the ways in which it's different. I have another slide where I talk about this later, but I'll say it now, maybe.
In reinforcement learning, everything is very self-directed, right? You get a reward or a punishment, and it's like something happened and it affected me right now. Now I'm changing my behavior as a result of it. But here, the equivalent things, the sanctions, they don't just change what you do. They change a language model, which includes what everyone in that role should do and would be sanctioned for doing.
So you can hear like, OK, the president did this inappropriate thing, and I'm not the president. I'm not the president myself and I have no intention of being the president, but I still have opinions about it. I might tell other people, like, isn't this appalling, this inappropriate thing the president did? So it's a picture that's more like training a language model than doing reinforcement learning to an individual.
That's how it's different. I'll say more things. Maybe that'll get more clear as we as we go on. I want to make sure, see if I have to speed up here. I guess I have to speed up to get to what in the world is appropriateness. I haven't gotten there yet. OK. So individuals, there's a picture here about the individual aggregating all the influences on them. We call that the guidance.
There's a story about social identity and stuff. But I think I need to move to what is actually appropriateness. For appropriateness in behavior with strangers, we define that as normative. Norms are-- and remember, there's some definitions here. Right? These are choices that we make. Right? And we decided that the word norm should be restricted to these kind of impersonal behavior with strangers kind of cases. We wanted them to not be sensitive to the details of personal relationships.
Yeah. So the definition here-- and there's a bunch of words in here that are all made up that I'm going to have to tell you what they mean in the next slides. But the definition is behavior is normative if it's encouraged or its complement discouraged by a generically conventional pattern of sanctioning. And so I have to tell you now what generically conventional pattern of sanctioning means.
So there's really two parts. There's the convention part and the sanctioning part. So now the traditional way to define conventions in this type of game theory infused social theory, whatever it is, is to say that when you have a game with multiple equilibria, then you call the equilibria conventions. That's the traditional thing to say. And this is compatible, but we want it to have a reward free way of doing it. Right? Because this is supposed to be a reward free theory.
We found this definition in this paper by Ruth Garrett Millikan, which I think is a really nice way to do that. And the way she defined conventions is she said they have to satisfy two properties. One is that it's a pattern that is reproduced, that is the source pattern. It's created from a source pattern, and if the source pattern were different, then the reproduction would also be different.
So it's meant to be kind of a counterfactual statement, the first one. And the second one is that the reason for the reproduction is the weight of precedent, in some sense. And then we make this precise in the paper. That's overall what it is. There's other comments to say about this is-- it doesn't mean that everyone has to do the same thing.
There could be partner dances where there's the male part and the female part, but they're one convention. Things like that. It has all the right properties you want for a concept of convention. Another thing that we talk about is the scope of the convention, we call it, which is like, who's involved in this convention?
And specifically, there could be narrow scope conventions, which are just within a family. Like, you could come up with a pet name for your partner and you'll both answer to it and it works like a name, but no one else in the world knows that. And you can do things like that in a friendship group. People you know well, you can build up conventions that are kind of unique to the friendship or family. But there's other conventions that are more like at the societal level.
Those are the wide scope, the generic scope. That's what we call them. Now, what are sanctions? So sanctions are just things that change the probability of the agent producing some output after the sanction. So it's another counterfactual statement that more or less says that. We have a more a technical way of talking about it on another slide.
But the way to think about it is just something gets put into the context of this language model that we're using to model a person and that says, you know, don't do this or do more of this or I don't like it when you do that. I don't think that the president should do that. I don't think this is what conduct-- good conduct for a judge is like, or sentences like that are sanctions. They could be positive or negative in this picture.
They're almost like reasons here, right? There could also be non-social reasons. Right? But these are social ones. So our definition of norm is a behavior is normative if it's encouraged or its complement discouraged by a generically conventional pattern of sanctioning. So we're saying if it's a pattern of sanctioning that is itself conventional. So it's reproduced according to the weight of precedent, then the behavior that it targets is the thing that we call a norm.
And then it's appropriate with strangers. That's our picture. This is the slide where I talk about how it's different from reinforcement learning. The main thing about this is that it's really about training a language model rather than being a kind of self-directed, like, you're hitting the individual for doing the wrong thing. It's like you're saying, this is what a person in this role should be, and it doesn't just affect your behavior yourself, but it affects also who you sanction as a result of knowing that.
Like, you can say that everyone around me thinks that the president shouldn't act this way, and then you talk about that with others and sanction along those lines. And so it can immediately bind people in uncommon roles. Like the moment someone becomes the president, they have some representation of what a president should be. They don't have to do reinforcement learning to figure that out, which would be a very weird model.
Obviously there are other ways to capture all these effects, but this is the way we're doing it. Yeah. So this definition again. This is things that I'm not going to have time to talk about. But there's more technical ways of describing all the same things. There's implications of these ways of defining things. Like when there's a convention, you can predict what one individual will do from knowledge of what the rest of the individuals will do or are doing.
And you can say if there's a population with established norms, they'll tend to stay. And you can say that if there's new actors entering the population, they'll tend to adopt the norms, things like that. Another thing that's really important that I definitely want to make sure I have a chance to talk about is that there's a story about explicit versus implicit norms here. Explicit norms are things that can be verbalized in language.
Think of them as an explicit rule that's written down, like a law or a proverb or something that has a particular verbal form that creates a particular standard that you can have in the prompt of a language model and then complete from and it will respond appropriately and implement the rule. That's an explicit one. There's also implicit norms, which are still a standard, but they're mediated by having been absorbed into the weights of the language model.
So we have a picture where I didn't really say much about it just now. But the picture is not just that an individual learns by constantly appending to its memory more and more sentences. The individual also learns by doing some fine tuning on those sentences. That's how we're thinking about it. We could also see it as a model of consolidation or something if we want to. And so that way of thinking about it, you also end up with implicit norms.
So if you have a bunch of examples that are all describing a particular standard, then those will be absorbed into the weights of the language model that does the prediction in the first place. And so this is meant to capture things that are more-- harder to articulate in a precise standard, like how close do you stand to a conversation partner and stuff. These things are also culture dependent in all the same kinds of ways, and arbitrary. They have all the same properties.
So the kinds of norms we're talking about are like deeply not just the moral kind, but we also think they work the same way. But everything that feels normative is here. And what the difference between implicit and explicit is also meant to be able to explain is there are these effects, like the Jonathan Hite kind of moral, dumbfounding experiments where you get these particular protocols where people find it hard to explain why they have a certain moral judgment that they have, but they still have the judgment.
The idea here is that implicit norms-- it's hard to defeat an implicit norm with explicit reasoning, but you can defeat an explicit one. So the implicit norms trump the explicit norms in this picture. And so this is what I said a second ago about in context versus in weights changes. I think this can explain the stylized facts, that these things are all context dependent, culture dependent, and role dependent, both the implicit norms and the explicit norms. Right?
I was just arguing. Obviously, for the explicit norms, it's like rules are different in different places so you end up with different patterns of behavior. For the implicit norms, it's also the experience people get is different in different places, and it ends up being different. And appropriateness has the same arbitrariness properties, as we were just talking about.
And it's automatic here in the sense that if we think that most things are handled by implicit norms-- you don't have to have in your global workspace at all times an explicit representation of the form of the rule that you're following. That would be a very weird way to accomplish things. It's like you're constantly saying to yourself, like, I should go through life by saying, follow this rule, follow this rule, over and over again so I don't forget it.
The picture is that you don't have to keep rehearsing internally and doing things explicitly. Instead, it's implicitly in the weights of the network, that you just wouldn't make that prediction from this context to that output. Also, it can change rapidly, because in the dynamics of these things, that's what happens, that you get-- it's very bandwagon effect-like.
And it's all underpinned by sanctioning. So that's how it fits it. And like I said, this was just the part with strangers. There's a different story with friends and family because there you have to take into account the precise dynamics of-- it could be like tit for tat is important with friends and family, because you're seeing the same person back and forth over time. That's not possible with strangers.
And right. This was all meant to talk about how can we live together. I'm trying to wrap up because I feel like I'm out of time, probably, Or I want to take questions, I guess. I'll try to quickly say something about what are the implications for AI though. So I really do think that this is an important picture for AI. But the way to understand it is, it's a picture about humans, right?
And the fact that humans have this attitude toward what's appropriate has these implications for how we should design AI systems. So now I'm no longer talking about the AI systems I'm using to model the humans. Now I'm just talking about technology. And so I think the top level thing is the importance of context and culture and everything of that type.
And it's really something that is weirdly-- there's a weird thing about the AI community where the quest is for generality. Right? So it's very easy to forget that different people want different things and they want different things in different contexts. I've been using this thought experiment to talk about it with people a lot where there's-- you could imagine that there's two apps and they're on your phone, just like two different buttons. Right?
Your two icons you click on. One is 'labeled comedy writing assistant' and the other one is labeled 'tech support helper' or 'search engine' or whatever. And maybe behind the scenes, those two apps are literally the same language model, exactly the same. Nothing is different whatsoever. But just the fact that the human clicks on the comedy writing assistant, clearly makes a huge difference for what it is appropriate for that bot to say.
And I think that's something that's really important and not as commonly understood in AI circles as it really should be. And I think it has a lot of implications for how we organize things and govern them and many other downstream things. And I will probably leave it there. That seems like a good spot to take questions. Yeah.
AUDIENCE: Cool.
[APPLAUSE]