November 20, 2023
November 14, 2023
Peter Dayan, Max Planck Institute for Biological Cybernetics
All Captioned Videos Brains, Minds and Machines Seminar Series
Abstract: Much existing work in reinforcement learning involves environments that are either intentionally neutral, lacking a role for cooperation and competition, or intentionally simple, when agents need imagine nothing more than that they are playing versions of themselves or are happily cooperative. Richer game theoretic notions become important as these constraints are relaxed. For humans, this encompasses issues that concern utility, such as envy and guilt, and that concern inference, such as recursive modeling of other players, I will discuss some our work in this direction using the framework of interactive partially observable Markov decision-processes, illustrating deception, scepticism, threats and irritation. This is joint work with Nitay Alon, Andreas Hula, Read Montague, Jeff Rosenschein and Lion Schulz.
Bio: Peter Dayan studied mathematics at Cambridge University and received his doctorate from the University of Edinburgh. After postdoctoral research at the Salk Institute and the University of Toronto he moved to the Massachusetts Institute of Technology (MIT) in Boston as assistant professor in 1995. In 1998, he moved to London to help co-found the Gatsby Computational Neuroscience Unit, which became one of the best-known institutions for research in theoretical neuroscience, and was its Director from 2002 to 2017. He was also Deputy Director of the Max Planck/UCL Center for Computational Psychiatry and Ageing Research. In 2018, he moved to Tübingen to become a Director of the Max Planck Institute for Biological Cybernetics. Research Interests Peter Dayan’s research focuses on decision-making processes in the brain, the role of neuromodulators as well as neuronal malfunctions in psychiatric diseases. Dayan has long worked at the interface between natural and engineered systems for learning and choice, and is also regarded as a pioneer in the field of Artificial Intelligence.
Selected Awards: In 2012, Peter Dayan received the Rumelhart Prize for Contributions to the Theoretical Foundations of Human Cognition and, in 2017, the Brain Prize from the Grete Lundbeck European Brain Research Foundation. In 2018, he was elected as a Fellow of the Royal Society of the United Kingdom. In 2019, he became Fellow of the American Association for the Advancement of Science (AAAS). Peter Dayan was also awarded an Alexander von Humboldt Professorship, Germany´s most highly endowed research prize and for which he will join the Department of Computer Science at the University of Tübingen.
SPEAKER: Welcome. I'm really just delighted to be introducing Peter Dayan to you, who will be speaking today. Peter directs the Max Planck Institute for Biological Cybernetics. We're really happy to have him back at MIT, where he was on the faculty from 1995 to 1998.
He's received the Rumelhart Prize. He's a fellow of the Royal Society in the UK and the American Association for the Advancement of Science. You can read all about that on his webpage.
The thing I really want to say to welcome Peter is, he's been an inspiration, I think, to many generations of students in two very distinct ways, one, intellectual. His research has really deeply synthesized and integrated work at the computational, cognitive, and neural levels of analysis, really bridging across those areas very substantively over many decades. And two, I remember him as just a very generous mentor, staying up late at NeurIPS, talking to students until 2:00 in the morning and answering some of our hardest questions and inspiring us for years.
So just really delighted to have Peter here. Without further ado, Peter.
PETER DAYAN: Thank you very much indeed for that very generous introduction. Unfortunately, I'm now so old that I don't think I ever see 2:00 in the morning except when I fly. That's so sad. But no, thank you. It's really a real great pleasure to be back at MIT.
In my day, we were over in E25, and the department was also in E10, this really old building. So it's really nice to see what a fantastic new digs you have. And MIT has changed so much, even that [INAUDIBLE] has now disappeared. Josh tells me that you've gone up market, even above my favorite breakfast spots.
By the way, please do interrupt me and ask questions in the middle. If you have a question, just raise your hand. So I'm going to talk about work done with a number of people. Right at the end, I'll talk about some work I did with Andreas Hula, who's now in Austria, Read Montague, who's at Virginia Tech. And then I'll talk about some work also with Nitay Alon, and he was a student between me and Jeff Rosenchein at Hebrew University, and Lion Schulz, who's just submitted his thesis yesterday at [INAUDIBLE].
OK, so I don't need to say this. The audience know that we are always interacting in a-- frequently interacting with intentional social agents. The agents that we work with from a perspective are-- they have beliefs, desires, and knowledge of their own. And that then leads to a bunch of rich phenomena that we would like to explore.
So on one hand, we could have cooperation, so therefore we know these entities we'd like to interact with in a positive way. Not at MIT but at other institutions, they obviously have competition where there may be competitive interactions with organisms too. And then, also, it's very important for various aspects of psychopathology, so autism, personality disorders. I'll talk a bit about borderline personality disorder at the end.
Whereby, because we're a social species, of course many aspects of psychopathology, then, affect the way that we interact with others in positive and negative ways as well. So what social characteristics might we worried about? There of course are a very, very large number, and we're not going to be able to cover. In fact, the field as a whole is not covered. It's only started, in some sense sort of scratched the surface of them themselves.
But we might have things like social propensities, so the thing we'll think about at the end are aspects to do with, say, envy and guilt. So I might be envious if in an interaction you get more money than me, or you do better than me in some way. I might feel guilty about doing better than you.
And those social propensities, they may be factors of our own utility functions, which are therefore going to have a big impact on the way that we might interact with each other. And then if I know, for instance, that you're guilty, then I could then, for instance, exploit that. So if I can learn that you're guilty, I could exploit that in an interaction. Or if you know that I'm guilty, you could exploit that in interaction with me.
We may have ignorance about the state of information that our partners have, so I might know what? I might know my state in the world, but I might not know the state that you have in the world or what you've observed. That's something else you might like to learn about. Then, in particular, we can have-- and that's going to be a real theme running through this talk is, we can have aspects of recursive modeling.
What I mean by that is that, on the one hand, we can have just a one player who might have-- in this case, there may be one of three possible policies shown by these colors. We'll see what that means later when we play an ultimatum game, or this will play an ultimatum game with it. But this person that we're interacting with or that one person might interact with-- so then the other player could have a model of the person, the player 1.
So if we think of this as being player 1 that just plays in particular way, player 2 is going to play-- has its own beliefs and desires, but also could have a model of player 1. And then player one could have a model of player 2's model of player 1, and player 2 could have a model of players-- 1's model of player 2, now model 1 and so forth. We can just go up and up in this recursive hierarchy, and at the end, if I have enough time, I'll play you a little clip from The Princess Bride, which many of you may know, which has this lovely-- nicely illustrates this interaction between these players.
And so what that means is that now because these entities-- because these players-- they have their own intentions. They have their own beliefs and desires. We can then essentially plan through-- we can try and plan through our model of them. But we have to worry that they're also planning through our model of them, too. And that's when some of the richness of this recursive reasoning will become important. And we'll talk a bit about that or show you some examples of that here.
So we're studying it-- so a handy framework, a handy, crisp framework to study this comes from game theory, comes from economics, essentially, where these issues have been studied for very many years. And so we're taking advantage of some of the understanding that comes in game theory, but making it more cognitive because the theory in game theory, like Nash equilibria, or Bayes Nash equilibrium in the case that we have these unknown propensities is a beautiful but crystalline theory that slightly shatters when you meet the real world of humans playing games.
So here, we'll use very simple games of deception, generosity, and trust. And, of course, we are very far from the only people to work on this. And so, in fact, a lot of the work here has come actually from people at MIT and Harvard. So I just put some of the names down. Many of your colleagues are here. So we very much acknowledge the tremendous contributions that have come from many individuals here and probably others as. Well but this is the framework we're going to look at.
And, in particular, the framework that we're going to apply is a thing called interactive partially observable Markov decision processes, which came from Doshi and Gmytrasiewicz in the early 2000s. And what they were interested in doing is taking-- if you're planning in the face of ignorance, then that's essentially a form of partially observable MDPs, MDP for the normal planning component. And then what makes it partially observable is that you don't have full knowledge of the environment. So here, the environment could be aspects of the real environment.
But it could also be aspects of the other individuals. I don't know how guilty you are or how envious you are when I play with you. And so I'm going to have to probe and do many things to work out what you're like. And then what makes it interactive is to think about this hierarchy as we go up through these stages. That leads to lots of computational complexity, which I will mostly finesse. But I think there's some really interesting questions that come up there, and I know that [INAUDIBLE] and others are thinking about these at the moment.
So to introduce this, I think some of you may be familiar with this game. So who's heard of the p-Beauty game? Hands up. Oh, a few of you, not all of you-- good. So this comes from the work of Keynes, and I think it's been most beautifully done by Rosemarie Nagel in various contexts.
But the idea of this game is all of you in this audience, you're going to choose in your mind a number between 1 and 100. And the rules of the game are I'm going to give a prize of some sort to the person whose number is closest to 2/3 of the mean. So I invite you to think about what number you're going to choose. So we just calculate all the mean. We calculate all the numbers. We calculate the mean. And then I'm going to give you a prize, the one who's closest to 2/3. I won't get you to shout out your number. That would be a bit ear shattering.
But if you look at what people do-- so the thought [INAUDIBLE] there's a couple of really nice experiments done, actually, by newspapers for some reason. So this was one done by The Financial Times. And what they're doing here, they had-- whatever it is-- 1,500 people send in their answers to this. And these are the numbers they got, actually, between 0 and 100 in this exact game with the 2/3 case here. And you can see what distribution of things that people generated. So some people went for 100. That's interesting. They're the ones who-- yes.
These are FT readers. Maybe they're the ones who didn't pay their subscription. And then what's striking is you see discrete peaks at, in this case, 33, 22, and then it degenerates, and then there's some numbers around towards 1 or 0. And so you could imagine-- and this is why it's a relevant for recursive reasoning-- is you could imagine a reasoning process that goes like this. You think, well, some people think, well, the average is going to be 50, and they just say 50. They didn't think very hard.
And then other people think a little bit harder, just literally one step harder, and think, well, if the average is going to be 50, then I should go for 2/3 of 50. Then I'll be in the money. So they go for around 33.
And then other people think, well, those smart people, they'll go for 33, but I'm one click smarter than them, so therefore, I should go for 2/3 of 33 and get to 22 or whatever, and then so forth. And then, there are other people, the smart alecks-- again, none of whom are at MIT. There's the Caltech version. They think, oh, well the Nash equilibrium of this is I should go to 0 or the lowest number possible because then-- if everyone went for that. But, of course, they don't win either because they're not living in an environment where everybody is-- goes to Caltech.
Fortunately. So the average, as you see, in the FT group was about 19, and the winning number was, therefore, 13. [? And then ?] somebody got that. So it wasn't only done in the hapless Brits in FT. This was done also in Germany, where you see, again, the same thing discrete peaks at the people who misunderstood the question, the discrete peaks at 33, 22, and then, there, the average was about 22, and the winning number was about 14. And then this was done in a Spanish newspaper, and again, you see they didn't have 0 as an option, just had 1. And again, you can see these discrete peaks.
And so what this invites us to understand is that there are these sequential aspects of reasoning that people go through, and I think it's really nicely shown by these discrete reasoning. And so when we think about a model that you might have of somebody else, that's exactly what they're doing. They're building a model of the other people, but they only build a one-click model. They don't build a two-click model or a three-click model. That's, I think, the only way you can imagine that you get these very discrete peaks at those numbers. So that's what we're going to think about in our games.
So here's the first game. So here, we're actually taking a nice example that comes from Jara-Ettinger, Julian Jara-Ettinger, et al, and actually-- and Josh. So here, what we imagine is we have two players, a buyer and a seller. And so we're adding-- we're going to make it a bit more manipulative than the way that Jara-Ettinger did it.
So here, and there are two items that we can go for. There's an apple and an orange. And the idea is it's a supermarket where the buyer comes into the supermarket at a particular-- in this case, we imagine that there's-- on a linear track, and comes in somewhere, and there's an apple-- could be close, could be far. Here, the apple is close, and the orange is far away. And the buyer has a preference between apples and oranges. In fact, we can imagine that has a reward value. The sum of the reward values of the apples and oranges is going to equal 10, and there's the 10-step manipulation here.
And so, in the end, the buyer could just go in and can go and choose to walk and go and get the apple or go and get the orange, and then it gets a reward value, which is the difference between the reward and the distances gone. But meanwhile, he's being observed by the evil Amazon in the sky, the seller here, who's watching him moving around his store. And then so here, he chose the orange, and he gets this utility for the orange.
So then the seller observes this, and then the buyer is going to then go and, actually, having got this one for free, is then going to go and get one he's actually going to have to pay for. And the seller gets to choose a price-- gets to choose a price for the oranges and price for apples. Then the buyer then has to go and purchase-- we have either the apple or the orange now based on the price that the seller chooses. So here, he chooses the apple.
Then the second utility for the buyer is then the difference between the reward and how much you had to pay. That's the cost that the seller got. And the seller then gets the value of the price that he's able to charge for the apple.
And the reason this is an interesting task, in a sense, is the buyer here is essentially expressing a preference by moving towards the apple or the orange. So he has a preference between the apples and the oranges. That's the reward value itself. And if he were to go a very long way in the environment to go and get the orange, for instance, here, he's telling the seller I really, really value the orange. And so the seller thinks aha, well, that means I can now charge you a lot for the orange, and then when are you going to have to buy the orange at the end, I'm therefore going to, then, make more money this way.
And so we have the opportunity to learn-- the seller has the opportunity to do what we might think of as being or what is exactly a form of inverse reinforcement learning to work out what the reinforcement value is to the buyer of the orange or the apple based on how far he's willing to go. And then he can therefore take advantage of the buyer in the future.
And so now the buyer, if he could do one click more reasoning than the seller, can say, aha, I'm not going to be exploited by this evil seller, so I'm going to change my behavior in order not to reveal the information to the seller that would then allow the seller to take advantage of me. So this is you hiding your preferences on Amazon using cryptographic techniques, for instance.
So we live in a very simple case, where the sum of the rewards or the sum of the distances and so forth-- or you can [INAUDIBLE] just make life very simple. And so we have the buyer-- so here, we ground this at a very simple buyer, where the buyer is just a level-- we call this a theory of mind or sometimes depth of mentalization minus 1, where we imagine that's just a simple reinforcement learning agent that doesn't plan, just thinks just one step-- basically thinks I just want to get the apple or the orange based on my reward.
We then have a seller who is, then, observing the buyer, seeing what the buyer does, and therefore can then set the prices. Then we now have the theory of mind 1 buyer is now thinking, aha, this is what the seller is going to do. And so I can, therefore, change my behavior to try and confuse that evil seller. Then the seller can think about what the buyer is going to do this, and then the buyer can-- so forth. So we can then scan our theory of mind as deeply as we like in this context.
So what's the policy look like? So what you might call-- what Jara-Ettinger called this-- a naive utility calculus. Just says that if we think of this theory of mind minus 1, we then think about how likely are we to choose the apple. That's shown by blue here-- and then as a function of the distance of the apple and the reward value of the apple. So you might think that if either the distance of the apple is very small, or the apple is very valuable, then our naive buyer is going to buy the apple and, otherwise, is going to buy the orange. And so we get this very nice here. We have a little bit of softmax noise just to soften out the decision curve.
So if you're a seller, then what you're going to do is you now get to set the price for the apple on the orange according to what the buyer did-- you observe the buyer. And you see that you essentially say is if the buyer went to get the apple, then as a function of the distance to the apple that the buyer went in that first stage, you can say how much are you going to charge for the apple in the second stage. And as you might imagine, if the buyer went a very long way to the apple in the first stage, then the seller is going to charge a lot for the apple in the second stage. And that's what happens, and therefore, less the orange. And so, therefore, you're seeing this preference be expressed. And then this is the seller taking advantage of the buyer in that context.
So what does the buyer do to protect himself? So this is the optimal policy of the buyer who's then deciding to protect himself against the seller's evil intent. And so what you can see here, for instance, is if the buyer is very close to the apple, then even if he doesn't like apples very much, and so he likes oranges much more, he's still going to choose the apple to avoid expressing his strong preference for the orange to the seller. So, therefore, the seller can't manipulate him, can't exploit his preference quite so well. You see, there's something else interesting-- a bit more bizarre happens in the center, and then it's obviously symmetric if the buyer likes the apple-- the orange instead.
And so, now, the seller, if he's now one click above that, now gets to set the prices to try and to say, OK, well, I now understand what the buyer is going to do. And now you see, for instance, that the seller says, well, the fact that he went for the apple, even though the apple was very close, actually tells me nothing about the precedent for the apple versus the orange because the buyer who's this smart, this theory of mind depth of mind mentalization 1, is going to go for the apple if he's very close to the apple anyway.
So I therefore can't read anything out about the value of the apple or the value of the apple or the orange from the buyer's choice. And, therefore, he does not change his price that he's willing to-- he's asking the seller to pay-- ask the buyer to pay by choosing the apple if the apple is very close. And then we can just recurse again and again. We'll see it basically makes almost no difference at this point-- very small changes to the policy there.
And one way of looking at what goes on is to use a information theoretic understanding of how you get manipulated by this or how you change your idea about the value of these outcomes here. So here, what we're looking at is how the mutual information between the buyer's desire-- the buyer's reward function for apples or oranges-- and the action of going for an apple or orange as a function of the distance.
What you see is if the buyer is a low theory of mind-- so buyer is this minus 1 theory of mind, then you get this nice pattern of how it provides information. So, as you imagine, you provide the most information if it's very close to 50/50, but then, even if it's a close, you get-- even if it's not exactly 0, you still leak some information about your preferences. And that's what the seller is exploiting by changing the value in this case here.
But if we now have a smarter buyer, one who's higher up these recursive values of reasoning, you can now see that, for instance, in this region of distance, where it doesn't matter whether he likes apples or oranges, he's always going to choose the apple because he's very close to the apple, as you imagine, he actually provides no information about his value because this is the action he's going to do anyway, which means that there's-- this then curve is flat. He provides no information through his responses, and therefore, that's why the seller can't exploit that information that he's not providing during that time. And then you can see that then-- it gets even a little more funky as for different values of the theory of mind.
Another way you can look at this is how the seller gets to be skeptical about what the buyer is doing. So here, this is the Kullback-Leibler divergence between the prior that the seller has about the reward function of how much the buyer likes the apple versus the posterior. So how much has it been moved around, essentially-- one way of how much he's been moved around by the choice of the buyer.
And as we imagine, that if the just reflects this information, this mutual information. So if the buyer leaks a lot of mutual information-- so then a function of how the buyer moves that distance in the first place-- then the seller's posterior is quite different from the buyer's. And, in the case that the buyer protects himself by not by changing his behavior in this precise way, you can see that, therefore, the seller is skeptical. It doesn't change its prior because there's just no information available based on the choice of the buyer, in that case.
And then we can look at the issues to do with the error that the buyer can enforce in the seller. So here, what we're doing is looking at the Kullback-Leibler divergence between the seller's expectation about what the buyer would do versus what the buyer actually does when, here, the buyer is where-- the seller's expectation is based on it being a dumb buyer, theory of mind minus 1 buyer, when, in fact, the buyer is a theory of mind 1 buyer here.
And then this shows this KL divergence is quite big, which means the buyer is able to convince-- to fool the seller quite substantially in this one when you go from minus 1 to 1. But in the case that we now, instead of being minus 1 to 1, we now look at 1 to 3, so instead, we say the seller is now the level theory of mind level 2 instead, now the degree of seller error is much lower, and that's consistent with the fact that basically everything's stabilizing in this particular direction.
So what we have here in this very simple example already is-- a simple example in terms of the mechanics of the operation-- is we have quite complex signaling and planning happening. So signaling is happening because we're signaling through the buyer, signaling through the choices that he's making. The very first choice, already, is signaling quite a lot. And planning is happening because he has to plan-- the buyer is planning to see what the seller is going to do with the information it has.
So we have inverse RL. So the very first thing that the seller does is to do a form of Bayesian inverse reinforcement learning. That's the naive utility calculus that I mentioned from Jara-Ettinger [INAUDIBLE] or from Baker and Tenenbaum. So planning through this-- and then the buyer then plans through the inverse reinforcement learning that the seller is doing.
We have partial observability because it starts off where the seller doesn't know what the buyer's like, and then we model it through this interactive POMDP mechanism that I mentioned from Gmytrasiewicz and Doshi. And then the question, of course, comes, well, what happens if the buyer has a bit more agency? So here, the buyer doesn't have a huge amount of agency, just makes this very first choice. And so we'd like to then look at some other games to look at what else might happen.
So here's another game that is the foundation of the next game we're going to look at. So I'm sure you're all familiar with the dictator game. So here, one player just gets $10, and then it chooses a split-- then, it's a dictator, so it gets to enforce this on other people. So this is a good one for the oldies in the audience. So this is what the fraction that's given-- and this is a huge study that came-- a huge meta-analysis done by one of my colleagues, Max Planck colleagues, called Christoph Engel of 20,000 observations or so.
So you can see that this is, then, the distribution of how much people give in the dictator game. So there's a peak at 0. There's a peak around 50%. And some people are extremely generous, [? one, ?] and then that's the distribution here.
And I think the average here is, like, 28% is what gets given the dictator. If you do it in a way that people can see who's doing it, then people they get more generous. So then there's a big peak at 50/50. If you know who's dictator-- dictator know that people are watching them as they go.
If you tell a sob story, so you talk about it-- there being associated with a charity, for instance, people get a lot more generous, which is good. And nonstudents are much more generous than students-- nothing to do with their poverty, of course. If you look across ages-- again, this is nice, this meta-analysis from Christoph Engel-- so children are very mean. Students are even meaner. Middle-aged people are fairly generous, and old-age people, like myself, are extremely generous. So if you want to money, you have to tap the oldies amongst us.
But the game that we're going to talk about actually is the ultimatum game, which is basically a bit like the dictator game, except that now you have two players A and B. So A proposes a split, and B gets to say yes or no. If B says yes, then that split is implemented. If B says no, then neither of them gets anything. So here's a table. So you may not be able to read it. So this is a nice-- again, another meta-analysis by another group of people who looked at many, many studies here to see, in this instance, what fraction across these different cultures and different studies, how much people give.
So you can see that in the UK and the US, the mean offer is about 40. They're a bit more generous in the West Coast, interestingly. And then the mean reject-- so that's like how much do you have to offer such that people don't reject it, then, is quite different between the East Coast and the West Coast, at least in these studies, where you have to offer at least 17 to get rejected. If you offer less than 17%, that varies hugely amongst different countries. And then this table is a table about Gini coefficients and so forth. It's kind of interesting.
Anyhow, so we're going to do an iterated version of the ultimatum game, where we have 10 rounds where you always have the sender and the receiver. And this is a notional game here. And then we have some discounted utility function as well. And then the receiver we imagine has a straight utility. So the receiver-- just whatever the utility function receives is just whatever the money that the sender is going to give him.
And the sender-- we're going to imagine, actually, three different senders. Two of them are what we call threshold senders. So here, the utility for the sender is just the amount that the sender-- that she'll get, which is 1 minus the amount that she gives to the receiver. And then we have a threshold as well, minus a value eta, and that eta is going to have either, say, 0.1 which means that, therefore, the sender is going to have to at least keep at least 10%, like 0.1 of the money to have to actually have a positive utility at all.
We have a sender who's a bit more mean, so the sender who needs to keep at least 50% of the money in order to have any reward at all. And we'll also imagine a sender who's uniformly random, which means that the sender just chooses any random number all the time. And that's going to be interesting for-- this existence of this uniform random sender is going to be going to make for interesting dynamics in the game, as you'll see in a minute.
So let's think about the sender. So, again, the whole format is grounded on what the level minus 1-- so the baseline, the base level decision maker is. So we imagine the sender-- obviously, the random sender just acts randomly, just sends money randomly.
The threshold center, we imagine that the threshold center is a bit smarter and maintains, basically, a lower and upper bound for-- because, essentially, the sender is having her offers rejected or accepted. And so we imagine that she has lower and upper bounds for what she needs to offer in order to get any money. And so then the lower bound, if it gets rejected, the lower bound will increase. If it gets accepted, then the upper bound will decrease, and so forth. So basically, there's a very simple way that the sender is working out what the value is going to be.
And then we imagine we have a softmax policy within the bounds. So, of course, the sender would like-- she'd like to get more money, but she obviously realizes there's a rejection process happening as well. So that's what this depth of mentalization, this very simple sender is going to do.
And if you can imagine what is a good idea for a smarter sender to do is to try and pretend to be random. And the reason for that is that, remember, the receiver would like to get more money. So the receiver would like to the sender to send more money. And so the receiver can only signal that by rejecting. The sender offers him some money. He says, no, that's not good enough, basically. And that's telling the sender that he has to up the offer, which is what would happen through these upper and lower bounds.
But if the receiver is convinced that the sender is random, then this there's nothing you can do. There's no point in trying to push a random sender around because the random sender is not going to change her behavior at all. So, therefore, it's the interests of senders to pretend to be random in order to fool the receivers into essentially accepting any old nasty offer that they do. And that's exactly what we'll see happens.
So here's a sample plan, a sample path for-- in this case, we have the simplest sender, either the random sender or the threshold sender, and then we have the simplest receiver, who here just has linear money as being their utility function. And here, you see the offer of the random sender, which, of course, is random. Here are the offers of the threshold senders, so either the 0.1 or the 0.5 sender.
And what you're seeing here is when there's an open circle, that's rejected. When there's a closed circle, it was accepted. So the random sender basically makes a very high offer to begin with. That gets accepted, and then you just carry on like this, and you see this interaction.
The threshold senders, they start low because they want to get money, and then they get rejected. And so they get pushed up and up and up until they hit some the value at which it makes sense for them to offer. And then the receiver will start accepting, at some point, which depends on the horizon and the discount factor and a whole bunch of factors like that. So that's when you get this acceptance happening. And then, here, because it's 10 rounds, they just carry on like that.
And now we can look at the beliefs of the receiver. So what's the receiver inferred about the sender during this time. And here, because-- so now, here, the receiver is doing inverse reinforcement learning. And here, it's doing reinforce learning about the quality of the sender. Is it a threshold sender with 0.1 and 0.5? Or is it a random sender? And it does a great job. So here, if it's a random sender, the receiver infers it's random. If it's a threshold sender, it infers its threshold appropriately accordingly.
But now, if we have a smarter sender-- so now she's on level 1, playing with a level 0 receiver-- now we see exactly what I mentioned before. So now here, again, we still have the random sender that makes no difference.
But now if we have the smarter sender who's the threshold sender, what she does to start with is offer a lot of money. So she offers more a very big large amount of money in this case. So that's costly to the sender because, of course, they'd like to offer less money, particularly in the case of these threshold senders.
But because the receiver is too has too low level of mentalization to realize that they're being manipulated in this case, that convinces the receiver that the sender is actually a random sender. And what do you do with a random sender? Accept any old rubbish that they offer, and that's exactly what happens.
So now the sender can, in the end, offer less money and still have it accepted by the receiver because they've convinced the receiver that they're random. And you can see that happening here, whereby the beliefs of the receiver are then-- that in each of these cases, even if it's a threshold sender of 0.1 or 0.5, this receiver has then been fooled by the sender to think that it's a random sender, and therefore, the job is therefore to accept everything in these cases.
So if we look at the consequence, then, in terms of the average offer, for instance-- so here, you're seeing the average first offer by the sender. Of course, it's a random sender. The average first offer is 50%. There's no smarts there. And what you see is if the sender is level minus 1, then the sender sends less money because the sender is not trying to fool the receiver, doesn't understand that, whereas if the sender is a level 1 sender, so depth of mentalization 1 sender, it sends more money to try and convince the receiver that they're random instead.
One consequence now is if you have a smarter receiver-- so now we have a level 2 receiver instead of a level 0 receiver-- then that receiver looks at information-- so here, this is actually where the sender really was random, but at what point does the receiver become suitably convinced that it was a random sender? How long how many trials does it take? So if it's a simple receiver, then they decide quickly that it's a random sender. If it's a smarter receiver, then the receiver is more skeptical about the evil sender, potential evil sender, and therefore, it takes more trials to agree to the fact that the sender is really the sender that we mentioned.
And, of course, that has a cost to both the sender and the receiver. So here, we're now asking how many rewards do you get if you have a smart or a dumb receiver playing with a threshold sender in this case, like a simple threshold agent. And you see that if the receiver has a level 0, so the receiver is not building this sophisticated model of the sender-- in fact, they actually do rather better than they do in general, again, than they do if you have a smarter receiver because here, you have a mismatch in the level. Remember, the receiver level 2 is thinking they're playing with a sender of level 1. In fact, here, they're playing with a sender of level minus 1. And that actually has a costs to the sender.
So, in some sense, we can see this play out again. So here, we ask what happens if you have a receiver who's either a level 0 receiver or a level 2 receiver playing with a sender, in this case, who's a level minus 1, who's random? So here, it takes the-- here, you're seeing the inference happen about the receiver's beliefs about the sender, so whether the sender is a threshold sender or a random sender. So here, in the case, it truly is a random sender. You can see that both receivers are making inferences about this random sender, but it's a bit slower for the level 2 receiver because they're a bit more skeptical, as I mentioned.
When you play with the threshold agents, here, it's obviously harder to make inferences about the threshold agents because the threshold agents have these other characteristics, these level 1 threshold agents. You can see that, then, it takes longer for both of them. And, in fact, here, if you're a level 0 receiver, you're completely fooled by the threshold agent, who's a level 1 threshold agent, for the reasons that we talked about before. The level 2 is not fooled, but it still takes a long time to be convinced because, of course, there's this attempted manipulation happening.
But the consequence is of being-- essentially, the consequences of being too smart-- so now you can think what you're seeing here is the acceptance probability of what this level 2 receiver actually does here is, actually, oddly enough, you might think, it actually accepts all this stuff that comes from the sender no matter what. And what's happening is the following reads a bit complicated, in a bit of a complicated way. But if you think about it, the level 2 receiver thinks that it's playing with a level 1 sender. That's how it goes. And the level 1 sender, it thinks is playing against a level 0 receiver. So we have this planning that this level 2 is doing about the level 1. It thinks about the thinking that the level 1 is doing about the Level 0.
So what does the level 1 think? Well, the level 1 sender thinks that she can fool the level 0 receiver. That's their job to do that. And that means that when the level 0 receiver rejects, the level 1 sender essentially ignores those rejections because it's trying to pretend that it's one of these random agents. And so, therefore, it's going to ignore the rejection just like a random agent does, which means that the level 2 receiver doesn't think that it's worth trying to manipulate the level 1 sender because the level 1 sender is unmanipulable-- nothing they can do.
And therefore, the level 2 receiver just accepts anything that happens. So here's basically a penalty of being too smart in this wrong way. So by not thinking about the fact that it might be a level minus 1 sender instead, the level 2 receiver actually makes mistakes, if you like, and therefore just gives up the ghost and says, oh-- basically, it's too skeptical essentially about what happens. They think there's nothing I can do in this circumstance and, therefore, gets manipulated out of money by this actually not-so-smart sender.
So we can see that here. Here, we have a level minus 1 sender playing with a level 2 receiver. So we have a really smart receiver, apparently, but, in fact, it's basically too smart for its own good. You can see that, in the end, the threshold sender, because the level 2 receiver doesn't try to manipulate the level 1, what it thinks of as the level 1, it doesn't actually make any progress. And here, in fact, therefore, it actually loses money relative to the level 1 receiver which actually understands what's going on.
So here, you can see the amount of money they get if you play against a random agent-- of course, there's nothing. It doesn't make any difference because it doesn't try to manipulate it. If you play against these threshold agents-- in fact, the level 2 receiver does worse than the level 0 receiver. So, in some sense, this is a penalty of being too smart. There's almost a penalty that comes along here.
Now, of course, we could do other things. So here, we have a very simple manipulation model going on. So we could imagine the agent could treat this depth of mentalization as an intentional variable. So maybe the agent could say, OK, I don't have to stick at being a level 2. I could pretend to be a level 1. I could pretend to be a level 0. I could update my own level in order to fool the other subject too, and that's certainly something you could do as well.
And then the other thing you could do is to convince the level 1-- so here, we have a very simple policy that the level 0 receiver has, and if we can convince the level 1 sender that the level 0 receiver had a different policy, then, of course, we could do that. So those are the mechanisms themselves.
So that's one aspect of being too smart. Now there's another thing that might happen with being not smart enough. So here, we have this level 0 who might like to try and protect itself against the level 1-- so level 0 receiver would like to protect itself against the level 1 sender.
But remember, it can't build a model of what the level 1 sender is like. It doesn't have a conception of this more complicated entity. A bit like in the p-Beauty game, those of you who might have gone for 33 couldn't have a conception that other people might have gone for 22 in order to try and exploit you, essentially, because they don't have that level of reasoning.
So what can you do to protect yourself in these circumstances? So here, again, we're seeing the same belief dynamics, if you have this very simple dynamics where you're being essentially being fooled by-- the level 0 receiver is being fooled by this level 1 sender to think that he's playing with a random sender instead. You can see that fooling happen here.
Well, so one thing you could do is even though you don't have a model of what they are, you do have a model of what they should be like. So you have an idea about what you expect senders to be. It's a model-based reinforcement learning mechanism, actually, running underneath this. And so you could use that model to ask where that gets violated.
Now, oddly, if you have in mind that the sender might be random, then, of course, every sequence of bids has the same likelihood itself under-- if it's a random sender. Whether the sender gives your 1's all the times or 10's all the time, they all have an equal-- they will have an individual equal likelihood.
So we can use other statistics of what the sender might be like in order to avoid getting exploited. So, for instance, here, you might monitor the average that you've received from the sender and say that if that average gets too much less than 0.5, then it couldn't have been an average sender. And that's something which you can monitor if you have a model of what the sender might be like. And so here, we can imagine monitoring that. And then you have some threshold to do with the variance or the standard deviation of what you expect and say that when the sender violates that, the receiver can say even though I can't understand what the sender is doing, I can understand that. I've not got the money I expect if it really was an average sender, and so you could refuse to cooperate.
And then, also, so that's another thing you could do is look for typicality, look for typical sets, essentially, of responses that the sender might have. And if you see violations of typicality, then you think, well, again, that's something which is not consistent with my model of what the sender should be, and use that to prevent yourself from being exploited.
And then, what you need to be able to do, and that's what we haven't really worked out, I think, in the optimal way yet, is to make a credible threat, essentially, and say, OK, if you're exploiting me, I need to have a threat to say, OK, well, I'm going to I'm going to do something which might harm me but is also definitely going to harm you in an interaction, so that, then, you then have an incentive not to engage the-- not to awaken the monster, if you like, and therefore not to engage with that process so that we can retain cooperation.
So we imagine we have this X mechanism which detects violations of the model itself. And we have a policy associated with that mechanism, which says, in this case, for instance, just stop cooperating altogether. And if you know that I have that potential in my policy, if that's something that I might do, then you have an incentive not to awaken that possibility.
And therefore, you're going to cooperate more. So even though I can't model you, I'm not smarter than you, I'm actually dumber than you, nevertheless, I can still protect myself against the manipulation that you would otherwise do by essentially threatening to throw my toys out of the pram and, therefore, your toys out of the pram too. So it's a bit like those of you who know about zero-determinant strategies-- it's has some similarity with that process.
And so here, what you're seeing happening is what happens with the regular IPOMDP, where you're too dumb, and the XIPOMDP, where you have this threat. And if we look at what happens in terms of the payout ratio-- so here, we're now measuring the ratio that the sender gets compared to the receiver. So, of course, the sender, in the end, usually gets more money here because the sender is in charge.
So here, if you just have the IPOMDP, you have this smarter sender. The sender get 3 times more money than the receiver does. If we now have this X mechanism, we now have this credible threat associated with the receiver, then now that relative payout ratio goes much closer to 1, which, therefore, the receiver has successfully avoided this exploitation by the sender. If you have an unaware sender-- so that means the sender doesn't know, it's no longer a credible threat if the sender doesn't have in mind the fact that the receiver might do these nasty things.
Then the ratio-- although the ratio goes down because, essentially, the receiver just refuses to play ball. You see that, basically, both parties suffer dramatically. So here, we look at the cumulative reward for both the sender and the receiver are both lower in this third case, where you have an unaware sender. So here, this credible threat has to be credible. Both parties have to know about that threat, this mutually-assured destruction essentially, and that's what keeps the peace, but also keeps, in this case, the receiver doing reasonably well, even though the sender is this smarter sender.
So I just have one more thing I want-- one more piece of work that I want to talk about before I stop. So the interim summary-- so here, we've gone from inverse reinforcement learning-- so here, the receiver can build a model of the sender. The sender can build a model of the receiver. But those models can get hacked, basically. And we're interested in trying to say, can we avoid the consequences of being hacked and, therefore, protect ourselves? Can we make threats that keep our partners in line.
In something like the iterated ultimatum game, there's an asymmetry. The sender is sort of in charge because the sender gets to choose how much to do. And that's going to have some issues. There's, obviously, a very critical dependence on this minus 1 strategy-- what is the bottom level strategy. So here, we have these three simple strategies for the sender. But again, if you had different strategies in that sender, things would be difficult.
There's a danger of being too smart. There's a paranoia, if you like, that gets over that you're just overworried about what-- the sender could be overworried or the receiver could be overworried about the sender and, therefore, not cooperate early enough. And, therefore, they suffer the costs of paranoia itself.
And there's also this difficulty of being insufficiently smart. You could be manipulated. And for that, we needed this detect manipulation, which may not be able to be used-- likelihood methods, as I mentioned for the random sender. And then we need a policy.
We call this X policy, which says what are you going to do if you're being manipulated? And what we haven't done-- I think, for instance, in this X policy, what you'd like to have is a way of expressing a moderate threat, say, to signal to the-- the receiver would like to signal to the sender that you're irritating me. And then that will then be enough to keep the sender in line, essentially.
So to look at that, actually, we want to return to some earlier work that we had done together with Andreas and [INAUDIBLE] Read Montague, looking at a slightly different multiround game called the Multi-Round Trust Game, which has basic elements of both the dictator game and the ultimatum game.
So in this game, again, you have two players, an investor and a trustee. And the rules of the game are the investor gets $20 from the experimenter can choose how much that he wants to invest. Let's say he keeps $10 for himself, invests $10. Then, whatever he invests, gets trebled by the experimenter.
Then the trustee here would be sitting, say, with $30, and then the trustee can choose how much to send back to the investor. So here, at that point, the trustee is playing a dictator who just has money-- can choose how much to send back. But because we're going to play this 10 rounds in a row, where the trustee is always the trustee and the investor is always the investor, the trustee has an incentive to return some money to the investor to try and persuade the investor to carry on investing.
So, again, the investor is in charge. The investor could walk away with $200-- so 10 rounds of $20. But that's not efficient. The experimenter is quite happy to make that into $600 if they can only share. But from a Nash equilibrium point of view, of course, in the last round, the trustee never has an incentive to return anything to the investor, which means that the investor has no incentive to send anything to the trustee.
But that means that therefore the last round is null. So that means, therefore, we can now move to the ninth round, where that's going to be null too. So it's one of these games that telegraphs, that collapses right from the end to the beginning. And so, in that case, it would say, from a Nash equilibrium point of view, you should never invest. But that's actually not what happens in practice. And Read has now run this with very many people.
So how are we going to model this? So we might start with a utility function here, for the investor, which is just a linear utility and how much money they make. So here, this is just 20 minus whatever they invest plus whatever the trustee is good enough to return. And then the trustee gets three times what the investor invests minus what the trustee gives back to the investor. So that's the linear utility. So that's what, then, from a Nash equilibrium point of view, would crumble right from the end.
So what we imagine happening, the reason why it doesn't crumble, is that maybe the parties are guilty. So this idea about guilt-- so maybe the investor is going to be grumpy to some extent if the investor ends up with more money than the trustee. So think you've done this 20 rounds in a row.
The investor, the experimenter, was so generous to you, and you walked away with $200. You'd feel a bit guilty, exploiting the trustee in that way. And similarly, the trustee might be a bit guilty relative to the investor and think, OK, well, if I keep all this money, then that's bad news too.
And it's this guilt, for instance, of the trustee which then allows the investor to have some assurance. If I knew that you were guilty, I could safely invest some money with you because I'd think you would be too guilty not to return any to me. So this guilt factor, this guilt parameter, this alpha of T, is something which I can use to establish cooperation in this game. So this utility function comes with-- sometimes known as the [INAUDIBLE] utility, one component of it, the component that's relevant for this game. And that's what inspires cooperation in this game.
And what's interesting in the IPOMDP session is that if you think-- I don't know how guilty. As an investor, I don't know how guilty you are. So if I were, then, in a regular POMDP, what I would do is probe. I might say I'm going to invest a minimum amount of money, $1, see what happens to that $1, and if I can then work out how guilty you are from that small investment, then I can afford to exploit the interaction in the future and make more money.
But, of course, if I just send you $1-- there you are, the trustee, thinking here he is, sending one measly dollar when he had all these $20 he could have sent instead, that means I'm setting some bits in you about what you think-- how guilty you think I am too. And in these interactive POMDPs, it's this interaction, it's the fact that you're working with these intentional agents that make for these rich interactions amongst these different parties.
So the way that we'd use it-- I [? can't ?] see this. It may be a bit too dim. The way that we'd use this was to work on people who have borderline personality disorder. So this is a very unfortunate psychiatric disorder in which people seem relatively-- who suffer from it seem relatively normal in normal interactions, but under circumstances of stress or if they're threatened, you see that the interaction breaks down in a very dramatic way. And so then you get-- so that leads to violence and anger. In fact, Read works with a lot of people who are on remand or we have data from people who are on remand in the English prison service-- so basically people who've been who've been either convicted or in the probation service in the process of being released.
So what we'd hoped to do was to use this as a-- by playing with people with borderline personality disorder either as the trustees or the investor, to see if the interaction changed when-- so, for instance, imagine you have a borderline personality person who is the trustee. Maybe the way the investor interacts with them would be different. And therefore, it would always be like a way of understanding something about-- almost like a signature, like a canary of the nature of the interaction. Even though you have a healthy control investor playing either with a trustee who's with borderline personality disorder or a healthy control trustee, there would be a difference in the investment pattern. And that's what they found.
So here, you're seeing what the investment is like with a-- where you have a healthy control trustee and investor, or a healthy control investor and a borderline personality trustee. You can see, in the end, if you just look on average, you see this worse interaction.
Another way of looking at that is how much-- and then we then built a very simple IPOMDP model of how this interaction might work. And you can see that our model did a reasonable job of what happened when you had a healthy control investor and a healthy control trustee. But did a poor a relatively poor job of when you had a healthy control investor but a borderline personality trustee. And it basically overexpected the interaction. So our model suggested that we overpredicted how much the investment would be in a way that was incorrect. So basically, we hadn't captured the essence of what's happening in this BPD.
And another way of looking at that is in a particular interaction, you could imagine-- so here's what our model assumed about what the process of the investment and the return would be. And what you can see is that there are cases where the investment crashed to 0-- in this case, in round 4, and then, actually, again in round 7. It was then outside the bounds of what our model expected, which is shown by this blue fringe. And, of course, if the investor doesn't invest anything, the trustee doesn't have any money to return, so he couldn't return anything.
So we somehow missed, in our model, this idea that there could be a collapse of cooperation, essentially, in this process. And what we did was-- this is where we first had this notion of this X mechanism, this notion of irritation. And so what we did is that, essentially, this is a case where the trustee has been unable to coax the investor to invest.
And so what we imagined is we had a state, and we had a trait of irritation and a state of-- so we had a trait, irritability, and a state of irritation, such that if the investor got back less than the trustee than he expected, then his state of irritation would go up. And if he got back more than he expected, his state of irritation would go down. And how much it would go up and go down would be a function of the degree of the irritability of the other player.
So, in the interest of time, I won't go through the entire mechanism we had. But just to show you that what happens is that we have-- so here, what this gold-- what I've shown you is two interactions where we program in the starting two interactions here, where we have the investor, in this instance, starts out-- gives a 50/50, and then invests everything in the second round. And here, the trustee gives less back in the second round than they did in the first round as a proportion. So we program that in, and then said what happens in subsequent interactions as a function of whether you have this idea about irritability or not?
And here, if you have an irritable investor, so an investor who get irritated, but here, you have a trustee which understands the possibility of irritation, then what the trustee does is give a lot more money back in the next round. And that, then, re-establishes-- essentially removes the irritation. The irritation goes up because of the programmed in defection, goes down again because the trustee was more generous, and then they carry on cooperating till the end.
But here, you have an irritable trustee, but with an-- an irritable investor with a trustee who doesn't understand that. And so here, in fact, the trustee thinks, oh, well, I got even less this time. I'm going to give even less back in the next time. Then the investor's irritation becomes very high, remains high throughout the rest of the interaction, and, therefore, the interaction fails, and therefore, the interaction between the two parties fails.
So now when you put that in, we can now simulate better the interaction between the investor and the trustee. So now when we have this borderline paid investor, we captured this idea that the investor might crash in the degree of investment because we have this characterization of irritability. And then we can then sort out our trustees by this mechanism.
So let me just sum up. So there are really rich complexities of these intentional agents. So we have this theory of mind. So we build this model-- we have this model that we have of our opponents, essentially, and we think of this very much as parallel to the same metacognition we apply to ourselves. So this is like an other person metacognition. But some of the same ways that we think about metacognition for ourselves, I think, are very much in sync with this. And so that's a route that we're exploring at the moment.
We talked about some components you might have, like guilt and envy, as a way of thinking about what might happen. That then leads to threats and retaliation and protection and so forth-- so a very richly structured interaction that comes even the incredibly simple components that I we talked about.
We needed to have this anomaly detection, this X mechanism. So how you protect yourself against a smarter opponent is by having these credible threats to keep your partners in line. You can have what amounts to a paranoia from overestimation of the theory of mind. I think if I have too high a level, I interpret everything you do in the terms of I'm always thinking about what the angle you have of the interactions you have. That then leads to bad-quality interactions too.
In the case of our personality disorder, this BPD, then, in this instance, if you actually-- if the person with BPD is actually the investor, their interaction looks perfectly normal. And the reason we think for that is, there, the investor is in charge. So the trustee, who's the healthy control trustee, is never going to provoke the irritability or the irritation of that investor. And so, again, who's in charge really matters in these cases.
Of course, you can only reveal some behaviors if you have an appropriate partner. So one of the troubles with these interactive games is you have a partner who doesn't inspire you to be irritated, then you're not going to see the effect of irritation or irritability. And likewise, in many of the other cases, you need an appropriate partner to reveal this. And that, then, is a big complexity for us as we think about running these games online, for instance, where you have to design partners to expose all these factors.
A technical interesting issue that I think we still need to do more work on is that we-- so in Bayes' Nash equilibria, this idea from Harsanyi, a very old idea in economics-- you have the idea that you have essentially a type of a partner, which is typically a utility type. It's their utility function. That's their notion of intentionality. So we have that in guilt.
But we also have a policy type. That's your irritation and the irritation mechanism or the X mechanism-- is a policy type rather than a utility type. So that works fine in the IPOMDP world. How that looks in the Bayes Nash world is something we have yet to really fathom. We need to think about that a little bit more.
There are huge computational complexities, I think, therefore, in these models. So I only gave you the simple picture. So we've used various planning mechanisms. So it's very nice to look at how people here, with Josh and [INAUDIBLE] and a paper recent paper by Tim and Shu looking at amortization, how you can then improve the complexities of doing some of these things. And so that's, I think, some of the directions that we want to follow. So thank you very much.