Common Sense Physics and Structured Representation in the Era of Deep Learning
March 3, 2021
March 2, 2021
Prof. Murray Shanahan, Imperial College London / Google DeepMind
All Captioned Videos Brains, Minds and Machines Seminar Series
The challenge of endowing computers with common sense remains one of the major obstacles to achieving the sort of general artificial intelligence envisioned by the field’s founders. A large part of human common sense pertains to the physics of the everyday world, and rests on a foundational understanding of such concepts as objects, motion, obstruction, containers, portals, support, and so on. In this talk I will discuss the challenge of common sense physics in the context of contemporary progress in deep reinforcement learning, and the question of how deep neural networks can learn representations at the required level of abstraction.
JOSH TENENBAUM: Today, we are very pleased to be hosting Murray Shanahan, who is both a Professor of Cognitive Robotics at Imperial College London, as well as a senior research scientist at DeepMind. So yeah, I've been a fan of Murray's work for quite a long time. He's one of the people who I admire, because of the way he has had his eye on the big picture questions in intelligence, and hasn't just been restricted to a single paradigm.
Earlier in his career, he primarily worked with symbolic approaches to AI. Then he discovered more continuous learning-based approaches. Most recently, he's been very interested in neurosymbolic approaches. He did some of the first neurosymbolic reinforcement learning. And he's been actively involved in creating interesting new benchmarks, challenge tasks, and domains for getting machines to learn common sense. For example, common sense physics, as you're going to hear about here.
I think we'll hear from him on his own narrative of where he's come in his work. I remember hearing from him what got him excited about new developments in deep reinforcement learning and the neural networks in AI, that led him to join DeepMind. But also why he continued to keep his academic affiliation, because he's really interested in just the big picture of all the different ways that we can think about intelligence computationally.
I think it's a really great fit for both the style and the content of the way we like to think about things in CBMM. He doesn't primarily work on brains and minds. But the way he approaches questions of machine intelligence, deeply rooted in thinking about the fundamental questions of intelligence that have arisen originally in biological brains and minds. And so it's just a perfect fit. So without further ado, Murray, give us your talk.
MURRAY SHANAHAN: Thank you very much for such a generous introduction, Josh. And thank you for the invitation, and for giving me an opportunity to talk about some of the things that are exercising me at the moment. So just a little quick slide of thanks to my colleagues at DeepMind and elsewhere.
So I'm going to work through four interrelated topics. So I'm going to begin by trying to characterize common sense, it's quite a tricky thing in itself. Common sense in the sense of interest to us as AI researchers. Then I'm going to talk about transfer learning. And what I call transfer gap between some training tasks and test tasks, which I am very interested in as the basic measure. That an agent actually possesses a certain common sense concept. Then I'm going to talk a bit about architecture and representation.
So first of all, defining common sense. So the whole challenge of endowing computers with common sense has been with the field of artificial intelligence pretty much since its inception. And John McCarthy had the earliest paper on this in 1959, a paper called "Programs with Common Sense," where he characterizes it as follows.
He says, a program has common sense if it automatically deduces for itself a sufficiently wide class of immediate consequences of anything it is told, and what it already knows. And then he goes on to say, certain elementary verbal reasoning processes so simple, that they can be carried out by any non-feeble minded human, have yet to be simulated by machine programs.
And that's still the case, I think. So John McCarthy's paper inaugurated a major paradigm of AI research, that culminated in the whole expert systems way of thinking. So a very important paper. But I think it's very interesting to note the emphasis on language and reasoning in his definition there. So I'm giving you a little potted history really, of some of the landmark attempts to define it that have influenced me.
So then moving on to a paper that had enormous influence on me by Pat Hayes, the "Naive Physics Manifesto." In the "Naive Physics Manifesto," Pat Hayes says, I propose the construction of a formalization of a sizable portion of common sense knowledge about the everyday physical world, about objects, shape, space, movement, substances, solids and liquids, time, etc. Such a formalization could, for example, be a collection of assertions in a first-order logical formalism or-- And he has some other suggestions.
So this was a very, very important paper as well, which launched entire research programs, such as the whole little subfield of quantitative reasoning. And the CYC project, which those who are aficionados of the history of the field will know about. And so the paper is multifaceted. And one of the things it does is to advocate logical formalization of common sense, so trying to capture things in axioms.
In my opinion, that was flawed. Although, I was very drawn to that approach for a long time, but I decided it was flawed about 20 years ago. However, in that paper, Pat Hayes also pinpointed the foundations of common sense, in a way which I thought was brilliant and is still right. And he still has a great influence on me today.
So here's my definition. So to have common sense is to display an understanding of the principles of operation of the everyday world, in particular, the physical and social environments. And certain core concepts pertaining to everyday life have such universal application, they deserve to be called common sense. And mastery of a common sense principle or concept doesn't necessarily entail its internal representation in some language-like form, rather, it's going to be manifest in behavior.
This is a key thing that I'm interested in, is discussing, how is it manifest in behavior? And how can we actually test for the presence or possession of a fundamental common sense concept? Such as that of object permanence, say, or similar things. Or more likely, it's going to be the lack of some aspect of common sense, that will be manifest in behavior in a failure to perform some task, a transfer task, for example, as we'll see.
So I just want to say a couple of things about language, the relevance of language here. So in contemporary discussions of common sense, many, many of them appeal to language examples, as John McCarthy's original characterization did. So this is an old paper, but it's still representative of the kind of thing that people think about, even, for example, in the GPT-3 paper very recently. So we have similar examples.
So this is the kind of common sense fact that people think might be of interest. So to open a door, you must usually first turn the doorknob. So if you are trying to do some story understanding, or next word prediction, or something. And somebody approaches a door, then you may well see your GPT-3 predicting that the person is going to turn the doorknob next.
But to really understand this, you have to know what a doorknob is. And you have to know what turning is. And to understand what a doorknob is, you have to know what an object is. And to understand what turning is, you have to understand motion and space. And to my mind, these are the most fundamental things. These underlie any manifestation of common sense in language.
So I think that when we find failures of common sense at the language level, as I already mentioned, we even see with GPT-3, which does amazing things. But when it comes to some of the common sense understanding tests, it doesn't do so well. And I think those failures of common sense at the language level, they're symptoms of shortcomings at a deeper level. And even success at testing common sense, tests of common sense at the language level, those successes don't guarantee that this deeper level of understanding is present.
Now my view is that it's essential to develop common sense in our agents that we want to build as AI researchers. To develop common sense from the ground, foundations first, and then core domains, and then higher capacities like language. So the common sense priors that I'm interested in, capture the structure of our world at the deepest, deepest level.
So I'm not interested in language right now, is what I'm trying to say. Although of course, we want to build artificial general intelligence, AGI. That ultimately, of course, we want to build. We want to endow our AI systems with linguistic capabilities. But to my mind, there's something that comes first. And these common sense foundations come first.
So this foundational common sense, I would claim, is conceptually prior to language. And animals have it to a degree, and I'm going to talk about that a little bit. And infants have it to a degree, as studied by a developmental psychologist such as Liz Spelke and her notion of core knowledge.
So to my mind, these foundational common sense concepts, I like to divide them into two domains, the physical domain and the social domain. And in the physical domain, they're just absolutely fundamental things. They're the kinds of things that are so obvious, that we don't think about the fact that we have an actually deep understanding of these things. So things like objects, motion, paths, obstacles, parcels, containers, those sorts of things. Surfaces, support, and things like that.
And in the social domain, I think we have some equally foundational concepts such as being with another person or agent, possessing something, giving or taking something, cooperating or completing. Those sorts of things are fundamental. But I'm mainly going to be talking about physics, rather than the physical domain, rather than the social domain here.
And so to my mind, these common sense foundations underpin human general intelligence. Moreover, I would say that a small repertoire-- and this echoes the things that Pat Hayes said in the "Naive Physics Manifesto." There are a small repertoire of common sense concepts is a sufficient foundation, upon which to build a hierarchy of abstraction that encompasses all human understanding.
And parenthetically, I'm also very influenced by the work of people like George Lakoff, Metaphors We Live By. And the idea that these sorts of foundational concepts underpin our ability to construct abstractions at a much, much higher level.
So what is it? So I'm using these dodgy words here, like understand. Now, understand is a difficult word. So when I talk about an agent. Or we might talk about an animal understands the concept of an object, or a container, or something like that. What does it mean to understand? To understand a foundational common sense concept? We really need to operationalize this idea of understanding. And that's where I think methodological principles from the empirical cognitive sciences can be very, very useful.
And in particular from animal cognition, I've been very influenced by animal cognition. As of course, the interesting thing about animals is that they don't have language in any human language. So that's one thing. And then the other question, of course, is how can we endow our AI agents-- and the agents I'm interested in are reinforcement learning agents. How can we endow these agents with this understanding?
And basically, I think that I talk about a two-pronged attack on that question, of how we can endow our agents with an understanding of foundational common sense. I think it's partly to do with having the right curricula of tasks, or partly to do with having the right architecture. So I'll discuss both those aspects a bit.
So first of all, so going back to this business of operationalizing this notion of understanding, and importing methodological ideas from, say, animal cognition. So I'm just going to work through that idea, with some illustrations of some of the work that I've been doing at DeepMind recently.
So the really interesting thing is that with the advent of deep reinforcement learning in 3D environments, then we can actually put agents, our AI agents, in situations that are very much reminiscent of the daily lives of animals, humans and other animals, particularly other animals. And that means to say that we can import experimental paradigms and insights from animal cognition. So we can actually construct in virtual reality an analog of an experimental setup, that we find in the animal cognition literature.
So in the animal cognition literature, suppose animal cognition researchers design some apparatus, whose purpose is to test whether an animal understands the concept of connection. The connection between an object and something that it's pushing against, for example. Or the concept of support, holding something up against gravity.
So an animal cognition researcher might ask the question, does this animal, does this particular species understand the concept of supports, say, or the concept of connection? And design an apparatus, whose purpose is to distinguish an associative account of an animal solving a problem, versus a cognitive account.
So what you would like to isolate is a task that the animal can only solve. And we can only account for its being able to solve that problem by appealing to cognition. And in particular, to its understanding some concepts such as support. So we can do this thing with deep reinforcement learning.
So I'm just going to show you a little video of an animal example. So this is a fairly recent animal tool-use experiment. And it's actually a tool construction challenge. So the crow-- In the transparent tube there, which is inaccessible, there is a food item. And the crow has to construct a long tool in order to slide the thing out, and thereby, gets the food.
Now of course, we know that this experiment, or this little video, is in the context of something that's been published in a peer review paper, as an example, of animal cognition search. So we know that certain methodological questions have been addressed. But if we were just to see that video on its own, then as you find a video on YouTube of some animal doing some clever thing.
And of course, as a scientist your first question should always be, ah, but how is that animal doing that thing? Has it been just trained very, very carefully? Has it been trained through some behavior shaping to do that fancy thing? Or is it a spontaneous activity in the face of a novel situation that displays some genuine cognitive capability?
So that's, of course, the question that we should always be asking when we see those videos, whether they're fun or not. If the scientist has asked those questions? And of course, we should be asking exactly the same questions about deep reinforcement learning agents.
And let me just show you. So here's one of our deep reinforcement learning agents at DeepMind. And I'll just let you watch it, and I'll play again in a second. So what's going on here? So what you're seeing is the avatar. This avatar moves around the table. And this avatar gets rewards for consuming green apples, these green spheres. And in this particular case, there's a green sphere in the box. And the box is on that table, but it's out of reach.
Now, it's important to understand a little bit about how this avatar works. So the avatar has a invisible tractor beam, which is of a certain length. And this invisible tractor beam allows it to pick up objects and move them around. Now in this particular setup, the green apple, the reward item, is inside a box. And the box is on the table. And the box is out of reach of the tractor beam.
So it can't get it just by trying with its tractor beam. It's out of reach. But there's a stick nearby or a plank nearby. And he can pick up the plank, and he can use the plank as a tool to slide the box off, and thereby get the reward item. So it's very analogous to the animal experiment that you saw just a moment ago. So let me just play it again, because it's fun to watch these things. Off it goes. Looking around, finds the plank, knocks the box off, and the reward item is inside the box. The box ends up on its head, which is just quite funny.
Now we should, of course, be asking ourselves, exactly how was this agent trained? If this agent was trained to solve exactly this task, with exactly that setup, it really is no achievement at all. But actually, that's unfair. It's some achievement, but it's not the achievement that we're really interested in. And moveover, as you can tell from the title here, this is a heavily cherry picked example. Anyway, I'll discuss all of this in a little bit more detail in one second. But bear in mind that of course, we should be asking that question, how is it trained?
So this is the point I'm just making. So isolated examples of behavior are not indicative really of cognitive capabilities. We know exactly how the agent was trained. And in particular, the larger the transfer gap between training tasks and test tasks, the more impressed we are with test time success. And good experiments involve held-out transfer tests that are passed, if the agent has learned underlying cause of structure in training, as in this famous trap tube experiment by Amanda Seed and colleague from some time back.
Or not just underlying causal structure, but if it has perhaps learned or has an understanding of some relative concept, foundational common sense concepts such as connection, or object permanence, and so on. So that's the thing that we really should be looking for. Now I'm going to talk about the training that was involved in-- actually, not in the agent you've just seen. So I'm going to present two experiments.
So in the first experiment, we have three training conditions. So in one training condition, the apple is behind a panel. And the agent, the avatar, is here. And the second task, the apple is in a box, and the agent is somewhere around. And the third one, the apple is under a table, and the agent is around. So these are the three training tasks. And then we have four held-out tasks, which are the transfer tests. They are quite modest transfer tests, so a small transfer gap.
So first of all, we have a case where the apple is inside the box and behind a panel. It's under the table and behind a panel. Inside a box under the table. Or inside the box under the table, and behind the panel. Now here's an absolutely key thing-- so let me just go back to the training. So in the case of these--
So we're going to train our agents on these three tasks. But it's essential to do anything interesting to have procedural variation. So this is a term which I've never really heard until I joined DeepMind, and it comes from the gaming industry. But procedural generation basically means, just randomizing the initial configuration. So we want to randomize the initial configuration in lots of ways, to give lots of variance to the task.
So in particular for each object, the panel, the table, the box we vary. The color, the size, the aspect ratio. We vary the material and texture. The material affects its physics, by the way, some are heavier, denser than others. And the texture just affects the appearance.
We're going to vary the position and the orientation for the room itself. We're going to vary the size of the room, the wall height, wall color, the floor color, the lighting. And we're also going to spawn the avatar in random locations. And we're going to vary the size of the apple, the reward item. So lots of procedural variation in both training and in the held-out tests.
So just a little note about this procedural variation, it's actually not so easy to do. It's a bit of an engineering exercise to do very well. And we can't just vary sizes and locations randomly, because we are interested in particular task conditions.
So for example, in the apple under the table condition, the apple has to be under the table. And in this case, not out of reach. Not too big to fit, given the height of the table. So we can't just vary the table's height randomly, and the apple's height randomly, and expect to have a meaningful task at the end of it. And the more complex the setup we have, the more careful we have to be about how these various constraints, how they interplay with this requirement for procedural variation.
So how well do agents do? To point out that on the training tasks, all of the agents get to pretty nearly 100% success. So what we're interested in is these held-out tasks where there's a modest transfer gap. So we find that basically putting things behind the wall, the agent isn't really bothered by that. It gets pretty nearly 100% of those, even though it hasn't seen that combination of things being behind the wall before.
But with the example of the reward item being inside a box and under the table, then it does pretty well. Although, if we give it more time, it's going to do better. It's really not doing too great here. So only 34.7, 35% of the time is it successfully getting the apple, if it's inside the box and under the table.
So one lesson that I think is really important to take on board with this work, is to always look at the videos. And look at a lot of videos, because one of the fascinating things is to see how much variation there is in the way these agents solve these problems, or fail to solve these problems, in this case. So I'm going to just show you a couple of videos here.
So this is a success case. I hope everyone can see this. So basically, it's got hold of this box. And what it's doing-- And it gets there, we'll let it finish. They're quite painful to watch, sometimes, these things. Not great camera work, but that's because it's all automated.
And then we'll watch this one as well. So bigger table. You can see the variation in the size of the table, this is a huge table. Bear in mind, these are two of the 35% of success cases. So again, it gets it, but it's under the box. It has to flip it twice, and then chases after the thing, which is quite [INAUDIBLE]. So very good. It's succeeding, but maybe you can tell why it's succeeding. So in this particular case, it's succeeding because it's essentially learned to flip the boxes.
And in the training tasks, it's never seen a box under a table, so it's never had to pull a box out first. And so it's got no meaningful conception that there's a table in the way, so it's just trying to fit the box. So it tries to flip it, but trying to flip the box causes it to bang on the underside of the table. Now the box banging on the underside of the table, if it keeps on doing that, eventually, some of the time, it's going to just jiggle out. And then the reward item will bounce out.
So that's what happens. 35% of the time it's just getting lucky through doing the stupidest thing, which is just to try and flip the box, even though the box is under the table. So that's an interesting lesson in itself, looking at success rates doesn't always tell you exactly what's going on.
So that's experiment one, and here's a second experiment. So in the second, this is a much more complicated one. So in this second experiment, there are nine training tasks. So we have the same ones we have before, the apple is inside the box, under the table, on the table. Now in the box and under the table is a training task, not a held-out task. Similarly, in the box and on the table is a training task. And now we have a number of tool-use variations.
So in this case, the apple is out of reach, but there's a stick that it can use to knock the apple off or poke it off. Similarly here, the apple is out of reach and it's under the table, but there's a stick. And here are two variations of this, which we present statistically more often than they would occur by chance, which is hardly ever. Where the stick that is useful for knocking the reward item off, happens to be right next to the reward item, on the table or under the table here.
So this is a way of shaping the behavior of the agent, in much the way that animal trainers will shape the behavior of an animal to learn a task. So that's outside of nine training tasks. And now we have a couple of held-out conditions, but this is the interesting one.
So the interesting one is where the apple is in a box on the table, and the box is out of reach, but there's a stick here. So just to go back to the previous slide to see. You can confirm that combination does not occur in any of the training tasks. So it's never seen an apple in a box or the boxes out of reach, and it needs to use a stick to get it.
But however, of course, it has encountered boxes on tables, and it's encountered objects that are out of reach, and so on. So all the elements of this challenge, it's tackled before. So how well does it do? All the training tasks, again, we can get up to very close to 100%.
This is another held-out condition I didn't show you the picture for. But in these two held-out conditions, it's terrible. It really does really, really badly. When you saw that first video and I said it was cherry picked, it was very much cherry picked. It was one of these 4.6% of successes.
It's interesting that when you look at that video, it has the appearance of a intentionality. It seems to know what it's doing, when it doesn't seem to get it by luck. But in fact, it is pretty lucky, because most of the time it doesn't do that. It often picks up the stick and attempts to use it, sometimes on the box. But overall, it's really pretty bad.
So I would say that if we are to approach AGI, we really need agents equipped with enough foundational common sense to cross these transfer gaps. So part of the failure here, is certainly due to the fact that our agents lacked these common sense concepts. It doesn't have any kind of understanding when it succeeds. That it succeeds because it's using the stick to push the apple off the table, to bring it within reach. So it's simply carrying out a purely reactive policy that it's learned, with no understanding of the underlying foundational common sense concepts.
So I would say that we really need to equip our agents with these foundational common sense concepts, in order to be able to cross these transfer gaps. Moreover, I would say that these transfer tests, of the sort that I just outlined, and this kind of challenge, these percentages are so low. I think it's just the challenge that we should be trying to meet in deep reinforcement learning, and in machine learning, and AI in general. So how do we close this transfer gap? So how do we address these shortcomings?
So I propose, as I mentioned at the beginnings of the talk, a two-pronged attack. So first of all, I'm interested in devising suitable tasks and curricula. The experiments that we just saw, we looked at a very, very limited range of tasks really at training time. So I think that range of tasks is too little, for us to possibly hope that our agents are going to acquire these deep and important foundational common sense concepts. Like solid surfaces, and connection, and support, and all these sorts of things. Motion and so on.
And so one aspect of-- one half of the approach that I want to take is by developing a large repertoire of assets. Different kinds of objects that we can drop in these environments, with all kinds of rich affordances. And to introduce large-scale procedural generation of challenges, involving this large repertoire of assets. So greatly enlarging the set of tasks that we throw at our agents.
Now it's interesting, I think that there will be some people, the fans of big data, fans of the GPT-3 type approach-- which I admire hugely-- might think that this is enough. That if we have a big enough set of tasks, that will be enough to endow our agents.
If we have enough computation, and we do enough training, and we have a big enough architecture. Simple straightforward relatively homogeneous architecture with enough tasks, [INAUDIBLE] just enough data that we'll be able to-- that our trained agents will generalize and be able to solve those challenging transfer tasks.
And that may be the case. I think that's an empirical question, whether we can realistically do things that way. But my bet is that it's not the case. And that we need not only this large repertoire of assets, and tasks, and rich curricula, and so on. But we also need to work on the architecture. So the architecture, incidentally of the agents that you just saw, is a off-the-shelf but state of the art reinforcement learning, deep reinforcement learning architecture, VMPO. So VMPO agent.
But the other prong of my two-pronged attack is to work on the architecture, to try and develop better architectures. And there are two aspects of this that interest me, and that I think are really important. So one is all to do with compositionality. And I'm going to discuss compositionality quite a bit in a second. And the second is having an appropriate world model. So being able to make predictions of the effects of a series of actions into the future. Josh, I'm glad I mentioned your name as you introduced me on one of my slides.
So the work that Josh has done with intuitive physics, is very much addressing the second aspect of architecture. But I'm going to mainly talk about compositionality here. So now we think about architecture. I'm going to set aside the whole issue of the curricula and tasks, as I want to think about architectures. So I think the essential thing that we need to put into our architectures, is to enable our agents to learn essentially the right level of abstraction, abstract representations. The essential thing is compositionality.
The idea there, of course, is that we want our agents to learn about certain elements such as objects, and events, and so on. And to learn how to compose these elements via relations, to make up compound scenes. And via narratives, composing events with temporal ordering and so on to make up whole narratives. And that's just an example, of course. But essentially, I think we want our agents to be able to learn about aspects of the world, and learn how to compose those aspects of the world, and exploit the combinatorics that results in.
So the advantages of doing this-- and of course the representations that we see in classical symbolic artificial intelligence, they all have this compositionality that's at the very heart of symbolic AI. And the kinds of representations are learned in symbolic AI. In the good old days of symbolic AI that I grew up into, that was the number one argument for why symbolic representations were useful.
I don't want it to-- what's the word I'm looking for? I don't want to bring back from the dead-- There's a word I'm looking for, but it's escaped my mind. I don't want to reanimate the classical symbolic AI. What I want to do is import certain elements of it into a deep learning framework.
Exhume, that's the word I was looking for. I don't want to exhume classical symbolic artificial intelligence. But rather, I want to import certain ideas into deep learning from symbolic AI. So the essential advantages that I'm interested in are of compositionality, compositional representations, for example. That we have reusable components that we can recombine in different ways.
And it mitigates overfitting, because suppose that we learn about rabbits and we learn about cars, well, we may never have seen a certain kind of rabbit in combination with a certain kind of car before. But the fact that we've learned about those two things separately, means that we can always combine new instances of the elements. Not a great example, bear in mind, should have had a better example of what I think it is.
But there are a number of challenges to achieving compositionality in a deep learning framework. One of these challenges, a big challenge, is modularity. So a problem with the training of deep learning is that the way things are typically set up, when some particular part of the system is training, then you get gradients flowing throughout the whole system. So everything's changing.
So you don't have modular parts that train separately and independently on some aspect of the world, and then consolidate their thing. Rather what's happening is that as the network learns about something else, it will perturb all of the parameters in that first component, because of the holistic nature of back propagation and the stochastic gradient descent.
I think that it's very important to develop architectures that have a bit more modularity than that. And obviously, I'm not the only person to think of this, and there's plenty of work along these lines. And it also is important for continual learning, because of the problem that I just set out, is what brings about catastrophic forgetting. So when a network learns, for example, to solve some particular task, you then give it a new task. And then if you don't continue training it on the first task, then it will catastrophically forget how to solve the first task.
And if we want to have a complex curriculum of learning, then it's essential that we are able to solve this problem with continual learning. Because we want our agents to learn to solve one task, and then move on to another one, and then to another one. And then to reuse some aspects of the very first task, in order to solve some compound tasks later on. So we want to get rid of catastrophic forgetting. And that's another reason why we need modularity. So those are two of the challenges that we face.
So I just very briefly point out, that there are different varieties of compositionality. So we kept compositionality at the feature level. So if we've experienced red spheres and blue cubes. And we can handle red cubes, so recombining the shape features and the color features, so that we can have spatial compositionality. So if we've experienced apples in boxes and sticks on tables, then we ought to be able to handle apples on tables.
And temporal compositionality. So if we've experienced picking up a box and then going through a door, then we ought to be able to handle going through a door and then picking up a box. So there are different kinds of compositionality. And ideally, we'd like our agents to be able to deal with and exploit compositionality for each of these sorts.
So achieving this kind of compositionality, so there are two potentially orthogonal ways of doing it. So one is at the level of neural organization. So we might want to try and implement independent specialist modules. So I talked about modularity earlier on. But modules that are actually separate collections of neurons, that maybe consolidate their weights when they've learned their thing, their specialism, and are not going to be affected by other modules learning their specialism.
So the idea there is that you're encapsulating the processing, and encapsulating the training, as well as of the module. And canalizing the information flow. So I think that's also another important property, you want the information to flow along channels and not to interfere too much.
So there are a number of examples in the contemporary literature of architectures that do this sort of thing. So for example, Anirudh Goyal and co-authors, their "Recurrent Independent Mechanisms" recently. Global workspace architecture, which is something that I worked on for quite a long time, is another example of that.
And incidentally, Anirudh has just with Yoshua Bengio released a paper on Archive, I think today actually, where they describe the global workspace architecture that works in this way as well, which is very, very interesting to see. And mixtures, a much older kind of idea, mixture of experts, models have this kind of property. So it's the thing people have been looking at for a while. But that's something that interests me very much.
And orthogonally, you can try to achieve compositionality with representational structure. So learning representations that are structured, in this way that mimics classical logic in a way. Where you have representations, you learn representations that are structured in terms of objects, relations, and propositions. And so that's a direction that I've explored recently in some work that was published in ICML this year. And I'll talk about that very briefly, just to wrap up in a second.
So I'm just running through, in a sense, the architectural things that interest me at the moment, and the approach that I'm interested in taking. So there are different approaches. So focusing on this challenge of learning compositional representations, so there are different approaches to that as well.
So one approach is a hybrid architecture, so where you have a combination of symbolic AI components and deep learning. So for example, you might have a neural imaging coder at the front end that takes the raw images, and transforms the raw images into some symbolic representation. And then at the front end, you have a symbolic component, a much more classical AI type, that processes those symbolic representations.
So in his introduction, Josh mentioned some of the early neurosymbolic work that I did with my PhD students and colleague Marta Garnelo. And that was the approach that we took in that paper. But nowadays, I'm more interested in fully differentiable architecture. So not so much taking that hybrid approach at the moment. Although, I'm very interested in working on those lines, and respecters, and so on.
But I'm more interested in fully differentiable architectures. So basically, that means having neural components throughout the architecture where everything is learned. All of the parameters of the model might be learned, of the architecture might be learned end-to-end. But you might partially pretrain some of them.
So again, that's the methodological choice. I don't have strong preferences there. Some people do, but I don't. But what I'm interested in is that all of the elements of the network, of the architecture, are neural networks and are learned. So that's currently what I favor.
But the key thing to achieve these compositional representations, is that we want representations that are sufficiently articulated, so have parts to support compositionality. And yet if you make everything end-to-end differentiable. Or you make everything differentiable, not necessarily end-to-end. But where everything is learned, then you have blurred boundaries there. And the boundaries are learned and blurred.
For example, objecthood is learned not engineered. And incidentally, I was very influenced by a recent book by Brian Cantwell Smith, that I think called The Promise of Artificial Intelligence, which I highly recommend, where he discusses this difficult notion of objecthood in a very eloquent way.
Apologizes for quoting myself or quoting ourselves with Marta Garnelo again. So this is the thing that I'm interested in. And this is the thing that I think potentially will help to achieve the success on those crossing the transfer gap, that I characterized earlier on, which is what we need to endow our agents with this common sense.
So I would say that a truly satisfying synthesis of symbolic AI with deep learning, would give us the best of both worlds. Its representations would be grounded, learned from data with minimal priors. It would be able to learn representations comprising variables and quantifiers, as well as objects and relations. It would support arbitrarily long sequences of inference steps, using all those elements, like formal logic. But it would not be constrained by the rules of formal logic, things would be more learned, in other words. And it would be able to learn forms of inference that transcend the strictures they imply.
And this of course, you're making a bit of a sacrifice when you say that, in terms of what you can prove about these things. But given an architecture that combined all these features, we claim the design properties of data efficiency, and powerful generalization, and human interpretability would likely follow.
So that's the thing that I've been interested in. So I think I'm getting towards the end of the talk now. So I just want to give you a little survey of work along these lines really. And then just characterize, just describe a bit of my own work, my ICML paper, just very briefly.
For the particular topic of objects and relations, or of objects, so there's quite a lot of ongoing work. It's quite a hot topic at the moment. So many of the approaches used slot-based representations, with slot-based memory, where one slot represents one object. And their central challenge there is to be able to learn a component, which will transform a raw pixel input into that slot-based representation. So there's a number of really nice papers that have come out in the last couple of years that do that. For example, from DeepMind there's MONet by Chris Burgess, and Alex Lerchner, and others.
I really like this "Slow Attention" paper by Locatello et al. Then there are other systems, other models that do similar things like iodine and genesis. None of those models really deal well, I think, with object permanence. Or at least with the idea that an object can disappear and reappear later, and you want to re-identify it as being the same object. It's the same object.
So one of my colleagues, Tony-- at DeepMind, Tony Creswell has been working on a component which will do that called a line net. And we've got some new work coming up along using that. So that's objects. And then what about relations? So again, there's been a number of really nice recent papers that have tackled the question of how-- obviously, I'm not presenting a comprehensive survey, this is very selective. So apologies to anybody in the audience whose work I've omitted.
But there have been some great papers. For example, in 2017 was a great paper by DeepMind colleagues Adam Santoro and others on relation nets. Then there's people-- that was before or around the time of the transformers and self-attention revolution. And very quickly, people realized that you could use self-attention to help with this thing. So the key query attention that's used in transformers, can be used to help to extract and exploit relational information.
So this paper by Zambaldi et al, again, DeepMind colleagues, was along those lines. And then my own little contribution to that was a system called the [? PrediNet, ?] which was published in ICML this year. And I'm just going to talk about that very quickly. To articulate this earlier point I made about the representations that we'd ideally like to have, reflecting a structure that we see in the representations of classical symbolic AI. So in classical AI, representations-- so the context here is, I'm describing the motivation behind this [? PrediNet ?] architecture, which I've proposed.
So in classical AI, representation is founded on predicate calculus. And of course, going back to the "Naive Physics Manifesto," which I mentioned long ago in the talk, what Pat Hayes wanted to do was to represent common sense using predicate calculus. So that's not quite what I want to do. But what I do want to do, is I want to build reinforcement learning architectures, or deep learning architectures that will learn representations that have a symbolic-like structure. So a structure that echoes what we find in classical AI.
So what we might find, the system we might see in classical AI, based on predicate calculus. Predicate calculus, its structures, its representations in terms of objects and relations. So we have a predicate argument structure, with variable binding and quantifiers. And we have then sets of propositions, and then logical operations between those propositions, like and/or, not, and so on.
So the thing that we might want to do is take an image, a raw image or pixels, to extract from the raw pixels a collection of objects with their features. So this tall blue bottom right. It shouldn't be. It should be tall blue middle, shouldn't it? And so there's a mistake here. So this object F1 is meant to be tall, and blue, and in the middle. And then we have a short green object in the middle left, and so on.
So we have a collection of features, and objects, and then we can compose those using inter-elementary propositions that describe the relations between those objects. F1 is in front of F2. F2 is shorter than F1. And so on. And then we functions of these elementary propositions. So some particular object is the tallest, another is the furthest away, and so on. And of course, we can conjoin and disjoin these propositions and so on. So that's the thing that we might be interested in. So bear that in mind.
Now the previous work that I mentioned earlier on, for learning to extract and exploit relational information, such as relation networks and the self-attention networks. These were shown to be very effective on tasks requiring relational reasoning, such as the clever data search and box world. But the learned representations are obscure. And they're not compositional in any explicit way or abstract. Although the networks, the models have clearly extracted relational information in a way that can be exploited to solve the task. The relational character of that information is not there explicitly in the structures that it's learned. It's ended up as just a big mushed up vector.
So what I was interested in was, trying to build an architecture that learns explicitly relational representations. So that's what this [? PrediNet ?] architecture looks like. And I'll just describe that very briefly and then finish. So basically you take a raw image, and you process it through a convolutional neural network to produce a whole collection of features. So if you imagine the tube, each tube through here is one feature. And so you can think of that as a big list of features, one corresponding to each location here.
So that list of features is then passed into this [? PrediNet ?] module, which we're going to look at in a second. And that produces a propositional-like representation, a representation that could be interpreted as a conjunction of relations between objects. And then that representation is passed through a multi-layer perceptron to give my final answer.
Now I'm just going to look at this module in a little bit more detail. In particular, I'm going to look at one head, and you should bear in mind that there are multiple heads of this sort. So this is what it looks like. So this is our list of incoming features. So each vector, each row, you can think of each row as representing one particular location in space. And it uses a key query as a tension mechanism.
But what happens is that there is a shared key space. So this key, these keys here, belong to a shared key space that's shared across all the heads. So remember that there was multiple heads here. So shared across all of these heads, the same key space. But this particular head we're looking at here, produces then a pair, its own pair of queries. So we get two queries out.
And then we do the classic key query softmax attention thing to give us an attention mask, by doing the dot product to the query and the key, and then applying softmax. And then that attention mask, we then apply that-- Unlike transformers, we don't have another layer producing values. Rather, we apply that mask directly to the original input. This L is the same as this L here. We apply it to the original input, and this is going to give us a pair of entities, of entity representations.
Now these entity representations go through another couple of linear layers, to piece essentially another representation of these two entities. And we then apply a comparator, which is element wise subtraction, all of these, of each element here, to produce this final output.
And then the point is that each element in this final output, can be interpreted as representing a relationship between the selected pair of objects. And so then, because we have multiple heads, we have a whole collection of relations between different pairs of objects. So that's basically what it looks like.
So you can think of the predicate module as mapping an image to a set of answers, to elementary binary relational questions. How far is x above y? And so on. And the key query matching operates as both a selective attention mechanism and a variable pointing mechanism.
Now I have to come clean here. When I say this, what I should have said is ideally, the predicate module does that. We'll come onto that as the at the end of my honesty moment. So the results that we got from this are, we applied this to a number of tasks which I called the relations game. Which were about determining whether certain propositions, certain properties, held between the various objects, relational properties between the various objects in a scene. And it did very well on this and outperformed the baselines, including the relation mask, and the [? multi-layer ?] attention, and so on.
So this is where the shortcomings come in. The truth is that the relations learned by the [? PrediNet ?] are usually not as crisp and interpretable as we would like. In typical deep learning fashion, the network would learn some cheating way of representing an object, which would smuggle in an encoding of the answer somewhere.
And then it would exploit that encoding of the answer at the final stage. And you'd be scratching your head to understand how it had managed to do this. Because the nice clean beautiful interpretation of the output, in terms of propositions, and relations, and objects, was often very obscure.
So the truth is that although it worked very well and outperformed the baselines, it didn't quite do what I wanted it to do. And so building networks that do, that produce properly interpretable, meaningful, and explicitly structured representations is still an outstanding problem, I think.
Just a quick mention of my ongoing work. So I'm very interested in this whole question of these curricula of tasks for acquiring common sense concepts. And that's work I'm doing with Ben Beyret. I'm very interested also in these object-centric agent architectures, and Toni Creswell at DeepMind is working on that. And I'm also interested in architectures for continual learning, to have the modularity that I was talking about earlier. And Christos Kaplanis is working on that with me at DeepMind.
So thank you very much. I'll leave you with some required reading, which I'm embarrassed to say is all my own papers. But you can look up all those many other bits of work that I mentioned, by going back through the recording. So thank you very much indeed.