Is compositionality overrated? The view from language emergence
June 24, 2020
June 23, 2020
Marco Baroni, Facebook AI Research and University Pompeu Fabra Barcelona
All Captioned Videos Brains, Minds and Machines Seminar Series
Abstract: Compositionality is the property whereby linguistic expressions that denote new composite meanings are derived by a rule-based combination of expressions denoting their parts. Linguists agree that compositionality plays a central role in natural language, accounting for its ability to express an infinite number of ideas by finite means. "Deep" neural networks, for all their impressive achievements, often fail to quickly generalize to unseen examples, even when the latter display a predictable composite structure with respect to examples the network is already familiar with. This has led to interest in the topic of compositionality in neural networks: can deep networks parse language compositionally? how can we make them more sensitive to compositional structure? what does "compositionality" even mean in the context of deep learning? I would like to address some of these questions in the context of recent work on language emergence in deep networks, in which we train two or more networks endowed with a communication channel to solve a task jointly, and study the communication code they develop. I will try to be precise about what "compositionality" mean in this context, and I will report the results of proof-of-concept and larger-scale experiments suggesting that (non-circular) compositionality is not a necessary condition for good generalization (of the kind illustrated in the figure). Moreover, I will show that often there is no reason to expect deep networks to find compositional languages more "natural" than highly entangled ones. I will conclude by suggesting that, if fast generalization is what we care about, we might as well focus directly on enhancing this property, without worrying about the compositionality of emergent neural network languages.
PRESENTER: Thank you, Marco, for accepting to give a talk today. And it is my pleasure to introduce you to our CBMM audience. So Marco got a PhD in linguistics from UCLA in 2000. And since then, he has had several positions in industry and academia. He is now a research professor at the Catalan Institute for Research and Advanced Studies in Barcelona. And he is also a research scientist at Facebook AI research in Paris.
Previously, Marco worked extensively on distributed and multimodal presentations of words and sentences. And his current interests are on language evolution from both a linguistic perspective and as a way to build more powerful artificial intelligence agents. And that's what he will tell us about today. So I'll turn it over to Marco now.
MARCO BARONI: OK. Great. Thanks for the introduction. Thanks for inviting me. I'm really excited to have a chance to talk to this audience. I would also say that I'm glad to be able to do it from the comfort of my living room, except that I didn't consider the fact that today, here in Barcelona, it's the Vigil of St. John, which means that lots of people are out in the street with firecrackers. So I'm actually hiding in the darkest corner of my apartment with all the doors closed. And still, if you hear sort of bombs exploding in the background, probably, it's nothing too bad. It's just a local tradition.
OK. But let's get started. And I will start by thanking the people in the FAIR Evolution in Language team-- Emmanuel Dupoux, Rahma Chaabouni, Diane Bouchacourt, Roberto Dessi, and Eugene Kharitonov, since what I'm presenting today is a work that I've been doing with them.
Hector suggested that I could talk about working compositionality. And since nowadays, I'm really excited about emergent language research, I decided that I would look at compositionality in the framework of emergent language research, which means that I will have a brief introduction on why I think emergent language in deep networks is an interesting thing to explore, something about how it's being explored, and what kind of things we are finding.
I will then go in depth in the discussion of compositionality and its relation to generalization in the context of emergent languages, ending with some take-home messages that hopefully go beyond this specific field of research. Why is this thing not proceeding?
So I think I don't need to remind this audience that in the last 10 years or so, deep artificial neural networks have been shown to be able to do amazing things, ranging from recognizing objects, thousands of different objects in natural images, to beating human champions at challenging games like go, to providing very high-quality machine translation.
And yet, these networks are really good at doing things. But they're really good at doing one thing at once. In a sense, they are like-- I mean, of course, in a much more powerful way-- but they're a bit like the old Unix command line tools that, as the slogan went, do one thing and do it really well. So what happens when we actually want, in some way, to leverage their combined power? Well, what happens nowadays, a bit like we will do with typing on the command line-- we have to manually combine these networks in order to get them to do something more powerful.
And again, there is a lot of amazing work in this direction. But it's not very flexible. The ultimate dream, perhaps, would be to have an almighty neural network that, sort of like what in the good old times was called artificial general intelligence, will be able to do everything with superhuman intelligence. That would be nice. But I think even the most enthusiastic proponents of deep learning will agree that this is still something that is very far away.
And so as a still very ambitious but perhaps a bit more realistic goal, our team, and several other people in the field, have instead thought of the idea of endowing neural networks with a communication system with a language that might allow their very specialized networks that are good at doing specific things to communicate to solve together more complex tasks flexibly, just like we human beings have experts in all sorts of fields that can accomplish things that they could not accomplish alone just by talking to each other and sharing knowledge.
Now, this is the grand dream. Much more modestly and concretely, what this has meant is that in the last few years, several people have been looking at traditional communication games from philosophy and linguistics, where you will have two or more agents that need to establish some kind of communication in order to solve a task together, but now plugging deep networks as the agents into these setups.
In particular, a very simple version of these communication games, which is the only one that I will consider today, is one in which there are just two networks. A sender network, which sees some input-- for example, in this figure, a target image-- and it sends a message to the receiver network. The receiver gets the sender message, and it might get its own input-- for example, here, it gets as input the same image that the sender is seeing plus the destructor-- and the receiver has to complete a task, to perform an action. For example, in this case, it might have to point to the correct target image.
Importantly, the only way in which we constrained this setup to be human language-like is that we require the message to be a single, discrete symbol, or a sequence of discrete symbols, from a fixed alphabet, rather than, say, having some kind of continuous vector, which would actually turn the system into a larger-- but really like a single big network.
And the second and, I would say, even more important-- the way we train the network is by giving a reward by giving a training signal only for task success, with no supervision on the messages that are generated by the sender, which means that we do not-- we really let the networks evolve their own language. And there are several reasons for this. But the most obvious one is that if we want neural networks to evolve a language, that means that we don't really have already a language that we could use to generate training data for these kinds of scenarios.
OK. So this is the general setup. As an early example of these, with Angeliki Lazaridou, who is now a research scientist at DeepMind-- a few years ago, we used this framework to ask one of the most basic questions that you could ask about an emergent language, namely, whether this language could develop words that, just like the words in human language, refer to general semantic categories such as, I don't know, cats, dogs, cars, pianos, and so on and so forth.
To do this, we designed a game where the receiver would see two natural images that depict instances of different categories-- in this example, a dog and a cat. The receiver was also told that one of the two images was the target. Then, the receiver produced a single symbol-- sorry, the sender. This symbol was sent to the receiver, also shown the two images, and had to point to the correct target image. At training time, this would be done with a subset of images.
And then, at test time, we would evaluate the ability of the agents to communicate about the new instances of the same categories. And since they were able to do this, to use these words to refer, say, to dogs that they had not seen during training, we concluded that, yes, this emergent language evidently had words that were able to refer to general categories.
However, it turns out that we were a bit too optimistic. A few years later, with Diane Bouchacourt, we went back to exactly the same game, using exactly the same agents, and exactly the same training data-- so images of things like cats, dogs, and the pianos at training time. But now, at test time, instead of feeding them more cats, more dogs, and more pianos, we would feed them blobs of noise.
And what we found out is that actually, the agents were as happy to communicate about these blobs of noise using the same, quote unquote, "words," as they were to communicate about cats and dogs, which suggested that really, their words were not referring to general semantic categories, but probably were encoding some kind of really ad hoc, low-level strategy to solve the specific task.
And in parallel with our work, there were several other papers that basically showed in other ways a similar result-- that maybe, it is really not enough to have the agents succeeding at some task by using some emergent language, but we have to be careful about what kind of language emerges if we hope that it's something of any generality.
And so this has shifted to many recent papers that focus-- more than, say, on scaling up to simulations to even more ambitious scenarios, they focus on analyzing the emergent language. And in particular, what has really attracted much attention is the topic of whether these emergent languages are compositional. So in these slides, you'll just see a few examples of many, many recent papers that debate whether emergent languages are or are not compositional.
Why compositionality seems such an important property-- well, because we know that perhaps, the most useful thing that human language can do is allowing us to talk about things that we have never encountered before. And so to generalize zero-shot, as if-- let's say that neither you nor me have ever seen a blue banana before. Still, I could zero-shot the new point at the blue banana, and pretty sure that you would point to the right object in this image.
And the standard and reasonable story goes, the reason why natural language is so good at generalization is because human language is super compositional. So when we were learning English, we learn that banana is the word for bananas, that blue things are called blue, and then, given this knowledge, we can compositionally refer to a blue banana that we have not seen before as a blue banana. So it seems very desirable to have this property of compositionality.
However, if we want to study compositionality in emergent languages, first of all, we need a useful definition of what it means for an emergent language to be compositional where, note, compositional is not just a synonym for able to generalize. Otherwise, we can just statically study generalization and stop worrying about compositionality.
Now, actually, in the new literature, you don't really see many explicit definitions of what the researchers really mean when they say that the language is compositional. And it turns out that actually, the standard definition of compositionality, coming from philosophy and linguistics, the one that says that a linguistic expression is compositional if its meaning is a function of the meaning of its parts and possibly the rules used to combine them, turns out to be quite useless in the analysis of these deep network emergent languages.
And we've been having a lot of discussions about why this doesn't really seem to work. And I think we came to the conclusion that the reason why it doesn't really work is because this is a definition that focuses entirely on the meaning side. It basically assumes that we know what are the parts of the linguistic expression and the ways to combine them, and then, we look at whether this combination resulted in something that was indeed compositional on the meaning side.
But actually, in emergent language simulations, the meaning composition is typically triggered-- it's pretty much always some kind of conjunction of properties, such as blue and banana. And our job is really to discover which parts of the linguistic expression are the ones that are referring to the atomic parts of the meaning, and how they combine to refer in this way. So our intuition is really that when people are talking about emergent languages being compositional, what they really mean is that a compositional language is one where it is easy to read out which part of a linguistic expression refers to which components of the input.
So for example, if in response to a blue banana our agents said, "ALM," a very compositional language would be one where we can nicely, fully segment this string into, say, "AL" meaning blue, and "M" being banana. something like the whole string "ALM" referring to blue banana with no possibility of further analysis would, of course, be compositional. But those in the middle-- say, something where L is playing a double role, contributing to the blue meaning together with A, but to the banana meaning together with M, we would say that is less compositional than the first analysis.
So in our last few papers, we have tried to make this idea more precise, proposing a notion of naive compositionality, where we make the assumption that on the meaning side, the only way to combine primitive input elements is to assemble them in a collection, which, for example, would mean that the meaning is just a list of attribute value pairs, like the one I'm showing in the example on this slide, when the combination would just be a combination of the values attributes.
Or it could be a set of objects. It could be a list of properties, such as blue and banana. And we do not apply any transformation to these items on the meaning side, which is limiting, for example, if we want to look at natural language. But I think it's fair to say that it covers nearly all the current emergent language simulations. So if we make this assumption on the meaning side, then we can say that a language is naively compositional if the atomic symbols in its expressions refer to single input elements independently of either the input or the linguistic context.
Now, notice that this definition does not really mention composition explicitly. But it actually results in compositionality, because if a collection of inputs has to be referred to by symbols in a way that is context-dependent, then really, the only way to refer to a composite input will be by the juxtaposition of the corresponding atomic symbols.
And it is a naive notion, both because, as I really state on the meaning side, we are only considering assembling meaning primitives, but also because on the formal side, on the form of the linguistic message, we really only consider a juxtaposition of atomic symbols as formal composition. You could say that there is no phonology, or no morphophonology, if you like. And notice that actually, it's exactly these restrictions that make for an easy readout of what is being composed. Because I can immediately say, OK, L must refer to whatever value-- 12-- doesn't matter in which context it [INAUDIBLE].
So for example, again, given our agent announces this list of attribute values and says ALM, we could say that it is speaking a naively compositional language if now we find out that A is always referring to the value 29, N is referring to 12, and M is referring to 31-- where, by the way, notice that this could be a bag of symbols, representation where A always refers to a certain attribute value, no matter where it, of course. Or it could also be-- it could be a rudimentary syntactic representation, where A reference to that only when it occurs, A, as the first symbol.
Again, if just their own expression as a whole, it's not possible to break it down to get at its meaning, then we say that it's not compositional. But now, say, a language which actually is still compositional, in the traditional sense of [INAUDIBLE] because you can actually put together the parts in some systematic way to get the intended meaning. But now, there are a lot of complicated rules, such as say A refers to 29, but only if it is followed by L. L, in turn, refers to 12, but only if the order input values are odd. And M refers to 31, only one of the figures [INAUDIBLE] was A. We would we say that is non-naively compositional. And for our purposes, we don't mean want a language like this to emerge, because it's very, very difficult to understand.
So equipped with this definition, we tried to systematically study the relation between compositionality and generalization in a paper that was led by Rahma Chaabouni and Eugene Kharitonov, that they're going to present at the ACL next week, I think. So in this paper, we consider a communication game where now the input is composite, is our usual list of attribute values. They send their producers a multiple symbol message, and the receiver has to reconstruct the input attribute values.
Now, at training time, the agent will see a subset of possible attributes value pairs. And then, at a test time, they will be tested on attribute value combinations that they were never seen at training time-- which means that, basically, testing accuracy will directly be a measure of how fast, how good the agents are to successfully generalize to unseen combinations. And at the same time, we can expect the emergent code after convergence to check whether it is naively compositional.
How do we quantify naive compositionality? We'll just go over these very fast. We have to be use three separate measures in the paper. Here, I will only use for the quantitative results one, which is what we call it positional disentanglement, which is basically a strong form of naive compositionality where we measure to what extent a symbol in a certain position-- for example, as the first symbol in the sequence-- immediately refers to two different values of the same attribute. So not only whether there is a [INAUDIBLE] correlation between symbols and input includes values, but whether this is position-dependent. So that I say that first symbol in a message will systematically refer to a value of the second attribute of the input.
And as I say, we had other measures, and we got similar results. So these results are not particularly dependent on these measures. At the same time, it is intriguing that, to the extent of our emergent languages are compositional, they were compositional in this kind of positional way, which shows that there is some kind of liking for some kind of rudimentary virtual syntax. But there is no [INAUDIBLE] to that here.
So the first question, of course, to ask is, are agents able to generalize, to develop languages that generalize? The answer is, yes, they are able to generalize, but with an important caveat, which is, like this plot. It shows the test accuracy, so generalization accuracy, in function of how large the input size, that is how large the space of possible meanings is. And we see that there is a super strong accuracy between how varied the input to the agents is and how good they are at the generalizing. They really need to have a rather large input space to develop a language that generalizes.
And this, in a sense, It's an obvious result, right? If you have a very small input space, then it's just easier to memorize everything. But still, I think it's a good thing to keep in mind, because nowadays, there are there many of compositionality in neural networks, including some that I ran before that they rely on really small toy tasks. And then, maybe, when we conclude that neural networks don't generalize, it's not because there is really an issue with generalization of neural networks, but because the input was not varied enough to avoid just memorizing local maxima, if you want to call them.
But anyway, generalization happens. Does generalization correlate with compositionality? I think this is the most interesting result of our paper. The answer is a resounding no. There is not a relation between generalization and compositionality. So what you see on this plot is on the y-axis, our measure of compositionality. On the x-axis, generalization accuracy is worse for a particular configuration. But it's true in general, robust, many hyperparameters, variations, and so on, and so forth.
You can see that basically, you don't see some kind of nice relation, but really, languages that generalize are now all over the map. They can be relatively compositional. Notice, also, that the largest compositionality scores are around 70% and not at the maximum of 100%.
But also, you can have languages that are not compositional at all, and they do still generalize very well. What's going on here? It's actually very difficult to tell, exactly, because the non-compositional languages are very, very entangled. So it's very difficult to analyze them. but spending a week or so with them, what I found out, doing some kind of qualitative analysis, some kind of linguistics field work, is that in general, what seems to happen is this.
In general, deep agents, in order to successfully converge, actually need more expressive power that would be strictly necessary in a perfectly compositional language. So what does this mean? It means that, for example, if the input consists of two attributes with 100 values each, you would expect that agents could learn if they have a language that allows messages of two symbols, and they have a vocabulary of 100.
But actually, they will need, for example, a language allowing messages of three symbols with a 100-symbol vocabulary. Which means that there is going to be some kind of redundancy. There's going to be some extra leeway that, then, is used by the agents to find solutions that are not fully compositional.
So for example, in what was actually one of the most compositional languages, what I found was that languages were specialized such that one position in a message will mostly denote one value for an attribute. Another position would mostly denote the value of the other attribute. But still, some kind of ambiguity was left. And then, the term symbol would-- often in very bizarre ways-- fine-tune the referent.
So maybe if this thing was A to the F, we could conclude that A meant either value 53 or 72 of attribute 1, and then, the term "symbol" would tell us, really, which of the two it had to be.
So the general conclusion, as I said, is that it doesn't seem-- we didn't really find-- actually, we didn't find any relation between degrees of compositionality and degrees of generalization. So the question, then, is like, is compositionality good for something? Is there maybe something else for which compositionality is good?
But even before running any experiment, of course, one reason why we may want languages to be compositional, and, in particular, naively compositional is exactly because naive compositionality basically means that the language is easy to understand. It's easy to interpret. And we are interested-- maybe, eventually, we want to have humans having conversations with neural networks. We want to be able to interpret them, to make sure that they are not developing evil plans to destroy the world under our backs.
So sure, if interpretability is your main goal, then compositionality might be something worth pursuing. But how about from a more general scientific perspective. Is compositionality good for something?
Quite a useful observation here might be that actually, if we go back to the plot that I showed you before, what we notice is that compositionality might not be necessity for generalization, but it might be sufficient, in the sense that if you look here at the top left, actually, there is no high-compositionality language which fails to generalize. So it looks like whilst the agent, perhaps by chance, discovered a compositional language, then we can be pretty sure that that language will be able to generalize.
And so following up on these observations, we explore the hypothesis that compositional languages are like good viruses or good means that may originate by chance, but then can easily spread in a population. Indeed, in our experiments, at least, that seems to be the case. What we did is we froze our senders so that they could not modify the language, and then we expose them to new receivers, even receivers that had different architectures. And we found that there was a super high correlation between whether a language was compositional or not and how fast the new receivers would be learning the languages, and also how well the new receivers would be using those languages, in the sense that how good the receivers would then be at using those languages to generalize.
So I think this is interesting, because actually, one of the nicest and most robust results in the experimental study of language evolution is that cultural transmission favors the emergence of compositionality. There are classic experiments by Kirby, Smith, and others with this respect.
And what we are finding here is we're inverting the arrow of causality and saying, it's not only that cultural transmission causes compositionality, but maybe compositionality is also favoring cultural transmission, even when, maybe, it arises by chance.
And this, actually, would be a good thing, because our end goal would be to breed large communities of interacting agents that have a shared language, and so a language that is easy to use to train new agents would be a good thing.
Unfortunately, this is just like something that happened in the particular work that I just talked about. But it is not a general result. And if you think about it, it's actually sort of obvious that it is not a genuine result.
How viral a language will be will actually depend a lot on the kind of input you had. It will depend on the task that the agents have to do to have shown. And perhaps most interestingly, it will also depend on whether compositionality protects the agents from something that is hard for them-- if, accurately, entanglement, failure of compositionality, is the consequence of something that is easier, that is like aligned with the biases of our agents, Then compositionality will not be a particularly desirable property.
And we have a recent short paper with Eugene Kharitonov, where we just present a few near thought experiments about these. So for example, in one, we look at a simpler environment where inputs are coordinated within the unit circle. And then we consider two manually-crafted languages that we use to train receivers. In one language, the one that we call the coordinate language, each utterance is made of two symbols that are actually directly corresponding to discretized coordinates.
So if you look here at the top-left, you have a point that is at coordinate 1,0. And in the naively compositional language, the message corresponding to this input would be something like 1, 0, or some symbolic equivalent of that. In the rotated language, what we do is we have symbols that correspond to discretized coordinates after rotating the access to by a certain angle, so that again, now the same point would be denoted by two symbols that will be discrete equivalents of 1 divided by square root of 2 and minus 1 divided by square root of 2, which basically makes this language maximally entangled in the sense that identifying either element of the input will always require looking at both the modes in the language.
Now, we look at how easy these languages are to learn and to use by a new receiver. So we measure ease of learning by how fast convergence is during training. Blue here means coordinate, highly compositional. Red means rotated, not highly compositional at all. You see that the two curves are virtually indistinguishable, and ease of use-- like then, are the agents, the new receivers, able to use these languages to generalize to new points that they did not see during training? Again, no difference between the coordinate and the rotated languages.
So what's going on here? Well, what's going on is, we think, quite clear. In the rotated language, you can use a linear transformation to link the symbols-- or rather, the numerical values that correspond to the symbols-- and the inputs. And linear transformations are basically something that neural networks are doing all the time.
So in a way, a linear transformation is, we could say, virtually transparent for a neural network. It might be as easy to learn an arbitrary linear transformation as it is to learn an identity mapping, so that for the listener, the rotated language is as easy as the perfectly naively compositional coordinate language, which is like a little toy experiment to highlight the point that there is not a universal about more compositional languages being easier to learn.
OK, so this leads me to my conclusion, my take-home messages. At a very high level, I think that it is an interesting pursuit. It is like trying to evolve a shared the language to empower communities of specialized neural networks to communicate. I think this is interesting, both from an AI perspective, because it could be a way to have a more flexible layer to put together specialized networks. I also think that it's a very interesting scenario to study the emergence of a communication code from the point of view of cognitive science. So I hope that you guys, if you are not familiar with this kind of literature, now got a bit more curious about it.
At the same time, I would say that an important thing to keep in mind is that it's very, very difficult to predict what this kind of emergent language would do. As I say here, emergent neural network languages will do tricky things. They will converge to very strange solutions. They will be highly entangled, and so on, and so forth. So you have to be always very careful about not only checking whether communication is successful, but really, what kind of communication it is, if you hope that it's something of some generality.
Based on these kinds of considerations, a lot of recent work has actually focused on the question of whether emergent language is compositional. However, if we think that this is something important, we will first, maybe, have a clear and useful, actionable definition of what we mean for compositionality in this context. We propose naive compositionality, but that is clearly not the last word, what I think we cannot do. It just takes a definition that worked very well in formal semantics and expect that it will automatically be useful in studying these languages.
At the same time, our experiments showed that many of the desirable properties that we think are linked with compositionality, after all, in [INAUDIBLE] emergent language are not. Which means that maybe, what you really want to have, say, is a good generalization or faster-- is fast learning. You might directly focus on these properties, rather than worrying about how compositional the emergent language is, given that compositionality might not be a proxy for this or that [INAUDIBLE].
And that is fine. The linguist in me would ask-- I could say hey, hey, hey, now you're going too fast here. Human language, which is the best communication system on the planet, is actually almost defined by its compositionality. So what do you mean that we should not look at compositionality?
Well, let's really look at how compositional human language is. We just say that like the wonderful thing of human language is that we know the meaning of "banana." We know the meaning of "blue." We put them together. We search on Google for "blue banana." And the first hit for "blue banana" on Google turns out to be this thing over here, which is a geographical region which covers all the major areas of urbanization in Western Europe, going from northern England to northern Italy. So not that compositional.
So I would like to stress that really, also, human language is full of non-compositional explanations such as a metaphors like this blue banana thing, idioms, lexicalized constructions. And I think the best evidence for that is that if human language was like that compositional, we would not have lots of very, very smart people, for 50 years, just working really hard at trying to come up with a compositional account of English and other languages, and wine making great progress, but only for small fragments of these languages.
So a general conclusion is that maybe the right goal, not only for emergent language research, but for neural network research in general, might not be to go for full compositionality, but rather to aim for a human language-like opportunistic efficient mixture of compositional and non-compositional means of expression. And with that, I'm done, and I thank you for your attention.
PRESENTER 1: Great. Thanks very much, Marco. Roger Levy has his hand up Roger I'm going to hug you if you'd like to ask your question.
ROGER LEVY: Hi, can you hear me?
MARCO BARONI: Yes.
PRESENTER 1: Hi, Roger.
ROGER LEVY: Hi, Marco. Thank you for a really wonderful talk, as always. I really liked your provocative ending. But I want to engage with you on exactly what you want to commit to.
So I think the blue banana example that you give is really wonderful, but to me, I'm not sure I would call that non-compositional. Rather, I would say that it's a great example of how language has mechanisms that build on top of the output of compositionality.
So in this case, it's not an accident that the name for this region is called the "Blue Banana." It's precisely because of the compositional meaning of "blue banana." But then, language can-- in addition to computing the output of a composition of elements, it can then reify that particular expression and then build on top of it. And so I guess I would think of this as an instance of the rich means of departures from compositionality or additions on top of compositionality the language offers. And those are often very systematic.
So for example, like metaphors, I mean, you're highlighting frozen metaphors. But like for every frozen metaphor, there are many, many metaphors that are not frozen, and there is a lot of systematicity in the nature of metaphors. And so I guess of think of-- I mean, obviously, non-compositional expressions do exist. "Kick the bucket" is the paradigm example. But it's hard to find ones that are that non-compositional.
So I guess it seems to me that what I would describe as human language-like is maybe not a mixture of compositional and non-compositional means of expression, but rather a sort of compositional system with a number of other mechanisms for flexibly and productively building on top of that and storing the outputs of the resulting computations.
So I wonder, is that consistent with the picture that you're presenting, or is it in conflict? And if so, how?
MARCO BARONI: So thanks. I think you're making a very good point here. And I mean, I guess here, I was, I mean, deliberately being a bit provocative. And I mean, which language is very compositional.
And I would say, so my impression-- maybe because I come from morphology and the like-- is that actually, what language does, very often, is some kind of like-- it tends to be partially compositional. So you have, say, syntactic constructions that are largely productive. Largely, you can understand their meaning. But then they have certain forms that are more lexicalized, more frozen. And still, then, you can still see the meaning. And it seems like there is like a lot of like flexibility, a lot of gray areas in human language.
In neural networks, so I guess my main point here is, when we look at neural networks, I think sometimes we expect them or we will want them to speak like logics more than mathematics. And I think the more modern language, more than natural language, and I think it's not by chance that we humans don't speak in logics, right? I mean, it would be extremely cumbersome.
And so I think also for neural networks, when we go, oh, another failure of compositionality here. This could not be, possibly, a good model for humans, or it could not be a way into AI. I would say, well, let's be a bit more open-minded here.
Let's understanding which way it is a failure of compositionality. Is it a failure of compositionality that actually, maybe, resembles this phenomenon in human language-- which, as you correctly pointed out, once we look at them deeper, are actually not failures of compositionality but sophisticated forms of compositionality? I guess that is my main point.
The blue banana example, I mean, I put it because it was cute and because it's real-- because when I was googling for blue banana images, this is the first thing that popped up.
ROGER LEVY: [LAUGHS]
MARCO BARONI: But I guess, probably, a better example is that human language could also decide to call banana, I don't know, "yellow, curved, tube-shaped fruits," and that would be more compositional than "banana." But because we eat bananas all the time, at a certain point, we say, OK, let's just call that thing "banana," right?
And I think that's, maybe, closer to what I would expect our neural networks to do and I actually would like them to do. Let's say if some concept is frequent, even if you could decompose it into its parts, it's better to just have a name, I mean. I guess in this case, it would be a proper name. But even a generic noun that just defines this combination of primitive attributes.
ROGER LEVY: Yeah, thanks. Yeah, that all makes a lot of sense. Thank you.
PRESENTER 1: Thanks, Roger. Next question, we have one from Ava, who sent it in through the Q&A, but we're going to have her ask it in person. So Ava, go ahead.
AVA: Hi can you hear me?
MARCO BARONI: Yes.
AVA: All right, great. Hi, Marco. Thank you so much for the talk. This is super interesting. So I had a question about the relationship between ambiguity and compositionality when you're defining this idea of compositionality. So I'm coming from this from a linguist's perspective, and natural language is notoriously known to be super ambiguous when it comes to the mappings between sign and meaning. But this has never really stopped it from being compositional, and if anything, it makes our natural language more efficient when it comes to the size of our vocabulary and also the length of messages that we can send.
And so I was wondering, when you talk about this difference between naive compositionality and non-naive compositionality, it seems to me that that really has to do with whether or not you are introducing ambiguity into a vocabulary. And I wondered whether or not the definition of compositionality that you're using is conflating these notions of compositionality and what we'd call discreteness in linguistics. So what is the mapping of meaning to sign? So I was wondering if you could comment a bit about the relationship that you're envisioning about compositionality, discreteness, and ambiguity.
MARCO BARONI: Yeah, thanks. I mean, so this is another wonderful question. And I have many things I would like to say about it. Let's see the ones that I remember. So one is, yeah, totally. Natural language is very ambiguous. And indeed, that's something that, if you read nearly all the recent papers from compositionality in these neural network languages, actually, what they would like to do is to get rid of ambiguity.
So following up to what I was saying with Roger before, it's like, that's another property that, actually, we do not want to lose, right? Just like we want some words not to be compositional, so we may also want ambiguity, because as you say, I mean, there are many kinds of ambiguity, and they all have, clearly, a function. So we want to have them there.
Having said that, I did-- actually, something that I would have loved to put in the ACL paper is that we found that these languages are ambiguous in ways that are like human-like ambiguous. That's not really the case. I mean, I suspect it's also the case because these [INAUDIBLE] domains, like absolute values, are too simple for interesting forms of ambiguity to emerge. But what you get there are really these kind of bizarre things, where, say, you have that one word, let's call it, could refer to two values, and then there is another word that could also refer to two values. And then, when you intersect them, you are able to disambiguate.
So I think that it's like, we want ambiguity. Do neural networks have the right kind of ambiguity for the moment, for my little fieldwork? I would say no. Naive compositionality versus compositionality. So another way, maybe, to see what I'm saying-- and I mean, I'm not saying that-- I don't have a final answer here. It's more like what I, and Rahma, Eugene, and others have been thinking about, is that maybe, here, the question is also a like a terminological one about what do we mean when we talk about compositionality with respect to these languages.
So I think that like what AI researchers would like to have here, as I say, it's really easier readout-- something that makes it easy to understand what these agents are saying. And so in that sense, I think you could even think from a purely practical point of view, maybe a stupid language that is really concatenative, juxtaposition, totally unambiguous, with be a horrible language, but it will be one that is very easy to understand, and it's clearly not the one that these networks tend to naturally converge towards. And perhaps they are right not to.
So maybe, since I do hope that like linguists and people doing these deep learning sort of simulations talk to each other more. Like I had, after I organized a one-week workshop on compositionality last summer, and I think my only real-- with super good people in linguistics, neuroscience, AI, and so on and so forth. And my only really clear take-home message was that we are all talking about different things when we are talking about compositionality.
Everybody loves it. Everybody thinks that it's a core property. But then, we are really talking about different things. And so even if my contribution turned out to just be terminological here, I think it would be useful to say, hey, maybe we are looking at something else.
And speaking of "something else," I think like that the other thing that we have to keep in mind here is that we don't even know here when we are looking at phonology and when we are looking at morphology slash syntax. It's like the first job to do here is to understand, is A a phoneme that could be combined with other symbols to then get a meaning? Or is A an autonomous word or an autonomous morpheme that is already having a meaning in itself? And depending on that, you can come up with different analyses.
So actually, a game, maybe some of the things we should be thinking of here are not compositionality in the sense of semantics, but compositionality in the sense of how much phonology and how much more morphophonology is happening, is there a form of double articulation or not-- these kind of questions.
AVA: Thank you.
MARCO BARONI: Thanks.
PRESENTER 1: Great, thank you, Ava. Our next question is from Josh Tenenbaum.
JOSH TENENBAUM: Hi, can you hear me?
MARCO BARONI: Yes.
JOSH TENENBAUM: OK, thanks, Marco. A great talk, really-- really clear, really interesting. My question is in the spirit of this idea that by studying neural network models, there can be a kind of comparative linguistics, or comparative psycholinguistics. Because while these are, in some ways, very biological, in other ways, they're very different.
And I wonder if maybe, just to try to understand the difference between how, for example, in human studies, cultural evolution of language systems as you were talking about. Compositionality seems to be really valuable in speeding the development of a language and making it more learnable and more developable over generations. But yet, you don't find that on the neural network side. Could that point to some of the differences between the way neural networks learn and the way we understand humans learn?
So in particular, if I understand what you're doing, where the whole thing is you're trying to, basically, train it mostly with gradient descent, except for the discrete symbolic bottleneck between the two agents, we don't have the kinds of inferential capacities that humans use, human children use, to learn words from just one example or to resolve ambiguities with pragmatics-- so various kinds of sophisticated probabilistic inference that many people have studied that can account for our ability to do one-shot learning and resolve many kinds of pragmatic inferences.
So if, maybe, compositionality is distinctly valuable together with those abilities, that would allow not only the evolution of learnable languages, but languages that can be learned much more quickly and used in more sophisticated ways to come up with meanings beyond the literal meaning. And maybe there's a way to test that by basically doing the same kind of artificial multi-agent communication simulation, but where, instead of neural networks, you have agents that have these kinds of abilities to learn language and resolve ambiguities with, say, for example, probabilistic inference on structured representations, and seeing whether their compositionality would emerge in the space where you didn't necessarily force it to emerge, thereby paralleling what we've seen on the behavioral side.
MARCO BARONI: Thanks. Again, very interesting points, and I actually like many of them. Let me just start from-- the simplest thing is that actually, I must say that in our study, what we show is that compositionality does not necessarily lead to cultural transmission, but the other way around, there are a bunch of recent studies that show that the neural networks, like in these human experiments, are the other way around.
So cultural transmission does lead to more compositional languages. I mean, they're all smallish toy studies, so I don't think we have the last word on that. But I would not want you to get the impression from me that that's not the case, because we don't have evidence against it. we. Have some weak evidence for it.
But of course, that's independent of your more general points. But one thing about gradient descent and this sort of--
JOSH TENENBAUM: Yeah, just to clarify, because I'm not so familiar. The difference is whether it's just two agents iterating back and forth, versus like a chain of agents teaching and learning from each other? Is that the difference?
MARCO BARONI: So the difference is this, that what we did is we took a compositional language, and we asked, will this spread faster?
JOSH TENENBAUM: Yeah.
MARCO BARONI: And the answer is, in certain cases, yes, but in certain other cases not. Whereas the more traditional approach is the one where you actually take a random language at the beginning--
JOSH TENENBAUM: OK, and then you track it. Yeah, good.
MARCO BARONI: Having it spreading around, and eventually, it becomes more compositional. And there are some papers that show, at least to some extent, that to happen, also, with neural networks. I don't have the references on the top of my head, but I think our colleague Kirby or Smith are among the [INAUDIBLE] there, too.
Concerning the point about the way of learning, I really think-- I mean, unfortunately, I don't have any kind of mathematical grasp of what's going on, but I definitely have an intuition that one of the problems with the gradient descent is that you can never really rearrange. There is not an "aha" moment in which you get something, you realize that the real solution is another one. You wipe out your failed mistakes before, and you have everything nice and clean.
But by looking at these languages, I really think that that's what's going on here, that because gradient descent is such a gradual process, these networks start with lots of little bad ideas and things that locally sort part of the problem. And then, by the end, even when they generalize, there are still a lot of vestigial remains of these earlier solutions that they make the language very entangled. So I would really like, there, to collaborate with people that are really studying the mathematics behind these things to understand what's going on, because I suspect that's related.
Other kinds of learners-- that is definitely something we should try. As you said at a certain point, I feel like the problem with some of these more probabilistic learners is that actually, I would be worried that you are putting compositionality and rating the model by the way you build it. And so then, it will not be so surprising if it emerges. But again, I would be happy to, say, later chat about what could be the most agnostic form of these models that we could try that doesn't contribute to these risks.
JOSH TENENBAUM: Yeah, I think that's very interesting. I mean, it is really interesting, and there's lot of people at CB&M who are trying to understand more about the mathematics and the dynamics of gradient descent. And there's ways in which it's like biological learning, but also ways in which it's more like evolution and has that property, like evolution does, of accumulating-- maybe being suboptimal, inefficient, but still somehow cobbling together a good solution. So yeah, trying to understand that, tease out the evolutionary dynamic versus the more inferential learning dynamics. Those would be really interesting. Thanks a lot, again.
MARCO BARONI: Sure.
PRESENTER 1: Great. Thanks, Josh. We have a question submitted anonymously. I'll go ahead and read to you. "Do you think forcing networks to develop 'compact'-- in quotes-- languages will lead to more correlation of compositionality and generalization?
MARCO BARONI: That's a very interesting question. So one thing that we do know from our experiments is that actually, counter-intuitively, if you limit the capacity of the language, if you make the language less expressive but still expressive enough that a compositional solution would be able to talk about all the inputs in the domain, what happens simply is that the networks fail to converge.
So making the language more compact actually results in not learning, and the way we have to make them learn is by giving them a really large linguistic capacity. And then, with very large linguistic capacity, they can become more or less compositional. They can become able to generalize.
But this kind of thing, which would be very intuitive, that you just give them less capacity in the language, or even give them less capacity in the model, we did not find that. But again, I suspect that is just something to do dynamics of linguistic descent and the fact that it might-- how is it called? Yeah, now, if my quoters happen to be in the audience, they will be very ashamed of me, because whenever I mention this, it just shows how little I understand the maths of this thing.
But people talk about this lottery ticket hypothesis-- I don't know if you've heard about it-- that really, the reason why you need these very large networks is because during leaning, you're finding a small subnetwork inside the very large one. And you need the very large one just to finally chance upon the right smaller architectures.
And then, after learning, you can also prune a lot of stuff. So just metaphorically speaking, I suspect that something similar is happening here-- that the networks need to have just a lot of words, a lot of linguistic space to try out things to, then, finally find the way to communicate. And just reducing that space with standard gradient descent will not lead them to a better language. It will just lead them to not be learning.
PRESENTER 1: Great. Thanks, Marco.
MARCO BARONI: Thank you.
PRESENTER 2: All right, it seems like there's no more questions. So I just want to thank you again, Marco, for having made the time to give us this really interesting talk.
MARCO BARONI: Thanks, and thanks for the very nice questions.
PRESENTER 2: All right, and everybody else-- and you included-- please continue to stay safe.