Boundary conditions for language in biological and artificial neural systems
Date Posted:
November 12, 2021
Date Recorded:
November 9, 2021
Speaker(s):
Andrea E. Martin, Lise Meitner Group Leader, Max-Planck Institute for Psycholinguistics
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Human language is a fundamental biological signal with computational properties that are markedly different than in other perception-action systems: hierarchical relationships between units (e.g., phonemes, morphemes, words, phrases), and the unbounded ability to combine smaller units into larger ones. These and other formal properties have long made language difficult to account for from a biological systems perspective, and within models of cognition. I focus on this foundational puzzle – essentially “what does a system need to represent information that is both algebraic and statistical?” - and discuss the computational requirements, including the role of neural oscillations across time, for what I believe is necessary for a system to represent and process language. I build on examples from cognitive neuroimaging data and computational simulations, and outline a developing theory that integrates basic insights from linguistics and psycholinguistics with the currency of neural computation, in turn demarcating the boundary conditions for artificial systems making contact with human language.
Research website: www.andreaemartin.com .
ANDREA E. MARTIN: I'm delighted to be here. Thank you all. So I want to talk about the properties that the system needs to represent and process human language. And in order to do that, I'm going to present the labors of love of my research team and then allow myself to wax poetic a bit about what the findings might mean for both biological and artificial systems.
So let me first start by explaining my title a bit. So we can define a space of systems that can have language and describe that with a differential equation and then try to find the conditions under which a solution to that equation can be found. Right? That's very dry. But the conditions, nonetheless, is what I want to go after today, in a more conceptual way.
So we can divide systems that can have language from those that can't. And in doing that, we can find boundary conditions, which I want to argue, are going to be useful for us in formulating our theories of how systems like language might work. So the solution to the boundary variable problem that delineates these two spaces-- another way of saying that question is, what do the systems have in common-- the ones that fall into the teardrop here-- have with each other, that systems that don't fall into that space have?
And then we can say that the boundary value returned along the curve that defines these spaces can help us get an insight into what those conditions might be. Now, to think about that in a more abstract way-- we can see that it already gets tough. Because-- spoiler-- it turns out that human language is put forth and utilized by the human brain, which is a frustratingly complicated, fascinating system if there ever was one.
Now I have to talk about some of the disconnect between the most basic aspects of the system, so neuroscience and language science. This is actually fodder for a whole other debate or series of talks. And many people have talked about this before. But the main point here, as others have emphasized, that we don't know how to connect these dots. But the work of many people over the last 40 years has taken foundationally important steps to try to bring these two distinct disciplines and concepts together.
So what I want to highlight here just briefly, without going in to all the important work that I'm tangentially referring to here, is that despite all the progress we might have made in the brain language and cognitive sciences-- of which there's much to be glad, of which we should not look past-- there's still a persistent gap to be filled. And that gap requires us to transduce concepts where they are and as they may be already. And I want to make the case for that, for filling the gap earnestly in a piecemeal way, simply by starting with things like cleaving to the minimal formal facts of language.
So now for today's audience-- this problem that I've yet to dive into but which I'll explain in a moment-- explodes actually, as we try to add other relevant fields, for example, from other areas of cognitive science and their own taxonomies. So how will we ever get our dots to align if we struggle between neuroscience and language, if we also add in other concepts from artificial intelligence or from natural language processing?
So I suggest that we start with what we do know about language. And we know that what we know matters. So we know that associations, statistical regularities, and predictions are extremely important for both language as an object of study and for the brain. For example, I'm not talking about specific findings here, but rather huge bodies of work, showing that things like priming-- phonological, lexical, semantic, syntactic level-- that statistics are extremely important and definitional for language acquisition and development for very casual things like speaker adaptation. When you speak to someone, that you adapt to their, I guess, orchestra of speech sounds.
Things like the word superiority effect, various semantic illusions, pop-outs. There's a huge, rich literature on all of these interesting things. More contemporarily, we tend to focus on things like cloze probability or surprisal or entropy, which we find to be highly predictive of both behavior and brain activity.
So this is just to give you a whirlwind message to say, look, associations stick to regularities. And predictions, based on all the above, matter. So what we know matters.
However, the [INAUDIBLE] of what we as humans know, especially when it comes to language, is structure and meaning. And we can talk more about whether we want to see structure and meaning and statistics about it as separable at all. And I think there are compelling ways to think about them as orthogonal and also not orthogonal and in interesting ways.
But the point here that I want to make is that we as a species seem to know how to make structure and meaning in pretty much any context. We have bias to see the world in a structured way. So if we take just the phrase, complex structure and meaning, the identical sequence can easily come to have two different formal meanings. And that's because of the hidden structure of language, so the fact that language is compositional, can be built up and broken down into parts.
So this is an awful, inelegant cartoon, linguistics cartoon, to illustrate the point. So in the first tree, the nouns, structure and meaning, are separated in the tree itself. And they have their own individual branch. And the adjective, complex, only modifies structure. So we get the meaning, complex structure, but not complex meaning.
So I circled in red the part of the structure that gets the meaning, complex, combined with complex at each structure. So in the second case, the nouns share a branch in the structure, leading to both being predicated by complex, such that both structure and meaning are complex in this denotation. The point here, is that in order for there to be a difference in linguistic structure-- sorry-- in linguistic meaning, we have to have the structural distinctions to describe them.
But now the question is, how are we bringing basic notions of linguistic structure-- so in this case, of constituency-- veridical or not, so whether or not you buy into this cartoon or a different one-- into our theories of language and language processing and into theories of how we represent sensitive information in the brain? So in other words, not only are the formal descriptions at stake but how we use them and bring them in to our models and modeling interpretations of data are at stake, as well.
Well, the short answer is, these days, we don't so much. As we all know from daily life-- so tools like autocorrect and Google Translate and from language models that we might use professionally-- they don't always get language right. They all can miss out on core aspects. And in the extreme, this can actually be dangerous. It can also be sometimes comical.
But I believe that language models don't get it right because of how they represent these claims or claims about linguistic representation-- so basically, at its core, as a sequence of associations. So now, no matter how highly dimensional or conditionalized-- whether it's one word related to another, or the probability of one word given another, or distribution in a large corpus-- it's not necessarily as a structure or grammar, although [INAUDIBLE] [? are an ?] [INAUDIBLE] [? grammars. ?]
But what a purely statistics-only approach misses out on is actually what's miraculous about language, and that's that we can understand things that are not highly probable nor predicted. So we can understand things like, for example, the phrase, marmoset getting a toothbrush head rub. So I can't be sure, but it's unlikely that you've ever heard the phrase, marmoset getting a toothbrush head rub before, unless you've heard a version of this talk. But I can be sure it was not a predictable utterance for you to hear.
So the point is, models based on associations alone can't explain how we do this, without extra machinery, at least, nor do they explain how we say things or interpret things that are novel or unexpected, or even just variance of what we've heard. Even if we go for a purely association-driven language model, we still need to explain how we span the distribution, why we say some more improbable or unpredicted things in certain contexts than in other.
So it's obvious, in this case, that just statistical association alone and prediction, while they play an important role in language processing, they can't tell us the whole story. Yet, in the dominant models in psychology, neuroscience, and AI, that's the main thing we go on with our knowledge structure and not, per se, structures and grammar. But the main thing we put into these models, first of all, is distributional information.
But at least, from a formal point of view, structure is what explains how we know what a toothbrush head rub is. So this is the problem that I want to focus on for the rest of the talk. And it comes down to the fact that language, from a mathematical point of view, is actually very interesting because it's an example of a system that's both algebraic and statistical. I think we don't pay enough attention to that wondrous fact.
So I think one of the best primers on this comes from the mathematician Tai-Danae Bradley, who works on category theory, among other things. And I think she's really carved [INAUDIBLE] joints here and says it more succinctly than many in our field of cognitive science do. So she says language is algebraic, since words can be combined to form longer expressions. And language is also just statistical because some expressions occur more frequently than others. So her pithy observation leads to what I see is the main challenge for accounting for the facts of language in the brain, namely, how are we going to put together a system that can both be statistical and algebraic in a neural network?
So there's already so much that's known from basic linguistics and psychology, that-- per the same argument I'm making-- has not successfully yet been incorporated into our computational and neurobiological work on language. So I think that's actually quite a fertile ground for all of us to be focused on. I'm going to talk, not only about the challenge of structure and statistics in one model, but also about how timing-- something we know that is so important from psychology and behavior-- also follows the same pattern, how our models might benefit from incorporating both timing and conscious efforts to incorporate structure into them.
All right. So the approach I'm going to outline today can be summarized as trying to transduce concepts from different areas to constrain others in order to fill the gaps I was talking about in the beginning. So we're going to set out some desiderata that might be a little bit trying at times. But we're going to try to stay faithful to the key basic properties of formal linguistics. So for me, in this talk, it's going to be the notion of constituency.
We're going to try to also remain faithful to tenets from psycholinguistics and psychology, namely that computation is spread throughout time. Similarly, another really important thing that I've been inspired to think about a lot through collaboration with some great colleagues here in [INAUDIBLE]-- so Professor [INAUDIBLE] [? Roy ?] and Dr. [? Mark ?] [? Blackpool-- ?] is computational tractability, which is a whole other elephant in the room. But also then to maybe make the lives of our models even harder, we also want to try to make them as neurobiologically and neurophysiologically informed as possible. What we precisely mean by those statements, we can debate at length.
All right. OK. So how does this constraint and transduction of different known principals from different fields actually work? Well, arguably, you can just say, by using what has to be there. So we know that certain basics have to be there. So our descriptions from formal linguistics-- we have to have something that's structure-- and the processing tenets from psychology. We want these to shape our work in neuroscience and in computation.
So I can say that very easily to you and enthusiastically. We need both structure and statistics. But the how of that is going to be the rub. But I'll tell you about a model where we try to do a precursor of that, at least, so talking about how to have variable value independence in a neural network.
OK. Then I'm going to give you some empirical results that round us up to grounding the models that we're thinking about in what we know so far about neural readouts related to language and then talk about what kind of formal steps can we actually take to help us decide what we mean by structure and statistics and how we can realize them in neural networks.
All right. So when I talked about this approach, of trying to use principles from different areas of cognitive science that touch on our problem to constrain our models of that problem, we start inching forward to other, I guess, meta science, or philosophical problems, that are also very interesting. So we really want our models to start with a capacity that's to be explained, so not necessarily as interested in a model of a given task or a given phenomenon, per se. We want to get at the principles, or the causal structure, of the system that gives rise to the capacity that we want to explain.
Of course, then what is an explanation in cognitive science and cognitive neuroscience? Of course, it's an intensely interesting and difficult question. But to me, the modeling endeavor is about trying to establish, again, the first principles of the neural system that has the capacity you want to explain, in my case, human language. So we want to model that explains how language and language behavior arises, not a system that already asserts implausible principles that can mimic anything through statistical approximation. That's, I don't think, getting us closer to the goals that we've set out for ourselves.
So I think the way around that comes twofold. So first, holding on to your principles but also being clear that the goal of the model is to explain, not necessarily only to predict, and that successful prediction is not an explanation. Usually, when I have more time, I go into a great example from tide tables and theories of gravitation in the moon. But I won't put you to that today.
In any case, it's really important that we remember that predictions are not explanations. It's very easy-- because we're looking for what our models predict or do they predict the data. And that inferential link between the data we observe and the model we've put forth can often get turned around on itself.
So I think one of the best safeguards against that is holding on to your first principles or the constraints of the system that you know you can't disregard. So when we're setting out to make a model of language in the brain, we're not necessarily setting out to make up a new definition of what a formal language is or what a human language is. So inasmuch as holding on to those principles can help us design and interpret the neural readout or design the models that we are going to put forth about the brain and interpret the readout that we get, I think that those principles can serve us well.
Right. So this leads us with the big dream, right, which is developing theoretical and computational models that are faithful to the formal properties of language writ large, at this point, and to what we know about the brain. So I want to start from the principle that brain computations are extended in time and are distributed across neural networks, whose dynamics reflect their computation, which, in turn-- again, a big spoiler-- I believe expresses linguistic structure. It's the dynamics of those networks.
So I'm going to devote this part of the talk now to trying to show instances of this. So each subsection of the talk is also a little microcosm of the approach I'm trying to espouse, where each subsection implies constraints from one domain onto another. So in the first section, computation and psychology, we're going to try to use time in a neural network to encode a structure.
Then in our neuroscience section, we're going to take instances of neural readout about linguistic structure and neural oscillations, or brain rhythms. Then in the third section, I'm going to talk about bringing time and an internal language model to an oscillating neural network. And then I'll synthesize everything together in a temporally-informed formalism that we've been developing in the group and a theoretical model.
OK. So to be pithy, we know a lot about the brain circuits involved in understanding language but not as much about the computations and processes that lead to the generation of structure in those networks. But representing structure isn't simple. And we, as language researchers, aren't the only ones in cognitive science who face this problem. So we can actually learn a lot from how other disciplines have faced it. But we do have to face it.
So to pursue the difference in the examples I gave you earlier with the complex structure and meaning, when I was talking, you likely exploited grammatical cues-- so the rhythm and intonation in my speech, or the signs of the sign language interpreters if we had them-- to map incoming speech or sign onto the grammatical structures, in resulting meanings that your brain builds. And it's likely that we do this by combining our internal knowledge of language with the speech or sign sensory representation as it comes in, in a cue integration or synthetic process.
But where are we starting from? Essentially, all we have are bursts of energy and speech that we perceive as a sequence of words and time, or a sequence of movements and forms across space time and sign. So it's not surprising that psychological models of language processing have always stressed the importance of timing and rhythm in perceiving structure and meaning.
So timing, rhythm prosody, are critical for generating structural representations and interpreting meanings that stem out of them. This is an ancient idea in psycholinguistics, but it's seldom been taken up in computational modeling and neurobiological theories. But they're starting to be reflected in these domains, of course, and there are always important historical exceptions. So here, I want to advocate that we should all be systematically trying to exploit time and rhythm and processing and computation to try to understand how structure is uncovered and represented in our neural network.
So first, a quick primer on brain rhythms-- so already here, we can see how we need to incorporate a constraint from psychology, time and rhythm, onto how we think about understanding neural signals related to language. So we can incorporate-- and also constraints from neuroscience at the same time. So the brain can compute with a distributed neural network that cycles in and out of activation.
So we-- here-- see, again, a cartoon of raw EEG that can vary in its phase and power. And this is a typical lecture-style cartoon that you would see to illustrate how classically-measured signals like event-related brain potentials, which were used for so many decades in language research, actually can be composed of many underlying signals.
And the point here is simply to show you that brain rhythms contain more indifferent information from event-related brain potentials. And that's because the brain is an inherently rhythmic organ. It's rhythmic in its processing and computation. And that's simply due to the cellular facts of a complex synaptic system like this.
Similarly, speech is quasi-rhythmic. So by asserting that time and computation are important for the system, can we get a step closer then to bringing together how linguistic structure would be extracted from a rhythmic signal like speech or sign by a rhythmic computation device like the brain?
So now to zoom all of this together, I'm going to tell you about a single example of this constraint approach. And it's based on a highly influential study from Nai Ding and colleagues that published in 2016, where synthesized speech and continuous stimulation of the synthesized speech stimulus were used. Synthesized syllables represented at an isochronous, or fixed rate, of a quarter of a second, which produced a 4 hertz signal. And you can see that most clearly in the panel labeled B here on the figure, where you see a peak in spectral power at 4 hertz. That's what's actually in the stimulus that the participants heard.
But the important contribution of this work-- that if you compare that to panels B-- sorry-- panel B to panel C and A, you see that the brain is tracking and, arguably, extracting information on top of, or in addition to, that 4 hertz signal, namely that their power peaks at 1 hertz and 2 hertz, which correspond to the phrase and sentence grouping of these syllables, which, from an information point of view, is not in the physical stimulus. So you can see that in panel A. So then the authors argued, quite influentially, that this signal, this readout, reflects the cortical tracking of words, phrases, and sentences, how the brain is grouping this information.
But now I want to ask, how are those phrases and sentences generated? So what is this readout actually reflecting? Is it the brain just tracking their timing or their occurrence? Or is it actually a reflection of computation, and that's what's causing the 1 or 2 hertz readout?
So in order to ask this inferential question, we first created a computational model that uses time to bind information and gives off oscillations as it computes. And we want to ask, OK, if we assert that this is a computational principle of the system, can we reproduce the signal? So the work I'm describing to you is actually part of a larger body of research. And it was developed in collaboration with Alex Doumas-- the creator of the DORA model, which I'll tell you about shortly-- and my colleague and good friend from the University of Edinburgh.
So the background that goes into this model is actually a whole-- at least one other talk, maybe two. But I'm going to give you the short rendition today. And it's based on the idea that other areas of cognitive science-- specifically here, the world of analogy and relational reasoning and concepts-- have long struggled over how to represent structured representations in a neural network. And we, as language scientists, can actually profit from that.
So they wanted to resolve this problem for other reasons. But I think at the end of the day, related reasons-- so being able to do things, like generalize out of distribution and to solve relational reasoning problems that you can't solve based on distributional information alone. So John Hummel and Alex Doumas have pioneered this so-called symbolic-connectionist tradition, which represents, in my mind, this wonderful synthesis of what's often seen as a great ontological divide, or epistemological divide. They've unified it in their symbolic-connectionist models.
And of course, again, so much more to say about this-- really fascinating how LISA functions and represents predicate calculus and how DORA learns that. Great to talk about. I'm not going to go into to detail today. But I'm happy to answer any questions about it.
The necessary thing to know right now is that DORA is the model of relational reasoning and analogy. So it was created to basically simulate behavior and also the developmental trajectory of that behavior, so how kids learn to do relational reasoning tasks. So it's been used to simulate, at this point, probably more than 40 reasoning tasks and the developmental trajectory with great success.
So now I'll just talk briefly about what distinguishes DORA and what principles it espouses that sets it apart from other connectionist networks. So historically, early symbolic-connections networks-- namely, LISA-- focused on just instantiating a role-filler binding calculus in a neural network but not on learning that information, while DORA learns that. That's interesting and powerful for other reasons and also brings forth some other problems. But it's a very powerful ability.
OK. But the key things is that DORA is a network that's really focused on tuning and settling. It's a settling network, so on the internal representations at hand. So it's not in a feet-forward, input-output situation. It's really focused on, what do we settle on in our network? And that's the information that the model needs to perform.
OK. In order to achieve what it does, it needs these functionally separable banks of units, just basically so you can control the flow of information and inhibition in inseparable ways. Otherwise, it just uses Hebbian learning. Crucially, it's sensitive to time as an informational degree of freedom.
So when things happen or how they're relating in time, the model can learn from that information. That's what allows it, in the end, to be able to predicate things. So that sensitivity time is key. That's realized through lateral inhibition but also through having integrative inhibitors on different timescales.
So back to our toy example of this in this project that we did-- we wanted to ask, so if a computational model that uses time to bind information and gives off oscillations that does that, could we reproduce the signal that we found in language-related neural readout? And can that give us a clue into the interpretation of the readout, namely? Is this a way for a neural network to represent structured information?
So here's a simulation of the Ding, et al. paradigm using DORA, compared to an RNN. And the blue line represents the human data from [INAUDIBLE] and the dotted red spikes for the activation of DORA. So we also used an RNN that doesn't use time in the same way and doesn't end up creating structures in its internal representations. That's shown in black.
So crucially, both models performed the same criterion on predicting the next word, which is the standard performance metric. But the RNN didn't produce functional structures or oscillations to do so. DORA did. So from a behavior point of view, they're identical. The point is, what did the models do to achieve those states?
And so this finding suggests that brain rhythms could reflect the binding of words and phrases over time. Again, this is not deductive. So I'm not telling you this is what it means. I'm just showing you a proof of concept, that this is a possible interpretation of the results.
This leaves us with a candidate mechanism-- which I'm going to explore further throughout the talk-- with the idea that it's this time-based binding, or using the time of-- you can think of them as population codes, going off in the network-- that those binding information across time through a layer of neural network can give rise both to the kinds of neural readouts that we observe-- which, I'll give you a few more examples-- and can give us an account-- proto account-- of how perceiving structure from a sequence or combining words into phrases could produce these human-like oscillations.
So now we've constrained-- I mean, in the most ham-handed or writ large way-- we've put time. We've gotten an [? aversion ?] of structure. I wouldn't call it exactly linguistic, yet. But we've got a form of a role-filler binding calculus structure there. We've got these precursors, at least, to what we might need as linguistic ingredients-- for example, predicates-- into our model, or our world of models, so far. All right. Again, I'm happy to answer any questions in the Q&A about more details on that project.
All right. So now, I'm going to ask, can we observe the modulation of brain rhythms related to linguistic structure and meaning over and above the brain's response to things like sentence prosody and lexical content to these classical things that we know from psychology and psycholinguistics that are so important, right?
So if we really want to get closer to linking our models, the principles we espouse in them, our computational models, and our readout, do we want to have multiple constraints on these things, such that we have-- OK, we can create the representations we need, and we can give off readouts that look like the brain? If we're going to get closer to those facts, then we want to say, OK, are these readouts indicating more than just being supremely driven by speech? Because we know the brain is really driven by speech.
OK. So I'm talking about the work of Greta Kaufeld, who's recently graduated from the lab. And she asked whether mutual information between brain response and stimulus increases a function of linguistic content. So here's just a cartoon again. We're a fan of cartoons-- of her processing pipeline here, where you can see that basically, you take your speech stimulus and you take your brain data, and you do a similar signal processing pipeline to both of them to ask, how likely is it, or how much does it seem that they're drawn from the same underlying probability density function?
So her stimuli were things like a sentence, so "Timid heroes pluck flowers and brown birds gather branches," versus these two other crucial controls, which you see coming up a lot in psycholinguistics, namely something called jabberwocky, which is something that sounds a lot like a sentence in a language but consists, in this case, of words that don't exist in that language. But it maintains the prosodic structure of a sentence and in this case, the morphology. And again, the study was run in Dutch, so I forgive you for not knowing what that means, either.
But the point here is that we included this jabberwocky condition because the sentence prosody and its prosody from the modulation spectrum point of view are indistinguishable. So here, we have a good control for how the brain might be driven by prosodic responses without the full richness of sentence processing. Then she had a word list condition, where the lexical content was identical to the sentence but had a different prosody because the words were read in a list. And this allows us to use these two conditions as proxies and to say, OK, is the mutual information between these signals, these stimuli in the brain, higher for the sentence condition? If so, then we've controlled for, in a sense, the contributions of prosody and lexical content.
So what Greta found is that, indeed, brain rhythms tracked linguistic structure. And again, this is not so surprising, given the Ding findings. But what's important here is that they do, even when you control for the effect on the brain from prosodic acoustic and lexical properties of speech. So we wanted to use natural speech stimuli, in the sense that we didn't want to have something that was synthesized because we thought the brain is actually exploiting or taking advantage of the distributional cues in the speech, the core articulation, and the fact that information is really smeared in speech, smeared in time, and that that actually is beneficial for the brain.
So we really wanted to leave-- have that natural speech stimulus. But in order to be able to know what we could infer, we needed these controls. So Greta is showing that when you cut the brain response up into phrase length, word length, and syllable length parts-- so you can see that in the different columns here, so phrase, word, and syllable-- that on the timescale of words and phrases, the brain is more attuned to linguistic structure than to prosodically and lexically-matched controls.
So you see this by looking at the fact that the sentence condition, the green dot, and its distribution is always higher than the orange, the prosodic control, and the purple, the word control. So this demonstrates that it's not acoustic, prosodic, or lexical properties alone that are driving the response or accounting for the mutual information between stimulus and response. So this pattern is true, even when we used a hyper-reduced annotation of structure. So this suggests-- we did all this footwork to really make sure that the stimuli were physically matched and matched an important prosodic and lexical factors.
But when we used a hyper-reduced annotation of structure, we get the same result. So that suggests that we didn't even need to model the fine detail of the acoustic response to detect the role of structure in shaping the brain response. So there's just more mutual information between these hyper-reduced annotations and the brain data when structure is present, versus when it's not.
OK. So to sum up, the most structured and compositional thing we presented organized the brain response most, so beyond the acoustic and prosodic signals. And again, perhaps not so surprising but an important step forward, is that the tracking is about linguistic content not just the timescale that it's occurring on.
So I just told you that phrases are tracked more strongly when they're meaningful in their linguistic content. But what aspects of this linguistic content affects the tracking of phrases in natural speech? So PhD student Cas Coopmans wanted to follow up on this question by trying to get at the "what content" part. So as I've just shown you with this example and in the Ding study but many, many others in the literature, temporary regularities are being tracked in speech, even if those become abstract and have to be inferred, like in Greta's PhD work.
We know they're being tracked. And abstract phrases seem to be tracked more strongly when they're linguistically organized into constituents, et cetera. But are we still then zoomed in onto what is being tracked, yet? Arguably, not really. So that's what Cas was after.
And to this end, he created an experimental design, where we had a parametric modulation of linguistic content, so comparing sentences to idioms, syntactic prose, jabberwocky sentences, and word lists. So now we've exploded our conditions out of it. But the key thing here is that we tried to parametrically vary the amount, or the degree, that the interpretation or the meaning of the sentence is extracted from its form.
So that means that for all the conditions, the amount of lexical semantics is the same. So all conditions except for jabberwocky have lexical semantics. Good. And you can see the [? Xs ?] in there. Good. OK. All conditions except word lists have the same syntactic structure. And only for sentences the combination of word meanings and their phrase structure yields a compositional interpretation.
So it's also possible to derive compositional interpretations of idioms, syntactic prose, and jabberwocky. But these interpretations are either not the intended meanings, or in the case of idioms, they're semantically odd or under specified. We also tried again to match for acoustic differences. But in this case, the difference between sentences and word list is even more exaggerated than in Greta's study, probably because of the particular stimuli.
OK. So for each recording, the stimuli were manually annotated and converted to the average frequency at which the phrases were presented. So this is similar to what was done in Greta's work. And then we used an annotation for each stimulus about how many syntactic phrases are close to each word. And we used this to be an abstract representation of syntax that doesn't contain acoustic information, similar to the reduced annotations I just mentioned.
So you can see here. Right. So the key thing to focus on here is that the idioms and syntactic prose, the two conditions which have a less, straightforward one-to-one relationship between structure and meaning-- so again, we find stronger mutual information between sentences, both between-- sorry-- syntactic prose and prosodic jabberwocky in sentences. So still, the sentences have more mutual information with the stimulus than those other conditions. But again, we've got our word lists that are so acoustically different.
So when we compute mutual information for the abstract annotations of bracket count, we find, again, stronger regional information for sentences than for jabberwocky and word lists, again, but no differences between the other conditions. So at first, this seems to suggest that the stimuli don't really affect cortical tracking of phrase structure, the fine details of phrase structure. And we did a variety of post hoc tests and debriefs with the participants to probe what their interpretation of idioms were. And then we computed the ERPs-- actually [? listed ?] [? it ?] in the sentence final word and all syntactically-structured conditions, as well, and also found differences.
And these sentence final differences show that the N400 in the sentences was smaller than for both prose and jabberwocky, which is not so surprising. So sentences come together in a compositional way that's more coherent. And these ERP findings just underline the-- or emphasize the fact that participants did notice the semantic differences between conditions. So in otherwise, despite a clear difference between conditions and the extent to which compositional interpretation of the sentences yields a semantic interpretation, the clear differences in the ERP are there. But it didn't seem to affect the cortical tracking, which is interesting.
And so if we have to answer the question, which aspects of linguistic structure are reflected in cortical tracking and phrase structure? The best we can say at this point is that it's the properties of the input to structure building that are affecting the tracking response, not necessarily the compositional output. So it's the fact that you're getting things to put together into the structure that can be composed. Now, whether they're composed in a coherent way, that's not necessarily what the tracking is reflecting, or at least that's what we can say from this exercise, or this data set.
So it's important to show that the brain is, of course, sensitive to the compositional output. But the tracking response doesn't seem to be the readout that's relevant for that. And so this conclusion is in line with models in which cortical tracking of linguistic structure reflects the generation of that structure but that the structure building itself is partially lexicalized or driven more incrementally as words come in.
All right. I'm a little bit behind on my timing. But I'm going to speed up now. Now I want to show you that when we force-- so my last neural readout bit. When we force the stimulus to take up the same amount of time and be either a phrase or a sentence, we can learn something about the readouts that we're measuring here.
So to better understand how our internal representations, or expressions of our internal language model, are shaping the stimulus as it comes in-- another PhD student in the group, Fan Bai, compared synthesize stimuli that were exactly the same length in time but ended up composing a phrase or a sentence. And what he showed with these stimuli-- something like, the red vase, versus the vase is red-- is that you can make these stimuli as physically similar as possible and they still will sound like phrases or sentences to a native speaker.
But this is handy because now you can say, OK, from a physical point of view, nearly identical stimulus, what does this then do to the neural response and how can we then better understand the readout? So to hint at our main takeaway, we find that phase synchronization between sentences and phrases is clearly different in the brain when you have a maximally similar physical stimulus, with there being just more phase synchronization for sentences than for phrases.
And this is a coarse-grained prediction of a system, being organized the way that I talked about DORA being organized at the beginning, where its phase sets or distributed codes that fire in a temporal pattern with one another-- for example, a word in relation to another word, or both in relation to a phrasal representation-- that this firing of information in time to form a phase set, that sentences have more of these kind of phase set relations than phrases do, at least in how you would represent the constituents in a model like DORA. So in terms of the organization of neural assemblies and in as much as it can be reflected in phase synchronization-- which you see here between these two time frequency plots-- is that phase synchronization is increased for the cases where you have more of these phase set relations, i.e. sentences, even when they take up the same amount of time and have the same power envelope shape as a phrase.
Fan has also been working on encoding models of these data to better understand the readout and is getting good separability by hemisphere and in the theta and delta bans, suggesting that theta and delta play key roles for the brain's processing of phrases versus sentences, even when the physical stimulus is nearly identical. And finally, Fan wanted to examine one of the clearest predictions of existing models of speech processing by neural oscillations-- that's Giraud and Poeppel-- which says that phase amplitude coupling, or the relationship between phase and power across theta and gamma, is crucial for segmenting syllables and decoding phonemes.
This is also, of course, related to [INAUDIBLE] work, [INAUDIBLE] in the audience. But here, we just wanted to emphasize, we didn't find any theta gamma coupling for phrases and sentences. It's not sensitive to that distinction. We do find increased theta gamma coupling when participants were listening to speech, compared to resting state. So it's an interesting-- it's a good readout or signal of speech processing, but it doesn't seem to be sensitive as to higher level variables, like whether you're hearing a phrase or a sentence.
All right. So this quick summary of the readout section, that tracking does seem sensitive to abstract structure. Exactly what that structure is, we're not totally sure yet but that, clearly, it's the generation of structure as it unfolds that seems to matter. And this is consistent with a principal role for these endogenous brain rhythms-- as proposed in a series of recent papers that I list here-- where signals appear more rhythmic, despite the pseudo-rhythmicity of speech because of the contributions of your knowledge of language, or an internally-generated series of linguistic representations, which I lately have been talking about as your internal language model.
OK. So this brings me to my whirlwind last section, which I hopefully can get through, where I'll talk about what might that internal language model look like. And then how can we then use that to, again, go back to our constraint approach and try to constrain the way we think about and model these problems?
OK. So this is the work of Dr. Sanne Ten Oever, a postdoc in my group, who started working on modeling a corpus of spoken Dutch. So these are conversations that are recorded during customer service calls, so full of natural speech, not so many long sentences, lots of pauses and interruptions. So Sanne was interested in explaining how timing and speech and computation in a model might relate, the ultimate goal of relating a model that we're developing to neural data from natural speech.
So first, we started by incorporating linguistic structure constraints as feedback. And those were extracted from a spoken corpus using a [? deep ?] [? net. ?] And at the same time, she incorporates neuroscientific principles-- so inhibition-- and an internal rhythm to computation that mimics, in a very coarse way, how the brain might compute in time. And she wants to see if adding these components to the model improve its performance.
So let's see if I can get this animation to work. So she has her speech signal and her neural oscillations, and she wants them to align in the model to see if the pseudo-rhythmicity that you see in speech is actually a consequence of the linguistic content and speech and how the internal model is serving that up. Right. This animation is going better than I thought. OK. Now it's going to repeat, unfortunately.
Yes. So speech isn't isochronous. We need our words to vary [INAUDIBLE]. OK. So how then can we add these content predictions to our existing oscillatory tracking models? So we wanted to use a model with the following components. So we have our stimulus input. So it's a main processing level that has oscillations, inhibition, and feedback, based on the predictions that we've extracted from a large corpus.
And then we can train the model with a simple sentence like, "I eat." And the sentence is going to have a lot of next word or different predictions about what the next word can be. It can be something like, "I eat cake," which is highly predictable. Or "I eat delicious cake," or "I eat nice cake," which are less predictable.
So in the output of the model, you can see that the sensory input "I eat" and the feedback level after the word "eat"-- so I think you can see that in the model outcome, in the-- sorry-- in the cartoon of the model. So the feedback from the prediction goes back to the main processing buffer. And then you have oscillations or inhibition that gate that process.
So just simply with these very broad, architectural constraints, the model's been able to explain a bunch of behavioral effects showing different categorization, dependent on the phase of representation. And it needs all of its-- sorry-- all of its components to do that. So it can explain these so-called phase-dependent categorization effects that Sanne has found behaviorally. And also we're working, of course, to ratchet up this-- to be able to interface with MEG data.
But here, the cool thing is that we were able to take linguistic content, or statistical information derived from a large corpus, that largely reflects whether a word is predictable or not, incorporate that with a very basic notion of oscillation, and produce a bunch of behavioral effects but also other benchmark principles of how the model should behave, which we outline in the paper, which I won't go through all the details of now. But it's actually the stronger rhythmic response when the input timing matches the internal model that causes the rhythmic response to speech, even though the speech itself is isochronous, right? So in other words, we get a stronger rhythmic response from the network when the input timing matches the internal model's predictions.
And interestingly, this is enhanced when you combine it with an internal model making onset predictions, the endogenous oscillation, and the fact that the input is coming in a non-isochronous way. That actually gives you a stronger rhythmic response overall in the network because of the contribution of the internal model. And from a principles point of view, that's nice because it explains, OK, how can the brain respond so rhythmically to what is essentially a non-rhythmic signal? And if it's doing so because of the fact that it's deploying its linguistic knowledge, that seems to dovetail quite nicely.
OK. Now I'm going to try to integrate these other sections into my world view here. And I'll close with talking about the work of Karthikeya Kaushik, who is a master student and RA in the group, who's done a phenomenal amount of work in his short time with us and who is on the market for a PhD position and is a fabulous, fabulous scientist.
So Karthikeya got interested in the fact that most of the formalisms that we would talk about in lab meeting and such for language always considered the end state representation. So we're talking about modeling, we're talking about how to interpret neural readout, how to build our models. And we all-- always going back to the fact that, oh, it'd be lovely if we had a formalism that could constrain how we put our models together that wasn't just the end state of processing.
So he's been working on a way to tackle this problem using category theoretic descriptions of how grammatical sequences of words can relate to grammatical syntactic structures. And that's what this figure represents. So far, Karthikeya's able to express the rules of grammar being discoverable, based on abstractions made over sets of constituents. So you think of the sets as your way of itemizing things through time.
So this is a category theoretic pushout here that's used as a form of glue between sets of linguistic representations, so here, words and phrases, over time. So evidence for membership in an equivalence class here denoted by q is incrementally obtained in the sets Ti [? star. ?] So here then, the grammar is simply a way of identifying or restricting valid structures from an infinite set of possible structures. But the process is now extended in time. So how do you go through this pushout in time is the formal structure, rather than, what's the end state?
So we're hoping that this formal device will help us begin to include time-- now, in a more, I guess, 2D way-- or at least a notion of incrementality into our formal descriptions of language. So that'll allow us to mathematically describe how a given syntactic structure relates to a given sequence of words. And so we aren't there yet, but one could imagine that if you're taking a language model, that you could make at least a control structure or reference point, where you had these different descriptions of how a sequence in the language model might relate to a given structure that you don't yet have in your language model. At least then, you can track, as things unfold in time, where you are in that trajectory. Or at least that's my hope.
All right. So now I'll close by quickly summarizing the theoretical model that I've been developing since 2016. So here again, we have this internal model of language that's combined with sensory input like speech or sign. And the key principles that I focus on here are how gain and inhibition could be used to package information in time. As I've talked throughout the talk-- so like in a model like DORA or in a formalism where you divide things up into sets and step through time-- it's that same concept.
And here, this is a conceptual model where I talk about how gain and inhibition might be used to do that. And it's this packaging that leads to these rhythmic signals that we observe in neural readout. And so my goal here was to try to explain how hierarchical structure and system properties like compositionally might be able to arise in neural networks bounded by known computational principles in neuroscience. So I wanted to do that while taking into account time and incrementality, as you can see. It's a theme. So that results in this analysis by synthesis flavor.
So the latest incarnation of this model has recently come out in [? GOCN. ?] And I'll just summarize the core claims here. So here, the core claim is that linguistic structure arises in the temporal dynamics of neural networks. The way that it arises is a form of perceptual inference. That's important because it allows us to account for language in a way that's not fundamentally different from how other brain computations and perception might work, which I think is important.
The sensory percepts cue your internal model of language. All the representations that are relevant here are distributed codes. And it's the temporal distribution of these activation patterns-- which you can think of as population codes-- their tempo distribution results then in a functional linguistic structure, a la DORA, so all of the work I briefly spoke about with Alex Doumas.
And again, it's gain modulation and crucially, inhibition-- these are the assembly-level computations that produce representations of the linguistic structure from sensory inputs. And in the paper, I have more figures, a pseudocode, a glossary, and a range of predictions. And it's really an attempt to synthesize concepts from systems neuroscience and mathematical models of computation and psycholinguistics. So the model ends up describing nonlinguistic representations as trajectories in a manifold that render linguistic structure building, an expression of a coordinate transform in the brain, an operation that repeats across the brain but multiplexed in time.
OK. So to leave you with some takeaways-- if we want a future where we can explain how language becomes algebraic through statistical experience, an abduction over that experience, and in which we can discover the properties of the language system in the mind and brain in a way that stays faithful to what we know about them, how do we get there? Well, I've tried to give a rough outline here, basically from using mutual constraint from different areas in cognitive science.
And one way to think about that is to ask yourself, what are the minimal claims about the representations that I'm dealing with in the phenomena I'm trying to model? And for us interested in language, that means it's both structure and statistics. And as I alluded to earlier, there's some really interesting questions behind that.
So what does it actually mean? How are those actually different? There are many ways that they could be expressed, as things are [? multiply ?] realizable. But really getting into the nitty gritty of that can be really rich and helpful, I think. And the way to do that, of course, is to try to ground our models, informal and conceptual analysis, and again, also to take constraints from both sides and try to situate our computational level theories within the constraints of our implementations in our neurobiological models.
OK. Right. So my last self-indulgent list of takeaways-- so we want independent or orthogonal evidence for a given computation beyond just our traditional kinds of things in neural readout, where we have a region, or a model, or a network alone. We want our generalization of our predictors beyond tasks. Because again, what I mean by this, is that we're trying to get a capacity that we're explaining, right, not a particular task or effect.
To go even more broad, this might mean-- as I've alluded to a little bit-- that we have to brain the notions that we have from linguistics and psychology in order to get them into our models. So we need to do more than, for example localized box models, or try to find bracket counts in a region of the brain, or to draw a tree structure on the brain, as I've done here. We need to maybe think a bit more broadly about how the brain might be able to represent things in a way that doesn't deny the formal facts but that might not be as traditional as we thought.
For example, one of the great topics for this, I think, is the distinction between syntax and semantics in the brain. So I fully believe that they're formally orthogonalizable. But I'm not convinced anymore that the brain draws that distinction in a functionally orthogonalizable way. But I don't think that that means that syntax and semantics aren't separable. Anyway, OK. I think this also means that we need to move away more from a modularist view of the neural networks that we study.
Right. OK. I promise, last slide of takeaways. So successful predictions are not explanations. So good model performance isn't evidence of what we think is being represented actually being learned or that the mechanism that we want to be doing it is there. And we so easily fall into this way of thinking all the time. I mean, it happens to me daily. I almost have to wake myself up from it. But this is chiefly because of the principle of multiple realizability, which I'm happy to talk about further.
So what I've tried to persuade you is important today, is that we want to compute structure and be guided by statistics. We want to have time and rhythmic computation in our models. And that-- spoiler-- phrase synchronization and network dynamics are important for representing structure and meaning in the brain. And all this, I hope, will help us stumble towards the theory of how neural systems represent language beyond statistical association. So thank you, and thank you to my group.