Characterizing complex meaning in the human brain
Date Posted:
May 1, 2023
Date Recorded:
April 25, 2023
Speaker(s):
Leila Wehbe, Carnegie Mellon University
All Captioned Videos Brains, Minds and Machines Seminar Series
Loading your interactive content...
Description:
Abstract: Aligning neural network representations with brain activity measurements is a promising approach for studying the brain. However, it is not always clear what the ability to predict brain activity from neural network representations entails. In this talk, I will describe a line of work that utilizes computational controls (control procedures used after data collection) and other procedures to understand how the brain constructs complex meaning. I will describe experiments aimed at studying the representation of the composed meaning of words during language processing, and the representation of high-level visual semantics during visual scene understanding. These experiments shed new light on meaning representation in language and vision.
Bio: Leila Wehbe is an assistant professor in the Machine Learning Department and the Neuroscience Institute at Carnegie Mellon University. Her work is at the interface of cognitive neuroscience and computer science. It combines naturalistic functional imaging with machine learning both to improve our understanding of the brain and to find insight for building better artificial systems. Previously, she was a postdoctoral researcher at UC Berkeley, working with Jack Gallant. She obtained her PhD from Carnegie Mellon University, where she worked with Tom Mitchell.
PRESENTER: Hi, everyone. Nice to see this good crowd here. It is a huge pleasure to introduce Leila Wehbe today. And I can't resist starting with a kind of irrelevant but cute story of how I met Leila, which is around 15 years ago. I was traveling in Syria. Never mind, I just was traveling in Syria. And I stopped someplace to read my email. And there's an email from this undergrad student at American University in Beirut. And she looked really impressive in her email and wanted to come work in my lab.
And I said well actually, I'm in Syria. You're in Beirut. I'm going to be in Damascus in a few days. If you can figure out how to get there, we can meet in Damascus. And she said sure. And so we met in a coffeehouse in Damascus. And this incredibly, magnificent city I hate to think what's going on there now, but we had just an amazing day hanging out in Damascus and chatting. And lucky me, later the next summer, Leila joined my lab.
She did a bunch of fancy mathematical stuff with my then grad student Ed [INAUDIBLE]. I can't even remember something about optimal Bayesian decision making, I don't know. Anyway, it was great to have her around. She then went on to finish her undergrad degree at AUB and go on to grad school at Carnegie Mellon, where she did a bunch of machine learning and cool work with Tom Mitchell and then a post-doc with Jack Gallant at Berkeley.
And then she was hired back at Carnegie Mellon five years ago, Did I get that right? And so Leila's work tackles one of the most important fundamental and fascinating questions one can ask, which is, how is meaning represented in the brain. And her particular angle on this is to use more fancy math machine learning methods and artificial neural network models to try to understand how we in particular compose meaning from, not just the meanings of individual words, but how those words go together to capture more complex meanings in sentences and also how meanings of individual objects go together in complex images.
And I won't try to tell you about what she does. Instead, I will turn it over to Leila whose title is characterizing complex meaning in the human brain. Thank you, Kayla. Welcome.
[APPLAUSE]
LEILA WEHBE: Thank you so much, Nancy, for this great introduction. And thank you for allowing me to work in your lab 15 years ago and setting me off on this great adventure. Thank you so much. Hi, everybody. As Nancy said I'm at CMU. I'm in the machine learning and the Department and the Neuroscience Institute. And I'm going to talk to you today about characterizing complex meaning in the human brain. And I'm going to talk mostly about language processing and other modalities as well.
And just to think a little bit about the complexity of meaning in language, we can look at this very interesting data set called the Winograd Schema Challenge. So this was developed as basically a task for NLP models to test whether they understand language or not. It's kind of a Turing test if you would like. So you can read the sentence with me. "The large ball crashed right through the table because it was made of Styrofoam."
And if I ask you what was made of Styrofoam, was it the ball or the table, maybe you can do this, was it the ball? Who thinks it was the table that was made. Most people do. So you solve this correctly. You pass the Turing test. And if you look on the other hand, this sentence, "The large ball crashed right through the table because it was made of iron." And we asked what was made of iron, most of you would say the ball.
So let's think of what you need to do in order to understand the sentence and do correct pronoun reference resolution. So in order to know exactly what the word it means, you have to first, of course understand the meaning of those individual words and basically the syntactic structure of the sentence, how it's combined together. But then you also need to remember the properties of balls and tables. And you need to also maybe do some visualization. So really, it's a lot of really complex stuff.
And you can recruit many other modalities or many other cognitive modalities other than language per se like reasoning or maybe some imagination, things like that. And we are actually very far from a computational model of language processing that can tell us exactly what are the operations that the brain does in order to solve something like that. And when I say a computational model of language processing, I think of something like what we have envisioned.
Now, of course, envision, we're very far from understanding everything that's going on. But we do have a good working model of what the individual neurons in different parts, like for example in the retina V1, V2 like what are they exactly doing? What kind of representations do they have? And how do they compute them, and how do they pass them to the other regions upstream? But in language, the field is very different, mostly because we don't have good animal models of language or any animal any true animal models of language.
And so we do know which areas are involved in language processing, but we don't know exactly how these areas are connected or how they pass information between each other. We also don't know exactly what is the code-- what is exactly what they are representing? What is the code? I mean, there are definitely many theories or many models out there that say at the course level what function is processed where, like what function like syntax or semantics.
But we don't really know what are the neurons doing so what are the edge detectors, the analog of the edge detector is in the domain of language, and how does a neuron compute them? And so we still have to find out. We still have to answer this question, how does the brain compute and represent complex meaning? But also a reminder is that with the models, with the methods we have like noninvasive brain imaging and mostly human participants, we can't really answer how that easily.
Most of the time, we're basically still answering questions like where, when, and what is being represented? So let's just start with kind of a first level question, which actually is still was still not resolved. So once you have a sentence and you actually understand that sentence, so you generate new meaning by understanding the sentence, where does this meaning go? Where is it represented? And until recently, this was still unclear. It was suggested to be the ventromedial prefrontal cortex by Liina Pylkkanen.
But it was still unclear. I mean a lot of people have studied composition in neuroscience, mostly studying basically when do those operations happen. Where do they happen? Well, what about the actually the new representation that emerges after you do the composition, that was still unclear. So we're going to focus on that question for now. So specifically, where is this complex meaning represented? And I'm going to just simplify it into a set of binary questions so that we can keep a track of them during the talk.
So let's assume you understand a sentence, and you generate a new meaning by understanding it. Is this new meaning going to be stored in the same regions as the more simple meanings like the meaning of the individual words? So you might think yes. It's just, it's not like a different meaning of nature, it's just something you just generated. You use the same machinery to represent it. Or you can think that no, maybe there's some hierarchical meaning that you're building where other regions are representing the more complex meaning.
Now, in order to design an experiment to study this complex meaning, you need to-- you choose your modality. So doesn't matter which recording modality you're going to choose to study this complex meaning because [INAUDIBLE] you get different results if you use, for example fMRI imaging or imaging. And you might think that yes, it matters because these different modalities are recording basically different aspects of brain activity. And so they might tell you really different information about meaning composition.
Or you can think that no, I mean, of course, they're recording different aspects, but it is still the underlying, the same underlying process. So whatever you decode you get from both these two modalities will be very correlated with each other. Also still thinking about the methods, like do you need to always approach the problem from a theory driven way where you define hypothesis first, and then use that set hypothesis? Or can actually, can we learn directly from the brain the relevant representations?
So again here, you can think that yes, just need enough data from the brain and the good modeling trajectory. And this is maybe like an optimistic look at the problem. Or you can also think that no, you can have a model that predicts brain activity, but actually many versions of this model if you change them a little bit or you rotate the features et cetera, you'll still be able to predict the same brain activity. So it's unclear how much the ability to predict the brain activity tells you about the brain.
But let's assume you actually do this, and you manage to learn some representations from the brain activity. Would these representations with these models that you learn directly from the brain be useful for other things for actually building better AI systems? And here again, the optimistic view is that yes. The brain is the only system that understands language. And so it makes sense that training a model in the brain will actually give you clues on how to solve to build a system that understands language.
And on the other side, the more pessimistic view is that you don't really have to both the brain and AI don't need to approach the problem from the same way. And so maybe it's not useful at all. It's just like airplanes and birds version of this. These are the questions we're going to address. And so in my approach, what I use in general are two components. My approach is made of two components.
The first component is naturalistic experiments. So those image the brain while individuals process complex information. So basically, I run experiments in which people are reading in the scanner but sometimes also producing language, engage them in conversation, watching movies, et cetera. And the idea here is that there is no clear conditions in the data set. So at the end of the data-- the experiment you get something like a list of words and/or maybe a time series of words and a time series of brain measurements that--
So you end up with basically a stream of these two types of data. So you can think of the brain measurement as just a row of voxels where here every output is going to be model independently. And now, in order to kind of make this work, you also have to replace basically the words with another computational representation. And we call this a feature space. And so here this is a very simple feature space. It's like a simple syntactic feature space. It's just a toy example here.
And the idea is that you can annotate every point in the experiment with information in that manner and then build a model that goes from this feature representation of language to the brain activity. So this is called encoding models. And so these are powerful tools to investigate brain representations. So this is a second aspect of my approach. And basically, we replace the stimulus with a feature space we build a predictive model. And then we test this predictive model on held-out data.
So in this case, if we're using syntactic features, you replace the text, the held-out text with the syntactic properties. You use the encoding model that you trained on your training data. And you predict the data for the held-out set. Of course your held-out set would be much longer than just four time points. And so here, if the predicted data has nothing to do with the real data, you can't say much about that brain region.
But if there is a strong correspondence between the predicted data and the real data, which you can measure using something like correlation, then you can make a conclusion that basically at that voxel, because you have high prediction performance using, in this case syntactic features, this suggests that this voxel represents syntactic information or at least information that's related to syntax. And so this is basically just a very quick the definition of what an encoding model is.
Now, encoding models are much more popular, and you probably used them or encounter them before. In my lab, we do have a few works on improving encoding models specifically. I'm not going to talk about them, but just these are more computational works to know how to train them better or how to test them better. So I can talk about this later if someone is interested. And the idea, I think the most exciting thing about an encoding model is actually the feature space.
So this is what allows you to actually test your hypothesis. So with the same experiment, you can test multiple hypothesis by simply changing the feature space. For example, you can have very explicitly defined feature spaces where you actually design each feature space to be something specific that you want to model. For example, you can care about syntax or semantics or maybe more high level narrative things like what the characters are doing.
In another work, we also used actually behavioral measures as our feature space. So this was basically self-paced reading times and eye tracking measures of people reading stories. And we used that information to model comprehension difficulties in another set of participants that were listening to those stories in the scanner. And finally, we can also use implicitly defined features basically learned from large corpora of data such as, for example or most of the time, we use neural network language models.
So in this work, and back in 2014, we took a recurrent neural network language model. So this has a context vector that's latent. And from that, it predicts the next word in a sequence. So that's what a language model does. And it also has a fixed representation of each word like the word hairy here. And so basically, every time it's going to see a new word like hairy, it combines it with a previous representation of context and generates a new prediction for the next word.
And it keeps going in the same way. And so now, you can think of a person reading these words one by one. And you can think of how you can use these vectors that represent either the fixed meaning of the word or the meaning of the previous context in order to study exactly these representations in the brain. And so here, we used imagery activity in order to trace basically both how context is represented and modified in the brain as you see the new word and as well as the perception of that new word.
And we see so from the back of the head to the front of the head gradually and like the perception of the word going forward as well as an update of the context of the latent context in the brain. And later on maybe in 2018, this method was also applied to EEG to study grammar or syntactic processes and also to fMRI to study basically the length of how much information is maintained in different regions of the brain.
And it's very exciting because since then actually there's been so much work in this area, including great work by people here, very exciting development. And it's really becoming its own field. And this is just a few of the papers until 2021 and there's much more since then as well. So it's really become a sub area like in the modeling using NLP to model brain activity.
But I think there's still kind of it's not a solved problem in itself because even though NLP models are very promising source of language feature, sorry, are a very promising source of language features, they're still kind of constructed. I mean their forte is that actually even though they were constructed without any linguistic theory input, they're still super good at what they do. But actually, that's a double edged sword because also they're very uninterpretable.
So if you find a good or strong relationship between a neural network layer and an area in the brain, it's still unclear what that tells you. And it's unclear how you're going to make scientific conclusions just by doing this alignment. And I think a solution to this comes from augmenting your encoding models to improve your ability to ask scientific questions. And so I'm going to talk here about in this talk about two ways you can do that.
The first one is something we call computational controls. So my previous student, Maria and I kind of came up with that word, and we're hoping it catches on. But what we mean by that is that instead of designing a controlled experiment where you embed your controls in the experiment in terms of a condition of interest and a control condition, you just collect a lot of data in the most naturalistic way you can. And then you later isolate this effect computationally and I'll show you very soon an example of how that can be done.
And another method that can also help you basically improve your ability, the ability of your encoding models to make inferences is end-to-end modeling. So here, instead of getting a model and then fitting it on a large data set and then later on mapping it, mapping or aligning between that model in the brain activity, you fit the model directly on the brain data. And in some sense, you learn the features base that's relevant for the brain directly from the brain data.
And then you can investigate this. You can investigate what you learn and see what it tells you about the brain activity. All right, so now, I'm going to talk about basically two separate projects. The first one is about is about language, so aligning natural language processing in the brain with natural language processing in machines. And the second one would be about visual processing. And more specifically, it's about hypothesis neutral models of higher order visual regions in the human cortex.
So starting with the language work, I'm going to highlight mostly this particular study, which was recently published where we're using these computation controls that I told you about with natural text to reveal different aspects of meaning composition. So again here, remember, we're thinking about the end product of meaning composition. So once you combine words together, and you get this new meaning, where is it represented?
More specifically, if you look at this quote here, "The little boy finally finished his pasta." You understand that the little boy actually finished eating his pasta even though the word eating is not in the sentence, right? So this is a new meaning that you got to by combining the words together. So we're going to call this new meaning the supra-word meaning. So this is just also another term we use to refer to the specific meaning. And I want to know where is it represented.
So again, we use naturalistic imaging where somebody is reading or participants are reading a text in the scanner. And as a feature of space, we use basically a context embedding and single word embedding from a big language model at the time that was pretty big. It's no longer, I think, that big. So this was from Elmo. And so when we look at these two vectors, so remember, the context vector is representing all the words that occurred before the current word.
And the word embedding is representing the current word. And we're trying to see where in the brain do these vectors predict activity well. And the areas that are in red are predicted by the context vector. And the areas that are in blue are predicted by the word vector. And the areas that are in white are predicted by both of them. And as you can see, there's a lot of overlap between what is predicted by the context and what is predicted by the current word. And that's actually not surprising at all.
Because the way that the language model works is by kind of computing a context vector that's most predictive of the next word. So it makes sense that there's a very strong correlation between the embedding of what's happened before and embedding what is going to come later. That's by design. But this shared information between the context and the current word as well as the previous word actually complicates how we're going to interpret the result of this encoding model.
So what we're going to use here is we're going to do this computational control that I told you about. And we're going to construct a computational representation of this supra-word meaning. So first, what we do is we learn a linear function, G that predicts the context vector from all of the words that went into computing the context vector. So remember the context vector is predicted by-- is computed by a language model that has a lot of nonlinear operations.
But here, we're trying to just explain everything in that vector that's linearly predictable by the individual words. And then we're going to remove everything that's linear predictable from the context vector and what we're going to remain-- what will remain for us is just an embedding that has other information in the context vector that that's orthogonal to the individual words. So this is, we think of this as like the new meaning that's not present in any of the individual words but that comes to us from combining the words together.
And then we're going to use this. So this is our representation of the supra-word meaning. And we're going to use this to build an encoding model using both fMRI and MEG activity. So after we do that, so that actually effectively disentangling these two representations, we find that there are many regions that are still predicted by information that's unique to the context and not present in the individual words. So there are many areas that are predicted by the supra-word meaning.
And once we combine the information, or we combine the results across participants, we see that in the language regions, this is before we do this control, this is after the control. So reliably across individuals both the anterior temporal lobe and the posterior temporal lobe represent the supra-word meaning reliably across participants. And we repeat this analysis with data from another lab from the Cortana remote group in Montreal, which is like has a different paradigm, people are watching a movie, hidden figures.
And we take basically all the texts, all the subtitles of the movie, and we run the same analysis on that data. It's a different machine, a different modality, et cetera. So even with all these differences, we still get that both the ATL and the PTL are also significantly predicted by the supra-word meaning. And this was also a fMRI. But even though this fMRI like the fact that these two regions were significantly predicted by supra-word meaning in fMRI is very reliable, we see that actually when we repeat the experiment in MEG.
So here, we have subjects reading the same text as the first experiment. And we do exactly the same analysis. We find actually very different results. So here, remember that MEG is actually recording data at a very fast temporal resolution. So we can look at sub-word processes. So here, we're recording a time series from every part of the brain. And you can think of the red line as the onset of a word. So we can really look at sub-word events.
And what we look at the result, we see a very different story. So here at the top, we see that the representation of the current word, the word embedding is very predictive of brain activity both before and after we actually control for the shared information with the other words. However, we also see that the previous word that occurred before this word is also very predictive of brain activity. And it remains predictive even after we control with shared information for the current word.
So you're still processing the previous word when you're reading the current word. However, we see that if we look at the representation of context, so everything that happened before the current word, we see that it is predictive of brain activity when the current word is on the screen. But as soon as we controlled for shared information with the current word and the previous word, this context representation is not predictive at all anymore.
So the supra-word meaning is not predictive of MEG activity at all. So you can think of the supra-word meaning as being meaning that's maybe processed more slowly over time or more distributed over time because you're building it slowly or at different points in time. And whatever it is just doesn't seem to be maintained using the same-- using a neural process that actually leads to good MEG activity. So it's basically silent in MEG.
And so here, if we go back to our questions, we see that for the first question, we did see that complex meaning is stored in the same region as more simple meanings because the ETL and PTL also or at least the PTL is really hypothesized to be the location where simple meaning or the meaning of individual words is represented. And we see that it's also representing complex meaning. And we also see that actually, there's a big difference depending on which modality you use.
So we saw that the sustained representation of meaning is not visible in MEG. And this allows us to conclude that maybe the sustained representation of meaning is processed in different ways than ways that lead to strong MEG signals. So we know that, for example, the synchronized firing of cells is what leads to strong MEG signal so that's probably not how the supra-word meaning is represented. And I think also, this has some interesting implications for brain computer interfaces.
So if you're going to build a brain-computer interface to decode meaning from brain activity, you're probably not going to use fMRI because it's very slow and very impractical. You're probably going to use something that's related to electrophysiology, so something that shares some properties with the MEG signal. And so if you cannot decode like more look, more distributed in time meaning, that could be a big problem if it's totally silent in MEG the data. Yeah.
AUDIENCE: So in the part you've said you subtract train the quarter on individual words. So that from the context, I was wondering, do you train a single model on, I've made an example of context and words. Or is it for every single case you have to learn a new model [INAUDIBLE]
LEILA WEHBE: No, we do it over the entire data that we have for this experiment. So the whole data that goes into modeling or training the encoding model is the one we do this kind of fitting a linear model and subtracting it. And that guarantees that the variance that's in the supra-word meaning is orthogonal to the variance of the individual words.
AUDIENCE: [INAUDIBLE] You do have a individual model, and you optimize every time context and the words, we'll get something [INAUDIBLE] but then we subtract the we subtract the actual contribution of all of those to the right context versus white noise for all of the data [INAUDIBLE] an average.
LEILA WEHBE: Oh sorry, so I misunderstood your question. Yeah, so we do remove it on a sentence by sentence basis. But we learn the model to remove it over all of the data sets. Any other question? Yeah.
AUDIENCE: Just to clarify something, at which point are you expecting the extra meaning to be invoked? Is it the word posture that we're finishing? Or is finishing set up an expectation that there's going to be some additional meaning, but you don't know what?
LEILA WEHBE: Yeah, that's a great question. Unfortunately, because we're looking at this data set in like that's not really controlled for by allowing for example, additional space at the end of sentences or ends of words, we don't-- it's not optimized to look for, I would imagine like at the end of the sentences where, for example, some of that new meaning arises.
However, because we are looking at basically the data at the end of the word, at the beginning of the word, the middle of the word, et cetera, if it's systematically there at the end of the word, it should show up. But it doesn't seem to be systematically there for every word.
AUDIENCE: Every word at the end of the sentence.
LEILA WEHBE: No, every word in general, we did try to look at the end of sentences. But again, maybe just this experiment doesn't have enough power maybe to reveal that a more subtle thing like, for example, at the end of the sentence. That would be a great next experiment to run. Yeah. Does that?
AUDIENCE: Sort of.
LEILA WEHBE: OK, do you have another question? OK. Yeah.
AUDIENCE: I'm just curious if you would clarify regarding simple meanings. And I guess my question is twofold. One is regarding the brain. Why would we expect simple meanings necessarily represented in the brain just looking from your project almost like visual, medial regions? And then the second part is the way that you obtain the simple meanings is based on the decontextualized word embeddings. So Yeah do we even really know what these decontextualized word embeddings really get at?
Could they be correlated with other properties of the linguistic input, or? Yeah.
LEILA WEHBE: Right, so I guess by simply meaning here, I do mean lexical meaning so the meaning of individual words is just how I define it for the stock, which I mean themselves the meaning of individual words can be distributed and have multiple meanings as well. So that's a fair point. Just from using word embeddings like historically it seems like it does correlate with all of these, or at least the way that words correlate like the similarity of words together in an embedding in an embedding seems to be very similar in a word embedding seems to be very similar to the similarity of brain responses to words.
So that's where that reasoning is coming from. But maybe what do you mean by words?
AUDIENCE: I know that's true for glove and other word embedding models, but these use transformer word embedding models.
LEILA WEHBE: Sure, yeah, so we've used-- you've used other models like, oh that's what you mean. Yeah, so we've used models that come from simpler models like RNNs but also words like word embeddings like word back or glove or things like that. And they seem to be doing very similar.
AUDIENCE: The same thing as
LEILA WEHBE: Yeah, yeah, yeah. I feel like this is a bit better of a control because they're trained in the same way and trained together in the same model. But yeah. So we also had a project a couple of years ago going in the other direction, so trying to use natural language processing in the brain to improve or interpret natural language processing in machines. So the idea is that of course, we don't understand a lot about the brain. But we understand some things.
And what if actually the way that network maps onto the brain can tell us something not about the brain, but actually about the network itself. And so here, this was with again my student Maria, who is now faculty at MPI. So here, we wanted to say to look at, for example a transformer model, which is made of multiple layers of self-attention. So these operations are trained over a very large corpus to combine words together in a very specific way.
And it's very expensive to train these layers of attention. But we wanted to know how important is this attention even though it's very expensive to train. And is it as important at every layer, is it of the same importance at every layer? And so we started by actually taking the attention, taking their presentation at a given layer and just removing the attention and replacing it by just uniform averaging and trying to see how does that harm our ability to predict brain activity.
And so this is like the average of our variability-- the average change in how much we're able to predict brain activity after we remove the attention from a given layer. And so every line here corresponds to a layer. And we can see that in general, the more context we have, so the more words we consider, the worse we become at predicting brain activity without attention, which makes sense that's just we're kind of messing with the model. We're removing the weights that it learned. And we see that we don't predict the brain as well anymore.
But what we didn't expect is that actually, for the first six layers in the model, we're able to predict the brain activity better by removing the attention. So it doesn't make sense at all. This is again, you don't have to focus too much on the graph, but what this is telling us is that if I take a model that's train in a very expensive way, and I just mess with it, I actually remove what it learned in specific ways and specific words like layers one to six, I actually predict the brain better.
So here, we're like OK, So it seems like once I destroy the network in the specific way, it's a better brain model, so is it also a better language model. And so this is kind of a bet in some sense. If you make a network more like the brain, it will actually understand language more. So this was maybe some experiments we wanted to try. And so to do that, we used a specific set of syntactic tasks that are used to test whether a given model knows how to-- knows grammar, basically.
And so this is the base model that we was pre-trained before. And we didn't harm it. And basically, these are the experiments when we remove the attention from different layers. So when we move the attention from, for example, layer 11, the model performance drops down. And this is in sync with what we saw with the alignment with the brain. However, when we remove the attention from the first six layers, we saw that actually, we improve our performance in eight out of the 13 tasks, and we stay the same for the remaining tasks.
So actually, it did pay off. We did when the bet in this case that even though the model was changed from how it was trained, if we change it in the same way that leads to better brain prediction, it also leads to better performance on these NLP tasks that have nothing to do with the brain. And so again, this is a simple experiment. It's definitely 70 just a simple proof of concept. But still it was the first one that shows that actually aligning NLP models with brain activity makes them better at NLP tasks.
So this kind of very encouraging into going into this era of using inspiration from the brain to build better AI models. Yeah.
AUDIENCE: [INAUDIBLE] Something I didn't quite understand about that because essentially to get this result, so maybe I'm just confused. Don't you also need to train all these different models? So essentially, if you already train these models, you can directly evaluate them on the data set and directly get an answer, right?
LEILA WEHBE: Oh, these ones? No, these are not-- they're just kind of tasks that you don't need to train. These, I think, you just basically ask BERT to choose between two alternatives.
AUDIENCE: But don't you, sorry I'm confused. When you replace attention with the linear layer, do you retrain it on the line?
LEILA WEHBE: No, we don't retrain it.
AUDIENCE: Oh, OK.
LEILA WEHBE: That's why it's unintuitive because we're changing.
AUDIENCE: OK, but even then, so I guess the general question is, here you're proposing, for example, let's come up with different variants of networks that can either be retrained or not and then compare them to the brain data. Find out which one is better, and maybe that's predictive of what's going to work on some AI task. But since you already have these variants, why not just directly, if improving AI is a goal, why not just directly apply them to AI.
So you can just skip whether or not they improve the brain part because evaluating an AI is not the hard part, right?
LEILA WEHBE: Oh, I see. Yeah, yeah, so this is more like you're right. This is not necessarily you can take any model and then do these changes to it and then see if it leads to a better AI model, is that what you're saying? You don't have to go through the brain.
AUDIENCE: I mean, you're going through the brain, actually doesn't, it's in this particular scenario, it doesn't save your compute, right? Ideally--
LEILA WEHBE: No, it doesn't save the compute. But we wouldn't have thought of doing it if it wasn't for this experiment. I mean, that's how we thought like we were just trying to study the alignment between them. And we wouldn't think of doing it. We thought about it because we saw that weird non-intuitive thing, which is like an improvement in performance. And finally, I want to talk very briefly about this other work in which we tried to change BERT in this case by fine tuning it on data from the brain.
But before, I want to just kind of talk a bit more about one of the methods I talked about in the beginning of the talk, which is the end-to-end modeling. And I want to say that actually, the classical way like what I was talking about before about encoding models is not hypothesis free. We always embed our hypothesis in the feature space.
And so even if we're using a neural network, for example, if I train the neural network to be a language model then the hypothesis is, it's going to generate feature vectors that are good for predicting the next word. If I train for example a convolutional neural network to be an image classifier or object classifier, it's going to learn representations that are good for classifying objects. So in general, this cross function that I use to train a neural network is what dictates my hypothesis that I'm trying to see to study in the brain.
So if we're always doing this, we may be adding a strong bias of what are we looking for in the brain activity and missing something important. And so the alternative is to do end-to-end modeling. And so we can think of this as an assumption free approach to reverse engineer brain representations. And so what an end-to-end model is a model that's built from the ground up without any pre-training to predict brain activity from the input from the stimulus.
And the idea is that in the case of language, is it possible that by when you give enough data to the model, can the model learn to-- learn how the brain is actually combining these words together? And then once it learn that, can you actually interrogate it and see how is the brain doing that, it can become like an in-silico brain? And with that, allow us to extract some principle of what the brain is using to understand sentences.
And most importantly, is this even possible with fMRI? Can you actually train a model from the ground up to predict how the brain would react to specific sentences without pre-training. And then go and look in the model and see oh, this is how the brain is doing this and that, and these are the principles that are used by the brain. And so again, this is a bet because what you're saying is that if an algorithm is able to predict brain activity very well, then it must actually predict brain activity in under different cases like for example, in a metaphorical setting.
It needs to know how the brain would react. It needs to predict the brain when the brain is surprised, or predict the brain when the brain is making some types of inferences. So in order to perfectly predict the brain as much as you can, you kind of need to understand language. And so the bet is that by training an algorithm to predict brain activity, it will implicitly learn the rules that the brain is using to work.
And if anything, if this model is not better than say then GPT 4, which is probably going to be hard, maybe it's going to be more accurate than specific settings that models trained without brain activity are not accurate. And maybe it would be more efficient, for example, or more generalizable to other set situations. And so in the experiment we did back in 2019, we didn't have enough data to train a model from scratch.
But we did take a pre-trained BERT and we did try to fine tune it on predicting fMRI and MEG data. We were successful in making it a better encoding model. So it was better to predict data for new subjects and data across modalities, but we were not able yet to make it a better AI model. And so here, it's still inconclusive in this particular case whether we can actually build an end to end model in the domain of language from brain data and whether these representations will actually be useful for AI systems.
But we actually were successful to do this in the domain of vision. So this is a second part here I wanted to talk about. And so here, in the domain of vision, people have been aligning the vision system with the neural networks for a long time. And so here with Meenakshi Khosla, who is also a post-doc here at MIT, we thought of actually training neural networks directly to predict brain activity, so basically, trying to see if we can characterize response properties in high level visual cortex systematically in a hypothesis neutral fashion.
So you can think of this as an alternative tool to uncover the neuronal turning properties in a data driven manner and basically verbalize what those neurons are doing or what basically, areas of the brain are doing because we're using fMRI here. So specifically, what we use is a natural scenes data set, so you might already know about this data set. It's a very large data set of almost 40 hours of recording per subject, so which leads to 10,000 images per subject that are actually each repeated three times, so very good data, very clean data.
And the nice thing about this is that these objects, these images sample a very large set of some of this with many different complex scenes, et cetera. And so what we do here is actually we train convolutional neural network from scratch to take us input an image and to predict activity in different brain regions. So here, we actually focus on four ROIs, the FFA, the visual word form area of the body area and the RSC, which is a place area.
And so for each one of these areas, we're going to build a different neural network that's optimized to predict the activity of this area from taken as inputs via the different images. So we're going to see if this is successful. And it turns out to be actually quite successful. We do very well at predicting these different areas. We're very close to the noise ceiling in all of the areas. They actually do as well as state of the art convolutional neural networks and much better than just categorical models.
But actually, we do better than state of the art networks in other ways. For example, our models are very easy to generalize to new subjects. So you only need-- you need less than 100 images for a new subject to generalize the model to that subject, or sorry, even at 100 images for the new subject, you do much better than the other models. So this is a very reliable model. It's just trained without a typical network would be pre-trained, for example, on ImageNet to become a convolutional neural network that can classify images, let's see.
But here, we're skipping this pre-training step, and we're training it directly to predict brain activity from images. So that's great that it can predict well, but it doesn't tell us much about what it's using, what it's predicting. And so in order to interrogate this network, we're going to use network dissection, which is also a method from here. And so what this does is we're going to look at basically the voxels or the output units that predict the voxels in different regions, let's say the FFA.
And then we're going to look at a specific data set that is labeled in which every image is labeled at the pixel level. And we're going to basically put that image through our network. And we're going to see which areas activate a specific voxel the most. And we upscale that, upsample that to be of the size of the original image. So here at the top, I have, sorry, here at the top, I have the image that in terms of the image that's annotated with which areas are activating the voxel the most.
And here, I have the segmented image. And so I can kind of compute similarity between the objects that are in each image and how much which parts are activating a specific voxel. And so I can end up with for each area, basically this median overlap telling me which types of objects most activate each voxel. And I can see very clearly that the results make a lot of sense it's actually basically a hypothesis free way to show how selective, basically these areas are.
So the FFA has a lot of selectivity for things like head, skin, people. The body area cares a lot about people, skin. The RSC, which is a place area has a lot of or the network that's trained to predict the RSC has a lot of window selection window preferences, which is kind of an acute thing. Well, I'll show you in a minute. And the network that's trying to predict the visual world from area has a lot of selectivity for signboards, which is kind of also a very nice result.
Because actually the original data set didn't have a lot of writing in it in the first place. So the fact that we were able to actually have a network that has such predictable preference for signboards is quite impressive. You can see this data in another way. Here, we're looking at basically the top category for each voxel. And we see that basically, the FFA has mostly head detectors. The visual word form area which has some overlap with some boxes and FFA has also some head detectors but also signboards.
The body area again, has a preference for people, heads, et cetera. And the RSC has some preference for sky but also mostly for windows, which is again, an interesting thing. And you see these results are quite different from basically random networks that don't show this kind of selectivity. We can see this result again in a different way. So here, for each of the top predicted voxels, we're looking at images that activate them the most, but also at what activates those voxels in those images.
So very clearly here, the network learns that these images like this part of these images are going to made the fusiform face area voxels the most. And here, this is a word form area. You can see, again, a lot of signboards, a lot of writing on those signboards. The extra street body area also kind of is activated by all these body parts, and again, now the windows.
I think my hypothesis as to why the windows are important for predicting RSC activity is that tells you both about the layout of what you're looking at but also whether you're indoors or outdoors. So that's what I think is happening here. So again, remember this is from the ground up. We didn't have a definition of any of this category in the training. We really just trained the network to predict ephemeral activity, and it's learned all this activity automatically from the brain activity.
We also interrogate this network in the same way similar to [INAUDIBLE] et al. from 2019 in which an accent network that has another linear decoder that predicts the activity in before is actually maximized to see which inputs maximize the voxel, maximize the neuron activity. So we do the same thing here. But with our human data and the difference is that we train the network end-to-end.
And so we see these are the maximally exciting areas for the different maximus, sorry, these are the maximally activating images for the different areas. Each one of those is a different voxel. And so you can see for the face area, you have things that maybe if you squint, it does look like maybe it can be relevant for faces, so a lot of concentric circles. For the body area, you have more elongated shapes, maybe looking like arms in some places.
The retrosplenial cortex, RSC, has a lot of rectilinear features in contrast with the more curvilinear features for [INAUDIBLE] FFA. And this also makes sense. This is what we think these areas are representing, more rectilinear features. And I think the coolest one for me is the visual word form area maximizing images, which are looks like scribbles like a figure eight or an eight, et cetera. So I see all of these as an even more kind of a stronger demonstration of how selective these areas are.
But then we thought about it more. And we thought that OK, we are trying to basically predict, train a network to predict FFA, let's say. If FFA is just responding let's say, at level one when there's a face and a level zero there is no face. This is just training a face decoder or a face, sorry, face detector that will just output one if there's a face, and up to zero of this no face. So we're not really using the brain activity in any way. We're just using it as a set of labels for faces, maybe a noisy set of labels.
So maybe that's not so interesting that we're getting this just the face area is just a detector for faces. So that's if that's the case, then if I remove all the faces from my data set, and I try to run the same experiment, then I shouldn't be getting facial activity anymore. So we did just that. We did just that to see if we remove actually, all of the faces from the training set, what would happen. So we saw that actually, when we do that, we get exactly the same results.
So we train the network on a set that had no images of faces at all and just images of objects that had no faces. And somehow, the patterns that emerge in those objects made the network learn that actually, its activity would be maximized if it saw faces. And so these are the results on that particular experiment. So here again, remember, there's no faces in the training set. And yet, the network just learns that these particular aspects of the image are the ones that are going to maximize the voxels in that region.
So that means that even when there's no faces, the brain is still responding in some way that's characteristic of shapes that are important for faces. So there's this kind of domain generated response to these images that are not just characteristic of one category. And we see the same thing also for EBA where we train again, the network with images with no people, and we find again, the same selectivity.
All right, so very quickly, so we also wanted to see OK, these networks are trained on these different areas that actually do different tasks. So once we do this training, do they actually learn to do different things with different accuracy? And we saw actually that, for example, when we train these in that, or we use the representation from these network to do a face classification task, we see that the network, that screen on FFA outperforms greatly the other networks.
So it has learned from the FFA representation that's useful for doing face identification. We also see that the network that's trained on the face area on RSC, also does much better than the other or the representations from that network do much better than basically the other networks at a spatial task, which is classifying room layout. So this is pretty cool. We actually going back to our set of questions here, we saw that yes, it is true that basically we can learn the relevant representations directly from brain data.
And data sets like NSD, which are pretty clean are strong, are good enough for doing this. And what we concluded from this is also that we found a very strong characterization of selectivity and in a way that was very hypothesis neutral. And also, we won the bet basically here again. And we saw that basically, brain activity could help build better AI models.
So in this case with this set of results, that's fine, with a set of results we didn't find that we do better yet than state of the art models for face detection or for face identification or room layout classifications. But we do much better than the other networks. So there's definitely a lot of research there to be done. But anything, since I'm out of time, I think I'm going to stop here and just thank you all for your time.
[APPLAUSE]