Introduction to transformer architecture and discussion
Date Posted:
October 17, 2022
Date Recorded:
October 11, 2022
CBMM Speaker(s):
Brian Cheung Speaker(s):
Phillip Isola
All Captioned Videos CBMM Research
Description:
Phillip Isola and Brian Cheung, MIT
This is meant to be an informal discussion in which Phillip and Brian will give an overview of transformer networks and then we will open the floor for questions and discussion. It is likely that we will have another meeting at a later time discussing what transformers may contribute to neuroscience.
TOMASO POGGIO: If we look at history, models of the brain have always been the most fashionable technology. It was hydrodynamics in the 17th century and digital computers a few decades ago. And 10 years ago was convolutional neural networks, and now transformers of course.
So in preparation for this, we have an introduction to transformers. I think this is something that Michale Fee, chairman of this department here, and Jim DiCarlo asked for. Of course, they're not here, but Michale actually asked me to record it. Is that possible?
AUDIENCE: It's recording now, as long as that's OK with you guys.
AUDIENCE: Yeah, cool.
TOMASO POGGIO: Everybody agrees to be recorded? All right. So with this, we have an introduction to transformers, and this probably will be the first of some of the research meeting where we'll speak about transformers. Next time, perhaps more in comparison or in the framework of models of the brain. But this time, it's more on just technical aspects of transformers. And so Phillip Isola, CSAIL, and Brian Cheung, this building, will tell us how. Jim is here, after all. OK, so lupus in fabula.
AUDIENCE: What was that?
TOMASO POGGIO: Lupus in fabula exactly means what happened, that I was speaking about you, and you came in.
AUDIENCE: Oh, magic.
TOMASO POGGIO: So Phillip will start, and then we have Brian, and then perhaps discussion. And in the meantime, of course, very welcome to ask a lot of questions. All right.
PHILLIP ISOLA: OK, great. Thank you, Tommy. So this can be, as far as I'm concerned, very informal. I'm happy to go back and forth. I took slides from a lecture-- hour and a half lecture on transformers I've given and tried to pick a few slides.
So it's going to be a quick intro, but we can go into as much detail as you want. And happy to make it also just more discussion. But I'll start by talking about what are transformers for those of you who haven't encountered them yet.
So I think that there are two key ideas that are not new, but are relatively popularized by transformers, and the first is called-- the idea of tokens. And then the second is going to be the idea of attention. There's a few other bits and pieces in transformers, but to me, those are the two kind of key ideas.
So tokens are essentially a new data structure, and they're a replacement for neurons. So in artificial neural nets, neurons are just real numbers, just scalar numbers, and tokens are vectors. So that is the distinction, as far as I'm concerned. So it's just lingo for a vector of neurons, a vector of scales.
But when you start thinking of your primitive units in a network as tokens where they're the neurons, then sometimes the math will look a little bit different. And I think it's just a powerful way of working with new kinds of neural networks made out of tokens instead of neurons. Again--
TOMASO POGGIO: It's a patch in an image, right?
PHILLIP ISOLA: A patch in an image can be a token. I'll give a few examples. And not a new idea. Vector neurons, vectors of features for a U-net inside a network, that's an old idea. Shows up in graph neural networks, if you've seen those, too.
But the way I like to think of it is we used to work with arrays of neurons, and now in transformers, we work with arrays-- tensors of various shape of tokens. But they're encapsulated vectors of information. So you can tokenize just about anything, and that is the big trend right now, is just show how to turn whatever data you have into a sequence of tokens.
So here's an example Tommy is mentioning, of tokenizing an image. You could do this in a million different ways. But the way that it's very common is you start with an image, you break it into patches. You flatten all the patches to create-- just take all the rows of those pixels and just concatenate them into a really long row, or a really long column vector.
And then you get this thing called a token, which is just an n-dimensional vector representing that patch. And this arrow up here, if you can see my mouse, it could just be flattened or concatenated. But commonly, it will actually be a linear projection to some fixed dimensionality. So project to 128-dimensional vector, for example.
AUDIENCE: Should I think of a token as an embedding? Is it the same or what are--
PHILLIP ISOLA: Same thing, yeah. So a vector embedding of a patch would be a token in this context, yeah. So there's a lot of names for this, but to me, once you start thinking of things in terms of tokens, that's how I like to think of it. So you can tokenize anything.
That's often a linear projection. We used to operate over neurons, now we operate over tokens. And just a few more examples. So whenever you can take your signal and chop it up into chunks, and project each chunk to a fixed dimensionality vector, then you can tokenize that data.
So for language, people do this as well. They chop up into little chunks, which are often two characters, or a few characters at a time, and they project those into 128-dimensional vectors. With sound, you just chop up into little tiny snippets of sound.
So you can tokenize anything. And once you've tokenized it, then it's just a sequence-like representation, a sequence of vectors. Although really, the way transformers work is they think of it as-- they treat it as a set, not a sequence. But people often talk of it as a sequence.
So that's tokenization. I think that's the first critical idea. And then everything ends up being operators defined over tokens, as opposed to over neurons. So rather than taking linear combinations of neurons, which is the common linear layer in a network, we take linear combinations of tokens where you just take a weighted combination of vectors instead of weighted a combination of scalars.
So I'm not in this-- because it's only a little intro, going to go into the math and detail, but we can always come back if you're interested So in standard artificial neural networks, there's two key layers. There's the linear combination layer, the linear layer, and there is the pointwise nonlinearity, which might be a value or a sigmoid function.
So in tokens, it's the same thing. There's a linear layer, which we have here. Just a linear combination of vectors. And then there's a tokenwise nonlinearity, which is completely analogous to a neuronwise nonlinearity. So this is a neural net that does a pointwise nonlinearity over neurons. It applies the same function ReLU to every neuron in a list. And the token net applies the same function, f, to every token in a list or in a set.
Usually f is going to be itself a multilayer perceptron. It will be some parameterized function. You can equivalently think of this as a convolution over the tokens. So transformers, CNN's, it's all the same ideas rehashed. We can talk about that if people are interested.
But here's what it looks like. It's just a token-wise operation that slides along the list of tokens. So you can see that looks like convolution sliding across the signal.
TOMASO POGGIO: It's usually one layer, right?
PHILLIP ISOLA: Is it usually one layer? I think that it's usually linear rather than linear.
[INTERPOSING VOICES]
PHILLIP ISOLA: Yeah. So it's a nonlinear function. Yeah, it just has to be nonlinear so you get nonlinearities into the system.
TOMASO POGGIO: [INAUDIBLE]
AUDIENCE: I have a--
PHILLIP ISOLA: MLP--
TOMASO POGGIO: One layer of [INAUDIBLE].
PHILLIP ISOLA: Well, two linear layer MLP, but sure, you could call it something else. Yeah, question?
AUDIENCE: I have a quick question for the last slide. I think I have missed what is the z here like in the last slide?
PHILLIP ISOLA: Oh, z is the token code vector, meaning the vector of neurons inside that token. It has this code vector that lives inside it. Yeah.
AUDIENCE: OK, thank you.
PHILLIP ISOLA: You're welcome.
AUDIENCE: Yeah, what's the analogy with convolution here?
PHILLIP ISOLA: So convolution applies the same operator independently and identically to every item in a sequence. And this is doing the same thing. So it's like slide the filter across. Here, slide this nonlinear filter across the sequence. But think of it as a one by one kernel because it doesn't-- the receptive field in the sequence dimension is just looking at one token at a time.
OK. So here's a neural net, and here is what I'll call a token net. So it's just like a neural net, alternating linear combination pointwise nonlinearity, but now it's linear combination of tokens in tokenwise nonlinearity. And transformers are in that family.
Another name for that family is graph neural networks. They have the exact same structure. But the terminology is just different for graph neural nets. But they're the same thing. Token nets, graph nets, and transformers are all the same thing. You could say that token-- sorry, transformers are a special kind of graph net if you prefer to look at it that way.
So many connections that could be made. But I'm going to zoom right ahead and we can maybe discuss all the connections. So idea number one is build things out of tokens, vector valued units, as opposed to out of neurons, scalar-valued units. Idea number two to me is attention. Maybe this is the most famous idea of transformers, is they have this new layer called attention.
So let's look at what attention is. So in a neural net, or what I'm calling a token net, you'll have linear combinations of inputs to produce an output, a weighted sum of the inputs to produce an output. And you will parameterize that mapping with weights, w, and this will be learnable.
And in attention, you have a weighted combination of inputs to produce an output, but the parameters, or the values in the matrix A are not learnable. Instead, they're going to be data-dependent. They're going to be a function of some other data. So when you have data-dependent linear combination where the weights are a function of some other data, then that's called attention.
And so A is the weight matrix, and it's just a function of something else that tells you what to attend to, how much weight to apply to each token in the input sequence. And you'll take a weighted sum, which is just matrix multiply A by the tokens. Notation, don't worry about it.
So here's the intuition. So I can have attention given to me by some other branch of my neural network. Maybe it's going to be a branch that parses a question.
And that question will tell me what patches in the input should I attend to. The input patches are represented by tokens, and I will say these are the patches that I'm going to place high weight on, and then I'll take a weighted sum of the code vectors inside those patches to produce an output. So I could ask a question, what is the color of the bird's head? I'll attend to the bird's head.
What is the color of the vegetation? I'll attend to the background. So it's just saying which tokens-- tokens or patches there are going to be getting a lot of weight to make my decision. And then I'll report [? that ?] [? screen, ?] because the token code vectors will have represented some information like the color of the token.
So this is going to be a little too much detail to fully understand in just a few minutes, but here's the most common kind of attention layer, just to show you the mechanics really quickly. The question submits a query that is matched against a key in the data that you're querying. So the data you're querying is a set of tokens, and each token has a key vector, which gets matched against the query vector of the question.
You compute the product to get the similarity between the key and the query, and that dot product, that similarity becomes the weight that you apply to another transformation of your tokens, which is called the value vector, and you take a weighted combination of the value vector. And so all the fancy math here is just to say the question will tell me which tokens to weigh heavily in a weighted sum to produce an output. But it'll be via these three transformations to the token vectors, which are the key, the value, and the query transformations.
So that's a little mechanistic detail. We can go back to that if we want to discuss the nitty gritty. But the way it looks like is this, and this is common to most of the transformer architectures. You have a bunch of-- you break into patches, then you have a bunch of layers which are basically these one by one convolutional layers.
And then every now and then, you have something that tries to mix information across space, across tokens, and that's called attention. And that is just the tokens based on their values, queries, and keys will decide what should I-- which of the other tokens should I average together to produce a new representation in the output. So this head might attend-- say I should look at the other heads to decide how I can better recognize what's going on in this patch.
AUDIENCE: When you say one by one convolution, you mean a patch is a pixel?
PHILLIP ISOLA: A patch is-- so you vectorize a patch, and then you do one by one convolution across all of those tokens where the channel dimension is the token vector.
AUDIENCE: Yeah.
PHILLIP ISOLA: So here's MLPs, and then transformers are two changes. One is that rather than scalars, they have tokens. And the other change is rather than having parameterized linear weights, they have data-dependent linear weights that are given by that special attention operator. And that attention operator itself has parameters that define the key, the query, and the value, and those are the learnable parameters of the system.
TOMASO POGGIO: Phillip, just a small detail, but you start with self-attention, and then you have a one layer, a multilayer perceptron. But in the previous slide, you had the opposite order. Is that--
PHILLIP ISOLA: Well, so it'll vary a lot between different architectures. Yeah, you could alternate using different orders. And I did gloss over the detail. This is attention from an external question. But the more common thing is self-attention, where attention is coming from the image itself, from the data itself.
And that's-- this is like attention from the question branch, but I could just have the data choose what in its own token sequence to attend to. Each token chooses which other tokens to attend to, and that's called self-attention. This picture here where this patch will decide what to attend to get a better representation of that patch, essentially.
So this is self-attention, really, as opposed to-- that's why the dependency of the weights on the data is like this. It's attending to itself, as opposed to attending f is not coming from some external source.
So I think this is the last one, just to connect it to one of the most common demonstrations of transformers is to do sequence modeling. But really, transformers are more about set to set operations. But you can represent a sequence as a set, so it's fine. But-- oh no, I think I got the wrong slide here. Let me pull up the right one.
Yeah, this is what I wanted to show. So here's how it looks for doing next word prediction. And you can do next protein prediction in a sequence of proteins, or next sound wave prediction. You can do next [? to ?] [? me ?] prediction. That's a really common framework.
So this is like a one-layer transformer. We're going to say colorless green ideas sleep. But that into the transformer, we want to predict the blank. The words attend to each other, then you pass through this tokenwise nonlinearity, and then you make a prediction at the very end.
So just one example of what you have seen as transformers for sequences, but really transformers are more general in that they're not just about sequence modeling. This is just one way you could use them So that's the 10 minute overview of transformers. Let's see, should we do questions, or Brian do you want to just jump right in?
BRIAN CHEUNG: Maybe you can do questions while I set up.
PHILLIP ISOLA: OK. Questions while Brian sets up?
TOMASO POGGIO: What is the history of these terms, quest, query, value-- keys.
PHILLIP ISOLA: Probably other people here know better than I do, but I think it's all coming from the data retrieval database literature. So you have a database where you have knowledge stored in cells. And you can query that database. So you say, I want to find things related to drafts.
And then every cell will have a label. The key will be like, this is a cell for mammal, and this is the cell for giraffes, and here's a cell for plants. And you'll match the query to the key. And then the stuff inside the cell that you retrieve will be the value. So I think it's coming from that.
AUDIENCE: This isn't like a fully formed question, but I guess in the framework of having question key value, if-- let's say the question was like, what is the color of the turkey's head? And for some reason, the turkey's head is a feature that's complex enough that it can't be like well represented within a single token, does that cause issues with attending to the right thing, or weighting the right thing properly when--
PHILLIP ISOLA: Yeah. I suppose it could. One answer would be that you've often do multi-headed attention where you'll take your sequence, or your set of tokens and then you'll have K different query vectors, and K different value vectors, and K different key vectors. And each query can be asking a different type of-- it can say, I want to match to the same color.
Another one could say, I want to match the same geometry. And when you optimize the parameters of this network, it will somehow self-organize that, well, if it's useful to factor things into geometry and color, then there'll be one attention head that cares about color, and one that cares about geometry.
AUDIENCE: Just one quick thing on the transformer architecture and the nonlinearity. If I recall correctly, you do a normalization of each token, right?
PHILLIP ISOLA: Yeah.
AUDIENCE: What's the intuition for that? Why do you do that? I haven't seen that too much before transformers.
PHILLIP ISOLA: Yeah, that's a great question. So normally, you want your weighted sum to add up to one. So the weight should add up to one. And people do that achieve that via the softmax over the weights. Is that what you're referring to?
AUDIENCE: No. No, I'm referring to post the residual connection you do as part of the-- right before the--
[INTERPOSING VOICES]
PHILLIP ISOLA: Oh, the layer norms.
AUDIENCE: Yeah.
PHILLIP ISOLA: So my understanding of transformers has progressed to the level of tokens and attention. And all the rest, layer norm, and residual connections, at this point, I don't rock it. I don't know why. That feels like tricks that usually help neural networks train. It's good to normalize things. It's good to have residuals. I don't see anything-- I haven't understood anything specific to transformers about those ideas yet. They're generally useful tricks.
AUDIENCE: OK.
BRIAN CHEUNG: Transformers are very hard to train, especially several layer ones, like 12 and up. If you don't have residual connections or layer norm.
AUDIENCE: Yeah, OK. OK
BRIAN CHEUNG: I think the training isn't very stable unless you [? do ?] those tricks.
AUDIENCE: So it's not-- yeah. [INAUDIBLE].
PHILLIP ISOLA: Also, there's a [? tension ?] between multiplication, and that might not be as stable as normal, non-intentional layers just [? to ?] addition.
AUDIENCE: Yeah, that's the softmax, right?
PHILLIP ISOLA: The softmax can probably help with that.
BRIAN CHEUNG: Yeah.
AUDIENCE: So how do you decide what patches to use? For instance, in the image or the text, you also had what seemed to be arbitrary segmentation of the data.
PHILLIP ISOLA: Yeah, that's a great question, like how do you do the tokenization? How do you design that? I think it's super hackey right now, so that feels like somewhere that people could do a lot of work.
One thing that does seem to happen is that the smaller you make your tokens, if you have tokens that are a single pixel, it's just the better you get. So maybe what will happen is we'll just stop having clever tokenization and just go down to whatever the atomic units of the data are, like a single character, a single pixel, and that just makes the choice for us because you can't go below that, really.
So I think the smaller-- in vision transformers, the thing I've seen is that the smaller the tokens are, the better they tend to perform. But because the attention mechanism is every token attends to every token, it's like n squared, if you make the tokens too small, there's too many tokens, and then you run out of memory. So there's probably clever tokenization schemes like superpixels, or segmentation. Or in language, there's a lot of tokenization schemes that are-- like bytecode. Byte Pair Encoding is the name of one, and [INAUDIBLE], you could use that as your first tokenization layer. But yeah, that feels like a hackey area right now to me.
AUDIENCE: You can try to usually leverage at least some degree of topology by tokenizing it that respects spatial coordinates for images, or language order for words. Because there's also something called position encoding, which gives you knowledge about the topology of why this element is in this position versus this other position, and that tells you a lot about the structure of the image, or at least where this token is with respect to the overall structure of the image.
PHILLIP ISOLA: Yeah. And then another thing that seems missing right now is people usually envision transformers, which I'm most familiar with, that usually break into non-overlapping tokens. But we know from ConvNets and signal processing that you'll get a ton of aliasing in the filtering operations. If you have these huge strides, you're [? bringing ?] in non-overlapping patches, you really would want to have overlapping patches, or blur the signal. So all that signal processing stuff I think has been thrown away, and probably it's a good idea to put it back in there.
AUDIENCE: Someone has tried overlapping patches.
PHILLIP ISOLA: OK, I guess--
AUDIENCE: It does help, yeah.
PHILLIP ISOLA: And by the way, everything I say, as you probably know, there's 10,000 transformer papers now. So I'm sure everything that you could imagine has been tried. Yeah.
AUDIENCE: So the part about working with sequences versus sets confuses me a little bit, and this relates to position encoding. So I think it is important when you send these tokens to the transformer to actually inform it where it is in the sequence.
PHILLIP ISOLA: Yeah.
AUDIENCE: So transformers operates in sequences, or sets? Like how-- I'm--
PHILLIP ISOLA: Brian, do you mind if I pull--
BRIAN CHEUNG: Oh, yeah. Sure. Sure.
PHILLIP ISOLA: --take the slides over one more time? I do have some slides that I think clarify that. So yeah. So there's this idea of positional encoding, which I think is the third big idea of transformers, but it's been used in other contexts, too.
So maybe it's not just about transformers. But if you have a ConvNet and you don't want it to be invariant to shift, convolutional shift-invariant, you can just tell the filter where you are in the image by adding this positional code. So I can just say at the bottom right of the image, and then my mapping will end up being conditioned on the position.
Transformers, it's the same thing. I can tell you-- I can tell you where the token comes from. And if you do that, then you do get sequential-- our model in the sequence because you're telling the token, I'm the first item in the sequence, or the second item. If you don't do that, then you have the property of permutation equivariance, which is why I said it's really a set-to-set operation.
So if you don't tell the tokens where they come from, you don't give them positional codes, then if I take the tokens on the input layer and I commute them into any order, I will simply permute the tokens on the output layer. The mapping will be the same up to permutation. It takes a little thinking to see why that's true, but essentially the reason is because the attention layer is permutation invariant.
Because the way attention works-- the way self-attention works is it looks at the color of this and the color of that, and it makes some similarity comparison. And then the weight of the weighted sum that goes to here is just going to be something about the similarity between the color, the query, and the key of this. So no matter where you move that edge around, as long as the input and output are orange and blue, it will get the same weight.
So you can kind of work it out and see that attention is permutation invariant. The tokenwise operator is pointwise, so that's permutation invariant, too. And then that means that the whole transformer's a permutation invariant function, and you can make it model sequences by telling it the position of every token in that set. But if you don't have position encoding, it's more appropriate for set to set mapping.
AUDIENCE: One more question or comment is you started with the idea of tokens, but just a few sentences before, you said that something that may sounds like the idea of tokens is rather weakened that the transformers work better when you have individual bits of data, like pixels. So it feels to me that the idea of tokens is not really essential in the whole concept of work, but attention is the most critical part. And the idea that being able to attend dynamically anywhere in the picture if it is visual data is the idea that gives power in these algorithms, not the token per se?
PHILLIP ISOLA: Yeah, I don't know. And I think that's kind of the open debate. It'd be fun to keep discussing it. So I think the field is kind of split right now between people thinking that it's the tokens and the vector-- encapsulated vectors that are important to this versus the attention, and really if you just have an intentional mechanism, you can still be operating over neurons. It doesn't really have to be about tokens.
I feel like both probably are important to the success. But one counterexample to that attention matter is there are these other architectures now, sometimes called like MLP-mixer's one architecture in this family, which uses tokens, but not attention. So it's basically a ConvNet over tokens, one by one convs over tokens. They call it [? MLP-not-conv. ?] That's super confusing, but anyway, forget the terminology because these things are all just small transformations of each other.
But anyway, this MLP-mixer thing doesn't have attention. But it's still, in my view, is it's a token net, and that seems competitive on some tasks with attention networks. So maybe attention is, maybe it's not. [INAUDIBLE].
AUDIENCE: Related to this question, so it seems like, especially in the self-attention mode that you're describing, if you unrolled that, it's like a series of matrices I think going on there that could just be implemented as a straight up feedforward chain, if I'm following you. Is it just very deep? It's going to be skips, I think, but can you-- does that idea make sense to you?
PHILLIP ISOLA: Yeah. You can always-- you can always take any one of these arrows and just say, oh, that's actually just a matrix multiply. But it's a matrix with special structure. It's not a full rank matrix. And I think that that's one way of just understanding all the--
AUDIENCE: Of course I didn't mean it was any matrix, but I'm just like, you could express this as another-- I was trying to understand what the re-expression form might look like and then generalize that to-- and then you're back in standard mode again. Maybe that's closely related to what you were saying.
PHILLIP ISOLA: Yeah.
AUDIENCE: Where you have the second branch with the q, that's different. That's our active state versus this kind of deterministic processing based on--
BRIAN CHEUNG: It's kind of multiplicative there. That's where it differs from standard MLPs where they don't have this kind of multiplicative interaction. That means that you're dependent on the own-- you're self-contexting yourself, meaning that you're dependent on your own input when you process this particular input. So it's kind of like a hyper-network, I guess, in that case, where-- because we change the representation based on what your representation already is.
PHILLIP ISOLA: Yeah. So it's not like-- yeah, you can't just rewrite it exactly as these linear combinations on a regular feedforward network because it does do these multiplies. Maybe that's the one mathematical atomic unit that's different. But I think you can express these in different languages.
And I actually do like thinking of it as just a special kind of matrix. The matrix weights come from this other source, this self-attention mechanism. But then it's just a matrix. But that matrix has special structure.
And understanding that low rank structure is-- that's just a way of interesting architectures. What is the special structure you're imposing on the linear transformations on the matrices? But yeah, this dot product here between the query and the key involves multiplication, which is not something that you would directly get in a regular network.
AUDIENCE: [INAUDIBLE] related to this, and also the computational equivariance, that's within a block. So what about a cross-transformer blocks?
PHILLIP ISOLA: Yeah. So you could have-- you can have any set of tokens attend to any set of tokens, and they could come from one layer of the net and another layer of the net, or one block, another block. It could come from a text processing network, and an image processing network, and then the text attends to the tokens of the image.
And then it could come from just the text tokens attend to themselves, and that's mostly what I talked about. But-- actually, sorry, did that answer-- is that your question? I might've gone the wrong Way
AUDIENCE: Actually--
AUDIENCE: You said that you could permute within a block, the operations.
AUDIENCE: Yeah.
AUDIENCE: Pretty much. I'm just saying if you have 12 transformer blocks, it's been shown that some of the earlier blocks learn more like surface form-type of features by like later learn--
PHILLIP ISOLA: Oh, yeah.
AUDIENCE: --high level things. And I'm just wondering about can you also permute--
PHILLIP ISOLA: Yeah. I think no, probably.
AUDIENCE: No? No.
PHILLIP ISOLA: So I don't think you can permute depth-wise. I think just you can permute within--
AUDIENCE: Within the--
PHILLIP ISOLA: --the input sequence, yeah.
BRIAN CHEUNG: Actually, with multi-headed attention, you know how they concatenate all the heads together to go to the next layer? Don't they lose the permutation invariance, or the permutation equivariance, when you do the concatenation process?
PHILLIP ISOLA: I don't think so.
BRIAN CHEUNG: Because you concatenation has to have some ordering.
PHILLIP ISOLA: Oh, interesting. Yeah, maybe. OK. So it's possible that transformers aren't--
BRIAN CHEUNG: At least not--
PHILLIP ISOLA: Always--
[INTERPOSING VOICES]
BRIAN CHEUNG: --multi-headed transformers.
PHILLIP ISOLA: --equivariant.
BRIAN CHEUNG: The individual transformer heads are equivariant.
PHILLIP ISOLA: Yeah, I hadn't thought about the multi-headed thing. OK, yeah. Because in the multi-headed, you take a weighted sum of the heads, and they're not--
BRIAN CHEUNG: The feature vectors, they concatenate together.
PHILLIP ISOLA: And that's a parameterized sum. And if you change the order up-- yeah, I think you're right. OK, interesting.
TOMASO POGGIO: Did anybody try, instead of using the Q and K matrix or the Wq and Wk, use a single matrix?
BRIAN CHEUNG: Yeah.
PHILLIP ISOLA: So one thing you can do is just get rid of queries, keys, and values, and just have your code vectors at Z, and take the inner product of-- or the outer product of Z with Z as your attention. And I think that can sometimes work just as well. I haven't really followed the latest on that, but you can have queries, keys, and values that are linear functions of your code vector z, or they can be nonlinear functions, or they can be identity. And I'm not sure that there's a consensus on when you need which.
So one thing that's kind of interesting is that if you use the identity to create your queries and keys, you're basically creating a grand matrix between all of your token vectors with themselves. So it's going to act like-- it's going to cluster the data, and there's been some analysis of how identity, attention-- identity queries and keys will create this clustering, like spectral clustering type matrix.
That will create this similarity matrix. When you hit the data with that similar matrix, it'll group things that are similar. And maybe we can understand a little bit of what's going on from that perspective. Like linear queries and keys are just some projection of that kind of thing.
TOMASO POGGIO: And how does that work if you don't have the Wq and Wk matrix, but just the identity?
PHILLIP ISOLA: Actually, yeah. Do you know, Brian, or does anyone know?
BRIAN CHEUNG: So ResMLPs are actually like this-- more exactly like this spectral matrix because what you do is you process the input, transpose it, process it again. If you work out both operations in sequence, they end up becoming Z-transpose Z with a matrix in the middle. So I think it does work, I guess.
TOMASO POGGIO: That was my-- maybe not as well, but it does work.
PHILLIP ISOLA: I think It works, just maybe not as well.
BRIAN CHEUNG: Yeah.
PHILLIP ISOLA: Because you still get interleaving with these nonlinear operations, you still get expressive power.
TOMASO POGGIO: But one other question. You mentioned, and there is a paper, I think, from DeepMind about using a single pixel as a token. It seems very little in terms of being able to establish similarities with other tokens because you have just color and intensity, right?
PHILLIP ISOLA: So my feeling there is that yes, on the first layer, that's a very bad token. You can have a 128-dimensional vector that codes just the color of a single pixel. It's not going to do much.
But then when you do this linear combination operation, it's like taking a one by one patch and making a bigger patch out of it. I mean, it can learn to mix information across the whole image, and say all of these similar colors-- maybe all the white stripes in the zebras will go with all the black stripes in the zebras because the black keys will match the white queries. So it could create these abstracted tokens as you go deeper into the network and build up--
[INTERPOSING VOICES]
TOMASO POGGIO: Many years ago in computer vision, there was the idea of instead of using a patch of using at a single pixel the value of the pixel and derivatives.
BRIAN CHEUNG: Yeah.
TOMASO POGGIO: Which means, essentially, you're having non-local information because the derivatives gives you information about neighboring pixel's spatial derivatives.
PHILLIP ISOLA: It's kind of like a little token.
TOMASO POGGIO: That's right.
PHILLIP ISOLA: It vectorized--
[INTERPOSING VOICES]
TOMASO POGGIO: It was a vector of-- and it's very much like a token, yeah.
PHILLIP ISOLA: Yeah. Yeah, so I don't think that these are new ideas, really. It's just it helped me to coalesce around the idea of a token. When I think of all the operations in terms of tokens, but we used to talk about hypercolumns, and feature vectors, and there's a million names for the same concepts. And Brian, do you want to pull up just so--
BRIAN CHEUNG: Sure.
PHILLIP ISOLA: --we have time for you too? Yeah?
AUDIENCE: You made the connection earlier with graph neural networks. I don't recall which one you said is more general of the two.
PHILLIP ISOLA: Well--
AUDIENCE: But I also don't feel the connection because for graph neural networks, it's important. Like the connection in the graph, the structure is very important. Whereas in transformers, you said it is less of a case.
PHILLIP ISOLA: So yeah. I think it depends on you define these things. But I like to think of graph nets as a broader class. And so transformers can be graph nets with a fully connected graph. So every token talks to every token, and the aggregation function that decides how to do the weighted sum of incoming messages is given by this attention operator. So the similarity between node A and node B tells you how much the messages from node B will be summed up into node A.
AUDIENCE: And you can have attention graph neural networks. That's why you feel it's--
PHILLIP ISOLA: I think it was independently invented over there, or maybe even first invented over there. So graph net people have done attention I think before transformers made attention really popular. Although, of course, attention is an old idea as well, and it has shown up a lot of times.
BRIAN CHEUNG: So I guess I can start now. Yeah. Phil gave a great overview, which actually is going to make this a lot easier, where I think I'm going to go over more of the AI machine learning community's perspective of attention and the developments happening in that community, and go over some kind of unintuitive things about transformers that I think are surprising, I guess, to a lot of people now.
To go over that, let's go over how it started. So originally, the transformer was a language model, and the language model paper had possibly the most elegant title you could imagine, saying that attention is all you need. Now, little did they know that that actually turned out to be something that would actually have a little more truth to it than I think most people realized, including, I think, myself, which is that this now is a model that covers, as Phil mentioned, language, speech, vision. It's basically a universal model now for all the modalities that people apply to data, actually.
And again, attention is not a new idea. This is something that was proposed even before this paper. But this paper proposed something that was very, very similar to the vision transformer attention where they created, essentially, attendable feature maps at the last layer or the next to last layer of a VGG network that was still spatial. And they're able to use this to show that when you do image captioning, it would attend to the correct parts of an object.
So the reason why I think this seminar is important is because classically, whatever works well has always been something that we always ask ourselves is the brain also doing this? Because that's been the classical thing for a lot of the past 10 years of research, actually, a lot from Jim McCullough's group, and I think we have to understand what's going on, I think, to understand why transformers are so important, and potentially why we should either care about them or not care about them, actually. But I think there's this all too familiar aspect where whatever is working extremely well, we need to compare that to the brain now. And I think we should get ahead of it this time because in this case, we want to see where they're going.
And again, one of the things that's interesting is that when people compare these attention models, specifically certain versions of them, like CLIP, to human behavior, you get a substantially improved performance in terms of explaining either the misclassification errors or not in human visual systems. And I think the issue is we don't know why this is happening, in the sense that in this paper, they mention that the particular vision transform model that they tested was far better than all their other models for explaining human behavior. But they weren't sure why at the time. This was from a 2021 paper, so I believe that they're still not sure. What all they found was that the results show that this particular model clip was an outlier for their comparisons to human behavior.
And also, there's been, like I said, this natural tendency to now compare convolution networks to transformer networks because the transformer networks are starting to outperform ConvNets on a lot of vision tasks. So the question is now which models are more similar to neuroscience, or just human vision in general, whether it be cognitive science as well. Are ConvNets more similar or are transformers more similar?
And intuitively, you would think that there are certain properties of convolutions that make convolution a convolution, and a transformer a transformer. And what makes, what we believe, convolutional convolution is the fact of equivariance, meaning, as Phil mentioned, when you apply an operation to the input, the operation applied to the output should be the same type of operation.
Now the issue is that this turns out not to be exactly true. It turns out after training, a vision transformer is more equivariant to translation than a ConVnet, which I think is quite surprising. So Phil might actually already know the issues with the convolution not being equivariant, one thing being that a lot of the pooling and other operations, even the nonlinearity, contribute to hurting or harming the equivariance performance.
So this plot here is showing this paper's measure, which is a rederivative, the equivariance error of a ConVnet, this is ResNet-50, versus a vision transformer. And it turns out after training, a vision transformer is translationally more equivariant than a ConVnet which I think is quite non-intuitive in the sense that we built in the equivariance, specifically for a ConVnet. Yet, here we are, and vision transformers are more equivariant after training.
And then the sample on the right shows this is not just specific to these two architectures but many different architectures here. And it's also interesting that they also measure MLP-mixers, which, as Phil mentioned, is a new architecture also that also has surprisingly high equivariance after training. And this equivariance improves as you increase image test accuracy. Does anyone have any questions here.
PHILLIP ISOLA: Brian, one question. Do you have positional encoding on these networks?
BRIAN CHEUNG: Yeah, you do have vision.
PHILLIP ISOLA: So it shouldn't be equivariant.
BRIAN CHEUNG: Oh, well this is like a measure of the equivariance.
PHILLIP ISOLA: Yeah. No, no, I mean, it shouldn't be like--
BRIAN CHEUNG: Oh, it shouldn't be. Yeah, it shouldn't be
PHILLIP ISOLA: It'd have to learn to be.
BRIAN CHEUNG: Well, in that case, equivariance is very strictly by the patch. In this case, they're doing small transformations.
PHILLIP ISOLA: Oh, I see. OK.
BRIAN CHEUNG: So it's not the size of a patch. But since you asked that question, there's another paper that shows about invariance. And this is much larger translations.
And they show that actually after training, a transition transformer, which is a transformer model, is as invariant to translation shifts as a ResNet-18, which is a ConVnet. And this is interesting, because then the question is, what's going on here? We build in these things to our models, but apparently not building them in also gives us the thing that we wanted.
And I think this kind of goes to the issue that maybe might be controversial, but has been kind of true, which is the idea of scalability. And are people familiar with the bitter lesson here? Or has anyone heard of the bitter lesson?
This is a classic controversial statement. Actually, I think a former professor here actually gave a retort to this thing by Rich Sutton, who wrote that the lesson that we should take away-- and this was written, I think, in 2019 or 2018, is that we should not be working on inductive biases for models, as much as we should be working on inductive biases for learning. And I think one of the key directions that organizations like OpenAI are going towards is the idea that we shouldn't really try to bake in these priors that we think are really, really useful because at some point, your data might give you that prior anyway. And that could be what's happening here.
And to give you a flavor for what's going on in the machine learning community, they're essentially optimizing the upper right-hand corner of this curve. And the x-axis here is the compute required to train these models, and the y-axis is the negative log perplexity, which is essentially the task error. Or task performance, I guess, not error.
And in this case, what they're doing is they're trying to find architectures that go all the way up here in the performance curve. And they're essentially paying for that performance, not by incorporating inductive biases into their model, but by trying to absorb those inductive biases through data.
So I think to give you a glimpse of what's happening in the future is that there probably will be less inductive biases in this community of machine learning because they're willing to pay this cost of compute to not have to build in the inductive bias that they would normally have to build in for smaller data sets.
And as Phil alluded to, which is the-- or as actually Tommy also mentioned, it's just these architectures are getting more and more general purpose. So in this case, this is a transformer architecture. But unlike a transformer, this is an architecture that can attend not over the token level, but treats each pixel as a token.
And what's interesting about this is that they actually trained this on ImageNet without any image prior. So the position encoding that Phil mentioned was actually learned by the model, and they could actually get performance on the order of normal image prior-oriented models like ConvNets and patched vision transformers. So the remarkable consequence of this is that there's no actual image prior, or modality prior built into this model. It learned it on its own. And even in that perspective, it's still competitive on ImageNet. So this is not necessarily a very large data set, either.
AUDIENCE: Do you know if the perceiver, Brian, also has a texture bias, for example, or if in general robustness?
BRIAN CHEUNG: I don't think they've looked into that. I think the robustness qualities are probably not too different from standard trained models. I think in terms of adversarial examples, transformers are-- well, I think it's controversial, but I think people will say transformers are a little bit more robust adversarial examples than ConvNets. But on the grand scheme of things, they're both very susceptible to adversarial networks.
AUDIENCE: I mean, just to finish, I was just curious to see if-- there's many models now that they can all do whatever, 75% on ImageNet, but not all of them will see like a human, whether that's a goal or not, right?
BRIAN CHEUNG: Right. That's a good question. And I think we have to now be aware that this is where the field is going. Is that where we're going as well as a field? Not-- yeah.
AUDIENCE: What does it mean to learn position features in the case of an image?
BRIAN CHEUNG: So you randomly initialize the position encoding to be just from a random Gaussian or something.
AUDIENCE: Well, the position of the pixels in the image? So it's an unordered set of pixels?
BRIAN CHEUNG: Right. So you treat each pixel as an unordered element in a set.
PHILLIP ISOLA: Wait, wait, Brian. Is it the case that the top left pixel will always get the same positional encoding? It'll be a learned--
BRIAN CHEUNG: Yeah.
PHILLIP ISOLA: So--
[INTERPOSING VOICES]
BRIAN CHEUNG: So it's not like you shuffle every image independently of every other image, and then get--
PHILLIP ISOLA: So I think it's still inserting a lot of information there.
BRIAN CHEUNG: Well the prior that they're inserting is that the topology is consistent across samples, which I think is reasonable about modalities mostly. You would imagine that the position shouldn't be unique to every single sample. Because then in that case, then I think the problem would be really, really hard. I don't know if you can even learn anything--
PHILLIP ISOLA: If you just randomly shuffle all the pixels in an image, there's no--
BRIAN CHEUNG: Yeah, I don't know what you're going to be able to learn. Every image has its own permutation that you give it to the model. It could learn higher statistics, I guess, over what [? mixes ?] together. Yeah.
TOMASO POGGIO: Another number is how many training examples it takes to train this compared to a convolutional network?
BRIAN CHEUNG: Right. So I think one of the things that we know is that it takes more data to train a vision transfer--
TOMASO POGGIO: But do you have the number for this?
BRIAN CHEUNG: This is, I think, just ImageNet with augmentations, actually. This isn't a larger data set. This is just augmentation providing the extra information.
PHILLIP ISOLA: I don't have the numbers, but the [INAUDIBLE] I have seen tend to be-- do we have a whiteboard? So the ConVnet starts here and goes kind of flat, and the transformer starts down here and goes up. So it scales better, but for low data, it's doing worse.
BRIAN CHEUNG: Right. So I think that's one of the things that why the community is going towards-- industry, especially, going towards this direction because they're willing to pay for the scale via compute cost rather than having the built-in data points, I guess, be not for free, I guess. Because what happens is that convolutions, after a certain level of scale, will start saturating, and transformers still continue to go up in terms of increasing the parameter count will give you still positive returns in performance.
What I didn't discuss in this talk is also there have been a lot of other variations of transformers as well, and it's kind of strange, actually. People have-- at least Google has done a meta analysis on transformers, and all the zoo of variations of them. And at least in terms of scalability, it seems like the original transformers actually scales the best, which is a little bit odd.
AUDIENCE: Which ones scales the best?
BRIAN CHEUNG: Vision transformers. Well, there's vision transformers, and also sparse mixture of expert transformers. Those apparently both scale--
AUDIENCE: ViT. ViT, though.
BRIAN CHEUNG: ViT. Well, don't remember if they were specifically doing vision in this case. I think it was more language tests, but yeah.
But I think this tells you a story that maybe the machine learning community is going towards, which is not the fact that architecture matters the most, but the fact that data is actually the really important aspect. And I think they are building now architectures that aren't necessarily good at working well without being trained, but furthermore, work well at absorbing data lot more efficiently than other architectures. And I think we should think about the things that lead to the resulting model that we work with these days, which is the idea that it's not just the model itself with its architecture. But there are priors about the compute, the capacity of the model to train over that data, and also, more importantly, now more recently, the nice thing is that, as Phil mentioned, the notion that a lot of these models don't require supervision. So they can absorb larger and larger data sets without much economic cost to the person training it.
And one-- also another aspect of transformers is the way it was created was mainly as an alternative to recurrent networks and the reason for that is because as you see on the left here, I mentioned hardware. And one of the things about transformers is that they're much more sizable in GPUs than a recurrent network is, and they run much more better on like stupidly parallel software than a recurrent network does, and that's why they've become very popular also is because the parallelization of them is much easier than other sequence models.
AUDIENCE: One comment if I may make how I view this. I feel when you go to transform is just by introducing the idea of attention dynamically handled by the data, you kind of have-- I don't-- it's an overstatement, but as general architectural as you can have. Like you have really powerful architecture and now to train it, you need a lot of data. And this is really where we are now, and more data does better just because the architecture is extremely general, much more than convolutional, as you said. So I don't know if there would be any other [INAUDIBLE] architecture that would be [INAUDIBLE].
PHILLIP ISOLA: Maybe two comments on that. One is they are lacking in one thing, which is vanilla transformers don't have memory. They don't have feedback connections, and so they're not trained complete in the same way that an RNN is. Of course, people are adding memory and recurrence to transformers, but still, the majority of them don't have that. So that's actually, I think, a big limitation. They don't have memory.
And then two is, yeah. Eventually, lookup tables will perform best, or nearest neighbor will perform best because the limit of infinite data does work, and we don't know if these work in the limit of infinite data. So yeah, I agree it's not surprising that you need less structure with more data.
AUDIENCE: Brian, you were saying there's choices to be made if we're interested in models of brain systems.
BRIAN CHEUNG: So the question is, what are we interested in, in terms of if these architectures become less biased towards being structurally more relevant to neuroscience, but just being more task-relevant to neuroscience? Are we going to be stuck at some level of understanding that's only functional?
AUDIENCE: What makes them structurally less relevant? That's why I was asking these kind of weights questions? Why do you say that?
BRIAN CHEUNG: Well, I don't-- yeah, so I think that the tricky part is I think because things work well, people will find ways to say that they're more structurally relevant. I don't know if transformers are more structurally relevant than ConvNets. Obviously there's the Fukushima Neocognitron, which is inspired by neuroscience. But attention to themselves was never inspired by-- or at least transformers was never inspired by neuroscience.
So I don't know if they're actually more neuroscience-friendly, I guess, in the terms of similarity. But I think what the bigger picture is that they're going to be more and more generic, and they're going to take less inspiration from structural biases that we know about necessarily, until Phil mentioned the idea of recurrence and feedback. And those things become more-- because actually one thing that's interesting is that there hasn't been much-- I mean, people have proposed these architectures, and they do work.
But people aren't really using feedback versions of transformers, mainly because there's no recurrent nature to them. So for example, all these things are still feedforward architectures that still process from the bottom up. But I think the divergence is going to be are we going to be interested in the same models that the AI machinery committees are interested in? Or are we going to be interested in specific models that work well in neuroscience, but don't necessarily have a functional performance equivalence to what these models have.
PHILLIP ISOLA: One question I have for the neuroscientists in the room is what is the scale of data that the human brain is trained on when they reach adulthood? And my rough estimate or understanding is it's similar to what the biggest transformers are currently trained on, or a little bit smaller than that, but much more data than ConvNets were trained on. And so all this stuff about how-- in which data regime do you get what kinds of performance, well, the data regime that seems most relevant to neuroscience to me seems more like this transformer regime. But I don't know if that's true. How many images do humans see compared to these models?
TOMASO POGGIO: This is not true for text.
PHILLIP ISOLA: Text, I think they're trained on much more data. Yeah. But for--
[INTERPOSING VOICES]
TOMASO POGGIO: A human cannot possibly read everything on the internet.
PHILLIP ISOLA:
BRIAN CHEUNG: But there was a paper, I think, of Fred [? Aranki's ?] group recently that showed that even if you train a GP2 on a 10-year-old amount of text data, it does explain fMRI responses almost as well as even having a lot more data.
TOMASO POGGIO: That's because fMRI is bad.
PHILLIP ISOLA: Oh.
AUDIENCE: [INAUDIBLE] training [INAUDIBLE] around 10 million tokens, which is like what a child would be exposed to at the age of 10, while most transformers are trained on billions of [INAUDIBLE]. And the-- yeah, it [INAUDIBLE] data are [? not ?] equally well. But yeah, [INAUDIBLE].
AUDIENCE: There's an assumption under this question, which is are we actually interested in models of the system in the adult state, or are we interested in models of how the system gets to be in the adult state? Those are not the same question, right? So there may be a shift here between models that are like-- and you called it out, like hey, we can-- instead of us having to hand design them in, this is the bitter lesson version, we'll just lean on the data with a general flexible thing and let the data push it as long as our compute can handle that and we have enough data.
And I think the question we're asking that's interesting to us, some of us, is does that end state? Which of those end states looks more like the adult end state that's agnostic to whether-- neither of them probably followed the same biology path, but just even in that assumption state, what is the state of affairs? I don't think we know what the state of affairs is for visual transformers relative to ConvNets on alignments with even visual processing.
I mean, somebody was asking here about-- somebody here was asking about similarity-- maybe that was you. And then because also at the neural level, that also requires mapping assumptions that-- and they get more complicated with the transformers, right? But behaviorally, it sounds like in [? Gario's ?] paper, you were pointing out, there's some maybe better alignment. But I don't know how they compare against the latest AT-trained ConvNets.
PHILLIP ISOLA: So yeah, I don't know about-- I think those [? stories ?] haven't been done to my knowledge of actual alignment with neural recordings, and of course that-- I'm sure people here will do that. But alignment in terms of functional capabilities does seem quite a bit better just anecdotally to me because, OK, ConvNets, what can they do? And what's been demonstrated with ConvNets of 10 years ago? Classify 1,000 animals, cats and dogs, and [INAUDIBLE] categories.
What can-- well, sure, you can make ConvNets that grow bigger, but the current generation of the best models are these transformers like CLIP. But of course, there's the CLIP non-transformer version. But let's just say CLIP. And that seems much closer to the functionality of the human visual system that you can recognize millions of categories, or way more than thousands of categories.
And you can recognize compositions of categories, you can type in a red ball and have it recognize the red ball, and just see one example of that. And these networks are getting to that point. So the psychophysical level, I think they're getting closer. I don't know the neural embedding level.
AUDIENCE: So something that I'd like to share is there was one paper that me and William, my co-author, wrote that we submitted to [? NeRDs. ?] It actually got rejected, but it was one about-- and we just resubmitted [? to ?] ICLR about this transformer model that achieved state of the art in brain score for area before, which is kind of interesting because we went to the Brain-Score competition at the beginning of this year just hoping to participate. And all of a sudden, William trained this transformer. It was a dual-stream transformer with adversarial training and rotations, and we just broke the record in V4 unexpectedly and wrote a paper about that.
In any case, what I think was interesting is that the same model, exact same architecture, if you trained it another way, just classical SGD, ImageNet, no fancy augmentations or adversarial perturbations, the score wasn't that great. So I wonder just in general, should we also just be thinking about transformer model or the interaction of transformer models plus any type of training regime, or maybe a fancier loss function that we haven't even conceived? And suppose we do hit the [? explained ?] variance, or one correlation in brain score for IT.
How do we even reverse engineer from that? Because the model is just so big. I'm saying-- playing devil's advocate on my own work. The model's so big, how do we even go back and-- it's an open-ended question. I don't know if anyone has any ideas or--
[INTERPOSING VOICES]
[? PHILLIP ISOLA: ?] [INAUDIBLE].
BRIAN CHEUNG: I mean, I think one of the things that I think we often forget is that a model isn't just its architecture. Like the slide before, a model is also data. And once it's interacting with data, we have to understand data now, too, to understand what that model is doing. You can't just understand the architecture.
And I think as these models become-- these architectures become more generic, data is going to play a larger and larger role, and we're back to now trying to understand data now, and understanding what the architecture is. And I think that's-- I don't know if that's easier or harder.
TOMASO POGGIO: But also, the message is for the last 10 years until transformers came three or four years ago, I think the success stories in deep machine learning was convolutional networks. There was one architecture. Now, there are several. So we have quite a few options, and they all perform pretty well.
PHILLIP ISOLA: So--
TOMASO POGGIO: And I think if you just compare functions, input, output, or while they fit neurons, you'll find they're all doing OK. So you need a lot of other constraints, which means what can be implemented by neurons and synapses [? and ?] cannot, or very difficult to see how.
PHILLIP ISOLA: Yeah, I agree. I gave this guest lecture in Tommy and Brian's class, and I was calling it the Anna Karenina conjecture, that as systems get more and more intelligent, they converge on the same representations, abstractions, models, and so forth, which other people have put forth.
And I think it's kind of the same here. I don't actually think the difference being transformers, and ConvNets and MLPs is that dramatic. I think it's more-- as we get more and more data and you optimize more and more toward success at some objective, the models will converge.
TOMASO POGGIO: Well, that's one way. But the other way is that what I said at a meeting a few weeks ago, it could be like flight.
PHILLIP ISOLA: It could be.
TOMASO POGGIO: You have a model of a bird. But that's not really good for everything. What is important is to understand the principles of aerodynamics. Then you can understand how birds fly, and how to build airplanes and other things.
Maybe how fly flies, which is different from birds because aerodynamics involved is different. So I think principles are much more important than the specific implementations, which can be quite different. And the question is, what are the principles here?
PHILLIP ISOLA: Yeah, and I think they're similar principles. I think--
TOMASO POGGIO: Principles, yes.
PHILLIP ISOLA: All of these architectures are just reweighting of the same few ingredients. Factorization is in all of them, hierarchy is in all-- right? I don't know. And even in transformers and ConvNets.
Transformers can be rewritten as 90% convolution, and just a few little layers that are attention. If you look at the actual operations, almost every operation is a convolution in the sense of being one by one, chop up the signal into patches, and process each one independently and identically. Yeah, so I think the principles are going to turn out to be very similar.
BRIAN CHEUNG: My question is, which principles should we care about now given this kind of heterogeneity in architecture, but similarity in functional performance? Because I think one of the things that-- it becomes easier if the community has something they cannot do, whether it be the fly, I guess, for example, a [? fly, ?] which is how do we achieve something that we can't currently do right now?
I think the general spirit of the machine learning community is that oh, we're done. We can just keep making these models bigger, and keep doing this, and we'll be fine. But I don't think that's true. But right now, it seems like the spirit is in that direction, and that's why we're re-evaluating, oh of course this must be like the brain.
Of course this is like what we care about. Of course, all these back explanations are working. But once we hit a wall, I feel like then we know what's wrong and what's correct. Otherwise, I guess it becomes kind of hard to tell right now.
AUDIENCE: On the topic of data efficiency, this might have an obvious answer, but I was wondering if when you have multimodal data, whether learning becomes a lot more efficient if you have expensive information from visual stuff and excerpts of texts that maybe have captions associated with an image or something, versus 2X bits of just x data, and then [INAUDIBLE].
PHILLIP ISOLA: I don't know how well those things have been estimated, but the language vision models are a lot better on certain benchmarks than the vision-only models, and it does seem like language must be like incredibly valuable per word. There's a lot more information than for a pixel. So it seems like a lot of the recent success is just leveraging language, at least in computer vision. Same in robotics, a few other areas.
BRIAN CHEUNG: I think what made CLIP a lot more-- at least from that psychophysics experiment from [? Garris ?] et al a lot more powerful for their results was the fact that CLIP was trained on classification, but on caption similarity matching. Some matching to a text caption, which has a lot more information in the caption than just a single label for this entire image. A caption can tell you things about geometry. It tells you what thing's on the left, what thing's occluded, what thing is-- what season it is, or what time it is. It tells you a lot more information than a single word would be to ImageNet class.
PHILLIP ISOLA: Another-- OK, this is a little anecdotal, but what I've heard is that for training diffusion models, if you train them without language, just unconditional diffusion model generative model of imagery, it's really expensive, and we all thought, OK, we're not going to get into that game. It's not for us, it's for Google.
But if you train them text-conditional, they're actually much-- according to the students I've talked to, they train much faster because the text conditional models, the text gives you so much leverage. And so they said that no, no, we can train text conditional, like DALL-E type model stable diffusion. Those things are within the budget of MIT.
TOMASO POGGIO: Could you do that to train conditional--
PHILLIP ISOLA: Because you have to have text-image pairs. So you have a lot-- there's just a huge source of supervision there as opposed to just random images. And if you have that, then you're in the hundreds of thousands of range to train one of those big models, as opposed to the tens of millions of dollars range. This is the anecdotal [? things that ?] students are saying right now. They might be just trying to get some GPUs. I'm not sure.
TOMASO POGGIO: By the way, I address a question to both of you, but an important feature of transformers compared to previous networks is the fact that you don't have to worry about labeling when you use text, right?
PHILLIP ISOLA: Yeah.
TOMASO POGGIO: It's--
BRIAN CHEUNG: Well, I think that's the issue about the definition of supervision, which is I never found a consistent definition of what a supervised task is versus unsupervised one, besides economic costs. How much did you spend to acquire this data seems to be the only consistent label.
TOMASO POGGIO: But if you speak about neuroscience, it's not only cost, right?
BRIAN CHEUNG: Right. But then I think-- yeah, this is another divergence in between two communities. which is--
TOMASO POGGIO: But the real problem is the brain. Come on. Everything else is--
PHILLIP ISOLA: Well more questions? I'm happy to keep chatting. I'm not sure when we're supposed to end.
TOMASO POGGIO: Think we went for quite some time. We can adjourn and to the next iteration sometime in the next few weeks. OK, thank you.
Associated Research Module: