The Platonic Representation Hypothesis
Date Posted:
September 13, 2024
Date Recorded:
August 11, 2024
Speaker(s):
Phillip Isola, MIT
All Captioned Videos Brains, Minds and Machines Summer Course 2024
PHILIP ISOLA: So I'll talk about this position paper that we put out at ICML and this general argument that we call the platonic representation hypothesis. And this is work that's with Minyoung and Brian and Tongzhou. Brian is over here and was your moderator last night. So you can ask him the hard questions as well. OK.
So I want to start with this paper that I really like. This is a little bit older, 2015 paper on object detectors and how they seem to emerge when you train a network to do scene classification. So what this paper did is they trained a neural network to recognize different scenes and label them with kitchen or bathroom or whatever.
And then you can do what Jacob was talking about. You can do linear probes on different layers of the network. You can try to look at what signals in the input-- that's not really a linear probe, but it's related. What signals in the input most activate neuron A? And what images most activate neuron B?
So you can kind of imagine what it might be. It should have something to do with scenes. And what it turned out is that it's objects. So there exist object detector neurons that fire selectively whenever they see an object. Like this neuron A-- sorry. That neuron A fires whenever it sees a dog face. And this neuron B fires whenever it sees a robin.
And this was really cool because this is like a very concrete version of what I might be willing to call emergence. And it was one of the first that really convinced me at the time that deep nets are learning some interpretable internal structure. They're not just like a black box mystery thing. It's actually learning something structured that makes sense. To recognize this scene, you need to parse it into the object content.
OK, so a lot of other researchers went forward with that. We had this paper a little bit later in which we trained a network to colorize photos-- so take a black and white image and predict the missing colors. And we did the same experiment. So let's take neurons and see what input images most activate those neurons.
And so you can try to guess. What are going to be these units, these detectors in a colorization network? Anybody have a guess? What do you think it might be? It was dog faces and robins for scene classifiers. But what was it going to be for colorization?
AUDIENCE: Leaves maybe.
PHILIP ISOLA: Leaves? Yeah. Anything else? Maybe something about low level photometric statistics of the world. OK. No, it's dog faces. It's robins. It's flowers. It's the same thing. It doesn't matter if you train it to do scene classification or you train it to do image colorization, two things that seem very, very different. The network learns internal structure, which is what's surprisingly similar.
And this is a story that we've seen over and over again. It's probably not that new to many of you. This is really textbook knowledge. These are figures from a textbook that we just published. So this is textbook knowledge. This is well-known.
So over the years, we've seen that different neural networks trained in different ways often learn the same types of internal units. And we're just stating that as a more provocative hypothesis of let's see how general that is.
Different neural networks with different architectures trained on different data sets with different optimizers, to what degree do they converge on the same way of representing the world? And our hypothesis is they converge to some degree.
OK. OK. So I know what you're all thinking. You're all thinking, well, of course they converge. They're all trained on the same data. It's all about the data, right? Machine learning, deep learning, it's all about the data set.
All our models are trained, in the previous examples I showed, on ImageNet and data sets like that. They're all trained on the same set of photos, so of course they learn the same representation. Or now all our models are trained on the internet. Of course they're going to just learn an internet model.
It's going to be the same because all the companies train their models in the same way on the same data set. OK, so that's one possibility. Another possibility is, no, it's not just the data, it's the architecture. We're all using transformers.
So of course we're going to get some kind of convergence. Another possibility is, let's see, the optimizer. OK? We're all using ADAM and SGD. Maybe it's the people. Maybe it's because we talk to each other. It's a sociological thing. We converge because we all go to the same conferences, we meet here, and I recommend that you use the same methods as I use. It could be all of those things.
But I'm going to say something maybe a little stronger, which is actually, the data differs between different models and the architectures differ between different models, but we still see some degree of representational similarity. What is the same between all these models most fundamentally? The world.
So this is the gist of the whole argument I'll make. I'll come back to this. It's all about the world. Any model that you train in this universe should learn structure related to the physics and the statistics of this universe. That's the basic argument at the highest level. OK.
So the outline of the talk is going to be-- first, I'll present evidence of convergence. Then I'll talk about some forces in machine learning that could be driving this kind of convergence. Then I'll speculate a bit and have a kind of toy mathematical model of what we might be converging to. If models are converging to similar representations, what is that endpoint to that convergence?
And then I'll talk about implications and limitations of this analysis. And there should be plenty of time for questions and interruptions and people saying, this is all wrong. So feel free to do that throughout the talk. It doesn't have to just be at the end.
OK. So let's first look at evidence of convergence. OK, so I already showed you some with training neural networks and computer vision to do different tasks. We always get the dog face detector or the human face detector. We always seem to get neurons that detect those types of patterns, regardless of the objective.
And I think the most classical and striking version of this still today is the emergence of these Gabor-like edge filters in the cat cortex in AlexNet on the right. So this is a biological system. If you look at v1, it will have these Gabor-like detectors.
Maybe that's not all there is, but that's part of it. In AlexNet, this is literally all it is because we can characterize all the filters in the first layer of that neural network. And these are the receptive fields or these are the filters themselves, in fact. And they look quite similar.
So that's the classical result. But that's some kind of convergence. The first layer of processing in convolutional neural networks looks like it converges on the same set of filters. And there's kind of proofs for when that might happen, going back to things like Olshausen and Field and sparse coding and so forth.
But people have carried that story a lot further forward to the point where it's no longer provable, but it's just an empirical observation. And here's one recent paper that I like. This is a paper that identified these things that they call Rosetta neurons.
And so Rosetta neurons are, like the Rosetta Stone, neurons that are paired between different neural networks, that can be used to bridge or translate between different neural networks. So in all of these networks, in the columns, there exist these four different neurons.
These are kind of convolutional response maps. So it's saying where in the image does the neuron fire, that's what the heat map is showing. And there exists a neuron in StyleGAN, which is a generative model, and in ResNet, that's a classifier, that will fire whenever it sees a red Santa hat.
OK, this is a redness or maybe a Santa hat detector. And there exists a neuron that's a cat eye detector. And they found that there were about 20 or so of these Rosetta neurons that were just the same receptive field, the same response pattern across all the different models that they tried. So our conjecture is actually it's much more than 20. That we'll just go in and all of them will be the same in the end. But they didn't go quite that far.
So that's one more piece of evidence. I'm only going to be selecting a few different papers to highlight here before I get to some of our new results, because there's really a lot of people that have studied this question and I can't cover all of it.
So here's another one. This is a paper that popularized this approach called model stitching. And what they do is they take a computer vision system, so a convolutional network, and they break it into two halves. They train the first half with data set A and they train the second half with data set B, or even changing the objective in some cases. Like train the first with a cross entropy objective, the second with a regression objective or like a least squares objective.
And then they stitch them together with a linear layer. And the question is whether this network can correctly operate, correctly classify the input photos you give it, when it's only learning this linear transformation between the output representation of this network and the input representation of that network.
So if this representation is incompatible with that representation, then a linear mapping wouldn't be able to align them and operate correctly. But if they're actually learning fundamentally the same information just up to a linear transformation, then this should work. And what they found is that this works.
OK, so here is-- if I stitch, this is a random baseline and this is the penalty I get. So the decrease in performance due to stitching versus just training the whole thing end to end. If I stitch, this is the fraction of the layers where I'm cutting off my bottom half of the network and then stitching on the top half of the network.
And you can see there's a big stitching penalty for random networks. But for these pre-trained networks trained on self-supervised tasks, on supervised tasks, on all different types of tasks, there's really barely any stitching penalty.
So these flat lines are saying that if I train network half A on one task and network half B on a different task and I put a linear layer in between them, then it just works. It works just as well as end-to-end training the whole thing. That's more or less what this is saying. OK.
So that's another line of prior evidence for this. Somehow these different vision systems trained in different ways on different data sets are learning representations that are compatible up to a linear transformation.
Another very popular way of characterizing representations is to look at how they measure distance between different data points. And so this is going to be the main one that we look at for the rest of the talk. And I'll tell you the technical details of how this works now.
So this is sometimes called representational similarity analysis or kernel methods. So what you do is we will first restrict our attention to representations which are vector embeddings. So mapping from some data to a vector in RD, let's say.
Yeah, I should move so that you can actually see the equations here. So we're going to characterize such a representation in terms of its kernel. So what is the kernel? The kernel is the function that specifies how does that neural network, that embedding, measure distance between different data points?
So it's the inner product between the embedding for data point i and the embedding for data point j. So if these are images, it'd be like I embed the two vectors, I take the inner product, and maybe I take their dot product, and this creates the kernel, the similarity function.
OK, so visually it looks like this. Here again with an image example, but you could do this on any type of data. We will create the kernel for a vision system as what is the similarity of the embeddings, the representations for the apple with the orange? So the apple with the orange are quite similar to each other, according to this particular model.
And so the kernel will say that these have high similarity. However, the apple and the elephant are going to be lower similarity. So the apple and the elephant or the orange and the elephant will be darker. So this is what the kernel tells us. How does the neural network represent the similarity or difference in representation space between different inputs?
OK, so this is a really important object for understanding and representation. How does it measure distance? It shows up all over in machine learning, kernel methods, and so forth. I mean, I know these are a little out of fashion, but they're still important.
And in neuroscience, the name for this is representational dissimilarity analysis. OK, so a lot of fields have come up with the same thing. Now, in order to characterize whether the representation of one neural network is similar to the representation of another neural network, we're going to measure the similarity between the kernels induced by those two neural networks.
So it's a similarity of similarities. OK, so this is one neural network. This is the DINO neural network. It's a computer vision system. Here's another one called CLIP. It's another computer vision system. I can create the DINO kernel evaluated over these emojis. I can create the CLIP kernel evaluated over these emojis.
These are maybe slightly different. But if these two vision systems represent the data in the same way up to similarity analysis, then these two kernels should be similar. And our kernel alignment metric or our kernel similarity metric will just be some distance between the kernel matrices for two different neural networks. Any questions? That's going to be the technical object that we look at for the rest of the talk. Cool. OK. Yeah, one.
AUDIENCE: I do have a question about the stitching part. When you are saying stitching, are you referring to the same architecture or different architectures?
PHILIP ISOLA: Yeah. So in the stitching work, I believe you need to have two architectures where you can take the output of one architecture and put a linear mapping to the input to the second half of the other network. So it doesn't have to be that they're identical architectures. And I think probably some of the stitching work did not use the same architecture for the A and the B half of the network.
But they have to be compatible up to a linear transformation. So it has to be the first one outputs an N dimensional vector and the second one outputs an M dimensional vector that would satisfy-- takes as input an M dimensional vector, and then we'd have a linear map from N to M. So they don't have to be the same architecture. I don't remember what they did in that paper that I showed. They might have just used the same architecture. Yeah.
AUDIENCE: What is the rationale for doing this kernel similarity as opposed to something else like a transferability [INAUDIBLE] resolutions? How do you think the orientation or [INAUDIBLE] for example?
PHILIP ISOLA: Yeah so an alternative to this would be to try to embed the images with DINO and then learn a mapping that tries to predict what would the equivalent representation be. And if it's a linear mapping, that's again like linear probes and things Jacob was talking about.
So I think that's perfectly fine to do and that would be complementary. The main reason we didn't do that here is it's costly because you have to train those models. And yeah, I would just say this is another choice that's cheap to analyze. But I actually do think we should do this. So yeah. OK, good? So moving on. Yeah, one more question.
AUDIENCE: [INAUDIBLE] Like, this doesn't account any dimensionality, number of nodes, like [INAUDIBLE], right? Is there any way that [INAUDIBLE] number of dimensions, number of activation nodes that we use to calculate [INAUDIBLE].
PHILIP ISOLA: So we're really trying to characterize representation in a way that's invariant to some of those details. So maybe those details matter. Like, what is the dimensionality of the embedding? Maybe that matters for certain reasons. And that actually might matter with the metric trying to predict the embedding from one net given the embedding from another net.
But this factors that away. So this reduces all-- it's a characteristic of a representation, which is the same dimensionality, the same format, regardless of what is the original architecture that you used.
So regardless of what the dimensionality N is of the embedding, the kernel will be the same size. So we consider that to be a positive. This means that we can analyze things that differ in their dimensionality.
You might also be familiar with work that talks about how SGD finds similar parameter vectors that minimize the function or that all live in some basin, in some sense. And that's looking at convergence of parameters. And this is completely invariant to parameters. This is like a function space representation-- function space characteristic of the network, as opposed to a parameter space representation. So we're getting away from the number of units, the number of parameters. And we think that's a positive. It abstracts things a little bit. Yeah.
AUDIENCE: So do you use the embeddings from a particular layer or is it all of the layers?
PHILIP ISOLA: Yeah, so that's another good question. So for most of the computer vision systems, there's kind of a canonical layer that is used as the representation. It's often right before the logits, or sometimes in models like SimCLR, there's a projection head on top.
But there's often a canonical representation that people use. And that's the one that we use. For other models, like language models-- which we'll get to later-- there's not a canonical embedding layer of like GPT-3 or whatever. And for those models, we just kind of concatenate a bunch of layers together.
It's a bit ad hoc, but we didn't have a principled way of deciding which layer to use. OK, so how do you measure the similarity between two matrices? There's a lot of choices. You could just take the distance between them. However, the one that we found that actually shows the trends the most cleanly is this one, which is kind of a nearest neighbor based metric.
So what we do is we take a set of data points, we embed them with neural net f. We take that same set of data points. We embed them with neural net g. And we ask, what percent of the nearest neighbors of a given data point are shared between f and g?
So here we have the nearest neighbors to the blue point are red and yellow, and here they're also red and yellow. But this one differs in the purple data point. So this means that two out of three of the nearest neighbors are shared between the embeddings for f and g.
OK, so this is one way of measuring the similarity between two kernels. Because the kernel tells you what the nearest neighbors are. Essentially, it tells you the distance to all of the data points in embedding space. And this is saying, do the kernels induce the same set of nearest neighbors between network f and network g? Or how many of the neighbors are the same?
So that's our particular metric that we use in our experiments. But in the literature, there's a few other metrics. One is called CKA, Center Kernel Alignment. So there's a few other ones. And they all kind of roughly tell a similar story. I'll mention some point about CKA a bit later. Yeah.
AUDIENCE: Yes. I was just going to ask, have you tried other ones that work less well? If so, do you have any explanations on what works better?
PHILIP ISOLA: Let me say that. So CKA works less well. Some of the trends are the same, but some of them differ. And I'll bring that up in the limitations section. So we'll come back to that. Did I see one more question? Yeah.
AUDIENCE: When you say in this comparison that one of them works less well than the other, how are you comparing that to? What's the ground truth?
PHILIP ISOLA: Yeah. So what I mean by that will become clearer on the subsequent slides, but basically I'm going to show some trends, and these trends show up when we use this metric. They don't show up as cleanly when we use some of the other metrics. So the type of convergence that we will observe is convergence in the nearest neighbor structure of two representations. It's not convergence in certain other types of structures of two representations.
AUDIENCE: But how do you know that that's an actual proof, like a trend that's actually [INAUDIBLE] something serious.
PHILIP ISOLA: We could have kind of hacked the metric to find the one. I think that's a good question to keep in mind. That maybe is a limitation that-- yeah, maybe we found the one metric that shows what we wanted to see. But keep thinking about that. I would say it the other way around. I would say we found a sense in which representations are converging, and that sense is the nearest neighbor structure. OK.
So of course, there's a lot of research on this same question in neuroscience. So I want to go over this a little bit before getting to our new experiments. So this is the representational dissimilarity analysis line of work. And what they do there is the same type of thing.
They take two images. They show it to a monkey or a human or some animal. And they measure the activations inside the brain. OK, so we get a embedding in the brain, some activation vector. And we can look at the similarity between the activation vector for face and face, and it's high. Or this is dissimilarity, so the dissimilarity is low.
And the dissimilarity between face and house is higher. These are not considered similar by the animal's brain. So that's how you extract kernels for the brain. You can use neural recordings for that.
And the interesting thing there is that people have found that the kernel between a computer vision system, an artificial network, and the kernel for the macaque brain are to some degree quite similar looking. So we could measure-- this is the kernel for the-- OK, sorry.
This is the kernel for IT and the macaque, I believe. And this is the kernel in a deep net from a few years ago. And they look quite similar. You could measure the alignment in different ways. You could use the nearest neighbor metric. They were using something else at the time. But just visually you can tell they're quite similar. So there's a cluster of things which is like faces and a cluster of things which is houses. And that same cluster appears in the monkey.
OK. So deep nets and the primate visual cortex seem to organize data in somewhat similar ways. That's this other line of research. And I'm not going to get too much into the neuroscience. There's a lot of controversies about exactly how to measure that. But we thought, we'll just look at artificial networks and we'll do the same analysis there. OK.
So you can also do this with behavioral studies. And this is actually some work that we have done in my group some years ago, or I was involved in some of this work. So instead of measuring probes in the brain, I can just ask a human, how similar are these two images and how similar are those two images?
So we're going to do that test with you right now. We'll just do a little game. This is work that we did a few years ago. It led to this metric called LPIPS, which you might have used before. So the question is, how similar does a human think this image is to that image?
And we'll make a neural net model that will output the same similarity as a human would output if you just asked them. So there's behavioral data behind this. We didn't just ask them how similar are these two images. But we had various just noticeable difference type test, and two alternative forced choice type tests. But those are just details.
Somehow, we asked humans, how similar are these two images? So I'm going to ask you that now. So here is the reference image. And I'm going to ask you how similar is this image to the left image and then the right image. So here's the left image and here's the right image. And what I want you to do is clap if you think the left image is more similar to the middle image than the middle image is to the right image. So clap if you think that.
[SPARSE CLAPS]
OK, a few people. Sorry, there we go. You did that. OK, now clap if you think the right image is more similar to the middle image than the middle image is to the left image.
[CLAPS]
A lot more people. OK, so that's what you said. You're all human. And this is what our participants say too. OK, but why is that? Interestingly enough, it's not trivial. If you look at the Euclidean distance between this image and that image in pixel space, they're actually quite similar.
And this image and this image have a higher Euclidean distance because it's warped. The pixels are all a little bit misaligned. And this is the most standard classical way of measuring how similar are two images. It's sometimes called PSNR. There's other metrics you might have heard of, like SSIM. Same thing.
They think this image and that image are similar. So classic image processing people didn't know how to measure similarity. It looked like the brain is doing something dramatically different. But what about just the deep nets? Do they learn a notion of similarity that is the same as humans or not? OK.
So the way we can do that is, again, construct the entries in the kernel matrix for a computer vision system. So we pass an image and another image through the neural net. And we do a little bit of processing to average. And we basically subtract the activations at all layers. So here, we're not using just the final embedding, but all deep layers of the network, do some normalization and so forth.
And then finally, you get a scalar out, which is just saying, what is the distance between this image and that image, according to the activations of this neural net? So this is building one entry in the kernel matrix.
OK, so here were the empirical results. On the y-axis, it's going to be how often does-- I'll show a bunch of networks. And we're going to be measuring how often those networks agree with humans as to which of the two possible images in these triplets is more similar to the reference in the middle.
So if the network says that image A is more similar to image B than it is to image C, and humans also say that image A is more similar to image B than it is to image C, then we'll get a high agreement with humans.
So I already told you that just the Euclidean distance between the pixels doesn't work very well. Well, 70%. OK, SSIM and these classic metrics don't work too well. And yeah, question?
AUDIENCE: The last one, human one, is just human versus human?
PHILIP ISOLA: This last part is human versus human. It's a noise ceiling. How often do two humans agree? And these are AlexNet and VGG, some old neural networks, right? I mean, this is a little bit of older work.
And these aren't trained to fit human data. This is a classifier. This is an image classifier, again. And yet they're better predictors of human similarity judgments. They have more similar kernels to the human behavioral psychophysical kernel than all the kind of classic similarity metrics.
And it doesn't really matter what architecture you use or if you train these with supervised data or these are self-supervised methods. It's all kind of the same. Well, not quite the same, but close enough.
So there's some details. I'm not going to talk about each of these methods and the differences. The main effect is just that if I train a neural network, a deep neural network, it learns a notion of similarity which is related to the human notion of similarity by behavioral response.
So we did another paper on this just last year. So this is a little bit more up to date. And in this paper, we wanted to-- in the LPIPS paper, the last one I showed you, we were looking at the kind of low-level notions of similarity.
Like if I blur an image, do I consider that to be less similar than a distorted image? And here we're just saying, take two random photos, and they can differ in a lot of different ways. The background can differ and so forth. How similar does a neural net think those two are and how similar do humans think those two are? And do these trends still hold up in 2023?
OK, so let's play the game again, just because it's kind of fun. So we're going to do the same thing. So you're going to first clap if you think this image is more similar to this image than this image is to that image. OK, clap if you think the left one is the more similar image. OK? No? Right one.
[CLAPS]
OK. So, good. This is actually what that model-- the previous one I talked about-- decided. Because it was really only able to model low-level similarity very well. And AlexNet and VGG, they kind of agreed with humans on blur and so forth, but they didn't agree on these types of more complicated data. OK, another one. OK, clap if you think it's the left. Right.
[CLAPS]
You see that you have a lot of agreement. And again, LPIPS disagreed. So it wasn't sufficient. It actually didn't model humans as well as we thought. OK, this is a little bit harder. Think about it. Left.
[CLAPS]
Right. I think there might be some weird bias going on, because in this case, actually-- oh, wait. Oh no, I got it wrong. No, you're right. Because I chose all these-- these examples are all examples where humans disagree with LPIPS. So actually, humans did think that this one is it. I might be mixing left and right up because I think I'm rotated.
OK. This is pretty easy. OK, this image? This image.
[CLAPS]
Yeah. OK. So I gave it an easy one, and actually, LPIPS agreed with humans on that one. So LPIPS is not terrible. But now the question is, OK, fine. So AlexNet actually only captured low-level similarity that agrees with humans.
What about the newest networks? What about CLIP and DINO and these foundation models, the newest generation of computer vision systems? Do they agree with humans on these more complicated images? And it turns out they do.
So again, this trend is just holding. As time passes, these bigger, better computer vision systems are agreeing with humans on more types of data. So again, the same graph. Agreement with human judgments, some different models. You can see the LPIPS one I showed you has a medium agreement with humans on these complicated images. But the latest computer vision systems, which are using architectures which are five or six years after LPIPS, are agreeing with humans much better.
So just by waiting for better computer vision systems to come out, we get better agreement with human psychophysics. You might be asking yourself, wait, but does DINO also agree on those low level blur and so forth? And yes, it does, but only about as well as LPIPS. It seemed like LPIPS already kind of saturated that. Yeah, question.
AUDIENCE: These models were not trained on the same data, right? They have vastly different data.
PHILIP ISOLA: Yeah.
AUDIENCE: So it's not about the architecture. They just see more statistics about the world.
PHILIP ISOLA: Yeah. That's the rough hypothesis we'll get to, that they're converging to something which is like the statistics of the world, natural image statistics, rational analysis kind of argument. Yeah.
AUDIENCE: Like the transformer training, medium images and the image that [INAUDIBLE].
PHILIP ISOLA: Yeah. So these are trained on much more data than these were trained on. And when you train on more data, you get a model which better matches human perception. But they're not trained to match human perception. That's the point. OK, yeah.
AUDIENCE: Just want to make sure I understand. So how do you teach these models to output the pair of similarity, like A versus B and A versus C? What do you do with that?
PHILIP ISOLA: Yeah. So you measure the distance according to the model between A and B and you measure the distance between A and C, and you compare those two. And whichever one is bigger is the model's choice.
AUDIENCE: Yeah, but the distance is decided by you or by the model?
PHILIP ISOLA: The model. The model outputs a distance. Given a pair of images, the model gives you a distance. So I'll just quickly show-- well, maybe I've gone back too many slides.
AUDIENCE: It's just the Euclidean distance between the embeddings.
AUDIENCE: Well, why Euclidean? This is [INAUDIBLE]. What are they?
PHILIP ISOLA: You can choose other metrics. They're all going to be in pretty high agreement with each other. But yeah, I believe in fact we're using cosine similarity. OK, yeah. Sorry, here it was. OK. In this paper, it was the L2 distance. But anyway, it's kind of a detail. I don't think it matters.
OK, so let's get to some new 2024 experiments from this platonic paper. OK, so this was all history and prior work, which I think argues for this hypothesis. And we wanted to test it more directly.
OK, so we have some similarity between different computer vision systems. What we want to look at next is, is this similarity increasing over time? And if it is increasing over time, then that suggests there's some kind of convergence going on. Over time, as we train bigger and better models on more and more data, if they agree more and more with each other and potentially with humans as well, then it's like we're all converging to something.
And what is that and how far will that go? And the hypothesis is an investigation of that question. OK, so we're going to now run a study with all of the 2024 models looking at the kernel alignment between each of these models with each of the other models. We're asking, is one vision system becoming more alike to other vision systems as a function of time?
So hypothesis one is that, no, not necessarily. Actually, there are different ways you can fundamentally represent the visual world. You could represent it in terms of edges and textures or objects and events. There's many different possible ways of being visually intelligent. That's hypothesis one.
OK, maybe Tommy thinks this is the case. And I don't like putting forth a hypothesis. What I'm saying is meant to be thought provoking, but not necessarily something I strictly believe is true. OK, so there's probably some truth to this. But the other possibility is no, all strong visual representations are alike. There's only one way to do vision. There's only one way in this physical world, to do computer vision or human vision or whatever it is. There's just one way to do it.
So here's the first new study. So what we did is we took a bunch of different computer vision systems. They're all trained-- some are VITs, vision transformers, some are CNNs. None are RNNs. But anyway, a few different architectures. They're trained on different data sets. Some are even trained on synthetic data as opposed to real data.
And in this plot here, I'm going to measure-- I'm going to sort the models by their performance on this general competency called the VTAB task. So VTAB is just measuring a vision system by how well it does at a bunch of different things-- classification, bounding box prediction, counting objects, and so forth.
And I'm going to group the models into five bins of competency according to the VTAB performance. And then we're going to ask, what is the variability within each bin between the kernels? And so on the y-axis is within the bin, how similar are the representations learned by those different models?
OK, so here's the result. Well-performing models have very similar representations on their embedding layers, not in their outputs. Poorly performing models are all different. So you may recall that this sounds like the first line of Anna Karenina now, right? All happy models are alike. All unhappy models or all poorly performing models are different. They're poor in their own way. Yeah.
AUDIENCE: But when you do kernel, you have this 1 over n, this representation. So the fluctuation in the kernel is order of 1 over square root of n. So when you have models with wider layers, you have the fluctuation is smaller.
PHILIP ISOLA: Yeah, you mean the dimensionality of the embedding is kind of creating a bias. Yeah, that's a good question. We should systematically look at that. I don't think that's going to explain much of it because most of these models have similar-- I don't know. Brian, do did we fix the--
AUDIENCE: [INAUDIBLE]
PHILIP ISOLA: Oh yeah, it's also the nearest-- well, we should think about it a little bit more.
AUDIENCE: What happens when you try to do it with the Gaussian process?
PHILIP ISOLA: So I'm not sure if we've tried that. But we have tried doing this with just randomly initialized networks, which will vary in their embedding dimensionality. And those don't align with any of these models.
So if it were just about as dimensionality goes to infinity, you approach some limit which is convergent, then you would expect a random network would also have that property. And that doesn't happen. We can talk a little bit more in detail about it. I think it's something we should look at more systematically. But I don't think that's explaining the data.
So here's just another way of looking at that same result. OK, I'll come back in just a second. This is a UMAP. So if you know t-SNE, it's similar to t-SNE. It's just an embedding where two models with similar representations are near each other. In this scatter plot, each point in the scatter plot is a different model.
And you can see that the kind of main thing that controls how models cluster, which models have similar representations, is not their architecture, it's not their training data, it's their performance. So competent models are all alike in their representations. OK, maybe not too surprising, but this is the data. Yeah.
AUDIENCE: What's the set of images over which you're computing the distribution of [INAUDIBLE] similarly? Like, how sensitive is this?
AUDIENCE: Yeah, that's [INAUDIBLE].
AUDIENCE: [INAUDIBLE] versus all of our lines versus your physically impossible scenes versus--
PHILIP ISOLA: Yeah. So the kernels are evaluated in general over this Wikipedia image captioning data set. Brian, is that true for this experiment? This is Wikipedia data? Yeah. So it's images found on Wikipedia.
I don't know that we've really looked at the sensitivity to different data distributions for evaluation. They're all trained on different data. But for evaluating the kernels, yeah, it was always Wikipedia. That's a good thing. We should check that. OK. Yeah. One more.
AUDIENCE: Does this sort of dissolve the [INAUDIBLE] distract us from those thoughts [INAUDIBLE]?
PHILIP ISOLA: We'll get to that in the implications. But you would think, well, OK, if all good models are somehow converging, then just take them all and ensemble them. It should work well. They should already be kind of aligned and ensemble-able. So I think that is an interesting thing that you could try.
But you will have to-- in order to ensemble, you'll have to somehow-- there's a symmetry, which is that you can get the same kernel with differently rotated data. Any isometric transformation of the data will have the same kernel. So you have to get rid of that symmetry somehow.
So that's something that I think is interesting. But a lot of you might be saying, OK, that's kind of obvious. Like, two different computer vision systems that perform well in the same set of tasks have similar representations. Maybe it has to be that.
We're not proving this at the level of the neural collapse in the previous talk, where we are really showing that it has to be that. But I think intuitively it kind of-- yeah, if my representation is good for the same set of things, then it's going to be a similar representation.
But maybe it could have been otherwise. It could have been that there are equivalently good representations for the same set of things. But it's not unreasonable. But this is, I think, going to be the more surprising-- at least to me, this was a more surprising experiment. Are language models learning the same kernel as vision models?
So hypothesis one, well, no, of course not. Language models model language. They're going to learn about syntax and next word prediction. They're not going to really have anything to do with vision. And the better the language model, the more it will be a specialist in just the superficial characteristics of language.
Hypothesis two, no. Actually, it's what Jacob was saying. Language models are world models and they learn general knowledge about the world. And so the best language model will be the best model of the world, which will also be a good vision model. And maybe there's a strong form, which is like yeah, the best language model is literally the best vision model. They actually converge to exactly the same thing.
So here, we're going to measure the similarity between two kernels, but now it's a vision kernel and a language kernel. So how do we measure between two different modalities? What we do is we use paired data. So we take the image of an apple and the image of an orange to create the vision kernel. And we take the word apple and the word orange to make the corresponding paired matched language kernel.
So importantly, the language models we'll look at are trained only on language data, no vision. The vision models we'll look at are trained only on images, no language. But in order to evaluate the similarity of how they represent the world, we'll use paired data to align the two kernels in this way.
So here's the Wikipedia data that we ran this on. And we take a bunch of images online. We take their captions, so it's an entire sentence. We embed the images with the vision system. We embed the captions with a language system. We extract some layer of the language representation. There's technical details there.
We get these embeddings. And then we will measure-- sorry, we will measure, is the distance between the embedding for Half Dome and Yosemite the same as the distance between the embedding for the word Half Dome and the word Yosemite Valley, or the sentence Half Dome at sunset and the sentence Yosemite Valley?
OK. So on the x-axis is going to be the competency of the language model. So that's going to be measured as just how good are you at next character prediction, next word prediction. It's using a metric called bits per byte, but that's the details. Basically, the log likelihood of the next character that you're predicting.
On the y-axis is going to be the kernel alignment between each of the language models that we look at versus DINO, which is a vision model. OK, so what's the trend going to be? OK, we made it into a line by choosing the right metric.
Different metrics will be not quite as linear, but they go up and to the right. So these are a bunch of different language models. We looked at different sizes-- bloom, 560 million parameters, llama, 65 billion parameters. Each point is a different language model.
And so llama has a similar kernel to DINO. It aligns in the sense that I gave you to DINO. And the worst language models align worse. So better language models learn representations that measure distance in ways that are more alike to a given vision model. And it also goes the other way.
The better the vision model, the stronger the alignment to the language model. So DINO Giant is a better vision model than DINO Small. And Dino Giant has higher alignment to llama than DINO Small. OK.
So you might wonder, what about other vision models? What about not DINO? Well, basically-- oh, sorry. What's going to happen in the future? We really don't know. OK. I think this trend will keep increasing. But maybe we're going to overfit to language soon and just fall off a cliff. So it might not hold up into the future.
But the same story is true for a bunch of different vision models. This is Masked Autoencoder. This is CLIP. CLIP is interesting because CLIP is trained to align vision representations to text representations.
So you might have expected that the CLIP kernel will be really well aligned to a llama kernel. But it's only a little more aligned to the llama kernel than is DINO. OK, MAE is a bit worse. It's point one on our metric and CLIP is point two. But the trend is the same between all of these different systems. They're all going up and to the right.
OK. Another interesting thing is, this is true not just for log likelihood of the next character. But now I'm switching the axis. I'm going to have on the x-axis the alignment of a language model to vision and on the y-axis is the downstream performance of the language model.
So the more similar a language model is to a vision model in terms of this kernel, the higher performance that language model ends up getting on this common sense reasoning benchmark. And the trends are all kind of aligned.
So bigger models are more aligned to vision. Stronger models are more aligned to vision. And more recent models are more aligned to vision, because those three things are all correlated. So this does suggest something we haven't done, but we really want to do, which is, what if I fine tune llama to be more aligned with DINO? Will performance of llama go up?
If I fine tune a language model to be more like a vision model, will I do better at language modeling? This is a correlation. We haven't done that causal experiment, but hopefully we'll have something to say on that soon. Yeah.
AUDIENCE: Can you read the actual numbers on the axis? Because I'm having trouble seeing whether 0.2 is like a large score or I guess a reasonable score, or is it just like we're hitting noise with correlation, where you kind of understand.
PHILIP ISOLA: Right. 0.2, is that good alignment or bad alignment? It's definitely not-- the highest you can get on this metric is 1. This means that 20% of the nearest neighbors of model A are the same as the nearest neighbors in model B's embedding.
But that depends on your dictionary size, how many possible candidate neighbors are there. I think it was 1,000. Dictionary has 1,000 I think. So of the top five nearest neighbor, when there's 1,000 possible candidates, 20% are shared.
So that's technically what it means. But I think it's easier to say, well, CLIP, a model trained to align vision to language, learns embeddings which are not that much more aligned to a language model than these pure vision models. So I think that gives you a calibration.
AUDIENCE: If I recall correctly, somehow llama is like 0.8. But here you show it's 0.2. Isn't that too low?
PHILIP ISOLA: Sorry, which one?
AUDIENCE: Somewhere back in the slides, you have this human alignment. The score is around 0.8.
PHILIP ISOLA: Oh, I think-- oh, in the vision modalities, it was 0.8? Yeah, so there's technical differences between the experiments and the metrics aren't maybe not directly comparable. But a vision system is more aligned to another vision system than it is to a language system. So the numbers here are going to be lower.
AUDIENCE: As a reference point, the model that you do not train, [INAUDIBLE] untrained is like 0.035.
PHILIP ISOLA: Yeah. So chance is like 0.03 if you just randomly initialize a network.
AUDIENCE: Do you know if you use the text to confirm what that would be?
PHILIP ISOLA: The text embedding from clip will be much higher. Yeah, this is the vision embedding from CLIP against llama. The text embedding from CLIP against llama, I think it's higher, but I remember. Brian, do you know? OK, we never ran it. So I just assume it's higher.
AUDIENCE: Yes, the vision embedding from CLIP, which presumably is like super, super, [INAUDIBLE].
PHILIP ISOLA: Yeah, I think it should be pretty high. Yeah. OK. OK. So yeah, there's a lot of other work in this space. So I guess this is just a slide saying, read the paper if you want to get more references.
There's a whole field around representational alignment. There's workshops. At NeurIPS, there's going to be a representational alignment workshop on these questions. But I want to move on now to some of our hypotheses and explanations for why this might happen. OK.
So first, I'm going to go over a few kind of basic ML 101 reasons why you would expect convergence to happen in theory. And then I'll go into an idea for what it might all be converging to, if this is true.
So the first effect from ML 101 is that if I train, here's my hypothesis space I'm searching over. I'm training a model to solve a task. So here's a set of hypotheses that solve a task. If I train a model to solve one task, then it will arrive at something in this space.
If I train a model to solve two tasks, it will arrive at something in this space, or solve task 2, it will arrive in something in this space. If I train a model to multitask, to solve task A and task B, or equivalently to do well on data set A and data set B, then we'll be at the intersection of these two.
So the more constraints you put on your machine learning system, the more objectives it has to satisfy, the more data points it has to fit. You get a strictly smaller subspace of the hypothesis space, which is going to fit all of those constraints. So this is one effect that could lead to convergence, training on more tasks, on more data-- and that's what we're doing now-- should result in fewer and fewer valid functions that satisfy all of those constraints.
OK. This has also been called the contravariance principle. It's roughly the same idea. And also, again, others have referred to this as kind of an Anna Karenina. Happy families, happy representations have to satisfy a lot of things. Any one thing that goes wrong, and they'll be unhappy. But if you have to solve a lot of things, then it will force this convergence.
OK, so this observation has been out there for a while. But I think that this is part of it. OK, here's another, I think, important condition for us to get convergence. So we need to have big enough models.
Because if you think about it, if I have two models that are kind of small-- so here's the possible space of all functions. And here's neural net A hypothesis space. It only parametrizes some functions. Neural net B hypothesis space, it parametrizes other functions.
And this ambient space, actually the minimum in the ambient space of the machine learning problem is over here. It's not within either of these hypothesis spaces. Then they won't converge because this guy will learn that point and this guy will learn that point.
But if I simply scale up the models, I have a greater chance of finding a solution, which I have a greater chance that the two models will overlap on the optimum in the ambient space. So if I scale up two hypothesis spaces, they're going to overlap more. And therefore, there will be the chance that they can actually arrive at the same solution.
So I think that this is part of it too, that we are making bigger and bigger models. And that should have the effect that there is the possibility of convergence, if I also have other constraints that force me to find solutions in a small subspace.
OK. And then another hypothesis is that, well, here's the set of functions that solve all the tasks, but it might still have a lot of symmetries. It might be that this is a huge space, there's a million functions that fit any-- n data points can be fit with a million different curves, right?
So I need to have some regularizer that chooses which of those curves I'm actually going to pick within this space. These are the points that fit the data, the parameters that fit the data. But which one am I going to choose?
So we know that in deep nets and all of machine learning, you always have some regularizers, implicit or explicit. And these are some bias towards simple functions. And so we'll converge. And maybe that bias will be smooth, and so it will just prefer us to push to simpler functions and arrive at some corner of this space. So that could be affecting convergence too. And there's a lot of work. I think we had some talks on it earlier, that deep nets do have these types of biases toward low norm, low rank solutions, and so forth.
So these are kind of machine learning 101 type explanations for why this might happen. But I think that they could be part of it. So next, I want to talk about what is the endpoint of all of this.
This is the most speculative part of the talk. This is all just position talk. It is all speculation, but with some results. But the most speculation is here. So here is why we came up with this name, the platonic hypothesis.
So it goes back to the idea of Plato's cave. So you might have heard of that story. Plato imagined prisoners in a cave whose only experience of the outside world is the shadows projected on the cave wall. And the prisoner's task is to somehow infer what is going on outside in the real world.
And it's meant as an allegory because Plato is saying that's how it is for all of us in real life. We don't see reality. We just see photons and waveforms. And these are projections, partial shadows of the true platonic ideal.
And so he had some ideas around what that platonic ideal might be. It's cones and squares and mathematical objects. We're not making any specific claim there, but just saying that all of the data we see is some projection or sampling from an underlying world.
And that world is common between all the different ways of sampling the data. So here we have our real world out there, our platonic ideal. And we can take a photo of it or we can talk about it with language or we can listen to it.
And all of the different modalities are projections somehow, directly or indirectly, of that world. So here's an indirect projection. I go take a photo and then I caption the photo. But the information comes from that world.
And if I train representations unimodal on language and on vision, well, they become similar because ultimately that data is just a function of that underlying world. That's the rough idea. Not an incredibly novel idea, but that's the rough idea we're promoting.
OK. So we have a kind of toy mathematical model for starting to work with this and make more precise statements. So we're going to imagine a world which works as follows. The world consists of a discrete set of events, z. z is that causal variable that generates our observations.
We observe data mediated-- OK, there's a distribution over z. We observe data and that data is mediated through observation functions. We put a huge constraint on these functions. We're going to assume that they are bijective, so meaning the image contains all the information in the world.
Which is not true, but that's going to be our toy model for now. In this world, we're going to model co-occurrences over observations. So how often do I see red cone next to blue cone, for example?
And this modeling co-occurrences over observations is super common. So in computer vision, we do a lot of this contrastive learning. We sample two patches, two co-occurring patches in an image. And we try to align the two co-occurring patches and move apart two patches from two different images.
In language, there are similar contrastive learners, which try to say that car and street are similar because they co-occur together. This is like distributional semantics, like meaning is use kind of ideas. So this modeling co-occurrences and learning representations from it is one of the standard things that people do.
OK. So You can show and people have shown that contrastive learning with the NCE objective converges to the pointwise mutual information between your observations. So what is contrastive learning doing? Contrastive learning is learning an embedding f such that the inner product between image A and image B will be related to this probability ratio, which is how often A and B co-occur divided by how often you would expect them to co-occur by chance.
This is saying, if I have two image patches that co-occur, I'm going to bring them together, make them have high similarity to image patches that don't co-occur. I'll try to push them apart, make them have low similarity. And so that objective is equivalent to the pointwise mutual information between these two objects.
OK, so contrastive learning of this particular form in this particular world boils down to learning an embedding in which similarity in the embedding space, the kernel of that embedding, is equal to the pointwise mutual information between these observations.
So that's just saying that if apples and oranges co-occur in the kitchen together, then they will embed close to each other under this learning objective. And elephants will embed far away because they don't co-occur with these two things.
OK. Now, the really interesting thing is that if my observation functions are bijective and my world is discrete, then these projections don't end up changing these probabilities. So what this ends up meaning is that the PMI over the observations x is equal to the PMI over the underlying events z.
So this is a really toy world. But in this toy world, you will expect that you'll get exactly the same kernel if you learn from images versus if you learn from text. OK, so this is one toy construction in which you will get convergence, theoretically.
It matches roughly what we do in contrastive learning, except that we don't model data which is sampled in a bijective fashion. And so that's the big kind of assumption that's wrong in this model.
But I hope it's a starting point. And this does kind of actually hold to some degree in real data. So here's a simple experiment where we are-- I think, actually, Jacob showed some results on this line as well.
So you can measure the co-occurrence of color values within images and measure the PMI between red and green. Green and blue have high pointwise mutual information. They co-occur a lot. Red and green have lower. They co-occur less.
And you'll recover an embedding in which blue and green are near each other because they co-occur a lot, and blue and red are far from each other because they co-occur less. So that's roughly similar to how you see color.
You think green and blue are more similar than red and green. It's roughly similar to the LAB color space, which is kind of how humans see color. And if you do the same on words, the word red and the word blue and the word green and the word yellow, you'll get roughly the same kernel, which gives you roughly the same embeddings.
OK, so this is replication of some prior work from Abdou et al. But we ran this on a few newer models. These are a contrastive language learner, a predictive language learner, and they do in fact learn similar color kernels as you get from co-occurrence over pixel values.
So just to make that a little more intuitive, all I'm saying is that if you sampled pixels from an image, then you'll be likely to see two shades of blue co-occurring. And that will tell you if you train a contrastive learner, that will make the two shades of blue have a high similarity.
If you sample color words from sentences, then you'll also probably have the two shades of blue co-occurring, because people will describe the scene and they might describe the colors. And if they describe blue, they're likely to describe turquoise. OK, so you'll get a similar kernel in both of those cases, because both of these are descriptions or observations of the same underlying world.
And as long as those observations satisfy some properties, in the strictest sense being bijective and so forth, then you'll get this result. But in the more relaxed, real setting, I think you might get a more relaxed version of that result. OK. OK.
So finally, I'll talk a little bit about implications and limitations, because you're probably all thinking, OK, this is cool, but you're way overstating it. There's a lot of details that are wrong. And it's true, there are a lot of details that are wrong. OK, so let's look at some limitations.
So one is that, hold on, we can't get perfect convergence between language and vision, because there are some images-- some visual experiences that are just ineffable. You can't describe them in language. And there are some verbal concepts which can't be visualized.
OK, so for example, how would I talk about the experience of seeing a total solar eclipse? Can you raise your hand if you've seen one? Did you see the one a few months ago? Yeah, OK. Can you describe that in words? You can't tell a friend. It was so magical. It's ineffable. OK, so clearly, your visual experience and your sensory response was just fundamentally different than just talking about it.
What about this? I believe in the freedom of speech. What is the visual equivalent of that? It's an abstract concept. Vision is not good at showing abstract concepts. I mean, I could take a photo of the text, but that's a little bit of a cheat.
OK, so these are all cases where you don't have this bijection between the world and the observation. The text is some abstraction. It's like information is lost, you abstract. In the vision case, maybe people never talk about solar eclipses with a level of detail necessary to really feel that experience. So you don't capture that information in text. Nobody talks about it.
OK. Yeah. The kind of weird thing about this is, yeah, maybe vision and language are different. But our best vision systems are trained to be aligned with language. So we're kind of training our computer vision systems to reduce the world to the same information as a sentence, as a caption.
And that's working the best on a lot of tasks. So OK, maybe there are some narrow edge case differences, but there's a lot of shared information too. And that might be the majority of it. OK.
Oh yeah, here's just an experiment kind of investigating that in a little bit more detail. Certainly the word orange can't capture the same representational complexity as a picture of an orange. But what about a paragraph talking about that picture of the orange? Maybe that is going to have enough information to actually capture the same kind of representation.
And in fact, we do see evidence of this. So if we look at alignment between sentences and images, the longer the sentence is, the higher the alignment. So if I look at the alignment between embeddings of five-word sentences and images, the alignment is mediocre. And if I look at the embeddings of 30-word sentences and images, it's higher.
And if I had the embedding of an entire Shakespearean play and an image of a rose, maybe then Shakespeare has described it in so much detail that you'll have very good alignment. So an image is worth a thousand words, is the idea here.
OK. I think for the sake of time, I'll say-- yeah, we can talk about that offline. The alignment's not perfect. There's different ways of measuring it. These are technical details. But this one comes to mind for a lot of people.
OK, maybe I buy the story about convergence, but I don't think you're converging to reality. I don't buy the platonic part of it. You're converging to whatever the internet reflects about the world, which might be biased and not actually about the physical truth, but just some kind of weird misinformation online.
So maybe all these models are converging and the alignment is increasing, but that's because they're converging to something superficial or not factual or just these kind of BS machines, these just bad language models. And that could be because we're training on data, which doesn't really represent everything we care about.
It's limited in its own way. Or maybe our paradigm, our transformers, they're incapable of doing certain things. And so all these models are incapable of the same things, so it looks like convergence, but not to something good.
And there also could be sociotechnical reasons for this, because again, we share all of our ideas and everybody wants to do well on ImageNet classification. So we converge to a visual representation, which is good at that, but not at other things. OK.
But there are a lot of interesting implications. I think one was pointed out before that if these models learn similar representations, you should be able to share data and knowledge between models. You should be able to ensemble them, distill one to the other.
So in particular, it should help if you train your language models on images. And it should help if you train your image models on language. People are doing this, in fact. So unfortunately, we're not going to be able to give much advice because people are already doing it. But we're confirming that that was a good thing to do.
It should be possible to translate between modalities with minimal paired data because the kernel should be kind of a bridge which is invariant between the two modalities. And all you have to do is map to the kernel and then directly map to the other modality.
And there's some interesting papers that have used minimal paired data and kernel type methods in order to do these cross-modal translation tasks. There's this old problem of Molyneux's problem. And he wrote this letter to John Locke a few hundred years ago and asked if a person who was born blind was given sight, would they immediately be able to recognize a cone apart from a cube, just from sight alone, having only had experience with touching these objects in the past?
So do you get this knowledge transfer from one modality to the other? And there's recently some interesting work from Pawan Sinha and others at MIT, where they did find, in fact, that if you give children sight who were born blind-- you do surgery, you correct their cataracts-- then after only a little bit of learning, they are able to associate images with their previous concepts.
And so it says that you can't do it with no learning at all, but you can do a little adaptation. And I think that's consistent with this hypothesis. If I already have this kind of platonic representation of the world from touch, then to learn the mapping to that representation from a new modality, all I have to do is learn how to map to that representation.
I don't have to learn the representation in the first place, so it should be a lot more data efficient. But the main implication is just that if there really is some kind of thing we're converging to, then we should understand it. We should characterize it. We should know what that is. It's an important object.
So if the hypothesis is true, at least in part, then I think this is an important thing to study. OK. Oh. So I'll end there. I had a set of slides in case there was extra time, but yeah. So it's for a world model. OK, we already heard about world models in language. We'll skip that. We can talk about that offline. OK, so thank you.
[APPLAUSE]